<a href="https://colab.research.google.com/github/anilsolanki2645/WebScarping/blob/main/web_scraping_github_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Top Repositories for GitHub Topics**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


------------------------------------------------------------------------

##Pick a website and describe your objective

* Browse through different sites and pick on to scrape.
* Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
* Summarize your project idea and outline your strategy in a Google Colab.

##Project Outline
* We're going to scrape https://github.com/topics
* We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
* For each topic, we'll get the top 25 repositories in the topic from the topic page
* For each repository, we'll grab the repo name, username, stars and repo URL
* For each topic we'll create a CSV file in the following format:
> Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

### Use the requests library to download web pages

* Import requests lib to download web pages

In [2]:
import requests

* store github/topic URL in a variable topics_url

In [3]:
topics_url = 'https://github.com/topics'

* collect topic web page and store in to response var

In [4]:
response = requests.get(topics_url)

* check status code

In [5]:
response.status_code

200

* check length of words for topic page

In [6]:
len(response.text)

164798

* store all the web page text to the page_contents

In [7]:
page_contents = response.text

* Show page content whith limit 10000 words

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="false">\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-a09cef873428.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" me

* write all content in webpage.html file

In [9]:
with open('webpage.html', 'w') as f:
  f.write(page_contents)

--------------------------------------------------------------------------------------

### Use Beautiful Soup to parse and extract information

* import BeautifulSoup as a bs4 for scraping and parse the information

In [10]:
from bs4 import BeautifulSoup

* parse information and store to the doc var

In [11]:
doc = BeautifulSoup(page_contents, 'html.parser')

* define var for store title tag class

In [12]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

* filtering information with p tag and class store in to topic_title_tags

In [13]:
topic_title_tags = doc.find_all('p', {'class': selection_class})

* check the length of title tag

In [14]:
len(topic_title_tags)

30

* show top 5 title tag

In [15]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

* select the description and get information using tag and class

In [16]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})

* show top 5 topic des tag

In [17]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

* select the topic_link_tag and get information using tag and class

In [18]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})

* Length of link tag

In [19]:
len(topic_link_tags)

30

In [20]:
topic_link_tags[0]

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

* get only link tag with indexing

In [21]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


* create topic_title list and store all value in it

In [22]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


* access all the topic_descriptions

In [23]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azure 

* access all the topic_rul

In [24]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

* import pandas as a pd for convert list into data frame

In [25]:
import pandas as pd

* create a dictionary to store information

In [26]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

* dataframe store into topics_df

In [27]:
topics_df = pd.DataFrame(topics_dict)

In [28]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


* create CSV file from the datframe

In [29]:
topics_df.to_csv('topics.csv', index=None)

---------------------------------------------------------------------------------------------------------

## Getting information out of a topic page

* store first topic page url

In [30]:
topic_page_url = topic_urls[0]

In [31]:
topic_page_url

'https://github.com/topics/3d'

* collect topic_page_url [1] and store in to response var
* now the all process is same as a above

In [32]:
response = requests.get(topic_page_url)

In [33]:
response.status_code

200

In [34]:
len(response.text)

475791

In [35]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [36]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'

* collect all repo_tags information or links

In [37]:
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )

In [38]:
repo_tags[0]
print(repo_tags[0])

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href

In [39]:
len(repo_tags)

20

* collect only one repo_tag

In [40]:
a_tags = repo_tags[0].find_all('a')

In [41]:
a_tags[0].text.strip()

'mrdoob'

In [42]:
a_tags[1].text.strip()

'three.js'

In [43]:
base_url = 'https://github.com'

In [44]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [45]:
star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})

In [46]:
len(star_tags)

20

In [47]:
star_tags[0].text.strip()

'94.5k'

* function to remove 'k' and add 1000

In [48]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [49]:
parse_star_count(star_tags[0].text.strip())

94500

* function to get information about perticular repository

In [50]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


In [52]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 94500, 'https://github.com/mrdoob/three.js')

In [53]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


* for loop to get all details of 3D topics

In [54]:
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [55]:
print(topic_repos_dict)

{'username': ['mrdoob', 'pmndrs', 'libgdx', 'BabylonJS', 'ssloy', 'lettier', 'aframevr', 'FreeCAD', 'CesiumGS', 'metafizzy', 'blender', 'isl-org', 'timzhang642', 'a1studmuffin', 'domlysz', 'FyroxEngine', 'nerfstudio-project', 'google', 'openscad', 'spritejs'], 'repo_name': ['three.js', 'react-three-fiber', 'libgdx', 'Babylon.js', 'tinyrenderer', '3d-game-shaders-for-beginners', 'aframe', 'FreeCAD', 'cesium', 'zdog', 'blender', 'Open3D', '3D-Machine-Learning', 'SpaceshipGenerator', 'BlenderGIS', 'Fyrox', 'nerfstudio', 'model-viewer', 'openscad', 'spritejs'], 'stars': [94500, 23800, 21900, 21400, 17800, 16000, 15700, 15100, 11000, 10000, 9400, 9400, 9100, 7500, 6700, 6500, 6500, 6000, 5900, 5200], 'repo_url': ['https://github.com/mrdoob/three.js', 'https://github.com/pmndrs/react-three-fiber', 'https://github.com/libgdx/libgdx', 'https://github.com/BabylonJS/Babylon.js', 'https://github.com/ssloy/tinyrenderer', 'https://github.com/lettier/3d-game-shaders-for-beginners', 'https://github.c

In [56]:
topics_repos_df = pd.DataFrame(topic_repos_dict)

In [57]:
topics_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,94500,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23800,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21900,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21400,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17800,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16000,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15700,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,15100,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,11000,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,10000,https://github.com/metafizzy/zdog


In [58]:
topics_repos_df.to_csv('topics_repos.csv', index=None)

------------------------------------------------------------------------------------------------------------------------------------------------