## Scraping Github Topics Repositories

### Pick a website and describe your objective

- Browse through different sites and pick on to scrape.
- Identify the information that you'd like to scrape. Decide the format of the o/p CSV file.
- Summarize the project idea and outline the startegy. 


### Project Outline
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic we'll get the topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll have the repo name, username, stars and repo URL
- For each topic, we'll create a CSV file in the following format:
  ```
  Repo name,username,stars,repo URL
  three.js,mrdoob,69700,https://github.com/mrdoob/three.js
  ```

### Use the requests library to donwload webpages

In [13]:
!pip install requests --upgrade --quiet

In [15]:
import requests

In [17]:
# get the webpage
topics_url = 'https://github.com/topics'

In [19]:
# dowload the url
response = requests.get(topics_url)

In [21]:
# check the status of the download
response.status_code

200

In [23]:
# check the content of the webpage
page_content = response.text

page_content[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  \n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" /><link data-color-theme="dark_dimmed" cross

In [33]:
# Save the page content in HTML
with open('webpage.html', 'w',  encoding='utf-8') as f:
    f.write(page_content)

### Use Beautiful soup to parse and extract information

In [35]:
!pip install beautifulsoup4 --upgrade --quiet

In [37]:
from bs4 import BeautifulSoup

In [117]:
# parse the document
doc = BeautifulSoup(page_content, 'html.parser')

In [43]:
# check the type of the doc
type(doc)

bs4.BeautifulSoup

In [53]:
# Find the class for topics by inspecting the webpage 

selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [55]:
len(topic_title_tags)

30

In [57]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [59]:
# find the topic description
desc_selector = 'f5 color-fg-muted mb-0 mt-1'

topic_desc_tags = doc.find_all('p', {'class': desc_selector})

In [61]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [67]:
# now grab the topic url

topic_title_tag0 = topic_title_tags[0]

In [69]:
div_tag = topic_title_tag0.parent

In [71]:
topic_link_tags = doc.find_all('a', {'class' : 'no-underline flex-1 d-flex flex-column'})

In [73]:
len(topic_link_tags)

30

In [79]:
# verify if we fetched the correct topic link
topic0_url = 'https://github.com' + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [83]:
# create a list of topic titles

topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [101]:
topic_descriptions = []

for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip())

topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [115]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [120]:
# create a pandas dataframe out of the extracted info

import pandas as pd

In [122]:
topics_dict = {
    'Title': topic_titles,
    'Description': topic_descriptions,
    'URL': topic_urls
              }

In [132]:
topics_df = pd.DataFrame(topics_dict)
topics_df.head(5)

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Create CSV file with the extracted information

In [134]:
topics_df.to_csv('Github_Topics', index= None)

### Getting information out of a topic page

In [137]:
topic_page_url = topic_urls[0]

In [139]:
topic_page_url

'https://github.com/topics/3d'

In [141]:
response  = requests.get(topic_page_url)

In [143]:
response.status_code

200

In [151]:
len(response.text)

522605

In [147]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [254]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [165]:
len(repo_tags)

20

In [169]:
# check the repo tag
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/thre

In [181]:
a_tags = repo_tags[0].find_all('a')

In [177]:
a_tags[0].text.strip()

'mrdoob'

In [183]:
a_tags[1].text.strip()

'three.js'

In [193]:
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [195]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [197]:
len(star_tags)

20

In [201]:
star_tags[0].text.strip()

'105k'

In [229]:
# convert this k into an integer

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str) 

In [231]:
parse_star_count(star_tags[0].text.strip())

105000

In [242]:
def get_repo_info(h3_tag, star_tags):
    # return all the reqd info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, stars, repo_url

In [244]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 105000, 'https://github.com/mrdoob/three.js')

In [248]:
# for every title_tag
topic_repos_dict = {
    'username': [],
    'repo_name':[],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
             
    topic_repos_dict['repo_name'].append(repo_info[1])
             
    topic_repos_dict['stars'].append(repo_info[2])
             
    topic_repos_dict['repo_url'].append(repo_info[3])

In [250]:
topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'FreeCAD',
  'ssloy',
  'lettier',
  'aframevr',
  'blender',
  'CesiumGS',
  '4ian',
  'isl-org',
  'MonoGame',
  'mapbox',
  'metafizzy',
  'nerfstudio-project',
  'timzhang642',
  'cocos',
  'FyroxEngine',
  'domlysz'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'FreeCAD',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'blender',
  'cesium',
  'GDevelop',
  'Open3D',
  'MonoGame',
  'mapbox-gl-js',
  'zdog',
  'nerfstudio',
  '3D-Machine-Learning',
  'cocos-engine',
  'Fyrox',
  'BlenderGIS'],
 'stars': [105000,
  28300,
  23800,
  23600,
  23300,
  21300,
  18400,
  16900,
  14300,
  13300,
  13000,
  12000,
  11800,
  11400,
  10400,
  9900,
  9900,
  8400,
  8100,
  8000],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.js',
  'htt

In [252]:
# convert it into dataframe

topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,105000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,28300,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,23800,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,23600,https://github.com/BabylonJS/Babylon.js
4,FreeCAD,FreeCAD,23300,https://github.com/FreeCAD/FreeCAD
5,ssloy,tinyrenderer,21300,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,18400,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,16900,https://github.com/aframevr/aframe
8,blender,blender,14300,https://github.com/blender/blender
9,CesiumGS,cesium,13300,https://github.com/CesiumGS/cesium


## Final Code

To get topic titles, we can pick `p` tags with the `class - f3 lh-condensed mb-0 mt-1 Link--primary`

![](https://i.imgur.com/DOJxbbw.png)

In [367]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc
    
def get_repo_info(h3_tag, star_tags):
    # return all the reqd info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, stars, repo_url
    
def get_topic_repos(topic_doc):
    
    # Get the h3 tags containing repo title, username and repo URL
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

    # Get Star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

    # Get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name':[],
        'stars': [],
        'repo_url': []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])       
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

import os
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print('The file {} already exists. Skipping...'.format(path))
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path + '.csv', index = None)

In [322]:
topic_urls[6]

'https://github.com/topics/ansible'

In [308]:
get_topic_repos(get_topic_page(topic_urls[6])).to_csv('ansible.csv', index = None)

Write a single function to:
1. Get the list of topics from the topics page
2. Get the list of top repos from individual topic pages
3. For each topic, create a CSV of the top repos for the topic

In [324]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descriptions = []
    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class' : 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)



In [356]:
def scrape_topic_repos():
    print('Scraping List of Topics:')
    topics_df = scrape_topics()

    os.makedirs('Github_Topics_data', exist_ok= True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'Github_Topics_data/{}.csv'.format(row['title']))

In [358]:
scrape_topic_repos()

Scraping List of Topics:
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command-line interface"
Scraping top repositories for "Clojure"
Scraping top repositories for "Code quality