# Scraping Top Repositories for Topics on GitHub

Tools used: Python, requests, BeautifulSoup, Pandas

### Project Outline

- Scrape the 'topics' page of GitHub https://github.com/topics
- Get list of all the topics and for each topic, get the topic title, topic description and topic URL
- For each topic, get the top  n repositories
- For each repository, get the repo name, username, stars and repo URL
- For each topic, create a CSV file in the following format-
'''
Repo Name, Username, Stars, URL
'''

## Scrape the list of topics from GitHub

- Using requests library to download the page
- Using bs4 to parse and extract information
- Convert that into a pandas dataframe

For that, writing the function to download the page-

In [1]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # returns a Beautiful Soup Document containing a parsed web page containing list of topics
    topics_url = 'https://github.com/topics'
    # Download the page
    response = requests.get(topics_url)
    # Check successful response 
    if response.status_code != 200:
        raise Exception(f'failed to load page {topics_url}')
    # Parse the page using Beautiful Soup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

Adding some explanation of this function

In [2]:
doc = get_topics_page()
type(doc)

bs4.BeautifulSoup

In [3]:
doc.find('a')      #parses the first 'a' tag

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

Functions to parse information of topics

In [4]:
# To get the list of topics titles
def get_topic_titles(doc):
    # Getting the list of all the tags containing title text by inspecting the browser
    topic_title_tags = doc.find_all('p', class_ = 'f3 lh-condensed mb-0 mt-1 Link--primary')
    #List to store titles extracted from tags
    topic_titles=[]
    #Iterating through the list of tags to get the text and storing it in the list
    for tag in topic_title_tags:
        topic_titles.append(tag.text) 
    return topic_titles

In [5]:
get_topic_titles(doc)[:5]   # 'doc' is the returned document from get_topic_page()

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [6]:
# To get the list of topics descriptions
def get_topic_descs(doc):
    # Getting the list of all the tags containing description text by inspecting the browser
    topic_desc_tags = doc.find_all('p', class_ = 'f5 color-fg-muted mb-0 mt-1')
    #List to store descriptions extracted from tags
    topic_descs=[]
    #Iterating through the list of tags to get the text and storing it in the list
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [7]:
get_topic_descs(doc)[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [8]:
# To get the list of topics URLs
def get_topic_urls(doc):
    # Getting the list of all the tags containing url by inspecting the browser
    topic_url_tags = doc.find_all('a', class_ = 'no-underline flex-1 d-flex flex-column')
    #List to store urls extracted from tags
    topic_urls=[]
    #Iterating through the list of tags to get the url and storing it in the list
    for tag in topic_url_tags:
        topic_urls.append('https://github.com'+ tag['href'])
    return topic_urls

In [9]:
get_topic_urls(doc)[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Putting all the above functions in a single function returning dataframe of all topics  

In [10]:
# it returns the dataframe of information(title, description and url) of GitHub topics
import pandas as pd

def scrape_topics():
    #Getting the page for parsing
    doc = get_topics_page()
    # Dictionary to store all the lists of respective information
    topics_dict={
        'title': get_topic_titles(doc),      #Getting list of titles
        'description': get_topic_descs(doc), #Gettting list of dscription
        'url': get_topic_urls(doc)           #Getting list of urls
    }

    return pd.DataFrame(topics_dict)

In [11]:
scrape_topics().head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Get the top repositories from a topic page

In [12]:
# it takes the url of a topic page and returns its parsed page
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response 
    if response.status_code != 200:
        raise Exception(f'failed to load page {topic_url}')
    # Parse the page using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [13]:
topic_doc = get_topic_page('https://github.com/topics/3d')

- To get the top repos of a topic, the 'username', 'repo name', 'repo url' and 'stars' need to be extracted.

- For that, while inspecting the page, the username, repo name and repo url were found in two children ('a') tags of a parent ('h3') tag.
- So, first the 'h3' tags are extracted to reach its childern, 'a' tags.
- Stars were found in separate 'span' tags.

- So, to get the info of the topic, the 'h' tags (for username, repo name and repo url) and the coresponding star tags are passed in the following function and it returns the list of username, repo name, repo url and stars of the topic.
- To get stars in integer format, a separate function is built.

In [14]:
# takes the star string and returns the whole number of stars
def parse_star_tags(star_str):
    star_str=star_str.strip()
    if star_str[-1]=='k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [15]:
def get_repo_info(h_tag, star_tag):
    # returns all information required about a repository
    
    # Gettting childern tags from the parent tag
    a_tags = h_tag.find_all('a')
    # Getting username from first child tag
    username = a_tags[0].text.strip()
    # Getting repo name from second child tag
    repo_name = a_tags[1].text.strip()
    # Getting repo url from 2nd child tag and concatenating it with base url to get complete url
    repo_url = 'https://github.com' + a_tags[1]['href']
    # Getting stars in in interger format
    stars = parse_star_tags(star_tag.text.strip())
    return username, repo_name, repo_url, stars

In [16]:
# Getting information of first repository of '3d' topic

h_tag = topic_doc.find('h3', class_ = 'f3 color-fg-muted text-normal lh-condensed')
star_tag = topic_doc.find('span', {'class' : 'Counter js-social-count'})
get_repo_info(h_tag, star_tag)

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 84100)

Now the following function takes the parsed page of topic and returns all the repositories in pandas dataframe


In [17]:
def get_topic_repos(topic_doc):
   
    # Get the list of repo tags which contains username, repo name and repo url
    repo_tags = topic_doc.find_all('h3', class_ = 'f3 color-fg-muted text-normal lh-condensed')
    # Get the list of star tags
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    # Creating a dictionary to store repo info
    topic_repos_dict={
    'username':[],
    'repo name':[],
    'repo url': [],
    'stars': []
    }
    # Getting repos
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo name'].append(repo_info[1])
        topic_repos_dict['repo url'].append(repo_info[2])
        topic_repos_dict['stars'].append(repo_info[3])
    # Returning a dataframe of repositories made from dictionary
    return pd.DataFrame(topic_repos_dict)


In [18]:
# First getting the parsed page of 3D topic
topic_doc = get_topic_page('https://github.com/topics/3d')
# Then passing it in below function to get dataframe of top repos
get_topic_repos(topic_doc).head()

Unnamed: 0,username,repo name,repo url,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,84100
1,libgdx,libgdx,https://github.com/libgdx/libgdx,20300
2,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,19000
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,18000
4,aframevr,aframe,https://github.com/aframevr/aframe,14400


Now the following function exports the dataframe of the top repos of a topic into CSV file

In [19]:
import os

# it takes the topic url and path of the folder
def scrape_topic(topic_url, path):
    # Checking if the file already exists in the folder
    if os.path.exists(path):
        print('The file {} already exists. Skipping..'.format(path))
        return
    # scraping the top repos in a dataframe
    topic_repos_df = get_topic_repos(get_topic_page(topic_url))
    # Creating the csv file of repos from dataframe and saving in the folder
    topic_repos_df.to_csv(path, index=None)

In [20]:
scrape_topic('https://github.com/topics/3d', 'top_repos/3d.csv')

The file top_repos/3d.csv already exists. Skipping..


## Getting top repositories of all topics and saving them in CSV format

- We have the function to get the list of topics (a dataframe with topics information)
- We have the function to create CSV file from scraped repos of each topic
- Now putting them together in the following function, that saves the csv of all top repos of all topics

In [21]:
def scrape_topics_repos():
    print('Scraping topics..')
    # Getting the dataframe containing info of topics
    topics_df = scrape_topics()
    print('Scraping top repositories of each topic and saving in CSV format\n')
    # Creating a folder to save CSV files of top repos
    os.makedirs('Repos', exist_ok = True)
    # Iterating through the topics dataframe to save top repos of each topic
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for the topic "{}"'.format(row['title']))
        scrape_topic(row['url'], 'Repos/{}.csv'.format(row['title']))
    print('Process completed')

Now running the function to scrape top repos of all the topics on the first page of https://github.com/topics 

In [22]:
scrape_topics_repos()

Scraping topics..
Scraping top repositories of each topic and saving in CSV format

Scraping top repositories for the topic "3D"
The file Repos/3D.csv already exists. Skipping..
Scraping top repositories for the topic "Ajax"
The file Repos/Ajax.csv already exists. Skipping..
Scraping top repositories for the topic "Algorithm"
The file Repos/Algorithm.csv already exists. Skipping..
Scraping top repositories for the topic "Amp"
The file Repos/Amp.csv already exists. Skipping..
Scraping top repositories for the topic "Android"
The file Repos/Android.csv already exists. Skipping..
Scraping top repositories for the topic "Angular"
The file Repos/Angular.csv already exists. Skipping..
Scraping top repositories for the topic "Ansible"
The file Repos/Ansible.csv already exists. Skipping..
Scraping top repositories for the topic "API"
The file Repos/API.csv already exists. Skipping..
Scraping top repositories for the topic "Arduino"
The file Repos/Arduino.csv already exists. Skipping..
Scraping

Checking on a CSV

In [23]:
# Read and display a CSV using Pandas

pd.read_csv('Repos/ASP.NET.csv')

Unnamed: 0,username,repo name,repo url,stars
0,dotnet,AspNetCore.Docs,https://github.com/dotnet/AspNetCore.Docs,10800
1,aspnetboilerplate,aspnetboilerplate,https://github.com/aspnetboilerplate/aspnetboi...,10500
2,bitwarden,server,https://github.com/bitwarden/server,10200
3,abpframework,abp,https://github.com/abpframework/abp,8400
4,nopSolutions,nopCommerce,https://github.com/nopSolutions/nopCommerce,7400
5,ElectronNET,Electron.NET,https://github.com/ElectronNET/Electron.NET,6400
6,RicoSuter,NSwag,https://github.com/RicoSuter/NSwag,5300
7,thangchung,clean-code-dotnet,https://github.com/thangchung/clean-code-dotnet,5300
8,smartstore,SmartStoreNET,https://github.com/smartstore/SmartStoreNET,2500
9,JimBobSquarePants,ImageProcessor,https://github.com/JimBobSquarePants/ImageProc...,2500


## Summary

1. First we scrape the list of topics from first page of https://github.com/topics in the form of Pandas dataframe
- For that, we first download the page using requests and parse it using Beautiful Soup
- Then we extract the following information of each topic- title, description and url
- we extract each info separately by inspecting the web page in the browser
- In this way, we get four functions, one for parsing the page and rest three for extracting information
- Now, we put all functions together in a single function which gets the parsed page of topics and extract information of all topics on the first page and return a dataframe (first main function)
2. Scraping top repositories of a topic and saving in CSV format
- Before extracting top repositories of all topics, we first extract top repositories of first topic, '3d'
- First we define a function to parse the page of repository page of the topic
- Then we define a function which inspects the corresponding webpage and returns information of a single repo, which are username, repo name, repo url and stars. Alongside, we define a function that returns the stars in whole number.
- Then we define the function which iterates through the index of all repositories, extracts required information of each repository by making use of above function, stores the lists of all categories of information in a dictionary and return a pandas dataframe created from that dictionary.
- Now we define a function that takes the dataframe of top repos of each topic and saves its CSV file in a folder (second main function)
3. Scraping top repos of all topics
- Now we define a function that makes use of abpve two main function, first that returns the dataframe of all topics and second that saves CSV of top repos of each topic.
- This function iterates through the dataframe of topics (from first main function), makes a folder to save all CSV files using os library and saves CSV file of top repos of each topic in that folder(from second main function)
- Since 30 repos are there in first page of repos of all topics, total 30 repos are extracted for all 30 topics ( 30 topics are there on first page of topics) 
- Finally we can read and display the CSV files using Pandas


## Future Work

- Extracting top repos not only from first page of topics, but from all pages of topics
- Extracting repos not only from first page of repos, but from more pages, say top 100 repos.