# Scraping Top Repositories for Topics on GitHub

TODO  (Intro): 
- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)



Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js

```

## Scrape the list of topics from Github

We'll follow these steps:

- use requests to download the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [32]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

We import Requests and BeautifulSoup library to parse the github topics page and extract information.

In [33]:
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)


In [34]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles of topics.

In [35]:
titles = get_topic_titles(doc)

In [36]:
len(titles)

30

In [37]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [38]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs



`get_topic_descs` can be used to get a list of descriptions of topics.

In [39]:
len(get_topic_descs(doc))

30

In [40]:
get_topic_descs(doc)[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [41]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls


`get_topic_urls` can be used to get a list of urls of topics.

In [42]:
len(get_topic_urls(doc))

30

In [43]:
get_topic_urls(doc)[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put this all together into a single function

In [44]:
import pandas as pd

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

We import pandas library and using the function `scrape _topics` scrape the topics page and add the resulting title, descriptions and url lists into `topic_dict` and use the dictionary to make a data frame.

## Get the top repositories from a topic page

We'll follow these steps:

- use requests to download the topic page
- user BS4 to parse and extract information about top repositories
- convert to a Pandas dataframe

We use Requests and BeautifulSoup library to parse the topic page and extract information.

In [45]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [46]:
doc = get_topic_page('https://github.com/topics/3d')

We will first parse star and convert it into integer using the `parse_star_count` function and use the `get_repo_info` function to get repository information like username, repo_name, repo_url and stars using the h3 tag.

In [47]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = 'https://github.com'
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


In [48]:
parse_star_count('104k')

104000

Now we'll use `get_topic_repos` function to get the required repository info and then make a dataframe using the `topic_repos_dict` dictionary.

In [49]:
def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

Now we import os library, and using the the `scrape_topic` function scrape the topic repository and then save it to a .csv file

In [50]:
import os

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [51]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [52]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command-line interface"
Scraping top repositories for "Clojure"
Scraping top repositories for "Code quality"

We can check that the CSVs were created properly

In [54]:
pd.read_csv('data/Android.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,169000,https://github.com/flutter/flutter
1,facebook,react-native,121000,https://github.com/facebook/react-native
2,Genymobile,scrcpy,118000,https://github.com/Genymobile/scrcpy
3,justjavac,free-programming-books-zh_CN,113000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,89000,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,54800,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,53600,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,51600,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,51100,https://github.com/google/material-design-icons
9,laurent22,joplin,47700,https://github.com/laurent22/joplin


## References and Future Work

Summary of what we did

- So we made a web scarper that parse and extracts top repository information from github topics page by using various libraries like requests, beautifulsoup etc.
- We saved the reslting top repositories of each topic in csv format inside the `data` folder.


References to links you found useful

- https://github.com/topics
- https://requests.readthedocs.io/en/latest/
- https://pypi.org/project/beautifulsoup4/
 
Ideas for future work

- We can extend the project to get top 100 repositories from other pages too.
- Web scraping can also be done on e-commerce websites to extract useful information about the desired products.