# Scraping the top repos for Topics on GitHub

TODO (Intro):
- Introduction to web scraping.
- Introduction to Github (contains millions of open source repositories) and the problem statement.
- Tools we are using 
 - Python
 - requests
 - BeautifulSoup4
 - Pandas


Outline:
Here're the steps we will follow:
- We're going to scrape : https://github.com/topics
- We'll get a list of topics. For each topic, we will get topic title, topic page URL and topic description.
- For each topic, we'll get top 25 repositories in the topic from topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic, we'll create a csv file in the following format:

```
Repo Name,Username,Stars ,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

# Scrape the list of topics from GitHub

Explain

- use requests to download the page
- use bs4 to parse and extract information
- convert to pandas dataframe

Let's write a function to download the page

In [23]:
# importing required libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [5]:
doc = get_topics_page()

In [7]:
type(doc)

bs4.BeautifulSoup

Let's create some helper functions to parse information from the page.


To get topic titles we can `p` tags with the `class` ...

![](https://i.imgur.com/sklG1ai.png?1)

In [12]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class' : selection_class})
    
    topic_titles = []
    
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    
    return topic_titles

`get_topic_titles` can be used to get the title from the topics page

In [13]:
titles = get_topic_titles(doc)

In [15]:
len(titles)

30

In [16]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have designed functions for descriptions and URLs.

In [9]:
def get_topic_descs(doc):
    desc_selection_class = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : desc_selection_class})
    
    topic_descs = []
    
    for desc in topic_desc_tags:
        topic_descs.append(desc.text.strip())
        
    return topic_descs

`get_topic_descs` can be used to get description of the repositories.

TODO  : Example and explanation

In [18]:
descs = get_topic_descs(doc)

In [19]:
len(descs)

30

In [20]:
descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [11]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class' : 'd-flex no-underline'})     
    
    base_url = 'https://github.com'
    
    topic_urls = []
    
    for url in topic_link_tags:
        topic_urls.append(base_url + url['href'])
    
    return topic_urls

`get_topic_urls` can be used to get the urls of the respective topic page

Let's put this all together into a single function

In [24]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
        
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'urls': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

In [25]:
scrape_topics()

Unnamed: 0,title,description,urls
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Get Top 25 repositories from a topic page

In [26]:
def get_topic_page(topic_url):
    # download the page
    response3 = requests.get(topic_url)
    
    # checking successful response
    if response3.status_code != 200:
        raise Exception('Failed to load Page {}'.format(topic_url))
    
    # parse using beatuifulSoup
    topic_docs = BeautifulSoup(response3.text,'html.parser')
    
    return topic_docs

In [27]:
topic_doc = get_topic_page("https://github.com/topic/3d")

TODO: Talk about the h3 tags

In [28]:
len(topic_doc)

5

In [32]:
topic_doc.find('h3')

<h3 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904

In [34]:
def get_repo_info(h3_tags, repo_star):
    #returns all the required information about the repository
    
    a_tags = h3_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(repo_star.text.strip())
    
    return username, repo_name, stars, repo_url

TODO : Show an example

In [36]:
def get_topic_repos(topic_docs):
    
    # get h1 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_docs.find_all('h3', {'class': h3_selection_class})
    
    # get star tags
    stars_selection_class = "social-count float-none"
    repo_stars = topic_docs.find_all('a',{'class' : stars_selection_class})
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars' : [],
        'repo_url': []
    }
    
    # get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],repo_stars[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)


TODO - Show an example

In [37]:
def scrape_topic(topic_url, path):
#     fname = topic_name + ".csv"
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos((get_topic_page(topic_url)))
    
    topic_df.to_csv(path, index=None)

TODO - Show an example

## Putting it all together

- We have a function to get the list of topics
- We have a function to create a CSV file for scraped repos from a topic page.
- Let's create a function to put them together


In [38]:
def scrape_topic_repos():
    print("Scraping list of topics")
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print("Scraping top repositories for {}".format(row['title']))
        scrape_topic(row['urls'], 'data/{}.csv'.format(row['title']))
    

Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [None]:
scrape_topic_repos()

We can check that the CSVs were created properly

In [39]:
# read and display a csv using pandas

## References and Future Work

Summary of what we did

References to link I found useful

Ideas for future work