# Scraping Top Repositories for Topics on GitHub
##### Web Scrapping : 
Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.
#### GitHub :
GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.
#### Tools used :
Python, requests, Beautiful Soup, Pandas, OS


# Steps that are followed
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL

three.js,mrdoob,88300,https://github.com/mrdoob/three.js

## Scrapping the list of topics from GitHub
1. Use requests to downlaod the page
2. Use BS4 to parse and extract information
3. Convert to a Pandas dataframe

Writing a function to download the page.

In [77]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [35]:
# Getting the "Doc" for GitHub topic page 
doc= get_topics_page()

##### Creating some other helper functions to get information from the page

In [36]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags= doc.find_all('p', {'class' : selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

#### get_topic_titles can be used to get list of titles

In [37]:
titles = get_topic_titles(doc)

In [38]:
len(titles)

30

In [39]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

##### Similary functions have been defined for description and URLs

In [40]:
# Getting the desc of each Topic
def get_topic_descs(doc):
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all('p', {'class' :desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [41]:
desc= get_topic_descs(doc)

In [42]:
desc[:2]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.']

In [43]:
# Getting the URL of each Topic
def get_topic_urls(doc):
    topic_link_tags= doc.find_all('a',{'class' :"no-underline flex-grow-0"})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [44]:
get_topic_urls(doc)[:2]

['https://github.com/topics/3d', 'https://github.com/topics/ajax']

#### Puting all together into a single function

In [45]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [46]:
scrape_topics().head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


# Get the top 20 repositories from a topic page
1. Use requests to downlaod the page and extract the information by using BS4
2. Convert it to csv files.

In [60]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [48]:
topic_doc = get_topic_page('https://github.com/topics/3d')

In [61]:
def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars))

In [72]:
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url= "https://github.com"
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [66]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"  
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    # Get star tags
    star_tags= topic_doc.find_all('span', {'class' :'Counter js-social-count'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [67]:
get_topic_repos(topic_doc).head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,88300,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,21000,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21000,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,19200,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,15800,https://github.com/ssloy/tinyrenderer


In [74]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

### Putting it all together
1. We have a funciton to get the list of topics
2. We have a function to create a CSV file for scraped repos from a topics page
3. Let's create a function to put them together

In [75]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

#### Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [78]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre

#### We can check CSVs were created properly

## Summary
- We have scrapped the data for top repositories for each topic occured on 1st page.
- We have created a folder which includes all the csv files of each topic having information about repositories.
- Repository information covers - username, repo_name, stars & repo_url.

## Ideas for future work
- We can iterate through multipages by just being limited to 1st page and get the required data.