# Scraping Top Repos for Top Topics on GitHub

## Introduction to web scraping
Web scraping is the process of extracting data from a specific web page. It involves making an HTTP request to a website’s server, downloading the page’s HTML and parsing it to extract the desired data.

## Problem Statement 
GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere. This tutorial teaches you GitHub essentials like repositories, branches, commits, and pull requests.
GitHubTopics are labels that create subject-based connections between GitHub repositories and let you explore projects by type, technology, and more.

Our motive is to gather all the top repos of the top topics on GitHub, ie perform web scraping for further data analysis.
Web scraping can be done manually, but if the process involves a large number of web pages, it is more efficient to use an automated web scraping tool like BeautifulSoup or Scrapy.

Tools/Language used : Python, Beautiful Soup, Pandas

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
!pip install beautifulsoup4 --upgrade --quiet

In [4]:
from bs4 import BeautifulSoup

In [5]:
!pip install pandas --quiet

In [6]:
import pandas as pd

In [17]:
def scrape_Topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    # Check response status
    if(response.status_code!=200):
        raise Exception('Failed to load page {}'.format(topic_url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents,'html.parser')
    topic_title_tags = doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_desc_tags = doc.find_all('p',{'class':"f5 color-fg-muted mb-0 mt-1"})
    topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})

    topic_titles = []
    topic_desc = []
    topic_urls = []

    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    
    for tag in topic_link_tags:
        topic_urls.append('https://github.com'+tag['href'])
    
    Topicsdict = {'Title':topic_titles,'Description':topic_desc,'URL':topic_urls}
    return pd.DataFrame(Topicsdict)

def scrape_topic(topic_url,topic_name):
    topic_df = get_topic_repo(get_topic_page(topic_url))
    topic_df.to_csv(topic_name+'.csv',index = None)
    
def scrape_TopicsRepos():
    print("Scraping top repos from GitHub")
    topics_df = scrape_Topics()
    for index,row in topics_df.iterrows():
        print('Scraping top repos for "{}"'.format(row['Title']))
        scrape_topic(row['URL'],row['Title'])
    
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check response status
    if(response.status_code!=200):
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using beautiful soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def parse_star_count(stars_str):
    if(stars_str[-1]=='k'):
        return int(float(stars_str[:-1])*1000)
    
def get_repo_info(h3_tag,star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = 'https://github.com'+a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

def get_topic_repo(topic_doc):

    repo_tags = topic_doc.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    topicrepo_dict = {'username':[],'repos name':[],'stars':[],'URL':[]}

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topicrepo_dict['username'].append(repo_info[0])
        topicrepo_dict['repos name'].append(repo_info[1])
        topicrepo_dict['stars'].append(repo_info[2])   
        topicrepo_dict['URL'].append(repo_info[3])
        
    return pd.DataFrame(topicrepo_dict)


In [18]:
scrape_TopicsRepos()

Scraping top repos from GitHub
Scraping top repos for "3D"
Scraping top repos for "Ajax"
Scraping top repos for "Algorithm"
Scraping top repos for "Amp"
Scraping top repos for "Android"
Scraping top repos for "Angular"
Scraping top repos for "Ansible"
Scraping top repos for "API"
Scraping top repos for "Arduino"
Scraping top repos for "ASP.NET"
Scraping top repos for "Atom"
Scraping top repos for "Awesome Lists"
Scraping top repos for "Amazon Web Services"
Scraping top repos for "Azure"
Scraping top repos for "Babel"
Scraping top repos for "Bash"
Scraping top repos for "Bitcoin"
Scraping top repos for "Bootstrap"
Scraping top repos for "Bot"
Scraping top repos for "C"
Scraping top repos for "Chrome"
Scraping top repos for "Chrome extension"
Scraping top repos for "Command line interface"
Scraping top repos for "Clojure"
Scraping top repos for "Code quality"
Scraping top repos for "Code review"
Scraping top repos for "Compiler"
Scraping top repos for "Continuous integration"
Scraping to