# Scraping Top Repositoried for Topics on GitHub

###  What is Web Scraping ?

- Web Scripting is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

 

### What is GitHub ?

- GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.

### Problem Statement 

- We are gonna scrap top 30 topic of a Github and for each topic will get top 25 repositories from the topic page.

Here is the steps we'll follow

- We are going to scrape https://github.com/topics.
- We'll get a list of topics. For each topic, we'll get topic titles, topic page URL and topic description.
- For each topic, we'll get the top 25 repositories in the from the topic page.
- For each repository, we'll grab the repo name, username, stars and report URL
- For each topic we'll create a CSV file in the following format:

  Repo Name,Username,Stars,Repo URL
  
  three.js,mrdoob,69700,https://github.com/mrdoob/three.js
  
  libgdx,libgdx,18300,https://github.com/libgdx/libgdx

 ### Srape the list of topics from Github
 
 - Use requests to download the page
 - Use BS4 to parse and extract information
 - Convert to Pandas dataframe

Let's write a function to download function

In [6]:
import requests
from bs4 import BeautifulSoup


def get_topics_page():
    topics_url = 'https://github.com/topics'
    # download a page
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc



In [9]:
doc = get_topics_page()

## Final Code

In [22]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    if response.status_code!= 200:
        raise Exception('Failed to load{}',format(topic_url))
    
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


def get_repo_info(h1_tags, star_tag):
    a_tags = h1_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = 'https://github.com'
    repo_url= base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name,stars, repo_url

def get_topic_repos(topic_doc):
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1',{'class',h1_selection_class})
    star_tags = topic_doc.find_all('a',{'class':'social-count float-none'})
   
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [23]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class':selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text.strip())

    return topic_titles

def get_topic_descs(doc):
    selection_class = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class':selection_class})
    topic_descs = []

    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())

    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls


def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)

    if response.status_code!= 200:
        raise Exception('Failed to load{}',format(topic_url))
    
    topic_doc = BeautifulSoup(response.text, 'html.parser')
   
    topics_dict = {
        'title': get_topic_titles(topic_doc),
        'description': get_topic_descs(topic_doc),
        'url': get_topic_urls(topic_doc)
    }
    return pd.DataFrame(topics_dict)

In [24]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
    print('Scraping top repositories are completed')

In [25]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre