# Scraping Top Repositories for Topics on GitHub

Introduction:

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.

Problem Statement:

Getting the list of Topics and scraping the top repositories for each topic on Github.

Tools Used:
Python,requests,BeautifulSoup,Pandas


Here are the steps to follow:

- I'm going to scrape https://github.com/topics.
- Parse the downloaded html content using Beautiful Soup.
- Will get a list of topics. For each topic, we'll get topic title, topic page
  URL and topic description from Soup Object.
- For each topic, we'll get the top 25 repositories in the topic from the topic
  page
- For each repository, we'll grab the repo name, username, stars and repo URL
- Creating a DataFrame using pandas libraries of the scraped data
- Finally save the dataframe as a CSV file .




## 1. Scrape the list of topics from Github

### Import Libraries

- Used requests library to downlaod the web page.
- used BeautifulSoup(BS4) to parse and extract information.
- convert to a Pandas dataframe

In [24]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

### URL of the webpage

In [2]:
url="https://github.com/topics"


 Sending a get request using the python requests library and we are getting the html page as the response to the request sent

In [25]:
response=requests.get(url)
response.status_code # Status code 200 means that the request was successful


200

Getting the HTML content from the downloaded page

In [26]:
page_contents=response.text
len(page_contents)


156401

Parsing the Content into HTML using BeautifulSoup

In [27]:
doc=BeautifulSoup(page_contents,'html.parser')
type(doc)

bs4.BeautifulSoup

### Extracting Topic_Title

In [41]:
selection_class= "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags=doc.find_all('p',{'class':selection_class})
#topic_title_tags

In [42]:
topic_titles=[]
for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


### Extracting Topic_Description

In [43]:
desc_selector="f5 color-fg-muted mb-0 mt-1"
topic_desc_tags=doc.find_all('p',{'class': desc_selector})
#topic_desc_tags

In [44]:
topic_desc= []

for tag in topic_desc_tags:
    topic_desc.append(tag.text.strip())#strip() removes the empty space from the beginning and at the ending

print(topic_desc)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

### Extracting Topic_URL

In [45]:
topic_link_tags=doc.find_all('a',{'class':"no-underline flex-grow-0"})
topic0_url="https://github.com"+ topic_link_tags[0]['href']
#print(topic0_url)
#topic_link_tags

In [46]:
topic_url=[]
base_url= 'https://github.com'

for tag in topic_link_tags:
    topic_url.append(base_url+ tag['href'])

print(topic_url)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

## Creating Dataframe For the Topic Lists

In [47]:
topic_dict={
    'Title': topic_titles,
    'Description': topic_desc,
    'URL': topic_url
}

In [50]:
topic_df=pd.DataFrame(topic_dict)
#topic_df

## Create CSV file for the List of Topics

In [52]:
topic_df.to_csv('Topics.csv',index=None)
#pd.read_csv("Topics.csv")

# 2. Get the Top repositories from a topic page


### Extracting Information from each Topic Page

Create a function get_repo_info(h3_tag,star_tag) that seperates username, reponame, repo_url and stars count of the repository from the h3 tag list and star_tag list

In [28]:
def get_repo_info(h3_tag,star_tag):
    #returns all the required repository information
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=(star_tag.text.strip())
    return username,repo_name,repo_url,stars

Create a function get_topic_page(topic_url) :                                  - It takes the topic url as an argument and then fetches the content using the
  get request of ___requests___ library and then parse the content using BeautifulSoup and returns the same parsed object.

In [29]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))

    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

Create a function get_topic_repos(topic_doc): It takes a Beautiful Soup Object as an argument and find all the h3 tags that contains information about the User name, Repository Name, Repo Url, Stars, stores them in a dictionary and return a object created by converting the dictionary to a Pandas DataFrame

In [30]:
def get_topic_repos(topic_doc):

    repo_tags = topic_doc.find_all('h3',attrs={
        'class' : 'f3 color-fg-muted text-normal lh-condensed'})
    stars_tags= topic_doc.find_all('span',attrs= {
        'class' : 'Counter js-social-count'
            })
    topic_repo_dict = {
    'Username':[],
    'Repo_Name':[],
    'Repo_Url':[],
    'Stars': []
    }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],stars_tags[i])
        topic_repo_dict['Username'].append(repo_info[0])
        topic_repo_dict['Repo_Name'].append(repo_info[1])
        topic_repo_dict['Repo_Url'].append(repo_info[2])
        topic_repo_dict['Stars'].append(repo_info[3])

    return pd.DataFrame(topic_repo_dict)


Create a function scrape_topic(topic_url,topic_name): It scrapes the Top Repositories from the topic url and saves the scraped data as a dataframe to a .csv file, having the file name of the title.

In [31]:
def scrape_topic(topic_url,topic_name):
    filename = topic_name+".csv"
    if os.path.exists(filename):
        print(f"File {filename} already exists. Skipping...")
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))

    topic_df.to_csv(filename,index = None)

Create a function get_topic_titles(soup): It takes a Beautiful Soup object and finds all the topic titles present in all the paragraph(p) tags of the html page and returns the same's list

In [36]:
def get_topic_titles(soup):
    topic_title_tags = soup.find_all('p',attrs={
    'class':'f3 lh-condensed mb-0 mt-1 Link--primary'
    })
    topic_titles = []

    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

Create a function get_topic_description(soup): It takes a Beautiful Soup object and finds all the descriptions present in all the paragraph(p) tags of the html page and returns the same's list

In [39]:
def get_topic_description(soup):
    topic_desc_tag = soup.find_all('p', attrs=
                                   {'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_descriptions = []

    for tag in topic_desc_tag:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

Create a function get_topic_urls(soup): It takes a Beautiful Soup object and finds all the links present in all the anchor tags of the html page and returns the same list

In [38]:
def get_topic_urls(soup):
    topic_link_tags = soup.find_all('a',attrs =
                            {'class' : 'no-underline flex-grow-0'})
    topic_urls = []
    base = "https://www.github.com"
    page ="?page=1"
    for tag in topic_link_tags:
        topic_urls.append(base+tag['href'])
    return topic_urls

Create a function scrape_topics(): scrapes the topic title from the page and their corresponding description and url and returns a DataFrame object.

In [33]:
def scrape_topics():
    topics_url = "https://github.com/topics"
#     topics_url = "https://github.com/topics?page=1"
    response = requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    soup = BeautifulSoup(response.text,'html.parser')
    topics_dict = {
        'title' : get_topic_titles(soup),
        'decscription': get_topic_description(soup),
        'url':get_topic_urls(soup)
    }
    return pd.DataFrame(topics_dict)

Bringing all the functions and their activities under one function scrape_topic_repos(): Which first scrapes the topic title from the page and their corresponding description and url, using the function scrape_topics() and then using the url scrapes the top repositories of that particular topic.

In [34]:
def scrape_topics_repos():
    print("Scraping list topics from Github")
    topics_df = scrape_topics()
    for index,row in topics_df.iterrows():
        print(f"Scraping top repositories for {row['title']}")
        scrape_topic(row['url'],row['title'])

In [40]:
scrape_topics_repos()

Scraping list topics from Github
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scr