## **Project Title : Scraping Top Repositories for Topics on Github**

## **Introduction**

### **What is web scraping ?**
Process of collecting structured web data in an automated fashion is reffered as web scraping also known as **web data extraction**.

### **How does scraping works ?**
- User creates a **search request** to **web servers**----> Servers(www.google.com) : servers than give **raw index.html** file in return ---------> User web browser like chrome gets index.html ------> Chrome beautify index.html and present it to user as a web page.

- Here **index.html** file which is recieved from server for the request made by user. Index.html file is not given to browser rather then it is displayed in our **code environment** where we will scrap data via queries and will store results in CSV or excel or other database or file formats.





### **Problem Statement** : 
  - Scrape the list of topics on https://github.com/topics and fetch its **topic title , topic page url and topic description**.
  - For each topic fetch its **top 30 trending repositories** with **repository name , username , stars and its repository url.**
  - And at last create the **topic csv files**.


**Tools used** : Python , NumPy , requests , Beautiful Soup , Pandas

## **Outline**
- As mentioned in problem statement scrape respective url i.e " https://github.com/topics".
- Get all the list of topics. For each topic get its topic title , topic page url and topic description.
- For each topic ,get all top 30 trending repositories from its topic page.
- For each repository get its repository name , username , stars and its repository url
- For each topic create a CSV file for it.

## **Part 1 : Scraping list of topics from " https://github.com/topics".**

**Explanation** :
- With the use of **requests** get the index.html file from "https://github.com/topics".
- Parse the html file using **html5lib** and **bs4**.
- Using bs4 give the parse file a tree structure so that it can be used for traversal (for retrieving queries).
- At last convert results into **pandas dataframe**.

In [45]:
# Libraries to include
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [46]:
# Funtion 1
# Description : Parsing the https://github.com/topics page and getting a doc file from it.

def get_topics_page():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  # Check for successful response 
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topics_url))
  doc = BeautifulSoup(response.text , 'html.parser')
  return doc

In [47]:
# Funtion 1 : demo

doc = get_topics_page()
print(type(doc)) # Type of doc
print(doc.find('a')) # Finding the first anchor tag in parsed doc.

<class 'bs4.BeautifulSoup'>
<a class="px-2 py-4 color-bg-info-inverse color-text-white show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>


**Point to be noted :** How to acquire selection_class from a page
- Inspect the element on the web page for which information is needed to be acquired.
- Follow the below picture
![image.png](https://i.imgur.com/QU24yMs.png)
 

In [48]:
# Function 2
# Description : Lets create some helper function to parse information from github topic page

# Function (2.1)
# Description : get_topic_titles() is used to retrieve the list of topics on https://github.com/topics page

def get_topic_titles(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p',class_=selection_class)
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

In [49]:
# Function (2.1) : demo
topic_titles = get_topic_titles(doc)
print(len(topic_titles))
print(topic_titles[:5]) # Printing first 5 topics on https://github.com/topics page

30
['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']


In [50]:
# Function (2.2)
# Description : get_topic_descs() is used to get the description of the corresponding topics

def get_topic_descs(doc):
  desc_selector = 'f5 color-text-secondary mb-0 mt-1'
  topic_desc_tags = doc.find_all('p',class_=desc_selector)
  topic_descs = []
  for desc in topic_desc_tags:
    topic_descs.append(desc.text.strip())
  return topic_descs

In [51]:
# Function (2.2) : demo
topic_descs = get_topic_descs(doc)
topic_descs[0]

'3D modeling is the process of virtually developing the surface and structure of a 3D object.'

In [52]:
# Function (2.3)
# Description : get_topic_urls() is used to get the url associated with the topics

def get_topic_urls(doc):
  topic_link_tags = doc.find_all('a',class_='d-flex no-underline')
  topic_urls = []
  base_url = "https://github.com"
  for url in topic_link_tags:
    topic_urls.append(base_url + url['href'])
  return topic_urls

In [53]:
# Function (2.3) : demo
topic_urls = get_topic_urls(doc)
topic_urls[0]

'https://github.com/topics/3d'

In [54]:
# Merging the function (Function 1 and Function 2) together
# Function 3
# Description : This function will take list of topics , topics description and topics url and will finally convert them into a pandas dataframe


def scrape_topics():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
   # Check for successful response 
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topics_url))
  topics_dict = {
      'title':get_topic_titles(doc),
      'description':get_topic_descs(doc),
      'url':get_topic_urls(doc)
  }
  return pd.DataFrame(topics_dict)

In [None]:
# Function (3) : demo
scrape_topics()

## **Part 2 : Getting the top repositories from the topics**

In [69]:
# Function 4
# Description : Function which takes topic url to get topic page

def get_topic_page(topic_urls):
  # Download the page
  response = requests.get(topic_urls)
  # Check for successful response 
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_urls))
  # Parse the page using BeautifulSoup
  topic_doc = BeautifulSoup(response.text , 'html.parser')
  return topic_doc

In [57]:
 # Helper function : This function converts the stars on repository from string to integer
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1])*1000)
  return int(stars_str)

In [64]:
# Function 5
# Description : Function which takes h3 tag and star tag to grab username,repo_name,stars,repo_url.

def get_repo_info(h3_tag , star_tags):
  # Will return all the required information about the repository
  base_url = "https://github.com"
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  stars = parse_star_count(star_tags.text.strip())
  repo_url = base_url + a_tags[1]['href']
  return username,repo_name,stars,repo_url

In [59]:
# Function 6
# Description : Function to covert the topic dats's into dictionary then to pandas dataframe

def get_topic_repos(topic_doc):
  
  # Get the h3 tag to get repo title , username and its urls
  h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3',class_=h3_selection_class)
  # Get class to get star tag
  star_tags = topic_doc.find_all('a',class_= 'social-count float-none')

  # Finally get all repo info

  topic_repos_dict = {
    'username': [],
    'repo_name':[],
    'stars':[],
    'repo_url':[]

  }

  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i] ,star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
  return pd.DataFrame(topic_repos_dict)

In [None]:
get_topic_repos(get_topic_page(topic_urls[0]))

## **Part 3: Final Merging Functions**

In [60]:
 # Helper function
import os
def scrape_topic(topic_url , path):
  #fname = topic_name + '.csv'
  if os.path.exists(path):
    print("The file {} already exists....skipping.. ".format(path))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path , index=None)

In [61]:
# Function 7
# Description : Final scarping function

import os

def scrape_topics_repos():
  print("Scraping list of topics from Github")
  topics_df = scrape_topics()
  # Creating a folder
  os.makedirs('data',exist_ok=True)

  for index , row in topics_df.iterrows(): # Looping over rows in pandas data frame
    print("Scraping top repository for {}".format(row['title']))
    scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

In [None]:
scrape_topics_repos()

## **Conclusion and future works**
- Successfully finished scraping top 30 repositories of each topic on github topic page.
- All the data is stored into a local folder following hierarchy as 
data(folder_name)

![image.png](https://i.imgur.com/N8Y7Mpd.png)

- CSV file looks as 

![image.png](https://i.imgur.com/PtH1V0R.png)
- Future task is comprised of finding a way to scrap data automatically from other pages as well (page 2 , page 3....)