# Web-Scraping Project:

By Akshat Girdhar

##### Credit to https://www.youtube.com/@jovianhq for providing me with a great tutorial to follow along and work on this project. 
Video Link: https://www.youtube.com/watch?v=RKsLLG-bzEY&t=5465s

##### The project involves using web scraping to scrape data from the explore page on Github which features the trending topics and the famous repositories related to each topic. The goal is to create a dataset which contains details about all the trending topics and then another dataset that contains details about each topic. 

For a beginner, this is a great tutorial and covers all the major topics related to Beautiful Soup. It even uses Pandas library and gives a look into how one can use web scraping, data analysis to make a great project.  

- In the first part, the goal is to scrape https://github.com/ using the BeautifulSoup library.
- We are going to scrape the data present in the topics section.
- For each topic, we'll get topic title, topic page URL, topic description to make a list of topics.
- For each topic, we'll get their top 25 repositories from the topic page. 
- For each repository,we'll grab the repo name, username, stars and repo URL.
- For each topic, we will create a CSV file. 

#### Web Scraping:
Web scraping using Python is a powerful technique for extracting data from websites. With Python's robust libraries like Beautiful Soup and Scrapy, web scraping becomes accessible and efficient. By analyzing the HTML structure of a webpage, Python can navigate through its elements and retrieve specific information. This automated process saves time and effort, allowing users to gather large amounts of data for various purposes, such as data analysis, research, or building applications. Python's versatility and extensive library support make it an ideal choice for web scraping tasks.

#### GitHub:
Github is a popular platform for version control and collaborative development. One of its notable features is the Explore page, which showcases trending topics in the programming world. This page provides valuable insights into the latest technologies, frameworks, and projects gaining traction within the developer community. By highlighting trending repositories, languages, and topics, Github's Explore page serves as a valuable resource for developers to discover new ideas, stay updated, and explore innovative projects.

##### Pandas:
Pandas is a powerful data manipulation and analysis library in Python. It provides a comprehensive set of tools for working with structured data, including dataframes, series, and powerful data manipulation functions, making it a go-to choice for data cleaning, preprocessing, and analysis tasks.

##### OS Module:
The os module in Python provides a way to interact with the operating system, allowing tasks such as file and directory management, process management, and environment variable manipulation. It offers a wide range of functions and methods to work with the underlying operating system, making it a versatile tool for system-level operations in Python.

In [451]:
import requests
import os
from bs4 import BeautifulSoup

## 1. Using the Requests library to get the web pages

In [452]:
%pip install requests --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [455]:
topics_url = "https://github.com/topics"
response = requests.get(topics_url)
page_contents = response.text
print(len(response.text))

156341


Copying the content into another HTML file, for a better understanding of how the HTML file looks like for the given page. 

In [456]:
with open('webpage.html','w',encoding='utf-8') as file :
    file.write(page_contents)

## 2. Using BeautifulSoup to parse and extract information. 

In [457]:
%pip install beautifulsoup4 --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [458]:
doc = BeautifulSoup(page_contents,'html.parser')
type(doc)

bs4.BeautifulSoup

Starting by finding out where the title tags are present in the HTML file. 

In [459]:
title_tags = doc.find_all('p',class_ = 'f3 lh-condensed mb-0 mt-1 Link--primary') 
title_tags[:2] # only taking 2 tags as an example to show how they are present in the HTMl file. 

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>]

In [460]:
desc_tags = doc.find_all('p',class_='f5 color-fg-muted mb-0 mt-1')
desc_tags[:2] # only taking 2 tags as an example to show how they are present in the HTMl file. 

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>]

In [461]:
url_tags = doc.find_all('a',class_ = 'no-underline flex-1 d-flex flex-column')
url_tags[0]['href'] # to take out only the href part from the whole url tag we use this

'/topics/3d'

In [462]:
%pip install pandas --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


## 3. A function to form a database that contains top repos from a topic:

Selecting the first topic and forming a function to get all the top repos from it.

In [463]:
import pandas as pd 

We prefer having the number as a numerical value intead of having k represent 1000. 

In [464]:
def star_count_conv_k(repo_star_count):
    repo_star_count = repo_star_count.strip()
    if repo_star_count[-1] == 'k':
        return int(float(repo_star_count[:-1]) * 1000)
    return int(repo_star_count)

In [465]:
def get_repo_info(h3_tags,repo_stars):
    #this function returns all the required information about a repository
    a_tags = h3_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    star_count = star_count_conv_k(repo_stars.text)
    return username,repo_name,repo_url,star_count

Now that we have created a function that takes in the tag inputs and gives the required information as the output we can loop this through the whole web scraped information to get the required data for all usernames and repositories. 

In [466]:
#A dictionary for storing the required data for each topic. 
topics_repo_dict = {
    'username' : [],
    'repo_name' : [],
    'repo_url' : [],
    'star_count' : []
}

for i in range(len(h3_tags)):
    repo_info = get_repo_info(h3_tags[i], repo_stars[i])
    topics_repo_dict['username'].append(repo_info[0])
    topics_repo_dict['repo_name'].append(repo_info[1])
    topics_repo_dict['repo_url'].append(repo_info[2])
    topics_repo_dict['star_count'].append(repo_info[3])
    
topics_repo_df = pd.DataFrame(topics_repo_dict)

In [467]:
topics_repo_df

Unnamed: 0,username,repo_name,repo_url,star_count
0,mrdoob,three.js,https://github.com/mrdoob/three.js,92700
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,23000
2,libgdx,libgdx,https://github.com/libgdx/libgdx,21600
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,20900
4,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,17200
5,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,15600
6,aframevr,aframe,https://github.com/aframevr/aframe,15500
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,14300
8,CesiumGS,cesium,https://github.com/CesiumGS/cesium,10600
9,metafizzy,zdog,https://github.com/metafizzy/zdog,9800


This table gives us the top repos only for the topic '3D', now we have to get the top repos for every topics provided on the github topics page. 

## 4. A function to get the top repos for all topics on the Github Page:

Forming a detailed function that gives all the required details for each topic.

#### To do this process for all the topics we need to define a function that completes all the instructions and gives us an indivisual table for each topic. 

In [468]:
def get_topic_repos(topic_url):
    #the whole process of getting a page and then scraping all the data we need from it.
    response = requests.get(topic_url)
    #check the successful response and continue
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topic_url}")
    topic_doc = BeautifulSoup(response.text,'html.parser')
    h3_tags = topic_doc.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')
    star_tags = topic_doc.find_all('span',id='repo-stars-counter-star')
    #get all the repo info
    
    topics_repo_dict = {
    'username' : [],
    'repo_name' : [],
    'repo_url' : [],
    'star_count' : []
    }   

    for i in range(len(h3_tags)):
        repo_info = get_repo_info(h3_tags[i], repo_stars[i])
        topics_repo_dict['username'].append(repo_info[0])
        topics_repo_dict['repo_name'].append(repo_info[1])
        topics_repo_dict['repo_url'].append(repo_info[2])
        topics_repo_dict['star_count'].append(repo_info[3])
    
    return pd.DataFrame(topics_repo_dict)
    

In [469]:
def scrape_topic(topic_url,topic_name):
    file_name = f'{topic_name}.csv'
    if os.path.exists(f'Data_topics\{topic_name}.csv'):
       print(f"The file Data_topics/{file_name} already exists.") 
       return
    os.makedirs('Data_topics',exist_ok=True) 
    topic_df = get_topic_repos(topic_url)
    topic_df.to_csv(f'Data_topics/{file_name}',index=None)

In [470]:
topic_info = get_topic_repos(topic_urls[5])

In [471]:
topic_urls[5]

'https://github.com/topics/angular'

In [472]:
topic_info

Unnamed: 0,username,repo_name,repo_url,star_count
0,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,92700
1,angular,angular,https://github.com/angular/angular,23000
2,storybookjs,storybook,https://github.com/storybookjs/storybook,21600
3,leonardomso,33-js-concepts,https://github.com/leonardomso/33-js-concepts,20900
4,ionic-team,ionic-framework,https://github.com/ionic-team/ionic-framework,17200
5,prettier,prettier,https://github.com/prettier/prettier,15600
6,Asabeneh,30-Days-Of-JavaScript,https://github.com/Asabeneh/30-Days-Of-JavaScript,15500
7,SheetJS,sheetjs,https://github.com/SheetJS/sheetjs,14300
8,angular,angular-cli,https://github.com/angular/angular-cli,10600
9,angular,components,https://github.com/angular/components,9800


## 5. Forming a final function that outputs a similar table for all the topics and forms a dataset.

Now that we have extracted the table we can create a single function that can fulfill all the things done by these various functions:

1. Get the list of topics from the topics page.
2. Get the list of top repos from the indivisual topic pages.
3. For each topic create a CSV of the top repos for the topic.

In [473]:
def get_topic_titles(doc):
    #return the topic titles 
    title_tags = doc.find_all('p',class_ = 'f3 lh-condensed mb-0 mt-1 Link--primary')     
    topic_titles = []
    for topic in title_tags:
        topic_titles.append(topic.text)
    return topic_titles

def get_topic_desc(doc):
    #returns the topic descriptions
    desc_tags = doc.find_all('p',class_='f5 color-fg-muted mb-0 mt-1')
    topic_descs = []
    for desc in desc_tags:
        topic_descs.append((desc.text).strip())
    return topic_descs

def get_topic_urls(doc):
    #returns the topic urls
    url_tags = doc.find_all('a',class_ = 'no-underline flex-1 d-flex flex-column')
    base_url = 'https://github.com'
    topic_urls = []
    for url in url_tags:
        topic_urls.append(base_url + url['href'])
    return topic_urls


def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to load the page.")
    topics_dict = {
        'Title' :get_topic_titles(doc),
        'Description': get_topic_desc(doc),
        'URLs' : get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

In [474]:
scrape_topics()

Unnamed: 0,Title,Description,URLs
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


Functions to create the topics table are given in the above few cells, now in the next step we'll create a mega function that taps into each topic and takes out it's top repos and makes a new CSV file for each topic.  

In [475]:
topics_df = scrape_topics()
topics_df

Unnamed: 0,Title,Description,URLs
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [476]:
def scrape_topics_repos():
    print('Scraping Topics')
    topics_df = scrape_topics()
    for index,row in topics_df.iterrows():
        print(f"Scraping top repos from {row['Title']}")
        scrape_topic(row['URLs'],row['Title'])
        

In [477]:
import jovian

In [480]:
jovian.commit(filename='github-scraper.ipynb')

<IPython.core.display.Javascript object>

[jovian] Updating notebook "akshatgirdhar05/github-scraper" on https://jovian.com/[0m
[jovian] Committed successfully! https://jovian.com/akshatgirdhar05/github-scraper[0m


'https://jovian.com/akshatgirdhar05/github-scraper'