# GitHub Web Scraping Project

### Objective

#### What It Does
- The project below will scrape data from GitHub's "Topic" webpage: https://github.com/topics 
- The following information is scraped: the topic name, the topic title, topic page URL and description. 
- For each topic, the top 25 repositories will be extracted from the topic page. 
- For each repository, the project gets the repo name, the owner of the repository, the number of stars it has, as well as the repo URL. 
- All of this will be written to a topic csv file. 
- Each topic will have its own csv file.

#### Topic CSV File Layout
```
Repo Name,Username,Stars,Repo URL
Blog,ljianshu,7700,https://github.com/ljianshu/Blog
```


## Use the requests library to download web pages


In [1]:
!pip install requests



In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code

200

In [6]:
len(response.text)

166052

In [7]:
page_contents = response.text
type(page_contents)

str

In [8]:
page_contents[:500]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="false">\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.co'

In [9]:
# Writes the page_contents html to a webpage.html file, essentially mirroring the html of the topics_url above
with open('webpage.html','w',encoding="utf-8") as f:
    f.write(page_contents)

## [Topics Webpage] Use Beautiful Soup to parse and extract information

In [10]:
from bs4 import BeautifulSoup

In [11]:
# Create the BeautifulSoup object with the parsed html contents
doc = BeautifulSoup(page_contents,"html.parser")
type(doc)

bs4.BeautifulSoup

In [12]:
# Get the title of each topic

#Inspect the html on the GitHub page, find the exact list of classes that a topic name would have
topic_name_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_name_p_tags = doc.find_all('p',{'class': topic_name_class})

# 30 p_tags corresponds with the 30 displayed topics, as seen in the output of the cell below
len(topic_name_p_tags)

30

In [13]:
topic_name_p_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [15]:
# Get the description of each topic

# Inspect the html on the GitHub page, find the exact list of classes that the topic description would have
topic_desc_class = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_p_tags = doc.find_all('p',{'class': topic_desc_class})

# 30 p_tags corresponds with the 30 displayed topics, as seen in the output of the cell below
len(topic_desc_p_tags)

30

In [16]:
topic_desc_p_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [17]:
# Get the href of the 'a' tag of each topic (this helps us keep track of the link to the topic's webpage)
topic_link_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
len(topic_link_tags)
topic_link_tags[0]['href']

'/topics/3d'

In [18]:
# Prints the url of the first topic on the webpage
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [19]:
# Iterate through the topic_name_p_tags list, and append the topic titles to a list
topic_titles = []

for tag in topic_name_p_tags:
    topic_titles.append(tag.text)

print(topic_titles[:10])

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET']


In [20]:
# Iterate through the topic_desc_p_tags list, and append the topic descriptions to a list
topic_descriptions = []

for tag in topic_desc_p_tags:
    topic_descriptions.append(tag.text.strip())

print(topic_descriptions[:10])

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.']


In [21]:
# Iterate through the topic_link_tags list, and append the topic links to a list which are concatenated with a base_url
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

print(topic_urls[:10])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet']


In [22]:
# To create a dataframe and populate it with the lists above, we must first create a dictionary with the lists, and add to a dataframe
topics_dict = {
    'Topic Titles':topic_titles, 
    'Topic Descriptions':topic_descriptions, 
    'Topic URLs':topic_urls}

In [23]:
# Create and populate the data frame with the above dictionary
import pandas as pd
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,Topic Titles,Topic Descriptions,Topic URLs
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file(s) with the extracted information

To Do:
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.


In [24]:
topics_df.to_csv('topics.csv', index=None)

## [Individual Topic Page] Get Information from a specific topic's page
- Repeat the same scraping process as the one used to scrape the topics webpage, except it's to be implemented on a specific topic's webpage now
- The section below creates and formats the new BeautifulSoup object to hold the parsed html contents of a specific topic's page

In [25]:
topic_page_url = topic_urls[0]

In [26]:
print(topic_page_url)

https://github.com/topics/3d


In [27]:
response_page = requests.get(topic_page_url)

In [28]:
response_page.status_code

200

In [29]:
len(response_page.text)

477227

In [30]:
# Create a new BeautifulSoup object with the parsed html contents from response_page
topic_doc = BeautifulSoup(response_page.text,"html.parser")


In [31]:
# To isolate a repository, since the individual repo name and repo owner don't have classes associated, find by the 'h3' tags
# that the name and owner 'a' tags are nested in

# 20 repo_tags represent the 20 displayed repositories on a single topic's page
repo_tags = topic_doc.find_all('h3', {'class':'f3 color-fg-muted text-normal lh-condensed'})
len(repo_tags)

20

In [32]:
# Next, you can further isolate the repo name and owner by finding all of the 'a' tags nested in a single 'h3' tag
# A single 'h3' tag should have 2 'a' tags. To get the href and text in the 'a' tag (i.e repo name and owner) we can
# iterate through the 'a' tags like a list. I.e, first a_tag is repo owner, second a_tag is repo name

a_tags = repo_tags[0].find_all('a')
print("First 'a' tag: " + a_tags[0].text.strip())
print("Second 'a' tag: " + a_tags[1].text.strip())

First 'a' tag: mrdoob
Second 'a' tag: three.js


#### Add information to a dictionary to add to a data frame
There are two ways to do this. I have demonstrated both:
1. Define empty lists for each category of information. Use two for loops to extract and append the information to these empty lists. Once done, these lists can be converted to a dictionary, which can be appended to a new data frame
2. Define an empty dictionary, and use helper functions to extract the required information from one instance of a repository. Use these functions in a single for loop to append to the dictionary, which can be appended to a new data frame

In [33]:
# METHOD 1: FOR LOOPS --------------------------------------------------------------------------------------------------------

# With the above format in mind, create a list for each piece of required info (repo name, repo owner, repo url)
# Use a for loop to iterate through the 'h3' repo_tags, extract and append the required information to the appropriate list

repo_names = []
repo_owners = []
repo_urls = []

# Create a temporary list to store the 'span' tags for the repo stars, then use .text.strip() to populate the final list for repo_stars
repo_stars_temp = []
repo_stars = []

base_url = 'https://github.com'

repo_stars_temp = topic_doc.find_all('span', {'id':'repo-stars-counter-star'})

for tag in repo_tags:
    all_a_tags = tag.find_all('a')
    repo_owners.append(all_a_tags[0].text.strip())
    repo_names.append(all_a_tags[1].text.strip())
    repo_urls.append(base_url + all_a_tags[1]['href'])
    

for star_tag in repo_stars_temp:
    repo_stars.append(star_tag.text.strip())
    
print(repo_owners[:5])
print(repo_names[:5])
print(repo_urls[:5])
print(repo_stars[:5])

['mrdoob', 'pmndrs', 'libgdx', 'BabylonJS', 'ssloy']
['three.js', 'react-three-fiber', 'libgdx', 'Babylon.js', 'tinyrenderer']
['https://github.com/mrdoob/three.js', 'https://github.com/pmndrs/react-three-fiber', 'https://github.com/libgdx/libgdx', 'https://github.com/BabylonJS/Babylon.js', 'https://github.com/ssloy/tinyrenderer']
['94.2k', '23.6k', '21.9k', '21.3k', '17.7k']


In [34]:
# METHOD 2: HELPER FUNCTIONS -------------------------------------------------------------------------------------------------

# Returns the star count as an int
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)


# Returns all of the required info about a repository
def get_repo_info(h3_tag, star_tag):
    base_url = 'https://github.com'
    a_tags = h3_tag.find_all('a')
    repo_owner = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return repo_owner, repo_name, repo_url, stars

In [35]:
get_repo_info (repo_tags[0], repo_stars_temp[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 94200)

In [36]:
# METHOD 2 CONTD ------------------------------------------------------------------------------------------------------------

topic_repo_dict = {
    'Repository Owner': [],
    'Repository Name': [],
    'Repository Star Count':[],
    'Repository URL': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], repo_stars_temp[i])
    topic_repo_dict['Repository Owner'].append(repo_info[0])
    topic_repo_dict['Repository Name'].append(repo_info[1])
    topic_repo_dict['Repository Star Count'].append(repo_info[3])
    topic_repo_dict['Repository URL'].append(repo_info[2])
    
print(topic_repo_dict)

{'Repository Owner': ['mrdoob', 'pmndrs', 'libgdx', 'BabylonJS', 'ssloy', 'lettier', 'aframevr', 'FreeCAD', 'CesiumGS', 'metafizzy', 'isl-org', 'blender', 'timzhang642', 'a1studmuffin', 'domlysz', 'FyroxEngine', 'nerfstudio-project', 'google', 'openscad', 'spritejs'], 'Repository Name': ['three.js', 'react-three-fiber', 'libgdx', 'Babylon.js', 'tinyrenderer', '3d-game-shaders-for-beginners', 'aframe', 'FreeCAD', 'cesium', 'zdog', 'Open3D', 'blender', '3D-Machine-Learning', 'SpaceshipGenerator', 'BlenderGIS', 'Fyrox', 'nerfstudio', 'model-viewer', 'openscad', 'spritejs'], 'Repository Star Count': [94200, 23600, 21900, 21300, 17700, 16000, 15600, 15000, 10900, 10000, 9400, 9300, 9100, 7400, 6600, 6500, 6400, 5900, 5800, 5200], 'Repository URL': ['https://github.com/mrdoob/three.js', 'https://github.com/pmndrs/react-three-fiber', 'https://github.com/libgdx/libgdx', 'https://github.com/BabylonJS/Babylon.js', 'https://github.com/ssloy/tinyrenderer', 'https://github.com/lettier/3d-game-shade

In [37]:
# Add to a new dataframe

repo_df = pd.DataFrame(topic_repo_dict)
repo_df

Unnamed: 0,Repository Owner,Repository Name,Repository Star Count,Repository URL
0,mrdoob,three.js,94200,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23600,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21900,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21300,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17700,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16000,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15600,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,15000,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10900,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,10000,https://github.com/metafizzy/zdog


In [38]:
# The following two functions work to return a dataframe consisting of the top 20 repositories in a given topic url

def get_topic_page(topic_url):
    response_web = requests.get(topic_url)
    if response_web.status_code!= 200:
        raise Exception('Failed to load pag{}'.format(topic_url))
    topic_doc = BeautifulSoup(response_web.text,"html.parser") 
    return topic_doc
    

def get_topic_repos(topic_doc):
    page_tags = topic_doc.find_all('h3', {'class':'f3 color-fg-muted text-normal lh-condensed'})
    repo_stars = topic_doc.find_all('span', {'id':'repo-stars-counter-star'})
    
    topic_repo_dict = {
    'Repository Owner': [],
    'Repository Name': [],
    'Repository Star Count':[],
    'Repository URL': []
    }

    for i in range(len(page_tags)):
        repo_info = get_repo_info(page_tags[i], repo_stars[i])
        topic_repo_dict['Repository Owner'].append(repo_info[0])
        topic_repo_dict['Repository Name'].append(repo_info[1])
        topic_repo_dict['Repository Star Count'].append(repo_info[3])
        topic_repo_dict['Repository URL'].append(repo_info[2])
        
    new_df = pd.DataFrame(topic_repo_dict)
    return new_df

In [39]:
get_topic_repos(get_topic_page(topic_urls[2]))

Unnamed: 0,Repository Owner,Repository Name,Repository Star Count,Repository URL
0,jwasham,coding-interview-university,264000,https://github.com/jwasham/coding-interview-un...
1,trekhleb,javascript-algorithms,175000,https://github.com/trekhleb/javascript-algorithms
2,TheAlgorithms,Python,168000,https://github.com/TheAlgorithms/Python
3,CyC2018,CS-Notes,167000,https://github.com/CyC2018/CS-Notes
4,yangshun,tech-interview-handbook,94400,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,60300,https://github.com/kdn251/interviews
6,TheAlgorithms,Java,53400,https://github.com/TheAlgorithms/Java
7,azl397985856,leetcode,52300,https://github.com/azl397985856/leetcode
8,algorithm-visualizer,algorithm-visualizer,44200,https://github.com/algorithm-visualizer/algori...
9,youngyangyang04,leetcode-master,41600,https://github.com/youngyangyang04/leetcode-ma...


## [Topics Page + Top 30 Topics] Putting It All Together

The functions below:
- get_page_soup(url): Takes in a url, returns a BeautifulSoup object
- get_topic_info(tag_type, class_name): Takes in the type of tag being parsed for in the html doc, as well as the specific class name of the information being searched for, and returns a list of that information
- scrape_topics(): No input, scrapes the general topics webpage and creates a csv from a dataframe of the info on it
- scrape_topic_repos(topics_url): Takes a url input of the individual topic webpage, creates a csv from a dataframe of the info on it
- generate_all(): First it calls scrape_topics() to scrape the general topics webpage once, then loops scrape_topic_repos for however many repos are displayed on the topics webpage, creating 30 different csv files

In [40]:
def get_page_soup(url):
    response = requests.get(url)
    if response.status_code!= 200:
        raise Exception('Failed to load pag{}'.format(url))
    soupdoc = BeautifulSoup(response.text,"html.parser")
    return soupdoc

def get_topic_info(tag_type, class_name):
    topic_tags = doc.find_all(tag_type, {'class': class_name})
    topic_list = []
    if tag_type!='a':
        for tag in topic_tags:
            topic_list.append(tag.text.strip())
        return topic_list
    else:
        base_url = 'https://github.com'
        for tag  in topic_tags:
            topic_list.append(base_url + tag['href'])
        return topic_list

In [44]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    doc = get_page_soup(topics_url)
    
    topic_name_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_desc_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_link_class = 'no-underline flex-1 d-flex flex-column'
   
    topic_titles = get_topic_info('p', topic_name_class)
    topic_descriptions = get_topic_info('p',  topic_desc_class)
    topic_urls = get_topic_info('a', topic_link_class)
        
    topics_dict = {
        'Topic Titles': topic_titles,
        'Topic Descriptions': topic_descriptions,
        'Topic URLs': topic_urls
    }
    
    topics_df = pd.DataFrame(topics_dict)
    filepath = r'C:/Users/asmig/dev-projects/ds/Github Web Scraping/Generated CSV Files/'
    topics_df.to_csv(filepath + 'topics.csv', index=None)
    return topics_df


def scrape_topic_repos(topic_url):
    topic_doc = get_page_soup(topic_url)
    
    page_tags = topic_doc.find_all('h3', {'class':'f3 color-fg-muted text-normal lh-condensed'})
    repo_stars = topic_doc.find_all('span', {'id':'repo-stars-counter-star'})
    
    topic_repo_dict = {
    'Repository Owner': [],
    'Repository Name': [],
    'Repository Star Count':[],
    'Repository URL': []
    }

    for i in range(len(page_tags)):
        repo_info = get_repo_info(page_tags[i], repo_stars[i])
        topic_repo_dict['Repository Owner'].append(repo_info[0])
        topic_repo_dict['Repository Name'].append(repo_info[1])
        topic_repo_dict['Repository Star Count'].append(repo_info[3])
        topic_repo_dict['Repository URL'].append(repo_info[2])
        
    new_df = pd.DataFrame(topic_repo_dict)
    return new_df


def generate_all():
    scrape_topics()
    filepath = r'C:/Users/asmig/dev-projects/ds/Github Web Scraping/Generated CSV Files/'
    for i in range(len(topic_urls)):
        scrape_topic_repos(topic_urls[i]).to_csv(filepath + topic_titles[i]+ '.csv', index=None)
        

In [45]:
generate_all()

## Done! You can check out the completed .csv files in the '\Generated CSV Files' Folder
- Thanks for scrolling this far and checking my project out! Hope you have a good day!