# Scraping the Featured Topics Popular Repositories from GitHub

### What is Web Scraping 

- Web Scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

### Why Perform Web Scraping 
- In data science, to do anything, you need to have data at hand. To get that data, you’ll need to research the required sources, and web scraping helps you. Web scraping collects and categorizes all the required data in one accessible location. Researching with a single, convenient location is much more feasible and more comfortable than searching for everything one-by-one.



### What is GitHub 
- GitHub is an open-source repository hosting service, sort of like a cloud for code. It hosts your source code projects in a variety of different programming languages and keeps track of the various changes made to every iteration. URL : https://github.com/


__We’ll be using the packages__

- Requests — for downloading the HTML code from the TMdb URL
- BeautifulSoup4 — for extracting data from the HTML string
- Pandas — to gather my data into a dataframe for further processingPandas and Python Programming Language.

### Project Outline


Let's see an outline of the steps we'll follow:

1. Load the GitHub https://github.com/ using Requests.
2. Parse the HTML web page using BeautifulSoup.
3. Extract the list of Topics from the landing page. For each topic, we'll get the topic name, topic description and URL of the topic. 
4. Again for each topic name, we'll grab the username, repository name, star counts and repository URL.
5. Compile extracted repositories details into Python Lists and Dictionaries.
6. We'll extend the above logic to scrape multiple repositories.z
7. Finally, we'll save all the repository informations into a csv file.

  
  The csv file will be of the following format.

     repo name,username,stars,repo_url
     CS_notes,CyC2018,133000,https://github.com/CyC2018/CS-Notes
     tech-interview-handbook,yangshun,111000,https://github.com/yangshun/tech-interview-handbook 

### Installing the Libraries

Let’s start by installing the required packages.

In [1]:
# Install pandas
!pip install pandas as pd --quiet

# Install the bs4 module from BeautifulSoup 
!pip install beautifulsoup4 --upgrade --quiet

Let's import the necessary packages

In [2]:
# Let's import necessary packages 
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup

### Load the Webpage using Requests

The landing page of GitHub page consists of a list of featured topics. We can click on each of the topic and navigate to the individual topic page to get more details of each topic.

Each topic page contains popular repositories. From the landing page, we will parse the list of repoitory names, usernames,  stars count and repository URLs. Then, we can navigate to the next pages using the ‘Load More’ button click.

- Use the requests library to download web pages
- Download and save web pages locally using the requests library.
- Use BeatifulSoup to parse the website.
- Inspect the website's HTML source and identify the right URLs to download.

In [3]:
# GitHub URL 
github_url = 'https://github.com/'

In [4]:
# The github page is downloaded using 'requests`
response = requests.get(github_url)

In [5]:
# Check if the request was successful 
response.status_code

200

The above code validates if the requests was successful using the .status_code = 200.

In [6]:
page_contents = response.text
page_contents[:500]

'\n\n\n\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-VfI/oBxJKY4+ZNd24vTPiyumBP7aGW4fepRsD/fUTuz8Ebw8WLpgvNkCNjj8+YRFbtZUy'

In [7]:
with open ('github.html', 'w') as f:
    f.write(page_contents)

In [8]:
topic_doc = BeautifulSoup(page_contents, 'html.parser')

__The HTML page content is extracted using BeautifulSoup Library into `topic_doc`.__

Let's create a function to perform the above.

In [9]:
def get_topic_page():
    """
    Function to download a web page using 'requests' and check the status code to validate
    if the call was successful. 
    """
    topic_url = 'https://github.com/topics'
    
    # Access the webpage using 'requests'
    response = requests.get(topic_url)
    # Check if request is successful.
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using BeaurifulSoup.
    topic_doc = BeautifulSoup(response.text,'html.parser')

    return topic_doc

In [10]:
topic_doc = get_topic_page()

In [11]:
type(topic_doc)

bs4.BeautifulSoup

In [12]:
len(response.text)

224145

In [13]:
page_contents = response.text
page_contents[:1000]

'\n\n\n\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-VfI/oBxJKY4+ZNd24vTPiyumBP7aGW4fepRsD/fUTuz8Ebw8WLpgvNkCNjj8+YRFbtZUyBfJ9UTEzXippSqvaA==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-55f23fa01c49298e3e64d776e2f4cf8b.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-0G3kkMTKku66NTv2HdOjTCO1C5CgsLk4X7y+7ognYA7yZJRx+/7e/HF5Dzdp3R8mgHIPmKWkeQzO8c8drKzCjA==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-d06de490c4ca92eeba353bf61dd3a34c.css" />\n    \n    \n    \n    <link crossorigin="anonymous" media="all" integrity="

### Inspect the Web page

Chrome users can use the “Inspect” option by right-clicking on the page to examine the HTML code behind the page. A menu will appear, either on the bottom or right side of the page (based on the settings), with a long list of nested HTML tags. To find the correct tag associated with the information needed, select the details (ex. topic name) and click “Inspect” again and that will highlight a blue box. Now, you can click on the HTML tags and get the correct tag associated with the item of interest, here, topic name.

As we see in the image below, the 'topic names' are embedded in the 'p' tags with class ' f3 lh-condensed mb-0 mt-1 Link--primary '.



- Creating some helper functions to parse the information from the page

- To get topic titles, we can pick 'p' tags with the class 'f3 lh-condensed mb-0 mt-1 Link--primary'

![](https://i.imgur.com/PDSwDE0.png)



We can use the a.text.strip() to retrieve the name of the topic. Note, that we need to exclude the above lines, as those do not contain the topic names.

In [14]:
topic_title_tags = topic_doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [15]:
topic_desc_tags = topic_doc.find_all('p',class_='f5 color-text-secondary mb-0 mt-1')
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:5]   # top 5 topic description

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [16]:
topic_link_tags = topic_doc.find_all('a',class_='d-flex no-underline')

topic_urls = []

for tag in topic_link_tags:
    topic_urls.append(tag['href'])

topic_urls[:5]   # top 5 topic URL's

['/topics/3d',
 '/topics/ajax',
 '/topics/algorithm',
 '/topics/amphp',
 '/topics/android']

Lets create Function for Topic Titles, Topic Description and Topic URL's

In [17]:
def get_topic_titles(topic_doc):
    """
    Function to extract the topic titles from HTML source code using BeautifulSoup.
    """
    topic_title_tags = topic_doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    # Loop through the page get all the  topic titles from the page
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
        
    return topic_titles

- 'get topic titles ' can be used to get the list of titles.

In [18]:
# Get the popular movie list from the webpage using the BeautifulSoup object `doc`. 

get_topic_titles(topic_doc)[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

The above shows the list of featured topic titles in the landing page of the GitHub. 

Similarly,  let's define functions for topic description and URLs.

The description are embedded as part of the `p` tag under the `f5 color-text-secondary mb-0 mt-1` class in the webpage as below.

![](https://imgur.com/PuXp524.png)

In [19]:
def get_topic_descs(topic_doc):
    """
    Function to extract the topic description from HTML source code using the BeautifulSoup. 
    """
    topic_desc_tags = topic_doc.find_all('p',class_='f5 color-text-secondary mb-0 mt-1')
    topic_descs = []
    
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
        
    return topic_descs

In [20]:
# Get the topic description of each topic in the webpage using the BeautifulSoup object `topic_doc`. 

get_topic_descs(topic_doc)[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

The above shows the top 5 topics in the landing page of the GitHub web page. 

Each topic URL can be retrieved by appending the base URL of https://www.github to .a['href'].

![](https://imgur.com/yaDTGOI.png)

In [21]:
def get_topic_urls(topic_doc):
    """
    Function to extract the topic URLs from HTML source code using the BeautifulSoup. 
    """
    topic_link_tags = topic_doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [22]:
# Get the topic URL of each topic in the webpage using the BeautifulSoup object `topic_doc`. 

get_topic_urls(topic_doc)[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [23]:
def scrape_topic():
    topic_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))

topics_dict = {
        
        'title' : get_topic_titles(topic_doc),
        'description' : get_topic_descs(topic_doc),
        'url' : get_topic_urls(topic_doc)
    }
    
topics_df = pd.DataFrame(topics_dict)

In [24]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [25]:
topics_df.to_csv('topics_info.csv')

__Saving the topics information into csv format.__

### Get Popular 30 Repositories from  Each Topic Page to Scrape


By now we have topic titles, topic description and topic URLs for the first page.

Let’s first consider a sample topic web page: Amazon Web Services and see how we parse HTML tags to get additional information like username, repository name, star count and top 25 repository URL's of each of the topics. 
![](https://imgur.com/0diXK8r.png)

Putting this all together into a single function

- To read additional topic information, let's create a function that can accept a topic url.

In [26]:
# Let's read a topic page
def get_topic_page(topic_url):
    """
    Function to read the HTML source code using BeautifulSoup.
    """
    # download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text,'html.parser')

    return topic_doc

In [27]:
topic_doc = get_topic_page('https://github.com/topics/aws')

We have the HTML source code in the BeautifulSoup object 'topic_doc'.

In [28]:
h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
repo_tags = topic_doc.find_all('h1',{'class': h1_selection_class}) 
star_tags = topic_doc.find_all('a',class_='social-count float-none')

In [29]:
len(repo_tags)

30

In [30]:
a_tags = repo_tags[0].find_all('a')
username = a_tags[0].text.strip()
repo_name = a_tags[1].text.strip()


base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']

print(username,repo_name,repo_url)


print(star_tags[0].text.strip())

serverless serverless https://github.com/serverless/serverless
40.1k


In [31]:
# Function to convert 'k' to '1000' integer value

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [32]:
parse_star_count(star_tags[0].text.strip())

40100

Under the `h1` tag there are two `a` tags in first `a` tag contains the 'username' and in the second `a` tag contains 'repo_name'.

In the `href`  under the `a` tag contins 'repo_url'

In the `star tag` contains the 'stars count' for each repository.

In [33]:
def get_repo_info(repo_tag,star_tag):
    """
    Function to get the repository informations -
    username , repo_name, repo_url and stars count
    """
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [34]:
get_repo_info(repo_tags[0],star_tags[0])

('serverless', 'serverless', 40100, 'https://github.com/serverless/serverless')

I’ve used Python Dictionary to store the key-value pairs of the topic information. Later, I've copied the dictionary to pandas DataFrame to store the tabular repository information into rows and columns.

In [35]:
def get_topic_repos(topic_doc):
    """
    Function to get lists of repository information as lists from repo page. 
    """
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1',{'class': h1_selection_class})
    
    star_tags = topic_doc.find_all('a',class_='social-count float-none')

    topic_repos_dict = {
        
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
        
    }
    # Loop through all the repositories.
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        # get_repo_info returns username,repo_name,stars and repo_url.

        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [36]:
get_topic_repos(topic_doc)[:5]

Unnamed: 0,username,repo_name,stars,repo_url
0,serverless,serverless,40100,https://github.com/serverless/serverless
1,serverless,serverless,31000,https://github.com/serverless/serverless
2,serverless,serverless,21300,https://github.com/serverless/serverless
3,serverless,serverless,11700,https://github.com/serverless/serverless
4,serverless,serverless,11200,https://github.com/serverless/serverless


We have all the details that we are looking to retrieve from the repository web page username, repo_name, stars count and repo_url

### Putting all Pieces Together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together
- Let's run it scrape all the top repos for all the topics on the first page of GitHub

In [37]:
# function to get topic titles.
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
        
    return topic_titles

# function to get topic description.
def get_topic_descs(doc):
    topic_desc_tags = doc.find_all('p',class_='f5 color-text-secondary mb-0 mt-1')
    topic_descs = []
    
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
        
    return topic_descs

# function to get topic urls.
def topic_url(doc):
    topic_link_tags = doc.find_all('a',class_='d-flex no-underline')
    
    topic_urls = []
    base_url = 'https://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls


# function to scrape topic using Beautiful Soup and storing them in tabular form.
def scrape_topics():
    topics_url = 'https://github.com/topics'
    
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        
        'title' : get_topic_titles(doc),
        'description' : get_topic_descs(doc),
        'url' : topic_url(doc)
    }
    
    # with the help of pandas storing them in tabular format.
    return pd.DataFrame(topics_dict)

__Now we have all the functions for topic titles, topic descriptions and topic url. 
Now from each topic URL we are extracting all the Top Repositories to get username, repo_name, stars count and repository URL.__

In [38]:
def scrape_repos(topic_url,path):
    """
    Function to scrape each topic and store them in the csv format.
    If the file already exist skip scraping that topic and move to the next topic and so on.
    """
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    
    #Each repo_info is saved in CSV file.
    topic_df.to_csv(path ,index=None)

In [39]:
def scrape_topic_repos():
    """
    Function to iterate over all the rows of repository url in the topics data frame.
    """
    print('Scraping top topics from GitHub')
    topics_df = scrape_topics()
    
    #Create a folder 'data' to store all the repository inforamtion.
    os.makedirs('data',exist_ok=True)
    
    #loop for each topic URL in the topic page.
    for index, row in topics_df.iterrows():
        print('Scrping top repositories for "{}"'.format(row['title']))
        #then scrape_topic function runs through each URL to get repository information.
        scrape_repos(row['url'], 'data/{}.csv'.format(row['title']))

In [40]:
scrape_topic_repos()

Scraping top topics from GitHub
Scrping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scrping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scrping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scrping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scrping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scrping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scrping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scrping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scrping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scrping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scrping top repositories for "Atom"
The file data/Atom.csv already

__All the files are saved in CSV format in the `data` folder created__

__In my project, I’m scraping 30 featured topics and since each page has many repositories I'm scraping the popular 30 repositories based on stars count.  It goes without saying that the more topics listing you want, the more web pages you should scrape.__

### Summary

1. Downloaded the GitHub web page using Requests.
2. Extracted the topic details using BeautifulSoup (bs4).
3. Extracted all the topic informations - topic title, topic description and topic urls.
4. Complied the topic informations into Pandas lists and Dataframes.
5. Extracted all the repository informations from each topic - username, repo_name, stars count and repo_urls.
6. Complied the repository informations into Pandas lists and Dataframes.
7. Extracted the repository informations for multiple topic pages.
8. Saved all the dataset into .csv format in data folder.

### Future Work and References 

In near future, I plan to

- Perform web scraping using Selenium or Scrapy.
- Perform Data Cleaning on the Data
- Perform visualization on Data to get useful insights.