# Scraping `GitHub Topics` Repositories

<img src='images/github_logo.jpg'/>

## About the `PROJECT`
- The main `Objective` of the Project is to Scrape `Github Topics` Page and get the information of Each Topic (Title, URL, Description).

- Then For each Topic get the Top 30 Repositories information (Name, Username, Stars, URL).

- And then Save it to CSV file.

## Introduction:
1. Web Scraping: 

    - Web scraping is a technique to fetch data from websites. It is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.
<br><br/>
2. GitHub
    - GitHub is a web-based interface that uses Git, the open source version control software that lets multiple people make separate changes to web pages at the same time.
    - In this Project, I'll be need GitHub link to Scrape the Topics Repositories.
<br><br/>  
3. Tools/Libraries (Used in this Project)    
    - Python: For `Scripting`.
    - requests: To `Download` the website (HTML/CSS) using url.
    - Beatiful Soup: To `Parse` the page.
    - pandas: To Convert all the data into `DataFrame` and then save it as `.CSV` file.
    - os: To Make Directory and save all the CSV file into the Directory.

### Project Outline:

- Scrape this website: https://github.com/topics
- Get the list of Topics. For each Topic get Topic Title, Topic page URL and Topic description.
- For each Topic, Get the top 30 Repositories in the Topic from the Topic page.
- For each Repository, grab the Repository name, username, Stars and Repository URL.
- For each Topic Create a CSV file in the following format:
```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,81900,https://github.com/mrdoob/three.js
libgdx,libgdx,20000,https://github.com/libgdx/libgdx
```

In [1]:
"""
import all the required Libraries.
"""
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

- Use `requests` to `Download` the page
- Use `bs4` to to `Parse` the page
- Use `Pandas` to Convert Pandas Dataframe and save as Data to `.CSV` file
- Use `os` to Make Directory and save all the `.CSV` file in the Directory

## Scrape the list of Topics from GitHub

In [2]:
def get_topics_page():
    """
    This Function simply Takes Topics URL and Download the page content as HTML text, 
    Parse using Beautiful Soup and return all the text content.
    """
    
    # Download the page
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    # Check successful response
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topics_url}')
    
    # Parse using Beautiful Soup
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

In [3]:
"""
Calling the get_topics_page function and store return value in a variable.
"""

doc = get_topics_page()

### Create some Helper functions to Parse information from the Page.

### Get Topic Titles

To get topic titles, we can pick `p` tags with the class `f3 lh-condensed mb-0 mt-1 Link--primary`

<img src='images/title_tag.png'/>

In [4]:
def get_topic_titles(doc):
    """
    This function takes all the content (doc) and,
    Parse the topic title using Beautiful Soup,
    and store all the topic title into a List and return that list.
    """
    
    title_selector_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': title_selector_class})    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` used to get the List of Topic Titles.

Explanation:
- Open the URL `https://github.com/topics` in a Browser then Inspect the topic title and observe that topic title was inside the `p` tag with the class `f3 lh-condensed mb-0 mt-1 Link--primary`.
<br><br/>
- Then using Beautiful Soup `find_all` method get all `p` tags with the class name `f3 lh-condensed mb-0 mt-1 Link--primary`. 
<br><br/>
- Using Loop store all the `p` tags text into a list.

In [5]:
"""
Calling the get_topic_titles function
"""
titles = get_topic_titles(doc)

In [6]:
"""
Read all the Topic title.
"""
print(titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


### Get Topic Description

To get topic description, we can pick `p` tags with the class `f5 color-fg-muted mb-0 mt-1`

<img src='images/description_tag.png'/>

In [7]:
def get_topic_descriptions(doc):
    """
    Similarly, This function takes all the content (doc) and,
    Parse the topic Description using Beautiful Soup,
    and store all the topics Description into a List and return that list.
    """
    
    description_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_description_tags = doc.find_all('p', {'class': description_selector})
    topic_descriptions = []
    for tag in topic_description_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

`get_topic_descriptions` used to get the List of Topics Description.

In [8]:
"""
Calling the get_topic_description function.
"""
descriptions = get_topic_descriptions(doc)

In [9]:
"""
Read first 5 Topic Descriptions
"""
descriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

### Get Topic URL

To get topic URL, we can pick `a` tags with the class `no-underline flex-1 d-flex flex-column`
<img src='images/link_tag.png'/>

In [10]:
"""
This function takes all the content (doc) and,
Parse the topic URL using Beautiful Soup,
and store all the topic URL into a List and return that list.
"""

def get_topic_url(doc):
    url_selector = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': url_selector})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(f"{base_url}{tag['href']}")
    return topic_urls

In [11]:
"""
Calling the get_topic_url function.
"""
urls = get_topic_url(doc)

In [12]:
"""
Read first 5 Topic URLs
"""
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### Now put this all together into a single function

In [13]:
def scrape_topics():
    """
    In this function call each Helper function 
    and store the returned value in a Dictionary.
    Then convert the Dictionary into DataFrame and 
    simply return DataFrame.
    """
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        'Title': get_topic_titles(doc),
        'Description': get_topic_descriptions(doc),
        'URL': get_topic_url(doc)
    }     
    return pd.DataFrame(topics_dict)

In [14]:
"""
Calling scrape_topics function and
store returned value in a variable.
"""
topics_df = scrape_topics()

In [15]:
"""
Read first 5 rows from the DataFrame.
"""
topics_df.head()

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Get the top repositories from a topic page

In [16]:
def get_topic_page(topic_url):
    """
    This function Take topic url, Download all the content
    from the url using requests library and Parse it using 
    Beatiful Soup then return all the content.
    """
    
    # download the page
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

### Get all the required information from the topic page

In [17]:
def parse_star_count(stars_str):
    """
    This is helper function, It takes counted
    star as string convert it into integer and
    return the star count value.
    """
    
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [18]:
def get_repo_info(h3_tag, star_tag):
    """
    This function takes two arguments (h3_tag, star tag)
    using h3_tags Get the information (username, repo_name, repo_url)
    using star_tag get the counted star and return all the values.
    """

    a_tags = h3_tag.find_all('a')

    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = 'https://github.com'
    repo_url =  f'{base_url}{a_tags[1]["href"]}'
    stars = parse_star_count(star_tag.text)
    
    return username, repo_name, stars, repo_url

Explanation:
- In the above function h3_tag argument contains Username, Repository Name and Repo URL
- So using Beautiful Soup get all the Element and add Base URL in the repo url.
- The 2nd argument star_tag contains the counted star element, get text from it
- Then using parse_star_count function just convert string value to Integer.
- Finally return username, repo_name, stars, repo_url.

In [19]:
def get_topic_repos(topic_doc):
    """
    It takes topic doc (all content of topic page),
    Find username, repo_name, stars, repo_url and,
    Create dictionary then using loop store all the values 
    in list and assing all the list as Dictionary value.
    Convert the dictionary to Dataframe and return.
    """
    
    # Get the h3 tags containing repo title, repo URL and username
    h3_selector = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selector})
    
    # Get the star tag
    span_selector = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span', {'class': span_selector})
    
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    
    # Get repo info
    for repo in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[repo], star_tags[repo])

        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [20]:
def scrape_topic(topic_url, path):
    """
    It takes two arguments (topic_url, path),
    check if path already exist then show a message otherwise,
    call get_topic_repos() function and paas another function 
    as argument then save a returned value as .csv file.
    """
    if os.path.exists(path):
        print(f'The file {path} already exist. Skipping...')
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Now Putting it all together

- Have a function to get the List of Topics `get_topic_repos(topic_doc)`
- Have a function to Create a CSV file for scrape repository from a topics page `scrape_topic(topic_url, path)`
- Now Create a function to put them together `scrape_topics_repos()`

In [21]:
def scrape_topics_repos():
    """
    In this function Call scrape_topics() and store returned 
    value(Topics DataFrame) in variable. Then Create a Folder
    name Data. Using loop iterate each row and call scrape_topic
    function for next Process.
    """
    print("Scraping list of topics")
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True) 
    for index, row in topics_df.iterrows():
        print(f'Scraping Top repositories for {row["Title"]} to {row["URL"]}')
        url = row['URL']
        path = f"data/{row['Title']}.csv"
        scrape_topic(url, path)

Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [22]:
scrape_topics_repos()

Scraping list of topics
Scraping Top repositories for 3D to https://github.com/topics/3d
Scraping Top repositories for Ajax to https://github.com/topics/ajax
Scraping Top repositories for Algorithm to https://github.com/topics/algorithm
Scraping Top repositories for Amp to https://github.com/topics/amphp
Scraping Top repositories for Android to https://github.com/topics/android
Scraping Top repositories for Angular to https://github.com/topics/angular
Scraping Top repositories for Ansible to https://github.com/topics/ansible
Scraping Top repositories for API to https://github.com/topics/api
Scraping Top repositories for Arduino to https://github.com/topics/arduino
Scraping Top repositories for ASP.NET to https://github.com/topics/aspnet
Scraping Top repositories for Atom to https://github.com/topics/atom
Scraping Top repositories for Awesome Lists to https://github.com/topics/awesome
Scraping Top repositories for Amazon Web Services to https://github.com/topics/aws
Scraping Top reposit

We can check that the CSVs were created properly.