# Web Scraping GitHub Topics

## Introduction to Web Scraping
Web scraping is a technique used to automatically extract data from websites. It allows developers to collect and analyze data from web pages by parsing their HTML or XML content. This technique is particularly useful for gathering large amounts of data from publicly accessible websites efficiently.

## Introduction to GitHub and Problem Statement
GitHub is a widely used platform for hosting and sharing code repositories. It features a "Topics" page where various technologies, frameworks, and languages are categorized to help users discover relevant projects. 

The goal of this project is to scrape the top topics listed on GitHub’s Topics page and store the information—such as topic titles, descriptions, and URLs—in a CSV file. This will provide an organized way to analyze and explore popular topics on GitHub.

## Tools and Technologies Used

### Python
Python is a versatile programming language that is commonly used for web scraping due to its simplicity and the availability of powerful libraries.

### Requests
`requests` is a Python library that allows you to send HTTP requests to websites and retrieve their content. It simplifies the process of making web requests and handling responses.

### Beautiful Soup
`Beautiful Soup` is a Python library used to parse HTML and XML documents. It provides tools for navigating the document tree, searching for specific elements, and extracting data.

### Pandas
`Pandas` is a powerful Python library for data manipulation and analysis. It is used to structure the scraped data and save it in a CSV format.

## Python Code for Web Scraping GitHub Topics

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the `top 20 repositories` in the topic from the topic page
- For each repository, we'll grab the `repo name`, `username`, `stars` and `repo URL`
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

### To scrape the list of topics from GitHub, we'll follow these steps:

- <b>Download the Page</b>: We'll use the requests library to send an `HTTP GET` request to the GitHub Topics page and download the `HTML` content of the page.
- <b>Parse and Extract Information</b>: Once we have the HTML content, we'll use the `Beautiful Soup` library to parse the page and extract the relevant information such as topic titles, descriptions, and URLs.
- <b>Convert to a Pandas DataFrame</b>: After extracting the data, we'll structure it in a `Pandas DataFrame` for easy manipulation and export it as a `CSV` file.

#### Install the required libraries:

In [None]:
# !pip install requests --upgrade --quiet

# !pip install beautifulsoup4 --upgrade --quiet

#### Import these libraries in python script:

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

### Let's write a function to download the page.

In [None]:
def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [None]:
doc = get_topics_page()

## Let's create some helper functions to parse information from the page.

To get topic <b>titles</b>, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/M5padE4.png)

In [None]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [None]:
titles = get_topic_titles(doc)

In [None]:
titles

To get topic <b>descriptions</b>, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/70g2ugX.png)

In [None]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

`get_topic_descs` can be used to get the list of descriptions

In [None]:
desc = get_topic_descs(doc)

In [None]:
desc

To get topic <b>URLs</b>, we can pick `"a"` tags with a `base url` and the `class` ...

![](https://i.imgur.com/leQ9G0b.png)

In [None]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

`get_topic_urls` can be used to get the list of URLs

In [None]:
url = get_topic_urls(doc)

In [None]:
url

## Let's put this all together into a single function

In [None]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [None]:
topics_table = scrape_topics()

#### Getting a DataFrame of the topics_url with fields: `title`, `description` and `url`

In [None]:
topics_table

## Getting the top repositories from a topic page

In [None]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

<b>Getting the topic page for the url:</b> `https://github.com/topics/3d`

In [None]:
custom_url = url[0]

print(custom_url)

In [None]:
topic_doc = get_topic_page(custom_url)

To get <b>repo tags</b>, we can pick `h3` tags with the `class` ...

![](https://i.imgur.com/B6HDspK.png)

In [None]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',h3_selection_class)

#### Getting the `username` for the first repository in `3d` topic:

In [None]:
repo_tags[0]

In [None]:
a_tags = repo_tags[0].find_all('a')

In [None]:
a_tags[0].text.strip()

#### Getting the `stars` for the same repository:

In [None]:
star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})

In [None]:
star_tags[0].text.strip()

In the data, the value `101k` represents a shorthand notation commonly used to indicate large numbers, where `K` stands for `thousand`. To make this data easier to process and analyze, we converted the shorthand `101k` into its full numerical equivalent, `101000`.<br>
Converting shorthand notation to a fixed numerical value ensures that all values are in a consistent numerical format, facilitating accurate calculations and comparisons.<br>
The `parse_star_count` function does the conversion:

In [None]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [None]:
parse_star_count(star_tags[0].text.strip())

### Getting all the required information about a repository

In [None]:
base_url = 'https://github.com'

In [None]:
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

#### Getting the `username`,`repository name`,`stars` and `repository url` for the first repository in `3d` topic:

In [None]:
info = get_repo_info(repo_tags[0],star_tags[0])

In [None]:
info

## Getting all the topic repositories

The dictionary `topic_repos_dict` is designed to store information about repositories under a specific topic.<br>
Each key in the dictionary represents a list that will hold specific details about the repositories. 
<br>Here’s a description of each key:
- `username`: A list to store the GitHub usernames of the owners of the repositories.
- `repo_name`: A list to store the names of the repositories.
- `stars`: A list to store the number of stars each repository has received. Stars are a measure of how popular or well-regarded a repository is on GitHub.
- `repo_url`: A list to store the URLs of the repositories, allowing users to directly access them on GitHub.

In [None]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

In [None]:
def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

The `scrape_topic` function you've provided is designed to scrape repository data from a specific GitHub topic page and save it to a CSV file.

In [None]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [None]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topic_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topic_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [None]:
scrape_topics_repos()

#### We can check that the CSVs were created properly