# Scraping Top Repositories for Topics on GitHub


- Web scraping is the process of extracting data from a website. In this project, we will use Python libraries such as requests, BeautifulSoup, and pandas to scrape GitHub and identify the top repositories for topics.

- GitHub is a popular platform for hosting open-source software projects. It allows developers to collaborate and share their code with others. As of 2023, GitHub hosts over 200 million repositories.



### Project Outline
- We're going to scrap https://github.com/topics
- We'll get a list of topics. For each topic, we will get topic title, topic page URL and topic description
- For each topic, we'll get the top 30 repositeries in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create CSV file in the following format:

Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

## Scrape the list of topics from GitHub

- use requests to download the page
- use BS4 to parse and extract information
- Convert to pandas DataFrame

Let's write the function to downlaod the page

In [2]:
import requests
from bs4 import BeautifulSoup

def get_topic_page():
  # URL of the GitHub topics page
  topic_urls = 'https://github.com/topics'

  # Send a GET request to the topics page
  response = requests.get(topic_urls)
  if response.status_code != 200:
    raise Exception('Failed to load Page {}'.format(topic_url))

  # Parse the HTML response using BeautifulSoup
  doc = BeautifulSoup(response.text, 'html.parser')
  return doc

The `get_topic_page` function downloads the GitHub topics page using `requests.get()` and parses it using `BeautifulSoup`.

In [3]:
doc = get_topic_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` which we used in `selection_class` below

![](https://i.imgur.com/gs8AvJ9.png)

In [4]:
def get_topic_titles(doc):
  # Find all 'p' tags with the specified class name
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': selection_class})

  # Extract the text from each 'p' tag and store it in a list
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)

  # Return the list of topic titles
  return topic_titles

The `get_topic_titles` function extracts the topic titles from the parsed HTML document by finding all `p` tags with the class name `f3 lh-condensed mb-0 mt-1 Link--primary` and extracting their text content.


In [5]:
titles = get_topic_titles(doc)

In [6]:
len(titles)

30

In [7]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']



 Similarly we have defined functions for descriptions and URLs.

In [8]:
def get_topic_desc(doc):
  # Find all 'p' tags with the specified class name
  desc_selector = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', {'class': desc_selector})

  # Extract the text from each 'p' tag, strip whitespace, and store it in a list
  topic_descs = []
  for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

  # Return the list of topic descriptions
  return topic_descs

`get_topic_desc` finds all the topic descriptions on the GitHub topics page and puts them in a list.

- similarly we created a function to get the URLs of all topics from GitHub topics page.

- `get_topic_urls` function extracts the URLs of all the topics listed on the GitHub topics page.

In [9]:
def get_topic_urls(doc):
  # Find all 'a' tags with the specified class name
  topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})

  # Extract the URLs from each 'a' tag and store it in a list
  topic_urls = []
  base_url = "https://github.com"
  for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

  # Return the list of topic URLs
  return topic_urls

Overall, this function extracts the URLs of all the topics listed on the GitHub topics page and stores them in a list.

### Let's put this all together into a single function

In [10]:
import pandas as pd

def scrape_topics():
  # Send a GET request to the topics page
  topic_urls = 'https://github.com/topics'
  response = requests.get(topic_urls)

  # Check if the request was successfu
  if response.status_code != 200:
    raise Exception('Failed to load Page {}'.format(topic_url))

  # Parse the HTML response using BeautifulSoup
  doc = BeautifulSoup(response.text, 'html.parser')

  # create the dictionary with keys title, description and url
  topics_dict = {
      'title': get_topic_titles(doc),
      'description': get_topic_desc(doc),
      'url': get_topic_urls(doc)
  }

  # return into a pandas DataFrame
  return pd.DataFrame(topics_dict)

The function `scrape_topics` scrapes the list of topics(title, description, URLs) from the GitHub topics page and returns the data as a pandas DataFrame.

## Get the Top 30 Repositories from the Topic page

we find the list of topics from the GitHub topics page now let's find the top 30 repositories from each topic.

Steps to find the top repositories from each topic
- let's get the topic page url from gitHub topic page

In [11]:
def get_topic_page(topic_url):
  # Download the page
  response = requests.get(topic_url)
  # Check successful response
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  # Parse using Beautiful soup
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  return topic_doc

- The primary purpose of this function is to download the HTML content of a GitHub topic page and parse it using BeautifulSoup.
- This allows us to further extract and analyze the information available on the topic page.

In [12]:
doc = get_topic_page('https://github.com/topics/3d')

The function parse_star_count takes a string representing the number of stars as input and returns the integer equivalent.

In [20]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

- Let's take information about username, repository name, count of stars and repos URLs from each topic from topic page.
- This function extracts information about a repository from two HTML tags i.e. `h3_tag` and `star_tag`.
- `h3_tag`: An h3 tag containing the username and repository name as hyperlinks.
- `star_tag`: A tag containing the number of stars as text.

In [18]:
def get_repo_info(h3_tag, star_tag):
  # Extract the username and repository name from the h3 tag
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()

  # Construct the repository URL by adding base_url with a tags
  base_url = "https://github.com"
  repo_url = base_url + a_tags[0]['href']

  # Extract the number of stars from the star_tag
  stars = parse_star_count(star_tag.text.strip())

  # Return the extracted information
  return username, repo_name, stars, repo_url

- `get_repo_info` extracts the username, repository name, stars, and URL from the HTML tags.
- `get_topic_repos` finds all relevant tags and uses `get_repo_info` to collect data into a dictionary.
The dictionary is converted into a pandas DataFrame for easy manipulation.

In [14]:
def get_topic_repos(topic_doc):
  # Get the h3 tags containing repo title, repo URL and username
  h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
  # Get star tags
  star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

  # create a dictionary for collect the data into dictionary
  topic_repos_dict = {
      'username': [],
      'repo_name': [],
      'stars': [],
      'repo_url': []
      }

  # Get repo info and put into the topic_repos_info dictionary
  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict["repo_url"].append(repo_info[3])

  # return into a pandas dataframe
  return pd.DataFrame(topic_repos_dict)

`scrape_topic` function scrapes data from a topic page on GitHub and saves it to a CSV file.

In [15]:
def scrape_topic(topic_url, path):
  # Check if the file already exists and skip scraping if it does.
  if os.path.exists(path):
    print("The file {} already exists. skipping...".format(path))
    return

  # Extract information about the repositories related to the topic.
  topic_df = get_topic_repos(get_topic_page(topic_url))

  # Save the extracted information to a CSV file.
  topic_df.to_csv(path, index=None)

## Putting it all together

- We have the function to get the list of topics
- We have the function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [16]:
import os

def scrape_topic_repos():

  """
    This function combines the functionality of the `scrape_topics` and `scrape_topic` functions to scrape data from multiple topic pages and save it to individual CSV files.
  """
  print('Scraping list of topics')
  topics_df = scrape_topics()

  os.makedirs('data', exist_ok=True)
  for index, row in topics_df.iterrows():
    print('Scraping top repositories for "{}"'.format(row['title']))
    scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [21]:
scrape_topic_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

We can check that the CSVs were created properly

In [22]:
# read and display a CSV using pandas
pd.read_csv('data/Android.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,162000,https://github.com/flutter
1,facebook,react-native,116000,https://github.com/facebook
2,justjavac,free-programming-books-zh_CN,110000,https://github.com/justjavac
3,Genymobile,scrcpy,104000,https://github.com/Genymobile
4,Hack-with-Github,Awesome-Hacking,78500,https://github.com/Hack-with-Github
5,Solido,awesome-flutter,51700,https://github.com/Solido
6,google,material-design-icons,49900,https://github.com/google
7,wasabeef,awesome-android-ui,49400,https://github.com/wasabeef
8,tldr-pages,tldr,48800,https://github.com/tldr-pages
9,square,okhttp,45400,https://github.com/square


### **Summary and** **References**

**Summary of what we did**

- Scraped the GitHub topics page to retrieve a list of topics along with their titles, descriptions, and URLs.

- For each topic, scraped the top 30 repositories, collecting information such as the repository name, username, number of stars, and repository URL.

- Stored the scraped data in CSV files, one for each topic, in a structured format for easy analysis and manipulation.


**References to links found useful**


- Python Requests Library Documentation: Official documentation for the Requests library, which was used to send HTTP requests and download web pages.

- Beautiful Soup Documentation: Official documentation for the Beautiful Soup library, which was used for parsing HTML and XML documents.
- Pandas Documentation: Official documentation for the Pandas library, which was used for data manipulation and storage.

- GitHub Topics: The page that was scraped in this project, containing a list of topics and their corresponding repositories.

