## Top Repositories on GitHub

## Problem Statement:

GitHub is a widely-used platform for sharing and discovering open-source projects. The platform organizes repositories into various topics, making it easier to explore related projects. However, GitHub does not provide a direct, structured dataset for these topics and their associated repositories. This project aims to automate the extraction of relevant data from the GitHub Topics page , allowing users to analyze and explore repositories based on different topics. By scraping this data, we can create CSV files that include key repository details such as name, author, stars, and URL for each topic.

## Project Outline:

1. We're going to scrape https://github.com/topics
2. We'll get a list of topics. For each topic we'll get topic title, page url and description
3. For each topic, we'll get the top repositories on the topic from the topic page
4. For each repository, we'll grab the repo name, username,stars and repo URl
5. For each topic we'll create a CSV file in the following format:

Repo name,Username,Stars,Repo URL

## Tools used:

1. Python
2. requests
3. Beautiful Soup
4. Pandas

## Scrape list of topics from GitHub

- use requests to download the page
- use BS4 to parse and extract info
- convert to a pandas df

In [59]:
import requests
import os
from bs4 import BeautifulSoup
import pandas as pd

This function takes the URL of a specific topic (e.g., "Awesome Lists") and returns its HTML document, which contains repository details.

In [6]:
#function to download page
def get_topic_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)

    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))

    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [7]:
doc = get_topic_page()

In [8]:
type(doc)

bs4.BeautifulSoup

In [9]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" data-skip-target-assigned="false" href="#start-of-content">Skip to content</a>

## helper function to parse info from page

To get topic titles, we can pick `p` tags with the `class`...
![](IMG/title.png)

Extracts all topic titles (like "Awesome Lists") from the main topics page and returns them as a list.

In [14]:
#to get list of titles
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text.strip())
    return topic_titles

In [15]:
titles = get_topic_titles(doc)

In [16]:
titles

['Awesome Lists',
 'Chrome',
 'Code quality',
 'Compiler',
 'CSS',
 'Database',
 'Front end',
 'JavaScript',
 'Node.js',
 'npm',
 'Project management',
 'Python',
 'React',
 'React Native',
 'Scala',
 'TypeScript']

To get topic titles, we can pick `p` tags with the `class`...
![](IMG/description.png)

Extracts all topic descriptions (short summary text under each topic) and returns them as a list.

In [17]:
#to get description
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [18]:
description = get_topic_descs(doc)

In [19]:
description

['An awesome list is a list of awesome things curated by the community.',
 'Chrome is a web browser from the tech company Google.',
 'Automate your code review with style, quality, security, and test‑coverage checks when you need them.',
 'Compilers are software that translate higher-level programming languages to lower-level languages (e.g. machine code).',
 'Cascading Style Sheets (CSS) is a language used most often to style and improve upon the appearance of views.',
 'A database is a structured set of data held in a computer, usually a server.',
 'Front end is the programming and layout that people see and interact with.',
 'JavaScript (JS) is a lightweight interpreted programming language with first-class functions.',
 'Node.js is a tool for executing JavaScript in a variety of environments.',
 'npm is a package manager for JavaScript included with Node.js.',
 "Project management is about building scope and executing on the project's goals.",
 'Python is a dynamically typed progra

To get topic titles, we can pick `p` tags with the `class`...
![](IMG/url.png)

Collects the URLs of all topics (e.g., /topics/awesome) and appends the base GitHub URL, returning a list of full topic links.

In [21]:
#get repo link
def get_topic_urls(doc):
    topic_urls = []
    base_url = 'https://github.com'
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [22]:
urls = get_topic_urls(doc)

In [23]:
urls

['https://github.com/topics/awesome',
 'https://github.com/topics/chrome',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/compiler',
 'https://github.com/topics/css',
 'https://github.com/topics/database',
 'https://github.com/topics/frontend',
 'https://github.com/topics/javascript',
 'https://github.com/topics/nodejs',
 'https://github.com/topics/npm',
 'https://github.com/topics/project-management',
 'https://github.com/topics/python',
 'https://github.com/topics/react',
 'https://github.com/topics/react-native',
 'https://github.com/topics/scala',
 'https://github.com/topics/typescript']

This function downloads the main GitHub Topics page (https://github.com/topics) and parses it with BeautifulSoup, returning the HTML document

In [60]:
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

This function takes the URL of a specific topic (e.g., "Awesome Lists") and returns its HTML document, which contains repository details.

In [61]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

1. Combines titles, descriptions, and URLs into a single pandas DataFrame.
2. This DataFrame is later used to scrape repositories for each topic.

In [62]:
def get_topics():
    doc = get_topics_page()   
    
    titles = get_topic_titles(doc)
    descs = get_topic_descs(doc)
    urls = get_topic_urls(doc)
    
    topics_dict = {
        'title': titles,
        'description': descs,
        'url': urls
    }
    
    return pd.DataFrame(topics_dict)

Extracts repository information from one HTML block:

1. Repository owner (username)
2. Repository name
3. Repository URL
4. Number of stars

In [63]:
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

1. Takes a topic page and extracts all repositories listed under it
2. Returns a DataFrame with columns: username, repo_name, stars, and repo_url

In [64]:
def get_topic_repos(topic_doc):
    #get h3 tags containing repo title,repo url and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    
    #get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    #get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

Scrapes all repositories from a single topic and saves them to a CSV file.
If the file already exists, it skips to avoid duplicate downloads.

In [68]:
def scrape_topic(topic_url,path):
    #if error occurs no need to download the existing files all over again
    #fname = topic_name + '.csv'
    if os.path.exists(path):
        print("The file {} already exists.Skipping...".format(path))
        return
        
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path , index = None)

## Putting all together

- list of topics
- create csv file for scraped topics page
- function to put them together

This is the main function.

1. Gets all topics (title, description, url).
2. Creates a data/ folder.
3. Iterates through each topic and scrapes its repositories into a CSV file.

In [69]:
def scrape_topics_repos():
    print('Scraping list of topics...')
    topics_df = get_topics()  
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"...'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topics on the first page of
https://github.com/topics

In [70]:
scrape_topics_repos()

Scraping list of topics...
Scraping top repositories for "Awesome Lists"...
The file data/Awesome Lists.csv already exists.Skipping...
Scraping top repositories for "Chrome"...
The file data/Chrome.csv already exists.Skipping...
Scraping top repositories for "Code quality"...
The file data/Code quality.csv already exists.Skipping...
Scraping top repositories for "Compiler"...
The file data/Compiler.csv already exists.Skipping...
Scraping top repositories for "CSS"...
The file data/CSS.csv already exists.Skipping...
Scraping top repositories for "Database"...
The file data/Database.csv already exists.Skipping...
Scraping top repositories for "Front end"...
The file data/Front end.csv already exists.Skipping...
Scraping top repositories for "JavaScript"...
The file data/JavaScript.csv already exists.Skipping...
Scraping top repositories for "Node.js"...
The file data/Node.js.csv already exists.Skipping...
Scraping top repositories for "npm"...
The file data/npm.csv already exists.Skippin

## Are CSVs created properly

In [72]:
# read and display a CSV using pandas
df = pd.read_csv("frontend.csv")

In [74]:
df.head(5)

Unnamed: 0,username,repo_name,stars,repo_url
0,facebook,react,239000,https://github.com/facebook/react
1,vuejs,vue,210000,https://github.com/vuejs/vue
2,vitejs,vite,75500,https://github.com/vitejs/vite
3,thedaviddias,Front-End-Checklist,71400,https://github.com/thedaviddias/Front-End-Chec...
4,ionic-team,ionic-framework,52100,https://github.com/ionic-team/ionic-framework


## Conclusion
In this project, we successfully scraped the **GitHub Topics page** and extracted key details about the top repositories for each topic. The notebook automated the following steps:

- Extracted a list of GitHub topics with titles, descriptions, and URLs.  
- Scraped repository details such as **repository name, username, stars, and URL** for each topic.  
- Saved the data into **CSV files**, one per topic, for easy analysis.  

### Key Outcomes
- Automated data collection from GitHub topics.  
- Created structured datasets for analysis and exploration.  
- Built a reproducible workflow using **requests, BeautifulSoup, and pandas**.  

### Next Steps
- Extend scraping to capture additional metadata (forks, issues, last updated).  
- Build a **visual dashboard** to compare repositories across topics.  
- Use this dataset for further **data analysis or machine learning tasks** such as identifying trending repositories.  

This marks the completion of the notebook.