In this project, I combined traditional web scraping using **BeautifulSoup** with **GitHub's REST API** to gather repository data by topic.
# 🔧 Technologies Used:
- `requests` – for making HTTP requests
- `BeautifulSoup` – for parsing GitHub's HTML (to extract topic names and descriptions)
- `GitHub REST API` – to fetch top repositories per topic
- `pandas` – to organize and merge the data
- `Power BI` – for dashboard visualization


In [10]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
from datetime import datetime
import time


In [11]:
def parse_star_count(star_str):
    star_str = star_str.replace(',', '')
    if 'k' in star_str:
        try:
            return int(float(star_str.replace('k', '')) * 1000)
        except ValueError:
            print(f"Failed to parse star count: {star_str}")
            return 0
    else:
        try:
            return int(float(star_str))
        except ValueError:
            print(f"Failed to parse star count: {star_str}")
            return 0


## 🔍 Step 1: Scraping Topics using BeautifulSoup

I scraped GitHub's Topics page using BeautifulSoup to extract:
- Topic Name
- Description
- URL


In [12]:
# Manual mapping of problematic topic names
TOPIC_NAME_MAPPING = {
    "ASP.NET": "asp-net",
    "C++": "c-plus-plus",
    "C#": "c-sharp",
    ".NET": "dotnet",
    "The Julia Language": "julia",
    "Node.js": "nodejs",
    "Software-defined networking": "sdn"
}
# Function to normalize topic names
def normalize_topic_name(topic_name):
    # Use manual mapping if available
    if topic_name in TOPIC_NAME_MAPPING:
        return TOPIC_NAME_MAPPING[topic_name]
    
    # Default normalization: replace spaces with hyphens and convert to lowercase
    normalized_name = topic_name.lower().replace(' ', '-')
    return normalized_name

In [13]:
# Function to scrape all GitHub topics from the website
def get_all_topics():
    base_url = "https://github.com/topics"
    topics = []
    page = 1
    
    while True:
        print(f"Fetching topics from page {page}...")
        response = requests.get(f"{base_url}?page={page}", timeout=10)
        if response.status_code != 200:
            print("No more topics found.")
            break
        
        soup = BeautifulSoup(response.text, 'html.parser')
        topic_tags = soup.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
        
        if not topic_tags:
            print("No more topics found.")
            break
        
        for tag in topic_tags:
            topic_name = tag.find('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'}).text.strip()
            topic_desc = tag.find('p', {'class': 'f5 color-fg-muted mb-0 mt-1'}).text.strip()
            topic_url = "https://github.com" + tag['href']
            topics.append({
                'name': topic_name,
                'normalized_name': normalize_topic_name(topic_name),
                'description': topic_desc,
                'url': topic_url
            })
        
        page += 1
    
    return topics

## 🔌 Step 2: Fetching Top Repositories Using GitHub REST API

I fetched the top 30 repositories for each topic sorted by stars using the GitHub REST API.


In [14]:
# Function to fetch top 30 repositories for a specific topic (with retry logic)
def get_top_repos_for_topic(topic, headers, max_retries=3):
    print(f"Fetching top 30 repositories for topic '{topic['name']}' (Normalized: '{topic['normalized_name']}') sorted by stars...")
    
    for attempt in range(max_retries):
        # Introduce a delay to avoid rate limiting
        time.sleep(2)  # Wait 2 seconds between requests
        
        response = requests.get(
            "https://api.github.com/search/repositories",
            params={"q": f"topic:{topic['normalized_name']}", "per_page": 30, "sort": "stars", "order": "desc"},
            headers=headers,
            timeout=10
        )
        print(f"Status Code: {response.status_code}")  # Debugging
        
        if response.status_code == 403:
            print("Rate limit exceeded. Pausing for 1 hour...")
            time.sleep(3600)  # Pause for 1 hour
            continue
        
        if response.status_code != 200:
            print(f"Error fetching repositories for topic '{topic['name']}': {response.json()}")
            continue
        
        data = response.json()
        items = data.get('items', [])
        if not items:
            print(f"No repositories found for topic '{topic['name']}' (Attempt {attempt + 1}/{max_retries})")
            continue
        
        return items
    
    print(f"Skipping topic '{topic['name']}' after {max_retries} failed attempts.")
    return []


In [15]:
# Function to extract repository details
def parse_repos(repos):
    repo_list = []
    for repo in repos:
        repo_list.append({
            'username': repo['owner']['login'],  # Repository owner's username
            'name': repo.get('name', None),      # Repository name
            'full_name': repo.get('full_name', None),  # Full name (owner/repo)
            'description': repo.get('description', None),
            'url': repo.get('html_url', None),   # Direct link to the repository
            'stars': repo.get('stargazers_count', 0),  # Number of stars
            'forks': repo.get('forks_count', 0),       # Number of forks
            'language': repo.get('language', None)     # Primary programming language
        })
    return pd.DataFrame(repo_list)


In [18]:
# Function to scrape top 30 repositories for all topics
def scrape_all_topics_repos():
    print('Fetching all topics from GitHub...')
    topics = get_all_topics()
    
    print(f"Found {len(topics)} topics.")
    
    # Add your GitHub token here
    headers = {
        'Authorization': 'token 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',
        'Accept': 'application/vnd.github.v3+json'
    }
    
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    output_dir = os.path.join('data', timestamp)
    os.makedirs(output_dir, exist_ok=True)
    print(f"Created/Verified directory: {output_dir}")
    
    skipped_topics = []  # Track skipped topics
    
    for topic in topics:
        try:
            # Fetch top 30 repositories for the topic
            repos = get_top_repos_for_topic(topic, headers)
            
            if not repos:
                print(f"No repositories found for topic '{topic['name']}' (Normalized: '{topic['normalized_name']}'). Skipping...")
                skipped_topics.append({
                    'original_name': topic['name'],
                    'normalized_name': topic['normalized_name']
                })
                continue
            
            # Parse repositories into a DataFrame
            repo_df = parse_repos(repos)
            
            # Save to CSV
            filename = f"{topic['name']}.csv"
            filepath = os.path.join(output_dir, filename)
            repo_df.to_csv(filepath, index=None)
            print(f"Saved data to {filepath}")
        except Exception as e:
            print(f"Error scraping topic '{topic['name']}': {e}")
            skipped_topics.append({
                'original_name': topic['name'],
                'normalized_name': topic['normalized_name']
            })
 # Log skipped topics
    if skipped_topics:
        print(f"\nSkipped the following topics due to errors or no repositories found:")
        for skipped_topic in skipped_topics:
            print(f"- Original Name: {skipped_topic['original_name']}, Normalized Name: {skipped_topic['normalized_name']}")

if __name__ == "__main__":
    scrape_all_topics_repos()


Fetching all topics from GitHub...
Fetching topics from page 1...
Fetching topics from page 2...
Fetching topics from page 3...
Fetching topics from page 4...
Fetching topics from page 5...
Fetching topics from page 6...
Fetching topics from page 7...
No more topics found.
Found 164 topics.
Created/Verified directory: data\20250418_205615
Fetching top 30 repositories for topic '3D' (Normalized: '3d') sorted by stars...
Status Code: 200
Saved data to data\20250418_205615\3D.csv
Fetching top 30 repositories for topic 'Ajax' (Normalized: 'ajax') sorted by stars...
Status Code: 200
Saved data to data\20250418_205615\Ajax.csv
Fetching top 30 repositories for topic 'Algorithm' (Normalized: 'algorithm') sorted by stars...
Status Code: 200
Saved data to data\20250418_205615\Algorithm.csv
Fetching top 30 repositories for topic 'Amp' (Normalized: 'amp') sorted by stars...
Status Code: 200
Saved data to data\20250418_205615\Amp.csv
Fetching top 30 repositories for topic 'Android' (Normalized: 'an

## 📁 Step 3: Merge CSV Files Into One Master File

Once scraping was done, I merged all CSV files into a single dataset using Python(Pandas) for Power BI.


In [19]:
import os
import pandas as pd

# Directory containing the CSV files
input_dir = r"C:\Users\welcome\data\Final data"  # Replace with your folder path

# Output file name
output_file = os.path.join(input_dir, "merged_repositories.csv")

# List to store individual DataFrames
all_data = []

# Check if the directory exists
if not os.path.exists(input_dir):
    raise FileNotFoundError(f"The directory '{input_dir}' does not exist.")

# Loop through all files in the directory
for filename in os.listdir(input_dir):
    if filename.endswith(".csv"):  # Process only CSV files
        try:
            # Extract the topic name from the filename (remove the .csv extension)
            topic_name = os.path.splitext(filename)[0]
            
            # Construct the full file path
            file_path = os.path.join(input_dir, filename)
            
            # Debug: Print the file being processed
            print(f"Processing file: {file_path}")
            
            # Read the CSV file into a DataFrame
            df = pd.read_csv(file_path)
            
            # Add a new column for the topic name
            df["Repositery"] = topic_name  # Note: "Repositery" is intentionally misspelled
            
            # Append the DataFrame to the list
            all_data.append(df)
        except Exception as e:
            print(f"Error processing file '{filename}': {e}")

# Check if any DataFrames were added to the list
if not all_data:
    raise ValueError("No valid CSV files found to process.")

# Concatenate all DataFrames into a single DataFrame
merged_df = pd.concat(all_data, ignore_index=True)

# Reorder columns to match your desired format
merged_df = merged_df[["Repositery", "username", "name", "full_name", "description", "url", "stars", "forks", "language"]]

# Save the merged DataFrame to a new CSV file
merged_df.to_csv(output_file, index=False)

print(f"Merged data saved to {output_file}")

Processing file: C:\Users\welcome\data\Final data\.NET.csv
Processing file: C:\Users\welcome\data\Final data\3D.csv
Processing file: C:\Users\welcome\data\Final data\Ajax.csv
Processing file: C:\Users\welcome\data\Final data\Algorithm.csv
Processing file: C:\Users\welcome\data\Final data\Amazon Web Services.csv
Processing file: C:\Users\welcome\data\Final data\Amp.csv
Processing file: C:\Users\welcome\data\Final data\Android.csv
Processing file: C:\Users\welcome\data\Final data\Angular.csv
Processing file: C:\Users\welcome\data\Final data\Ansible.csv
Processing file: C:\Users\welcome\data\Final data\API.csv
Processing file: C:\Users\welcome\data\Final data\Arduino.csv
Processing file: C:\Users\welcome\data\Final data\ASP.NET.csv
Processing file: C:\Users\welcome\data\Final data\Awesome Lists.csv
Processing file: C:\Users\welcome\data\Final data\Azure.csv
Processing file: C:\Users\welcome\data\Final data\Babel.csv
Processing file: C:\Users\welcome\data\Final data\Bash.csv
Processing fil

## 📊 Step 4: Power BI Dashboard

The merged CSV file was imported into **Power BI** to build an interactive dashboard, showcasing:
- Top repositories by topic
- Language trends
- Star/Fork analysis
