# Scraping Top Repositories for Topics on GitHub

## Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [215]:
import requests


In [216]:
topics_url = "https://github.com/topics"

In [217]:
response = requests.get(topics_url)
response

<Response [200]>

In [218]:
response.status_code

200

In [219]:
len(response.text) # # Length of the response text is very large

216537

In [220]:
response.text[:100]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-t'

In [221]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text, 'html.parser')
doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-c59dc71e3a4c.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/light_high_contrast-4bf0cb726930.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-89751e879

In [222]:
p_tags = doc.find_all('p' , {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [223]:
len(p_tags)

30

In [224]:
p_tags[:5] # List of first 5 <p> tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [225]:
topic_title_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Bash</p>,
 <p class="f3 lh-condensed m

In [226]:
topic_description_tags = doc.find_all( 'p' , {'class': 'f5 color-fg-muted mb-0 mt-1'})
topic_description_tags

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (App

In [227]:
topic_title_tag0 = topic_title_tags[0]
topic_title_tag0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [228]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
len(topic_link_tags)

30

In [229]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
topic0_url

'https://github.com/topics/3d'

In [230]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)


['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [231]:
topic_description = []
for tag in topic_description_tags:
    topic_description.append(tag.text.strip())
print(topic_description)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud computing service created by Microsoft.', 'Babel is a compiler for w

In [232]:
topic_urls = []
for tag in topic_link_tags:
    topic_urls.append("https://github.com" + tag['href'])
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/topics/continuous-integration', 'ht

In [233]:
import pandas as pd

# for creating into dataFrame we first convert into dictionary 

In [234]:
topics_dict = {
    'title': topic_titles,
    'description': topic_description,
    'url': topic_urls
}

In [235]:
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [236]:
topics_df.to_csv('github_topics.csv', index=False)

In [237]:
topic_page_url = topic_urls[0]
topic_page_url

'https://github.com/topics/3d'

In [238]:
response = requests.get(topic_page_url)
response.status_code

200

In [239]:
len(response.text)

522651

In [240]:
topic_doc = BeautifulSoup(response.text, 'html.parser')
topic_doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-c59dc71e3a4c.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/light_high_contrast-4bf0cb726930.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-89751e879

In [241]:
repo_tags = topic_doc.find_all('h3', {'class': "f3 color-fg-muted text-normal lh-condensed"})
len(repo_tags)

20

In [242]:
a_tags = repo_tags[0].find_all('a')
a_tags[0].text

'mrdoob'

In [243]:
base_url = "https://github.com"
repo_url = base_url + a_tags[0]['href']
repo_url

'https://github.com/mrdoob'

In [244]:
star_tags = topic_doc.find_all('span', {'class': "Counter js-social-count"})
len(star_tags)

20

In [245]:
star_tags[0].text.strip()

'107k'

In [246]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [247]:
parse_star_count(star_tags[0].text.strip())


107000

In [248]:
def get_repo_info(h3_tag, star_tag):
    #returns all the required information about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url
    

In [249]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 107000, 'https://github.com/mrdoob/three.js')

In [250]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'repo_url': [],
    'stars': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0]) 
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])   
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [251]:
import os

In [252]:

def get_topic_page(topic_url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(topic_url, headers=headers)

    if response.status_code != 200:
        raise Exception(f"Failed to fetch topic page: {response.status_code}")
    
    return BeautifulSoup(response.text, 'html.parser')

def get_topic_repos(topic_doc):
    repo_tags = topic_doc.find_all('h3', {'class': "f3 color-fg-muted text-normal lh-condensed"})
    star_tags = topic_doc.find_all('span', {'class': "Counter js-social-count"})

    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0]) 
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])   
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    fname = topic_name + "_repos.csv"
    if os.path.exists(fname):
        print(f"✅ File {fname} already exists. Skipping download.")
        return 

    topic_doc = get_topic_page(topic_url)
    topic_df = get_topic_repos(topic_doc)
    topic_df.to_csv(fname, index=False)
    print(f"✅ Saved: {fname}")

In [253]:
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df

Unnamed: 0,username,repo_name,repo_url,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,107000
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,29100
2,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,25000
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,24200
4,libgdx,libgdx,https://github.com/libgdx/libgdx,24100
5,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,22100
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,18800
7,aframevr,aframe,https://github.com/aframevr/aframe,17100
8,blender,blender,https://github.com/blender/blender,15400
9,4ian,GDevelop,https://github.com/4ian/GDevelop,14700


In [254]:
url5 = topic_urls[5]
url5

'https://github.com/topics/angular'

In [255]:
get_topic_repos(get_topic_page(url5)).to_csv('angular.csv', index=False)


write a single function to :
1. Get the list of topics from the topics page
2. get the list of top repos from the individual topic pages 
3. for each topic , create a csv of the top repos for the topic 


In [256]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = [tag.text.strip() for tag in topic_title_tags]
    return topic_titles

def get_topic_descriptions(doc):
    topic_description_tags = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})
    topic_descriptions = [tag.text.strip() for tag in topic_description_tags]
    return topic_descriptions

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topics_urls = ["https://github.com" + tag['href'] for tag in topic_link_tags]
    return topics_urls

def scrape_topics():
    topics_url = "https://github.com/topics"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
    response = requests.get(topics_url, headers=headers)

    if response.status_code != 200:
        raise Exception(f"Failed to fetch topics page: {response.status_code}")

    doc = BeautifulSoup(response.text, 'html.parser')

    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descriptions(doc),
        'url': get_topic_urls(doc)
    }

    return pd.DataFrame(topics_dict)


In [257]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [258]:
def scrape_topics_repos():
    print('Scraping list of GitHub topics')
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], row['title'])

In [259]:
scrape_topics_repos()

Scraping list of GitHub topics
Scraping top repositories for "3D"
✅ File 3D_repos.csv already exists. Skipping download.
Scraping top repositories for "Ajax"
✅ File Ajax_repos.csv already exists. Skipping download.
Scraping top repositories for "Algorithm"
✅ File Algorithm_repos.csv already exists. Skipping download.
Scraping top repositories for "Amp"
✅ File Amp_repos.csv already exists. Skipping download.
Scraping top repositories for "Android"
✅ File Android_repos.csv already exists. Skipping download.
Scraping top repositories for "Angular"
✅ File Angular_repos.csv already exists. Skipping download.
Scraping top repositories for "Ansible"
✅ File Ansible_repos.csv already exists. Skipping download.
Scraping top repositories for "API"
✅ File API_repos.csv already exists. Skipping download.
Scraping top repositories for "Arduino"
✅ File Arduino_repos.csv already exists. Skipping download.
Scraping top repositories for "ASP.NET"
✅ File ASP.NET_repos.csv already exists. Skipping downloa