# Top Repositories in GitHub

Here are the steps:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [2]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [246]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [282]:
page = 1
while page != 7:
    pages_url = f"https://github.com//topics?page={page}"
    doc=get_topics_page()
    
    page = page + 1

https://github.com//topics?page=1
https://github.com//topics?page=2
https://github.com//topics?page=3
https://github.com//topics?page=4
https://github.com//topics?page=5
https://github.com//topics?page=6


In [None]:
get_topic_titles(doc)
get_topic_descs(doc)

In [260]:
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [261]:
doc = get_topics_page()

In [262]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

In [263]:
titles = get_topic_titles(doc)

In [264]:
len(titles)

30

In [265]:
def get_topic_descs(doc):
    
    topic_desc_tags= doc.find_all('p', {'class':"f5 color-fg-muted mb-0 mt-1"})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs
    


In [266]:
descs=get_topic_descs(doc)

In [267]:
len(descs)

30

In [268]:
descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [269]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls


In [270]:
urls=get_topic_urls(doc)

In [271]:
len(urls)

30

In [272]:
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [294]:
def scrape_topics():
    page = 1
    df_final=pd.DataFrame()
    while page != 7:
        pages_url = f"https://github.com//topics?page={page}"
        page = page + 1
    #topics_url = 'https://github.com/topics'
        response = requests.get(pages_url)
        if response.status_code != 200:
            raise Exception('Failed to load page {}'.format(topic_url))
        doc = BeautifulSoup(response.text, 'html.parser')
        topics_dict = {
            'title': get_topic_titles(doc),
            'description': get_topic_descs(doc),
            'url': get_topic_urls(doc)
        }
        df1=pd.DataFrame(topics_dict)
        
        df_final=pd.concat([df_final,df1])
    return df_final

In [296]:
t_df=scrape_topics()
t_df.tail()

Unnamed: 0,title,description,url
25,Windows,Windows is Microsoft's GUI-based operating sys...,https://github.com/topics/windows
26,WordPlate,WordPlate is a modern WordPress stack which si...,https://github.com/topics/wordplate
27,WordPress,WordPress is a popular content management syst...,https://github.com/topics/wordpress
28,Xamarin,Xamarin is a platform for developing iOS and A...,https://github.com/topics/xamarin
29,XML,XML is subset of SGML (Standard Generalized Ma...,https://github.com/topics/xml


In [297]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [298]:
#get topic page from topic url
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful responsetopic_url
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [299]:
#Get the information for each repository

def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    base_url = 'https://github.com'
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url
    

In [300]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    #h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class':"f3 color-fg-muted text-normal lh-condensed" })
    len(repo_tags)
    # Get star tags
    star_tags = topic_doc.find_all('a', {'class': "social-count js-social-count"})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [301]:
import os
def scrape_topic(top_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(top_url))
    topic_df.to_csv(path, index=None)

In [302]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [316]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre

Scraping top repositories for "Vim"
Scraping top repositories for "Virtual reality"
Scraping top repositories for "Vue.js"
Scraping top repositories for "Wagtail"
Scraping top repositories for "Web Components"
Scraping top repositories for "Web app"
Scraping top repositories for "Webpack"
Scraping top repositories for "Windows"
Scraping top repositories for "WordPlate"
Scraping top repositories for "WordPress"
Scraping top repositories for "Xamarin"
Scraping top repositories for "XML"
