# Top Repositories for Github Topics

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. 
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Outline:-

- Scrape https://github.com/topics
- Get a list of topics. For each topic get topic title, topic url and description.
- For each topic get the top 25 repositories.
- For each repository, grab the repo name, username, stars and repo url.
- Each topic should have its own csv file. Format:-

Repo Name,Username,Stars,Repo URL

## Use the requests library to download web pages

In [1]:
!pip install requests --upgrade 



In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code #check the response to know if request was successful

200

In [6]:
len(response.text)

146787

In [7]:
page_contents = response.text

In [8]:
with open('webpage.html','w',encoding="utf-8") as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

In [9]:
!pip install beautifulsoup4 --upgrade



In [10]:
from bs4 import BeautifulSoup

In [11]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [12]:
selec_class = "f3 lh-condensed mb-0 mt-1 Link--primary"

topic_title_tags = doc.find_all('p',{'class':selec_class}) #Query to get the topic names

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags[:10]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>]

In [37]:
desc_class = "f5 color-fg-muted mb-0 mt-1"

topic_desc_tags = doc.find_all('p',{'class':desc_class}) #Query to get the topic descriptions

In [38]:
len(topic_desc_tags)

30

In [39]:
topic_desc_tags[:10]

[<p class="f5 color-fg-muted mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Angular is an open source web application platform.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ansible is a simple and powerful automation engine.
             </p>,
 <p class="

In [40]:
url_class = 'd-flex no-underline'

url_tags = doc.find_all('a',{'class':url_class})

In [41]:
len(url_tags)

30

In [42]:
url_tags[10]['href']

'/topics/atom'

Now lets clean the data

In [43]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

In [44]:
topic_titles #List of all topic titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [45]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip()) #strip() removes spaces from front and end

In [46]:
topic_descs #List of all topic descriptions

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'A

In [47]:
topic_urls = []
base_url = "https://github.com"

for tag in url_tags:
    topic_urls.append(base_url+tag['href'])

In [26]:
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

### Now use pandas to create dataframe

In [48]:
import pandas as pd

In [51]:
topic_dict = {
    'title':topic_titles,
    'description':topic_descs,
    'url':topic_urls
}

In [52]:
topic_df = pd.DataFrame(topic_dict)

In [53]:
topic_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [54]:
topic_df.to_csv('topics.csv',index=None)

## Getting Information out of a topic page

In [55]:
topic_page_url = topic_urls[0]

topic_page_url

'https://github.com/topics/3d'

In [56]:
response = requests.get(topic_page_url)

In [57]:
response.status_code

200

In [58]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [59]:
h3_class_selec = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_class_selec})

In [60]:
len(repo_tags)

30

In [63]:
a_tags = repo_tags[0].find_all('a')

In [64]:
auth_name = a_tags[0].text.strip()

In [65]:
repo_name = a_tags[1].text.strip()

In [66]:
repo_url = base_url + a_tags[1]['href']

In [67]:
star_selec_class = 'social-count float-none'
star_tags = topic_doc.find_all('a',{'class': star_selec_class})

In [68]:
len(star_tags)

30

In [69]:
star_str = star_tags[0].text.strip()

In [70]:
def parse_star_count(star_str): #Func to convert star data to correct format
    star_str = star_str.strip()
    if star_str[-1] == 'k':
        return int(float(star_str[:-1]) * 1000)
    return int(star_str)

In [71]:
parse_star_count(star_str)

75300

In [72]:
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    userName = a_tags[0].text.strip()
    repoName = a_tags[1].text.strip()
    repoUrl = base_url + a_tags[1]['href']
    star_str = star_tag.text.strip()
    stars = parse_star_count(star_str)
    
    return userName,repoName,repoUrl,stars

In [73]:
get_repo_info(repo_tags[15],star_tags[15])

('domlysz', 'BlenderGIS', 'https://github.com/domlysz/BlenderGIS', 4500)

In [74]:
topic_repos_dict = {
    'username' :[],
    'reponame' :[],
    'repourl' :[],
    'stars' :[]
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['reponame'].append(repo_info[1])
    topic_repos_dict['repourl'].append(repo_info[2])
    topic_repos_dict['stars'].append(repo_info[3])

In [75]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'spritejs',
  'tensorspace-team',
  'domlysz',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'openscad',
  'ssloy',
  'mosra',
  'blender',
  'google',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'rg3dengine',
  'cnr-isti-vclab',
  'antvis'],
 'reponame': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'tensorspace',
  'BlenderGIS',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'openscad',
  'tinyraytracer',
  'magnum',
  'blender',
  'model-viewer',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'rg3d',
  'meshlab',
  'L7'],
 'repourl': ['https://github.com/mrdoob/three.js',
  'http

## Endgame

In [93]:
import os

def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    
    #Check successfull response
    if response.status_code != 200:
        raise Exception("Failed to load Page {}".format(topic_url))
    
    #Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

#Get repo info function
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    userName = a_tags[0].text.strip()
    repoName = a_tags[1].text.strip()
    repoUrl = base_url + a_tags[1]['href']
    star_str = star_tag.text.strip()
    stars = parse_star_count(star_str)

    return userName,repoName,repoUrl,stars

    

def get_topic_repos(topic_doc):
                
    #Get the repo tags and star tags
    h3_class_selec = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_class_selec})
    star_selec_class = 'social-count float-none'
    star_tags = topic_doc.find_all('a',{'class': star_selec_class})
    
    #Create a topic repo dict
    topic_repos_dict = {
        'username' :[],
        'reponame' :[],
        'repourl' :[],
        'stars' :[]
    }


    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['reponame'].append(repo_info[1])
        topic_repos_dict['repourl'].append(repo_info[2])
        topic_repos_dict['stars'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,path):
    
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
        
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)
    

In [77]:
url5 = topic_urls[5]

In [84]:
url5_topic_repo_df = get_topic_repos(get_topic_page(url5))

url5_topic_repo_df

Unnamed: 0,username,reponame,repourl,stars
0,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,83800
1,angular,angular,https://github.com/angular/angular,77400
2,storybookjs,storybook,https://github.com/storybookjs/storybook,65700
3,ionic-team,ionic-framework,https://github.com/ionic-team/ionic-framework,45500
4,leonardomso,33-js-concepts,https://github.com/leonardomso/33-js-concepts,44600
5,prettier,prettier,https://github.com/prettier/prettier,40900
6,SheetJS,sheetjs,https://github.com/SheetJS/sheetjs,27900
7,angular,angular-cli,https://github.com/angular/angular-cli,25000
8,angular,components,https://github.com/angular/components,22200
9,NativeScript,NativeScript,https://github.com/NativeScript/NativeScript,20600


## Now lets create a single function to:

1. Get the list of all the topics from topics page
2. Get the list of top repositories from each topic pages
3. For each topic, create a CSV of the top repos.

In [81]:
def get_topic_titles(doc):
    selec_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p',{'class':selec_class}) #Query to get the topic names
    
    topic_titles = []

    for tag in topic_title_tags:
        topic_titles.append(tag.text)
        
    return topic_titles

def get_topic_desc(doc):
    
    desc_class = "f5 color-fg-muted mb-0 mt-1"

    topic_desc_tags = doc.find_all('p',{'class':desc_class}) #Query to get the topic descriptions
    topic_descs = []

    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip()) #strip() removes spaces from front and end
        
    return topic_descs

def get_topic_urls(doc):
    
    url_class = 'd-flex no-underline'

    url_tags = doc.find_all('a',{'class':url_class}) #Query to get the topic urls
    topic_urls = []
    base_url = "https://github.com"

    for tag in url_tags:
        topic_urls.append(base_url+tag['href'])
    
    return topic_urls
        

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
            raise Exception("Failed to load Page {}".format(topic_url))
    
    doc = BeautifulSoup(response.text, 'html.parser')
    topic_dict = {
        'title':get_topic_titles(doc),
        'description':get_topic_desc(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dict)

    
    

In [94]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

In [96]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre

In [97]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre