 # Top Repo For GitHub Topics


## 1. Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

Repo Name, User Name, Stars, Repo URL

three.js, mrdoob, 95000, https://github.com/mrdoob/three.js

react-three-fiber, pmndrs, 24100, https://github.com/pmndrs/react-three-fiber


### Use The Request Library To Download Web Pages

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
# get the web page
topics_url = 'https://github.com/topics'

In [4]:
# download this url
response = requests.get(topics_url)

In [5]:
response.status_code

# Informational responses (100 – 199)
# Successful responses (200 – 299)
# Redirection messages (300 – 399)
# Client error responses (400 – 499)
# Server error responses (500 – 599)

200

In [6]:
# first 1000 contemt of webpage

page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-b92e9647318f.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" med

In [7]:
# save this in html

with open('webpage.html', 'w', encoding="utf-8") as f:
    f.write(page_contents)

*In the above saved page, we have all the contents of the original page that are List Of Topics, Repo Name, User Name, Stars, Repo URL

### Use BeautifulSoup To Parse And Extract Info

In [8]:
!pip install beautifulsoup4 --upgrade --quiet 

In [9]:
from bs4 import BeautifulSoup

In [10]:
# html code parsing

doc = BeautifulSoup(page_contents, 'html.parser')

In [11]:
type(doc)

bs4.BeautifulSoup

#### Topic Title

In [12]:
# finding tags and class in html corresponding to '3D' Topic in the doc. Here 3D is in p tag

selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"

topic_title_tags = doc.find_all('p', {'class' : selection_class})

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

#### Description Of Topic

In [15]:
desc_selector = "f5 color-fg-muted mb-0 mt-1"

topic_desc_tags = doc.find_all('p', {'class' : desc_selector})

In [16]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

#### Topic URL

In [17]:
topic_title_tag0 = topic_title_tags[0]

topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [18]:
# parent tag of p tags for 3D

topic_link_tag = topic_title_tag0.parent

topic_link_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

*This means, our p tag (containing title 3D) is inside a tag (containing href of title 3d, i.e., URL of Title)

In [19]:
# print topic url

topic0_url = "https://github.com" + topic_link_tag['href']

print(topic0_url)

https://github.com/topics/3d


In [20]:
# Parent tag for p tags for all Title 

url_selector = "no-underline flex-1 d-flex flex-column"

topic_link_tags = doc.find_all('a', {'class': url_selector})

In [21]:
len(topic_link_tags)

30

In [22]:
topic_link_tags[0]['href']

'/topics/3d'

In [23]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']

print(topic0_url)

https://github.com/topics/3d


#### Getting Topic Name, Description And Topic URL

In [24]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [59]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [26]:
topic_urls = []

base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### Creating Dataframe

In [27]:
import pandas as pd

In [28]:
topics_df = pd.DataFrame

In [29]:
topics_dict = {
    'title' : topic_titles,
    'description' : topic_desc,
    'url' : topic_urls
}

In [30]:
topics_df = pd.DataFrame(topics_dict)

topics_df[:5]

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Create CSV With The Extracted Info

In [31]:
topics_df.to_csv('topics.csv', index=None)

### Getting Info Out Of A Topic Page

In [32]:
topic_page_url = topic_urls[0]

print(topic_page_url)

https://github.com/topics/3d


In [33]:
response = requests.get(topic_page_url)

In [34]:
response.status_code

200

In [35]:
len(response.text)

483998

In [36]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [37]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"

repo_tags = topic_doc('h3', {'class': h3_selection_class})

# repo_tags include Repo Name, User Name and Repo URL

In [38]:
len(repo_tags)

20

In [39]:
a_tags = repo_tags[0].find_all('a')

a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [40]:
# User Name

a_tags[0]

<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [41]:
# Repo Name

a_tags[1]

<a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
            three.js
</a>

#### User Name and Repo Name

In [42]:
a_tags[0].text.strip()

'mrdoob'

In [43]:
a_tags[1].text.strip()

'three.js'

#### Repo URL

In [44]:
repo_url = base_url + a_tags[1]['href']

print(repo_url)

https://github.com/mrdoob/three.js


#### Stars

In [45]:
star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})

In [46]:
len(star_tags)

20

In [47]:
star_tags[0].text.strip()

'95.2k'

In [48]:
# converting into number

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

parse_star_count(star_tags[0].text.strip())

95200

In [49]:
def get_repo_info(h3_tag, star_tag):
    
    # returns all the reqd info about a repository
    
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 95200, 'https://github.com/mrdoob/three.js')

In [50]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    
topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'aframevr',
  'FreeCAD',
  'CesiumGS',
  'MonoGame',
  'metafizzy',
  'blender',
  'isl-org',
  'timzhang642',
  'a1studmuffin',
  'nerfstudio-project',
  'domlysz',
  'FyroxEngine',
  'google',
  'openscad'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'FreeCAD',
  'cesium',
  'MonoGame',
  'zdog',
  'blender',
  'Open3D',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'nerfstudio',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'openscad'],
 'stars': [95200,
  24100,
  22100,
  21500,
  18100,
  16200,
  15700,
  15500,
  11100,
  10200,
  10000,
  9800,
  9600,
  9200,
  7500,
  6900,
  6800,
  6700,
  6100,
  6000],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon

In [51]:
# converting in dataframe

topic_repos_df = pd.DataFrame(topic_repos_dict)

topic_repos_df[:5]

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,95200,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24100,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22100,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21500,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18100,https://github.com/ssloy/tinyrenderer


# Final Code

In [173]:
# defining a function for all repo

import os

# 1. Get Topic Page with Topic URL

def get_topic_page(topic_url):
     # downlaod the page
    response = requests.get(topic_url)
    
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

# 2. Get Repo Info

def get_repo_info(h3_tag, star_tag):
    
    # returns all the reqd info about a repository
    
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

# 3. Get Topic Info

def get_topic_repos(topic_doc):
    
    # get repo tags
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc('h3', {'class': h3_selection_class})
    
    # get star tags
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
    # get repo info
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame (topic_repos_dict)

# 4. Save the scraped topic info

def scrape_topic(path, topic_url):
    
    # checking if the file name exists    
    if os.path.exists(path):
        print("The file {} already exist. skipping...".format(path))
        return
    
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [174]:
# assembling the above codes together and saving it to csv

get_topic_repos(get_topic_page(topic_urls[4])).to_csv('topics-android.csv', index=None)

### Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [175]:
# 1 List of Topics (Title, Description, URL)

def get_topic_titles(doc):
   
    # title tags
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p', {'class' : selection_class})
    
    # parsing topic names
    topic_titles = []

    for tag in topic_title_tags:
        topic_titles.append(tag.text)
        
    return topic_titles

def get_topic_descs(doc):
    
    # desc tags
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all('p', {'class' : desc_selector})
    
    # parsing descriptions
    topic_descs = []
    
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())

    return topic_descs

def get_topic_urls(doc):
    
    # topic link tags
    url_selector = "no-underline flex-1 d-flex flex-column"
    topic_link_tags = doc.find_all('a', {'class': url_selector})
    
    # parsing topic urls
    topic_urls = []
    base_url = "https://github.com"
    
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
        
    return topic_urls

def scrape_topics():
    
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
        
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
            
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
        
    return pd.DataFrame(topics_dict)

In [176]:
scrape_topics().head()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [177]:
# 2. List of Repos in the Topics

def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    # make a folder and add all the scraped info of repos inside it
    
    os.makedirs('data', exist_ok=True)
    
    
    # to iterate over each row of topic title 
    for index, row in topics_df.iterrows():
        print('Scraping top repos for "{}"'.format(row['title']))
        scrape_topic('data/{}.csv'.format(row['title']), row['url'])
        
scrape_topics_repos()

Scraping list of topics
Scraping top repos for "3D"
The file data/3D.csv already exist. skipping...
Scraping top repos for "Ajax"
The file data/Ajax.csv already exist. skipping...
Scraping top repos for "Algorithm"
The file data/Algorithm.csv already exist. skipping...
Scraping top repos for "Amp"
The file data/Amp.csv already exist. skipping...
Scraping top repos for "Android"
The file data/Android.csv already exist. skipping...
Scraping top repos for "Angular"
The file data/Angular.csv already exist. skipping...
Scraping top repos for "Ansible"
The file data/Ansible.csv already exist. skipping...
Scraping top repos for "API"
The file data/API.csv already exist. skipping...
Scraping top repos for "Arduino"
The file data/Arduino.csv already exist. skipping...
Scraping top repos for "ASP.NET"
The file data/ASP.NET.csv already exist. skipping...
Scraping top repos for "Atom"
The file data/Atom.csv already exist. skipping...
Scraping top repos for "Awesome Lists"
The file data/Awesome Lis