#**Top repositories for github trending topics**

**Pick a website and describe your objective**
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. 

**Outline Strategy** :
- We're going to scrape https://github.com/topics
- We'll get a list of trending topics. For each topic we'll get topic title , topic page url and topic description.
- For each topic , we'll get top 25 repositories from its topic page
- For each repository we'll grab repo name , username , stars and its repo url
- For each topic we'll create a CSV file


**Use the requests library to download web pages**

In [2]:
!pip install requests --upgrade --quiet

[?25l[K     |█████▎                          | 10 kB 19.9 MB/s eta 0:00:01[K     |██████████▌                     | 20 kB 25.1 MB/s eta 0:00:01[K     |███████████████▉                | 30 kB 13.1 MB/s eta 0:00:01[K     |█████████████████████           | 40 kB 10.0 MB/s eta 0:00:01[K     |██████████████████████████▎     | 51 kB 5.2 MB/s eta 0:00:01[K     |███████████████████████████████▋| 61 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████████████| 62 kB 603 kB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [3]:
import requests

topic_url = 'https://github.com/topics'

response = requests.get(topic_url) # creating a response  variable to get the index.html page
response.status_code # To check the response was successful or not

200

In [4]:
len(response.text) # To check how much data is stored under response

139848

In [5]:
page_contents = response.text
page_contents[:100] # checking first

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="d'

In [6]:
with open('web_page.html', 'w') as f: # getting a local copy of html file
  f.write(page_contents)

**Use Beautiful Soup to parse and extract information**

In [7]:
!pip install bs4 --upgrade --quiet

In [8]:
from bs4 import BeautifulSoup

In [9]:
doc = BeautifulSoup(page_contents , 'html.parser')

In [10]:
type(doc)

bs4.BeautifulSoup

We need 3 things
- topic name(tag)
- topic description
- topic url

In [11]:
topic_title_tags = doc.find_all('p')
len(topic_title_tags)

67

In [12]:
topic_title_tags[:5]

[<p class="f4 color-text-secondary col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         .NET
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">.NET is a free, cross-platform, open source developer platform.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         TypeScript
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">TypeScript is a typed superset of JavaScript that compiles to plain JavaScript.</p>]

In [13]:
# Generic method to find p_tags while searching under specific class
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p',class_=selection_class)
len(topic_title_tags) # there are about 30 topics in a page

30

In [14]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [15]:
# Generic method to find topic description while searching under specific class
desc_selector = 'f5 color-text-secondary mb-0 mt-1'
topic_desc_tags = doc.find_all('p',class_=desc_selector)
print(topic_desc_tags[0:5])
print(len(topic_desc_tags))

[<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>, <p class="f5 color-text-secondary mb-0 mt-1">
              Ajax is a technique for creating interactive web applications.
            </p>, <p class="f5 color-text-secondary mb-0 mt-1">
              Algorithms are self-contained sequences that carry out a variety of tasks.
            </p>, <p class="f5 color-text-secondary mb-0 mt-1">
              Amp is a non-blocking concurrency framework for PHP.
            </p>, <p class="f5 color-text-secondary mb-0 mt-1">
              Android is an operating system built by Google designed for mobile devices.
            </p>]
30


In [16]:
# Generic method to find topic url 
topic_link_tags = doc.find_all('a',class_='d-flex no-underline')
print(topic_link_tags[0]['href'])
#print(len(topic_link_tags))
topic_0_url = "https://github.com"+ topic_link_tags[0]['href']
print(topic_0_url)

"""
/topics/3d
https://github.com/topics/3d
"""

/topics/3d
https://github.com/topics/3d


'\n/topics/3d\nhttps://github.com/topics/3d\n'

In [17]:
# Topic related functions for tags , description and url

#topic_title_tags[0].text # output is '3D'

topic_titles = []
for tag in topic_title_tags:
  topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [18]:
topic_descs = []
for desc in topic_desc_tags:
  topic_descs.append(desc.text.strip())

print(topic_descs[:3])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.']


In [19]:
topic_urls = []
base_url = "https://github.com"
for url in topic_link_tags:
  topic_urls.append(base_url + url['href'])

print(topic_urls[:2])

['https://github.com/topics/3d', 'https://github.com/topics/ajax']


In [20]:
!pip install pandas --quiet
import pandas as pd

In [21]:
topics_dict = {
    'title':topic_titles,
    'description':topic_descs,
    'url':topic_urls
}
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


**Getting information from each individual topic**
- repository name 
- username of repository
- stars corresponding to repository
- repository url

In [22]:
topic_page_url = topic_urls[0]
topic_page_url


'https://github.com/topics/3d'

In [23]:
response = requests.get(topic_page_url)
response.status_code

200

In [24]:
len(response.text)

624993

In [25]:
topic_doc = BeautifulSoup(response.text , 'html.parser')

In [26]:
h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',class_=h3_selection_class)
len(repo_tags)
          

30

In [27]:
a_tags = repo_tags[0].find_all('a')
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="wb-break-word text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [28]:
a_tags[0].text.strip()

'mrdoob'

In [29]:
a_tags[1].text.strip()

'three.js'

In [30]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [31]:
star_tags = topic_doc.find_all('a',class_= 'social-count float-none')
len(star_tags)

30

In [32]:
star_tags[0].text.strip() # giving a string output

'74.2k'

In [33]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1])*1000)
  return int(stars_str)

In [34]:
 parse_star_count(star_tags[0].text.strip())

74200

In [35]:
def get_repo_info(h3_tag , star_tags):
# will return all the required information about the repository
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  stars = parse_star_count(star_tags.text.strip())
  repo_url = base_url + a_tags[1]['href']
  return username,repo_name,stars,repo_url

In [36]:
get_repo_info(repo_tags[0] ,star_tags[0])

('mrdoob', 'three.js', 74200, 'https://github.com/mrdoob/three.js')

In [37]:
#len(repo_tags) is 30

topic_repos_dict = {
    'username': [],
    'repo_name':[],
    'stars':[],
    'repo_url':[]

}

for i in range(0,len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i] ,star_tags[i])
  topic_repos_dict['username'].append(repo_info[0])
  topic_repos_dict['repo_name'].append(repo_info[1])
  topic_repos_dict['stars'].append(repo_info[2])
  topic_repos_dict['repo_url'].append(repo_info[3])

In [38]:
topic_repos_dict

{'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'BlenderGIS',
  'openscad',
  'tinyraytracer',
  'magnum',
  'model-viewer',
  'blender',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'rg3d',
  'L7',
  'meshlab'],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/libgdx/libgdx',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/BabylonJS/Babylon.js',
  'https://github.com/aframevr/aframe',
  'https://github.com/ssloy/tinyrenderer',
  'https://github.com/lettier/3d-game-shaders-for-beginners',
  'https://github.com/FreeCAD/FreeCAD',
  'https://github.com/metafizzy/zdog',
  'https://github.com/CesiumGS/cesium',
  'https://github.com/timzhang642/3D-Machine-Learning'

In [39]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [40]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,74200,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18900,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,15000,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,14800,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13100,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,11200,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11100,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9800,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,8700,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,7500,https://github.com/CesiumGS/cesium


In [41]:
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [42]:
# Now lets build a function for all the topics 

def get_topic_page(topic_urls):
  # Download the page
  response = requests.get(topic_urls)
  # Check for successful response 
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  # Parse the page using BeautifulSoup
  topic_doc = BeautifulSoup(response.text , 'html.parser')
  return topic_doc

def get_repo_info(h3_tag , star_tags):
  # Will return all the required information about the repository
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  stars = parse_star_count(star_tags.text.strip())
  repo_url = base_url + a_tags[1]['href']
  return username,repo_name,stars,repo_url
  

def get_topic_repos(topic_doc):
  
  # Get the h3 tag to get repo title , username and its urls
  h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3',class_=h3_selection_class)
  # Get class to get star tag
  star_tags = topic_doc.find_all('a',class_= 'social-count float-none')

  # Finally get all repo info

  topic_repos_dict = {
    'username': [],
    'repo_name':[],
    'stars':[],
    'repo_url':[]

  }

  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i] ,star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
  return pd.DataFrame(topic_repos_dict)

In [43]:
# Helper function
import os
def scrape_topic(topic_url , path):
  #fname = topic_name + '.csv'
  if os.path.exists(path):
    print("The file {} already exists....skipping.. ".format(path))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  
  topic_df.to_csv(path , index=None)

In [44]:
topic_url

'https://github.com/topics'

In [45]:
get_topic_repos(get_topic_page(topic_urls[0]))


Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,74200,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18900,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,15000,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,14800,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13100,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,11200,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11100,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9800,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,8700,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,7500,https://github.com/CesiumGS/cesium


# **Scaling the code**

Write a single function to
- Get the list of topics from the topic page
- Get the list of top repositiories from the each individual topic
- For each topic , create a CSV for their top rerepositiories having
 - repository name , username , stars and repository url

In [46]:
def get_topic_titles(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p',class_=selection_class)
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

def get_topic_descs(doc):
  desc_selector = 'f5 color-text-secondary mb-0 mt-1'
  topic_desc_tags = doc.find_all('p',class_=desc_selector)
  topic_descs = []
  for desc in topic_desc_tags:
    topic_descs.append(desc.text.strip())
  return topic_descs

def get_topic_urls(doc):
  topic_link_tags = doc.find_all('a',class_='d-flex no-underline')
  topic_urls = []
  base_url = "https://github.com"
  for url in topic_link_tags:
    topic_urls.append(base_url + url['href'])
  return topic_urls


def scrape_topics():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
   # Check for successful response 
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topics_url))
  topics_dict = {
      'title':get_topic_titles(doc),
      'description':get_topic_descs(doc),
      'url':get_topic_urls(doc)
  }
  return pd.DataFrame(topics_dict)


In [47]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [48]:
import os

def scrape_topics_repos():
  print("Scraping list of topics from Github")
  topics_df = scrape_topics()
  # Creating a folder
  os.makedirs('data',exist_ok=True)

  for index , row in topics_df.iterrows(): # Looping over rows in pandas data frame
    print("Scraping top repository for {}".format(row['title']))
    scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

In [49]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [52]:
scrape_topics_repos()

Scraping list of topics from Github
Scraping top repository for 3D
The file data/3D.csv already exists....skipping.. 
Scraping top repository for Ajax
The file data/Ajax.csv already exists....skipping.. 
Scraping top repository for Algorithm
The file data/Algorithm.csv already exists....skipping.. 
Scraping top repository for Amp
The file data/Amp.csv already exists....skipping.. 
Scraping top repository for Android
The file data/Android.csv already exists....skipping.. 
Scraping top repository for Angular
The file data/Angular.csv already exists....skipping.. 
Scraping top repository for Ansible
The file data/Ansible.csv already exists....skipping.. 
Scraping top repository for API
The file data/API.csv already exists....skipping.. 
Scraping top repository for Arduino
The file data/Arduino.csv already exists....skipping.. 
Scraping top repository for ASP.NET
The file data/ASP.NET.csv already exists....skipping.. 
Scraping top repository for Atom
The file data/Atom.csv already exists..