# Top Repositeries for Github topics-Python Web Scraping Project

# ## Pick a website and describe your objective
- Browse through different sites and pick on to scrape. 
- Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

# ## Project outline:
 - We're going to scrape https://github.com/topics
 - we'll get a list of topics. For each topic,we'll get topic title,topic page URL and topic description 
 -For each topic,we'll get the top 25 repositories in the topic from topic page
 -For each repository, we'll grab the repo name,username,stars and repo URl
 -For each topic we'll create a csv file in the following format:
 
  Repo name,username,stars,Repo URL
  
  infinite-scroll,metafizzy,7000,https://github.com/metafizzy/infinite-scroll

# ## Use the requests library to download web pages
Inspect the website's HTML source and identify the right URLs to download.
Download and save web pages locally using the requests library.
Create a function to automate downloading for different topics/search queries.

In [1]:
#request
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
!pip install requests --upgrade --quiet

In [4]:
import requests

In [5]:
topic_url='https://github.com/topics'

In [6]:
response=requests.get(topic_url)

In [7]:
response.status_code

200

In [8]:
page_contents=response.text

In [9]:
len(page_contents)

146850

In [10]:
page_contents[:100]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="d'

In [11]:
with open("webpage.html",'w') as f:
    f.write(page_contents)

# ## Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

In [12]:
#parsing and extracting using beautifulsoup
!pip install beautifulsoup4 --upgrade --exit


Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: --exit


In [13]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents, 'html.parser')

In [14]:
topic_title_tags=doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [15]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [16]:
topic_desc_tags=doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})

In [17]:
topic_desc_tags

[<p class="f5 color-fg-muted mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Angular is an open source web application platform.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ansible is a simple and powerful automation engine.
             </p>,
 <p class="

In [18]:
topic_url_tags=doc.find_all('a',{'class':'d-flex no-underline'})

In [19]:
topic_url_tags[0]['href']

'/topics/3d'

In [20]:
 
topic_titles=[]
for tag in topic_title_tags:
    topic_titles.append(tag.text)

In [21]:
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [22]:
topic_descs=[]
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

In [23]:
topic_descs

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'A

In [24]:
topic_urls=[]
base_url='https://github.com/'
for tag in topic_url_tags:
    topic_urls.append(base_url + tag['href'])

In [25]:
topic_urls

['https://github.com//topics/3d',
 'https://github.com//topics/ajax',
 'https://github.com//topics/algorithm',
 'https://github.com//topics/amphp',
 'https://github.com//topics/android',
 'https://github.com//topics/angular',
 'https://github.com//topics/ansible',
 'https://github.com//topics/api',
 'https://github.com//topics/arduino',
 'https://github.com//topics/aspnet',
 'https://github.com//topics/atom',
 'https://github.com//topics/awesome',
 'https://github.com//topics/aws',
 'https://github.com//topics/azure',
 'https://github.com//topics/babel',
 'https://github.com//topics/bash',
 'https://github.com//topics/bitcoin',
 'https://github.com//topics/bootstrap',
 'https://github.com//topics/bot',
 'https://github.com//topics/c',
 'https://github.com//topics/chrome',
 'https://github.com//topics/chrome-extension',
 'https://github.com//topics/cli',
 'https://github.com//topics/clojure',
 'https://github.com//topics/code-quality',
 'https://github.com//topics/code-review',
 'https:

# ## Create CSV file(s) with the extracted information
Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
Execute the function with different inputs to create a dataset of CSV files.
Verify the information in the CSV files by reading them back using Pandas.

In [26]:
#to create csv files use pandas
!pip install pandas --upgrade --quiet

In [27]:
import pandas as pd

In [28]:
topic_dict={'title':topic_titles,
            'description':topic_descs,
             'urls':topic_urls}

In [29]:
topic_df=pd.DataFrame(topic_dict)

In [30]:
topic_df

Unnamed: 0,title,description,urls
0,3D,3D modeling is the process of virtually develo...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android
5,Angular,Angular is an open source web application plat...,https://github.com//topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com//topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com//topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com//topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com//topics/aspnet


In [31]:
#saving data to csv files
topic_df.to_csv('topics.csv',index=None)

# Getting information out of topic page

In [32]:
topic_page_url=topic_urls[0]

In [33]:
topic_page_url

'https://github.com//topics/3d'

In [34]:
response=requests.get(topic_page_url)

In [35]:
response.status_code

200

In [36]:
len(response.text)

637818

In [37]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [38]:
repo_tags=topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})

In [39]:
len(repo_tags)

30

In [40]:
a_tags=repo_tags[0].find_all('a')

In [41]:
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [42]:
#username
a_tags[0].text.strip()

'mrdoob'

In [43]:
#reponame
a_tags[1].text.strip()

'three.js'

In [44]:
a_tags[1]['href']

'/mrdoob/three.js'

In [45]:
repo_url=base_url + a_tags[1]['href']

In [46]:
repo_url

'https://github.com//mrdoob/three.js'

In [47]:
star_tags=topic_doc.find_all('a',{'class':'social-count float-none'})

In [48]:
len(star_tags)

30

In [49]:
star_tags[0].text.strip()

'75.4k'

In [50]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)
    

In [51]:
parse_star_count(star_tags[0].text.strip())

75400

In [52]:
def get_repo_info(h3_tag,star_tag):
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url + a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [53]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 75400, 'https://github.com//mrdoob/three.js')

In [54]:
topic_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
for i in range(len(repo_tags)):
    repo_info=get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [55]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'spritejs',
  'domlysz',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'openscad',
  'ssloy',
  'mosra',
  'blender',
  'google',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'rg3dengine',
  'cnr-isti-vclab',
  'antvis'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'BlenderGIS',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'openscad',
  'tinyraytracer',
  'magnum',
  'blender',
  'model-viewer',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'rg3d',
  'meshlab',
  'L7'],
 'stars': [75400,
  19200,
  15500,
  15200,
  13200,
  1

In [56]:
topic_repos_df=pd.DataFrame(topic_repos_dict)

In [57]:
#this is only for '3d' topic
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,75400,https://github.com//mrdoob/three.js
1,libgdx,libgdx,19200,https://github.com//libgdx/libgdx
2,pmndrs,react-three-fiber,15500,https://github.com//pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15200,https://github.com//BabylonJS/Babylon.js
4,aframevr,aframe,13200,https://github.com//aframevr/aframe
5,ssloy,tinyrenderer,11500,https://github.com//ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11400,https://github.com//lettier/3d-game-shaders-fo...
7,FreeCAD,FreeCAD,10100,https://github.com//FreeCAD/FreeCAD
8,metafizzy,zdog,8800,https://github.com//metafizzy/zdog
9,CesiumGS,cesium,7600,https://github.com//CesiumGS/cesium


In [98]:
#let us do for all topics
import os
def get_topic_page(topic_url):
    #Download the page
    response=requests.get(topic_url)
    #parse using beautifulsoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    #chec successful response
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    return topic_doc

def get_repo_info(h3_tag,star_tag):
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url + a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

def get_topic_repos(topic_doc):

    #get h3 tags containing username,repo_name and repo_url
    repo_tags=topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    #get star tags
    star_tags=topic_doc.find_all('a',{'class':'social-count float-none'})
    
    #Get repo info
    
    topic_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
            }
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,topic_name):
    fname=topic_name + '.csv'
    if os.path.exists(fname):
        print('The file name {} is already exists.Skipping...'.format(fname))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name + '.csv',index=None)

In [99]:
url4=topic_urls[4]

In [100]:
topic4_doc=get_topic_pages(url4)

In [101]:
get_topic_repos(topic4_doc)

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,132000,https://github.com//flutter/flutter
1,justjavac,free-programming-books-zh_CN,84000,https://github.com//justjavac/free-programming...
2,Genymobile,scrcpy,56500,https://github.com//Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,47500,https://github.com//Hack-with-Github/Awesome-H...
4,google,material-design-icons,44200,https://github.com//google/material-design-icons
5,wasabeef,awesome-android-ui,41600,https://github.com//wasabeef/awesome-android-ui
6,square,okhttp,41100,https://github.com//square/okhttp
7,android,architecture-samples,39700,https://github.com//android/architecture-samples
8,square,retrofit,39000,https://github.com//square/retrofit
9,Solido,awesome-flutter,38000,https://github.com//Solido/awesome-flutter


In [102]:
#put it all together
def get_topic_titles(doc):
    topic_title_tags=doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles  

def get_topic_descs(doc):
    topic_desc_tags=doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_descs=[]
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs    
def get_topic_urls(doc):
    topic_url_tags=doc.find_all('a',{'class':'d-flex no-underline'})
    topic_urls=[]
    base_url='https://github.com/'
    for tag in topic_url_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls   
def scrape_topics():
    topics_url='https://github.com/topics'
    response=requests.get(topic_url)
     
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict={'title':get_topic_titles(doc),
                  'description':get_topic_descs(doc),
                   'urls':get_topic_urls(doc)}
    return pd.DataFrame(topics_dict)

In [103]:
scrape_topics()

Unnamed: 0,title,description,urls
0,3D,3D modeling is the process of virtually develo...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android
5,Angular,Angular is an open source web application plat...,https://github.com//topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com//topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com//topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com//topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com//topics/aspnet


In [112]:
def scrape_topics_repos():
    topics_df=scrape_topics()
    os.makedirs('data', exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}" '.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [113]:
scrape_topics_repos()

Scraping top repositories for "3D" 


KeyError: 'url'