## Pick a website and describe the objective

- Browse through different sites and pick on to scrape.
- Identify the information to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy.

Road map:
- Website used for scraping- https://github.com/topics
- Collecting a list of topics and extrating the topic title, topic page URL and topic description for each topic.
- We will be extracting top 30 repositories from the list of topics from the topic page.
- We will be collecting repository name, username, stars and repository URL for each repository.
- We will create a seperate CSV file for each topic in the below format -
   
   ```Repo Name,Username,Stars,Repo URL```


## Use the requests library to download the web pages

In [1]:
import requests

In [2]:
topics_url="https://github.com/topics"

In [3]:
response = requests.get(topics_url)

In [4]:
response.status_code

200

In [5]:
len(response.text)

189827

In [6]:
page_contents=response.text

In [7]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-9M4GwJqBATCm6CMz2UrCo6uuX1/Wa8wUnm7N5BQhGHFch1oIX2y8dcpUXfnQVQ2HE2bD287O5YMXuc5jFAcU8w==" rel="stylesheet" href="https://github.githubassets.com/assets/light-f4ce06c09a810130a6e8.css" /><link crossorigin="anonymous" media="all" integrity="sha512-BEwN74xxmv+L2zArQGm+kGVvX3bGY85LF4umkZnZ6Zl6IciYbr1IGxIYxqnUCW0RDEN1kM7glvXkpmQsntPVYQ==" re

In [8]:
with open ('webpage.html','w',encoding="utf-8") as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information.

In [9]:
from bs4 import BeautifulSoup

In [10]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [11]:
type(doc)

bs4.BeautifulSoup

In [12]:
title_class="f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags=doc.findAll('p',{'class':title_class})

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [15]:
desc_class="f5 color-fg-muted mb-0 mt-1"
topic_desc_tags=doc.findAll('p',{'class':desc_class})

In [16]:
len(topic_desc_tags)

30

In [17]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [18]:
topic_link_class="no-underline flex-1 d-flex flex-column"

topic_link_tags=doc.findAll('a',{'class':topic_link_class})

In [19]:
len(topic_link_tags)

30

In [20]:
topic_link_tags[0]['href']

'/topics/3d'

In [21]:
topic0_url="https://github.com"+ topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [22]:
topic_titles=[]

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)
topic_descriptions=[]

for descriptions in topic_desc_tags:
    topic_descriptions.append(descriptions.text.strip())
    
print(topic_descriptions[:5])
topic_urls=[]
base_url="https://github.com"

for urls in topic_link_tags:
    topic_urls.append(base_url+urls['href'])
    
print(topic_urls)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']
['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']
['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://git

In [23]:
topic_descriptions=[]

for descriptions in topic_desc_tags:
    topic_descriptions.append(descriptions.text.strip())
    
print(topic_descriptions[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [24]:
topic_urls=[]
base_url="https://github.com"

for urls in topic_link_tags:
    topic_urls.append(base_url+urls['href'])
    
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [25]:
import pandas as pd

In [26]:
topics_dict={'title':topic_titles,'description':topic_descriptions,'url':topic_urls}

In [27]:
topics_df=pd.DataFrame(topics_dict)

In [28]:
topics_df.head(5)

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [29]:
topics_df.to_csv('topics.csv',index=None)

## Getting information out of a topic page

In [30]:
topic_page_url=topic_urls[0]

In [31]:
print(topic_page_url)

https://github.com/topics/3d


In [32]:
response = requests.get(topic_page_url)

In [33]:
response.status_code

200

In [34]:
len(response.text)

675798

In [35]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [36]:
h3_class="f3 color-fg-muted text-normal lh-condensed"
repo_tags=topic_doc.findAll('h3',{'class':h3_class})

In [37]:
len(repo_tags)

30

In [38]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d897521

In [39]:
a_tags=repo_tags[0].findAll('a')

In [40]:
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [41]:
a_tags[0].text.strip()

'mrdoob'

In [42]:
a_tags[1].text.strip()

'three.js'

In [43]:
a_tags[1]['href']

'/mrdoob/three.js'

In [44]:
base_url="https://github.com"

repo_url=base_url+a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [45]:
span_class="Counter js-social-count"

star_tags=topic_doc.findAll('span',{'class':span_class})

In [46]:
len(star_tags)

30

In [47]:
star_tags[0]

<span aria-label="79426 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="79,426">79.4k</span>

In [48]:
star_tag0=star_tags[0]
star_tag0.text

'79.4k'

In [67]:
star_tags

[<span aria-label="79426 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="79,426">79.4k</span>,
 <span aria-label="19685 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="19,685">19.7k</span>,
 <span aria-label="16935 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="16,935">16.9k</span>,
 <span aria-label="15997 users starred this repository" class="Counter js-social-count" data-pjax

In [50]:
def parse_star_count(stars_str):
    stars_str=stars_str.text
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str[:-1])

In [69]:
parse_star_count(star_tag0)

79400

In [52]:
def get_repo_info(repo_tags,star_tags):
    a_tags=repo_tags.findAll('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tags)
    return username,repo_name,stars,repo_url

In [53]:
get_repo_info(repo_tags[0],star_tag0)

('mrdoob', 'three.js', 79400, 'https://github.com/mrdoob/three.js')

In [54]:
topic_repo_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}

for i in range (len(repo_tags)):
    repo_info=get_repo_info(repo_tags[i],star_tags[i])
    topic_repo_dict['username'].append(repo_info[0])
    topic_repo_dict['repo_name'].append(repo_info[1])
    topic_repo_dict['stars'].append(repo_info[2])
    topic_repo_dict['repo_url'].append(repo_info[3])


In [55]:
topic_repo_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'lettier',
  'ssloy',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'domlysz',
  'blender',
  'spritejs',
  'openscad',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'ssloy',
  'google',
  'mosra',
  'gfxfundamentals',
  'FyroxEngine',
  'cleardusk',
  'tengbao',
  'jasonlong',
  'cnr-isti-vclab'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  '3d-game-shaders-for-beginners',
  'tinyrenderer',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'BlenderGIS',
  'blender',
  'spritejs',
  'openscad',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'tinyraytracer',
  'model-viewer',
  'magnum',
  'webgl-fundamentals',
  'Fyrox',
  '3DDFA',
  'vanta',
  'isometric-contributions',
  'meshlab'],
 'stars': [79400,
  19700,
  16900,
  16000,
  1380

In [56]:
topic_repos_df=pd.DataFrame(topic_repo_dict)

In [57]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,79400,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19700,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16900,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,16000,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13800,https://github.com/aframevr/aframe
5,lettier,3d-game-shaders-for-beginners,12300,https://github.com/lettier/3d-game-shaders-for...
6,ssloy,tinyrenderer,12200,https://github.com/ssloy/tinyrenderer
7,FreeCAD,FreeCAD,10800,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9000,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8300,https://github.com/CesiumGS/cesium


In [58]:
import os

def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(repo_tags,star_tags):
    a_tags=repo_tags.findAll('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tags)
    return username,repo_name,stars,repo_url

def get_topic_repos(topic_doc):
    h3_class="f3 color-fg-muted text-normal lh-condensed"
    repo_tags=topic_doc.findAll('h3',{'class':h3_class})
    
    span_class="Counter js-social-count"
    star_tags=topic_doc.findAll('span',{'class':span_class})
    
    topic_repo_dict={
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[]
    }
    
    
    for i in range (len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['stars'].append(repo_info[2])
        topic_repo_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repo_dict)

def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print(f"The file {path} already exists. Skipping...")
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [59]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv("Angular.csv")

Write a single function to:
1. Get the list of topics from the topics page.
2. Get the list of top repos from the individual topic page
3. For each topic create a csv of top repos for the topic

In [60]:
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [61]:
def get_topic_titles(doc):
    title_class="f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags=doc.find_all('p',{'class':title_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_desc(doc):
    desc_class="f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags=doc.find_all('p',{'class':desc_class})
    topic_descriptions=[]
    for descriptions in topic_desc_tags:
        topic_descriptions.append(descriptions.text.strip())
    return topic_descriptions

def get_topic_urls(doc):
    topic_link_class="no-underline flex-1 d-flex flex-column"
    topic_link_tags=doc.findAll('a',{'class':topic_link_class})
    topic_urls=[]
    base_url="https://github.com"
    for urls in topic_link_tags:
        topic_urls.append(base_url+urls['href'])
    return topic_urls

def scrape_topics():
    topics_url='https://github.com/topics'
    response=requests.get(topics_url)
    if response.status_code != 200:
        raise Exception (f'Failed to load page {topics_url}')
    topics_dict={
        'title':get_topic_titles(doc),
        'description':get_topic_desc(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)
# def Scrape_topics_repos():
#     topics_url='https://github.com/topics'
    
    


In [62]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [63]:
def scrape_topics_repos():
    print("Scraping list of topics")
    topics_df=scrape_topics()
    os.makedirs('data', exist_ok=True)
    for index,row in topics_df.iterrows():
        print(f"Scraping top repositories for {row['title']}")
        scrape_topic(row['url'],"data/{}.csv".format(row['title']))
    print('\nDone!!!')

In [64]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for 3D
The file data/3D.csv already exists. Skipping...
Scraping top repositories for Ajax
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for Algorithm
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for Amp
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for Android
The file data/Android.csv already exists. Skipping...
Scraping top repositories for Angular
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for Ansible
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for API
The file data/API.csv already exists. Skipping...
Scraping top repositories for Arduino
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for ASP.NET
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for Atom
The file data/Atom.csv already exists. Skipping..