## Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [158]:
import requests

In [159]:
topics_URL= 'https://github.com/topics'

In [160]:
response = requests.get (topics_URL)

In [161]:
response.status_code

200

In [162]:
len(response.text)

164723

In [163]:
page_contents=response.text

In [164]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="false">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media=

In [165]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)


## Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

In [166]:
from bs4 import BeautifulSoup

In [167]:
doc = BeautifulSoup(page_contents,'html.parser')

In [168]:
topic_title_tags=doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})

In [169]:
len(topic_title_tags)

30

In [170]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [171]:
topic_desc_tags=doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})

In [172]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [173]:
topic_desc_tag0 = topic_title_tags[0]

In [174]:
topic_desc_tag0.parent.parent

<div class="py-4 border-bottom d-flex flex-justify-between">
<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>
<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>
<div class="flex-grow-0">
<div class="d-block" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":null,"auth_type":"LOG_IN","originating_url":"https://github.com/topics","user_id":null}}' data-hydro-click-hmac="5

In [175]:
topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

In [176]:
len(topic_link_tags)

30

In [177]:
topic0_url="https://github.com"+topic_link_tags[0]['href']
topic0_url

'https://github.com/topics/3d'

In [178]:
topic_titles=[]

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [179]:
topic_descriptions=[]

for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip())
    
print(topic_descriptions)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [180]:
topic_URLs=[]
base_url="https://github.com"
for tag in topic_link_tags:
    topic_URLs.append(base_url+tag['href'])
    
topic_URLs

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [181]:
import pandas as pd

In [182]:
topics_dict={'title':topic_titles,
             'description':topic_descriptions,
            'url':topic_URLs}
topics_df= pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.


In [183]:
topics_df.to_csv('topics.csv',index=None)

## Getting information out of a topic page

In [184]:
topic_page_url = topic_URLs[0]

In [185]:
topic_page_url


'https://github.com/topics/3d'

In [186]:
response = requests.get(topic_page_url)

In [187]:
response.status_code

200

In [188]:
len(response.text)

476394

In [189]:
topic_doc=BeautifulSoup(response.text,'html.parser')

In [190]:
repo_tags = topic_doc.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})

In [191]:
len(repo_tags)

20

In [192]:
a_tags=repo_tags[0].find_all('a')

In [193]:
a_tags[0].text.strip()


'mrdoob'

In [194]:
a_tags[1].text.strip()

'three.js'

In [195]:
repo_url = base_url+ a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [196]:
repo_stars=topic_doc.find_all('span',{'class':'Counter js-social-count'})

In [197]:
repo_stars[0].text

'93.7k'

In [198]:
def parse_star_count(repo_stars):
    repo_stars=repo_stars.strip()
    if repo_stars[-1]=='k':
        return int(float(repo_stars[:-1])*1000)
    return repo_stars

In [199]:
parse_star_count(repo_stars[0].text)

93700

In [200]:
import os
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #check response status
    if(response.status_code!=200):
        raise Exception('Failed to load page {}'.format(topic_url))
    #parse using beautiful soup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc


def get_repo_info(h3_tag,star_tag):
    #returns all the information about the repository
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url= base_url+ a_tags[1]['href']
    repo_stars=parse_star_count(star_tag.text)
    return  username, repo_name, repo_stars, repo_url

def get_topic_repos(topic_doc):
    repo_tags = topic_doc.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})
    repo_stars=topic_doc.find_all('span',{'class':'Counter js-social-count'})

    topic_repo_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
     }
    
    #get info
    for i in range(len(repo_tags)):
       repo_info=get_repo_info(repo_tags[i],repo_stars[i])
       topic_repo_dict['username'].append(repo_info[0])
       topic_repo_dict['repo_name'].append(repo_info[1])
       topic_repo_dict['stars'].append(repo_info[2])
       topic_repo_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repo_dict)

def scrape_topic(topic_url,topic_name):
    fname=topic_name + '.csv' 
    if os.path.exists(fname):
        print("The file {} already exists. Skipping...".format(fname))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname, index=None)

In [201]:
topic4_doc=get_topic_page(topic_URLs[4])
topic4_repos=get_topic_repos(topic4_doc)

In [202]:
topic_repos_df=pd.DataFrame(topic_repo_dict)

In [203]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,156000,https://github.com/flutter/flutter
1,facebook,react-native,111000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,104000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,89000,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,67800,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,48400,https://github.com/google/material-design-icons
6,Solido,awesome-flutter,47900,https://github.com/Solido/awesome-flutter
7,wasabeef,awesome-android-ui,47000,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,44300,https://github.com/square/okhttp
9,android,architecture-samples,43000,https://github.com/android/architecture-samples


In [204]:
def get_topic_titles(doc):
    topic_title_tags=doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})
    topic_titles=[]

    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    
    return topic_titles

def get_topic_desc(doc):
    topic_desc_tags=doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_descriptions=[]

    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text.strip())
    
    return topic_descriptions

def get_topic_urls(doc):
    topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    topic_URLs=[]
    base_url="https://github.com"
    for tag in topic_link_tags:
        topic_URLs.append(base_url+tag['href'])
        
    return topic_URLs

def scrape_topics():
    topics_URL= 'https://github.com/topics'
    response= requests.get(topics_URL)
    if(response.status_code!=200):
        raise Exception('Failed to load page {}'.format(topics_URL))
    doc = BeautifulSoup(response.text,'html.parser')
    topics_dict={'title':get_topic_titles(doc),
             'description':get_topic_desc(doc),
            'url':get_topic_urls(doc)}
    return pd.DataFrame(topics_dict)

    

In [205]:
def scrape_topics_repo():
    print('Scraping list of topics')
    topics_df=scrape_topics()
    for index,row in topics_df.iterrows():
        print('Scraping topics repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])
    
 

In [206]:
scrape_topics_repo()

Scraping list of topics
Scraping topics repositories for "3D"
The file 3D.csv already exists. Skipping...
Scraping topics repositories for "Ajax"
The file Ajax.csv already exists. Skipping...
Scraping topics repositories for "Algorithm"
The file Algorithm.csv already exists. Skipping...
Scraping topics repositories for "Amp"
The file Amp.csv already exists. Skipping...
Scraping topics repositories for "Android"
The file Android.csv already exists. Skipping...
Scraping topics repositories for "Angular"
The file Angular.csv already exists. Skipping...
Scraping topics repositories for "Ansible"
The file Ansible.csv already exists. Skipping...
Scraping topics repositories for "API"
The file API.csv already exists. Skipping...
Scraping topics repositories for "Arduino"
The file Arduino.csv already exists. Skipping...
Scraping topics repositories for "ASP.NET"
The file ASP.NET.csv already exists. Skipping...
Scraping topics repositories for "Atom"
The file Atom.csv already exists. Skipping..