### Scraping a web page

# 1.Picking a website and describing the objective

- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook.

In [1]:
site = "https://github.com/topics"

# 2.Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [2]:
import requests

In [3]:
page = requests.get(site)

In [4]:
page.status_code

200

In [5]:
page_content = page.text

In [6]:
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

In [7]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_content)

# 3.Use Beautiful Soup to parse and extract information



- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

In [8]:
from bs4 import BeautifulSoup

In [9]:
doc = BeautifulSoup(page_content, "html.parser")

- getting topic title

In [10]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tag =  doc.find_all('p',class_ = selection_class)

In [11]:
topic_title_tag[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [12]:
topic_titles = []
for tag in topic_title_tag:
    topic_titles.append(tag.text)

In [13]:
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

- getting topic description

In [14]:
selection_class = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tag = doc.find_all('p',class_ = selection_class)

In [15]:
topic_descs = []
for desc in topic_desc_tag:
    topic_descs.append(desc.text.strip())

In [16]:
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

- getting topic url

In [17]:
selection_class = "no-underline flex-1 d-flex flex-column"
topic_url_tag = doc.find_all('a',class_ = selection_class)

In [18]:
base_url = 'https://github.com'
topic_url_list = []
for i in range(len(topic_url_tag)):
    href = topic_url_tag[i]['href']
    topic_url = base_url + href
    topic_url_list.append(topic_url)

In [19]:
topic_url_list[:10]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet']

# 4.Create CSV file(s) with the extracted information



- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [20]:
import pandas as pd

In [21]:
from operator import index


dict = {
    'topic':topic_titles,
    'decription': topic_descs,
    'url' : topic_url_list
}
topic_df = pd.DataFrame(dict,index=None)

In [22]:
topic_df[:10]

Unnamed: 0,topic,decription,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [23]:
topic_df.to_csv('topics.csv',index=None)

# Getting information from a topic page

In [24]:
topic_url_list[0]

'https://github.com/topics/3d'

In [25]:
response= requests.get(topic_url_list[0])

In [26]:
topic_page = response.text

In [27]:
topic_doc = BeautifulSoup(topic_page, 'html.parser')

In [28]:
repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed')
repo_tag[0].find_all('a')

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data

In [29]:
username = []
repository = []
user_link = []
repo_link = []
for tag in repo_tag:
    user_link.append(base_url+tag.find_all('a')[0]['href'].strip())
    repo_link.append(base_url+tag.find_all('a')[1]['href'].strip())
    username.append(tag.find_all('a')[0].text.strip())
    repository.append(tag.find_all('a')[1].text.strip())

In [30]:
star_tag = topic_doc.find_all('span',{'id' : "repo-stars-counter-star"})

In [31]:
stars = []
for tag in star_tag:
    star = tag['title']
    star = int(star.replace(',',''))
    stars.append(star)

In [32]:
dict = {
    'Username':username,
    'User Link':user_link,
    'Repository': repository,
    'Repo Link':repo_link,
    'Stars': stars
}
D3_repo_df = pd.DataFrame(dict)

In [33]:
D3_repo_df[:5]

Unnamed: 0,Username,User Link,Repository,Repo Link,Stars
0,mrdoob,https://github.com/mrdoob,three.js,https://github.com/mrdoob/three.js,83130
1,libgdx,https://github.com/libgdx,libgdx,https://github.com/libgdx/libgdx,20147
2,pmndrs,https://github.com/pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,18517
3,BabylonJS,https://github.com/BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,17680
4,aframevr,https://github.com/aframevr,aframe,https://github.com/aframevr/aframe,14300


# Summarizing the tasks

In [42]:
def scrap_topics_repos():
    i=0
    for url in topic_url_list:
        response= requests.get(url)
        topic_page = response.text
        topic_doc = BeautifulSoup(topic_page, 'html.parser')
        repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed')
        username = []
        repository = []
        user_link = []
        repo_link = []
        for tag in repo_tag:
            user_link.append(base_url+tag.find_all('a')[0]['href'].strip())
            repo_link.append(base_url+tag.find_all('a')[1]['href'].strip())
            username.append(tag.find_all('a')[0].text.strip())
            repository.append(tag.find_all('a')[1].text.strip())
        star_tag = topic_doc.find_all('span',{'id' : "repo-stars-counter-star"})
        stars = []
        for tag in star_tag:
            star = tag['title']
            star = int(star.replace(',',''))
            stars.append(star)
        dict = {
            'Username':username,
            'User Link':user_link,
            'Repository': repository,
            'Repo Link':repo_link,
            'Stars': stars
        }
        df = pd.DataFrame(dict)
        filename = topic_titles[i] +'.csv'
        df.to_csv(filename,index=None)
        i+=1
            

In [43]:
scrap_topics_repos()

# 5.Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.