# SEN163A - Fundamentals of Data Analytics
## Assignment 3 - Newspaper data web scraping
### Dr. Ir. Jacopo De Stefani - [J.deStefani@tudelft.nl](mailto:J.deStefani@tudelft.nl)
### Joao Pizani Flor, M.Sc. - [J.p.pizaniflor@tudelft.nl](mailto:J.p.pizaniflor@tudelft.nl)

# Review

- Web Server
- HTML Page


## Web Server

![WebServerWorkflow](WebServerWorkflow.png)

**Quick reference:** https://www.quackit.com/web_servers/tutorial/how_web_servers_work.cfm

## HTML Page

<img src="BasicHTMLPage.png" alt="Basic HTML Page Structure" style="width: 800px;"/>

**Source:** https://devpost.com/software/mike-dastic-basic-html-structure

<img src="HTMLPageStructure.png" alt="Advanced HTML Page Structure" style="width: 800px;"/>

**Source:** https://stackoverflow.com/questions/51609208/html5-page-structure-section-and-article-correct-placement


<img src="HTML5vsHTML4.png" alt="HTML4 vs HTML5" style="width: 800px;"/>

**Source:** https://dotnetinter.livejournal.com/78240.html?

**Quick reference:** https://blog.stackpath.com/autonomous-system-number/

# Scraping

1. Exploring the target webpage
2. Loading relevant libraries
3. While `! reached end page`:
    Scrape!

# Exploring the target webpage

Inspect the page: https://jdestefani.github.io/SEN163A-TabularRazorArchives/

# Loading relevant libraries

## Beautiful Soup

In [1]:
from bs4 import BeautifulSoup

In [2]:
index_page = BeautifulSoup("https://jdestefani.github.io/SEN163A-TabularRazorArchives/")



## Adding request library to perform HTTP requests

In [3]:
import requests
source_page = "https://jdestefani.github.io/SEN163A-TabularRazorArchives/"
response = requests.get(source_page)
index_page_html = response.text

In [4]:
index_page_html

'<head>\n<title>\nArticles\n</title>\n</head>\n<body>\n<h1>All articles</h1>\n<br><br>\n\n\n<h2>Articles year 2012</h2>\n<br>\n<div class="yearlink"> <a href="./2012.html">Articles in 2012</a> </div>\n<h2>Articles year 2013</h2>\n<br>\n<div class="yearlink"> <a href="./2013.html">Articles in 2013</a> </div>\n<h2>Articles year 2014</h2>\n<br>\n<div class="yearlink"> <a href="./2014.html">Articles in 2014</a> </div>\n<h2>Articles year 2015</h2>\n<br>\n<div class="yearlink"> <a href="./2015.html">Articles in 2015</a> </div>\n<h2>Articles year 2016</h2>\n<br>\n<div class="yearlink"> <a href="./2016.html">Articles in 2016</a> </div>\n<h2>Articles year 2017</h2>\n<br>\n<div class="yearlink"> <a href="./2017.html">Articles in 2017</a> </div>\n<h2>Articles year 2018</h2>\n<br>\n<div class="yearlink"> <a href="./2018.html">Articles in 2018</a> </div>\n<h2>Articles year 2019</h2>\n<br>\n<div class="yearlink"> <a href="./2019.html">Articles in 2019</a> </div>\n'

# Top-level parsing

In [5]:
index_page = BeautifulSoup(index_page_html)

In [6]:
print(index_page.prettify())

<html>
 <head>
  <title>
   Articles
  </title>
 </head>
 <body>
  <h1>
   All articles
  </h1>
  <br/>
  <br/>
  <h2>
   Articles year 2012
  </h2>
  <br/>
  <div class="yearlink">
   <a href="./2012.html">
    Articles in 2012
   </a>
  </div>
  <h2>
   Articles year 2013
  </h2>
  <br/>
  <div class="yearlink">
   <a href="./2013.html">
    Articles in 2013
   </a>
  </div>
  <h2>
   Articles year 2014
  </h2>
  <br/>
  <div class="yearlink">
   <a href="./2014.html">
    Articles in 2014
   </a>
  </div>
  <h2>
   Articles year 2015
  </h2>
  <br/>
  <div class="yearlink">
   <a href="./2015.html">
    Articles in 2015
   </a>
  </div>
  <h2>
   Articles year 2016
  </h2>
  <br/>
  <div class="yearlink">
   <a href="./2016.html">
    Articles in 2016
   </a>
  </div>
  <h2>
   Articles year 2017
  </h2>
  <br/>
  <div class="yearlink">
   <a href="./2017.html">
    Articles in 2017
   </a>
  </div>
  <h2>
   Articles year 2018
  </h2>
  <br/>
  <div class="yearlink">
   <a href="./

In [7]:
for link in index_page.find_all('a', href=True):
    print("Found link: " +str(link))

Found link: <a href="./2012.html">Articles in 2012</a>
Found link: <a href="./2013.html">Articles in 2013</a>
Found link: <a href="./2014.html">Articles in 2014</a>
Found link: <a href="./2015.html">Articles in 2015</a>
Found link: <a href="./2016.html">Articles in 2016</a>
Found link: <a href="./2017.html">Articles in 2017</a>
Found link: <a href="./2018.html">Articles in 2018</a>
Found link: <a href="./2019.html">Articles in 2019</a>


In [8]:
link.get('href')

'./2019.html'

# Intermediate level parsing

## Adding Urlib to manage URL more easily

In [9]:
from urllib.parse import urljoin

In [10]:
urljoin(source_page,link.get('href'))

'https://jdestefani.github.io/SEN163A-TabularRazorArchives/2019.html'

## HTML Request + Beautiful Soup Loop

In [11]:
link_list = index_page.find_all('a', href=True)

for link in link_list[0:2] :
    response = requests.get(urljoin(source_page,link.get('href')))
    current_page_html = response.text
    current_page = BeautifulSoup(current_page_html)
    print(current_page.prettify())

<html>
 <head>
  <title>
   Articles in 2012
  </title>
 </head>
 <body>
  <h1>
   2012
  </h1>
  <br/>
  <br/>
  <div class="monthlink">
   <a href="./2012-1.html">
    Month 1 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-2.html">
    Month 2 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-3.html">
    Month 3 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-4.html">
    Month 4 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-5.html">
    Month 5 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-6.html">
    Month 6 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-7.html">
    Month 7 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-8.html">
    Month 8 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-9.html">
    Month 9 in 2012
   </a>
  </div>
  <div class="monthlink">
   <a href="./2012-10.html">
    Month 10 i

# Bottom-level parsing

In [12]:
leaf_page_URL = "https://jdestefani.github.io/SEN163A-TabularRazorArchives/01/12/2012/eius-etincidunt-consectetur-eius.html"
response = requests.get(leaf_page_URL)
leaf_page_HTML = response.text
leaf_page = BeautifulSoup(leaf_page_HTML)

In [13]:
print(leaf_page.prettify())

<html>
 <head>
  <title>
   Eius etincidunt consectetur eius
  </title>
 </head>
 <body>
  <h1>
   Eius etincidunt consectetur eius
  </h1>
  <i>
   <div class="author">
    Bebe Riva
   </div>
   -
   <div class="date">
    2012-12-01
   </div>
   -
   <div class="time">
    17:14
   </div>
  </i>
  <p>
   Aliquam quaerat eius sed est. Magnam velit velit quiquia. Amet numquam amet etincidunt sed non ipsum. Consectetur numquam labore dolorem amet modi. Sed aliquam dolorem quaerat sed quiquia tempora est. Labore amet voluptatem adipisci non modi.
  </p>
  <p>
   Quisquam consectetur numquam tempora. Modi consectetur dolor velit quiquia ut. Eius adipisci dolorem dolor magnam. Dolor dolorem modi voluptatem magnam ut. Eius neque non ipsum etincidunt etincidunt. Eius eius sit non. Neque quaerat magnam quiquia sit voluptatem dolor. Tempora dolorem sed amet labore porro.
  </p>
  <p>
   Dolor magnam amet dolorem magnam modi est modi. Porro tempora dolorem neque numquam dolor. Velit eius volup

In [14]:
leaf_page.find_all('div')

[<div class="author">Bebe Riva</div>,
 <div class="date">2012-12-01</div>,
 <div class="time">17:14</div>]

In [15]:
for div_element in leaf_page.find_all('div'):
    print(div_element)

<div class="author">Bebe Riva</div>
<div class="date">2012-12-01</div>
<div class="time">17:14</div>


In [16]:
div_element.get_text()

'17:14'

In [17]:
post_data_list = [] 

for div_element in leaf_page.find_all('div'):
    if div_element.get('class') == ['author']:
        author_name = div_element.get_text()
    if div_element.get('class') == ['date']:
        post_date = div_element.get_text()
    if div_element.get('class') == ['time']:
        post_time = div_element.get_text()
    print(div_element)

post_data_list.append((author_name,post_date,post_time))

<div class="author">Bebe Riva</div>
<div class="date">2012-12-01</div>
<div class="time">17:14</div>


In [18]:
post_data_list

[('Bebe Riva', '2012-12-01', '17:14')]

In a more compact format, as suggested by one of your colleagues:

In [19]:
leaf_page.find_all('div',class_='author')

[<div class="author">Bebe Riva</div>]

In [20]:
leaf_page.find_all('div',class_='author')[0]

<div class="author">Bebe Riva</div>

In [21]:
leaf_page.find_all('div',class_='author')[0].get_text()

'Bebe Riva'