# SEN163A - Fundamentals of Data Analytics
## Assignment 3 - Newspaper data web scraping
## Lab 2
### Dr. Ir. Jacopo De Stefani - [J.deStefani@tudelft.nl](mailto:J.deStefani@tudelft.nl)
### Joao Pizani Flor, M.Sc. - [J.p.pizaniflor@tudelft.nl](mailto:J.p.pizaniflor@tudelft.nl)

# Review

- Web Server
- HTML Page


## Web Server

![WebServerWorkflow](WebServerWorkflow.png)

**Quick reference:** https://www.quackit.com/web_servers/tutorial/how_web_servers_work.cfm

## HTML Page

<img src="BasicHTMLPage.png" alt="Basic HTML Page Structure" style="width: 800px;"/>

**Source:** https://devpost.com/software/mike-dastic-basic-html-structure

<img src="HTMLPageStructure.png" alt="Advanced HTML Page Structure" style="width: 800px;"/>

**Source:** https://stackoverflow.com/questions/51609208/html5-page-structure-section-and-article-correct-placement


<img src="HTML5vsHTML4.png" alt="HTML4 vs HTML5" style="width: 800px;"/>

**Source:** https://dotnetinter.livejournal.com/78240.html?

**Quick reference:** https://blog.stackpath.com/autonomous-system-number/

# Scraping

1. Exploring the target webpage
2. Loading relevant libraries
3. While `! reached end page`:
    Scrape!

# Exploring the target webpage

Inspect the page: https://jdestefani.github.io/SEN163A-TabularRazorArchives/

# Scraping vs Crawling

**Web scraping:** Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed.

Working definition from: https://careerfoundry.com/en/blog/data-analytics/web-scraping-guide/

**Web crawling:** Web crawling is the process of indexing data on web pages by using a program or automated script. These automated scripts or programs are known by multiple names, including web crawler, spider, spider bot, and often shortened to crawler.

Working definition from: https://research.aimultiple.com/web-crawler/


|  | **Scrapy (Crawling)** | **BeautifulSoup (Scraping)** |
|---|---|---|
| **Functionality** | Scrapy is the complete package for downloading web pages, processing them and save it in files and databases | BeautifulSoup is basically an HTML and XML parser and requires additional libraries such as requests, urlib2 to open URLs and store the result [6] |
| **Learning Curve** | Scrapy is a powerhouse for web scraping and offers a lot of ways to  scrape a web page. It requires more time to learn and understand how  Scrapy works but once learned, eases the process of making web crawlers  and running them from just one line of command. Becoming an expert in  Scrapy might take some practice and time to learn all functionalities. | BeautifulSoup is relatively easy to understand for newbies in programming and can get smaller tasks done in no time |
| **Speed and Load** | Scrapy can get big jobs done very easily. It can crawl a group of  URLs in no more than a minute depending on the size of the group and  does it very smoothly as it uses Twister which works asynchronously (non-blocking) for concurrency. | BeautifulSoup is used for simple scraping jobs with efficiency. It is slower than Scrapy if you do not use multiprocessing. |
| **Extending functionality** | Scrapy provides Item pipelines  that allow you to write functions in your spider that can process your  data such as validating data, removing data and saving data to a  database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well. It allows you to manage a lot of variables such as retries, redirection and so on. | If the project does not require much logic, BeautifulSoup is good  for the job, but if you require much customization such as proxys,  managing cookies, and data pipelines, Scrapy is the best option. |

Source: https://www.datacamp.com/community/tutorials/making-web-crawlers-scrapy-python#compare

## Crawling example

In [1]:
import scrapy

class ArticlesSpider(scrapy.Spider):
    name = 'articles'

    # Prevent the crawling from going into anything outside of this domain
    allowed_domains = ['jdestefani.github.io']

    # Starting point (or points) from which to crawl the website
    start_urls = [
        'https://jdestefani.github.io/SEN163A-TabularRazorArchives/2012-1.html'
    ]


    # The parse function in a crawler determines what needs to be done once we get a page (in the response parameter)
    def parse(self, response):
        # Read some data from the page through CSS selectors
        # Compare this: 
        author = response.css('div.author::text').get()
        # with
        # leaf_page.find_all('div',class_='author')[0].get_text()
        
        # If we found author (same as if author is not null)
        if author :
            # Then we should return the data we found
            yield {
                'author': author,
            }
        else:
            # We should keep exploring the links (a, with href attribute)
            all_links = response.css('a::attr(href)')
            # For every link, we apply this very same function
            for a in all_links:
                yield response.follow(a, callback=self.parse)

In [None]:
from pathlib import Path
from scrapy.crawler import CrawlerProcess

# Specify the output path
output_path = './article-metadata.jl'

# Specify some parameters of the crawler
process = CrawlerProcess(
    settings={
        "FEEDS": {
            output_path: {"format": "jsonlines"},
        },
    }
)

# 'ArticlesSpider' is the name of one of the spiders of the project.
process.crawl(ArticlesSpider)

# The notebook will block here until the crawling is finished
process.start() 


2022-03-29 14:50:50 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-29 14:50:50 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Windows-10-10.0.19042-SP0
2022-03-29 14:50:50 [scrapy.crawler] INFO: Overridden settings:
{}
2022-03-29 14:50:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-03-29 14:50:50 [scrapy.extensions.telnet] INFO: Telnet Password: 1fa83074a055e6d3
2022-03-29 14:50:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-03-29 14:50:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['s