# Web scraping

When there are no APIs, data can be collected from website directly using techniques such as web scraping. This means extracting from the HTML content directly. This can be text, images, or, links.

To do this, one needs to parse HTML markup (text) into meaningful structure. In Python [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a popular library to work on HTML.

In [None]:
!pip install BeautifulSoup4
!pip install lxml

In [None]:
from bs4 import BeautifulSoup
import requests

website = requests.get("https://www.helsinki.fi").text
parsed = BeautifulSoup( website, 'html.parser' ) ## parser, lxml might be faster sometimes

print( parsed )

# Finding elements

Command `find_all` can be used to find all spesific elements from the website, either using their tag, their ID or their CSS class, or combinations of these.
Command `find` could be used to find only one element.
It is also possible to nest these.

In [None]:
for link in parsed.find_all('a'):
    print( link )

In [None]:
for text in parsed.find_all( class_ = 'paragraph'):
    print( text )

In [None]:
for text in parsed.find_all( id = 'does_not_exist'):
    print( text )

In [None]:
for text in parsed.find_all(attrs={"role": "banner"}):
    print( text )

In [None]:
## nested structure

for banner in parsed.find_all(attrs={"role": "banner"}):
    for link in banner.find_all('a'):
        print( link )

## Attributes and content

HTML tags have attributes and content and one can access them through parsing:

In [None]:
for link in parsed.find_all('a'):
    print( link.get('href'), link.get_text() )

## Tasks

1. Find all links on Yle.fi main page. What amount of them starts with http?
1. Find all images on Yle.fi and print their URLs
1. Go through all Finnish university web frontpages. Which of them have have a link to (a) Facebook, (b) TikTok or (c) X?
1. Extract the text of a single article on Yle.fi
1. Extract the text of a single article on HS.fi
1. Extract the text of a single news article on Helsinki.fi
1. Extract the text of a single news article on Aamulehti.fi
1. Extract the text of a single news article on BBC.com
1. Extract the text of a single news article on New York Times

# XPath

There is a [dedicated query language](https://en.wikipedia.org/wiki/XPath) to work with HTML/XML structured documents, you can copy the exact queries from the browser developer tools.
This requires parsing and working with the library using lxml library.

In [None]:
from lxml import etree

dom = etree.HTML(website)

for element in dom.xpath('//a'):
    print( element.get('href'), element.text )

# Following links, crawlers and spiders

Sometimes there is a need to follow links, for example identify follow-up pages and crawl the content on them as well.
On most simples format, one can detect links and open them.
There are also spesific libraries for this purpose, such as [Scrapy](https://docs.scrapy.org/en/latest/).

In [None]:
site = 'https://www.helsinki.fi'

html = requests.get( site ).text
parsed = BeautifulSoup( html )

for link in parsed.find_all('a'):
    if link.get('href').startswith('/'): ## checking that the link is under this main site, not e.g. to Sisu or Facebook
        print( site + link.get('href') )
        html2 = requests.get( site + link.get('href') ).text
        ## parse and work forward with html2 if needed

## Tasks

* Fix the above code to work correctly, i.e. manage the Nonetype challenge
* Collect all links which are not in the same domain
* Parse collected websites and calculate the times `cat` is mentioned throughout them.

In [None]:
!pip install scrapy

In [None]:
import scrapy

class ToScrapeSpider(scrapy.Spider):
    name = "toscrape"
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract_first(),
                'author': quote.css("small.author::text").extract_first()
            }

        next_page_url = response.css("li.next > a::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

In [None]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess(
    settings={
        "FEEDS": {
            "quotes.json": {"format": "json"},
        }
    }
)

process.crawl( ToScrapeSpider )
process.start()

## Task

* Adapt the scrapy parser to work on HS.fi