## Advanced HTML data collection

Increasingly, websites cannot be analysed just through web scraping -- they include components which are rendered only on the browser.
This is both a challenge and an opportunity, as often such webpages may pull content from internal grey APIs as well.

## Selenium

[Selenium](https://www.selenium.dev/) is a tool which allows executing content on the website, rendering it like a website would do.
It is possible to get the full _rendered_ HTML and use other tools on that, such as BeautifulSoup; or then use its inbuild tools to collect data and navigate on the website.

In [None]:
!pip install selenium

### Opening a new web browser

Selenium is aimed for automated testing of different browsers.
Therefore, you can open different browsers through [webdrivers](https://www.selenium.dev/documentation/webdriver/).
Variable `page_source` includes HTML for the full website, and [`find_element` function](https://www.selenium.dev/documentation/webdriver/elements/locators/) can be used to identify elements through tag name, class name, id or xpath.

It is also possible to [interact](https://www.selenium.dev/documentation/webdriver/elements/interactions/) with the webpage, such as click elements, press keys or [produce action chains](https://www.selenium.dev/documentation/webdriver/actions_api/). [Selenium IDE](https://www.selenium.dev/selenium-ide/) helps with these.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("http://www.helsinki.fi")

In [None]:
from bs4 import BeautifulSoup

html_content = driver.page_source

parsed = BeautifulSoup( html_content, 'lxml' )

for image in  parsed.find_all('img'):
    print( image )

In [None]:
for image in driver.find_elements(By.TAG_NAME, 'img'):
    print( image.get_attribute("src") )

In [None]:
for link in driver.find_elements(By.TAG_NAME, 'a'):
    print( link.get_attribute("href"), link.text )
    print( link.get_attribute('outerHTML') )

In [None]:
button = driver.find_element( By.XPATH, "/html/body/div[2]/div/hy-main/div/div/div/article/hy-hero-carousel/div/div/div[1]/div[1]/div/div[1]/hy-cta-button/span/a/span[1]")
button.click()

In [None]:
## stop the execution
driver.quit()

### Tasks

1. Find all links on Yle.fi main page. What amount of them starts with http?
1. Find all images on Yle.fi and print their URLs
1. Go through all Finnish university web frontpages. Which of them have have a link to (a) Facebook, (b) TikTok or (c) X?
1. Go through all Finnish university web frontpages. Collect and download all images from them.
1. Extract the text of a single article on Yle.fi
1. Extract the text of a single article on HS.fi
1. Extract the text of a single news article on Helsinki.fi
1. Extract the text of a single news article on Aamulehti.fi
1. Extract the text of a single news article on BBC.com
1. Extract the text of a single news article on New York Times

## Grey APIs

Websites which allow unlimited scrolling often have some kind of API which allow website to "pull" new content.
It is possible to use these APIs directly, but sometimes they require tricks -- such as setting cookies, executing the command on a web broser as logged in user, using a spesific application key etc.

In [None]:
import requests

data = requests.get("https://ca.api.yle.fi/v1/graphql?app_id=ukko_prod&app_key=12150df3a0c8844d37c520235bf7c5d4&query=query+tuoreimmatMostRecentUutisetAndUrheiluQuery+($limit:Int!$exclude:Exclude!$publishedBefore:String+$fields:ArticleFields+%3DHEADLINE)%7Btuoreimmat:+mostRecentUutisetAndUrheiluMainNews+(+limit:$limit+exclude:$exclude+publishedBefore:$publishedBefore+fields:$fields+)%7Bitems%7Bid+title+fullUrl+lead+publisher%7Bname%7Dsubjects%7B...conceptFields%7DdatePublished+format+headline%7Bfull+short+image%7B...on+ImageBlock%7B...listItemImageBlockFields%7D%7Dvideo%7B...on+VideoBlock%7Bid+image%7B...listItemImageBlockFields%7D%7D%7Daudio%7B...on+AudioBlock%7Bid+image%7B...listItemImageBlockFields%7D%7D%7D%7DmainMedia%7Btype:__typename...on+ImageBlock%7B...listItemImageBlockFields%7D...on+AudioBlock%7Bid+image%7B...listItemImageBlockFields%7D%7D...on+VideoBlock%7Bid+offsetSeconds+image%7B...listItemImageBlockFields%7D%7D%7Dtopic%7Bid+isHidden+isLocked+acceptedCommentsCount%7D%7D%7D%7Dfragment+conceptFields+on+Concept%7Bid+alternativeIds+title%7Bfi+sv+se+en+uk+ru%7D%7Dfragment+listItemImageBlockFields+on+ImageBlock%7Bid+alt+version+blurhash+crops%7Baspect+coordinates%7Bheight+width+x+y%7D%7D%7D&variables=%7B%22limit%22:21,%22exclude%22:%7B%22properties%22:[%22importance:low%22,%22automaticListHint:never%22,%22automaticListHint:no-recently%22],%22journalisticStyle%22:[%22non_journalistic_content%22],%22coverage%22:%22LOCAL%22%7D,%22publishedBefore%22:%222024-08-22T03:15:46%2B0300%22%7D").json()

print( data )

### Tasks

* Comments from YLE
* Comments for HS
* Voluntary: Collect Tiktok vidoes on cats.

## Tricks of the trade

Sometimes websites try to block data collection.
There are some techniques you can use:

* request headers: ensure your requests looks like it is a real web browser
* cookies: collect and use the cookies provided by the services and sent them back to them
* wait: sometimes services request you to wait for data analysis