# Session 7: Web Scraping 2, HTML and parsing

*Hjalte Fejerskov Boas*

## Recap

Recall the different steps in web scraping:
1. Mapping (session 6):
    - We learned how to use the structure of the URL to go through all the webpages you want to scrape
2. Downloading (session 6):
    - We learned how to download the HTML strings of webpages
    - We learned how to use the network panel to download data directly from the webpage's server
3. Parsing (this session)

In this session we will learn how to parse the downloaded HTML into meaningful and structured data

## Required readings

- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)

- [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)

# Overview of Session 7

1. What is HTML?
    - How does the tree structure work?
2. How can we find our way in the HTML string? I.e. find the data we need (parse the HTML string)
    - Regex
    - CSS selectors
    - BeautifulSoup
        - Today we will mainly spend time on BeautifulSoup

## Introduction to HTML

### Recall from previous session

How a human sees a webpage             |  How a computer sees a webpage (**HTML**)
:-------------------------:|:-------------------------:
![](https://drive.google.com/uc?exportview&id=1cbrC303j-gQnXbXyTEQBPT2xH7kgz6Cy)  |  ![](https://drive.google.com/uc?export=view&id=1VFlfDcJHCzbtmkpr4kvXzGecrDE7KmLY)

## [What is HTML?](https://www.w3schools.com/html/html_intro.asp)  

HTML(Hyper Text Markup Language) is the standard language for creating webpages

### HTML elements and tags

HTML consists of different elements: These elements tell your browser what to display and how to display it

An HTML element consists of a tag and the element content.
- The tag defines the content: for example the tag ```<h1>``` defines the content as "a large heading"
- Example: 
```html 
<h1> My first heading </h1>
```

In the browser, the HTML above will show up like this: <h1> My first heading </h1>

### Important tags

Here are some examples of often used tags:
```html 
<h1> Defines a large header </h1>
<p> Defines a paragraph </p>    
<div> Defines a section </div>
<a> Defines a link </a> 
<table> Defines a table </table> 
```

### Attributes to the HTML elements
Each element can have some [attributes](https://www.w3schools.com/html/html_attributes.asp)

- They are specified in the tags
- Example: 
```html 
<div class=myclass> My first section </div>
```

### Important attributes
Here are some examples of often used attributes:
- class: Specifies a class for an HTML element (multiple elements can share the same class)
- id: Specifies a *unique* id for an HTML element
- href: Specifies the link's destination/URL (used in combination with the ```<a>``` tag)

### HTML is like a tree

An element is also called a node

A node can have more nodes inside it. The nodes inside are then called *children*

- Example: 
```html 
<div> 
    <p> My first paragraph </p>
</div>
```
In this example, ```<p>``` is the child, and ```<div>``` is the parent.
- You may come across expressions like *children*, *siblings*, *parents*, *descendants*

### Here is an example of an HTML tree (can you see the similarity with a family tree?) 
<img src="http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png"/>

# Video 7.1: Navigating the HTML tree, intro

## How do we find our way around the HTML tree?

The HTML contains the information that we are interested in!
- But how do we locate it?

### Three ways of finding the information you want:
1. Regex: Exploiting string patterns in HTML using regular expresssions
2. CSS-selectors: Specifying paths in the tree using CSS-selectors
3. ```BeautifulSoup```: A Python package that makes it easy to navigate the HTML tree

### 1. Regex
**What is regex?**

Regex is used to define a search pattern in text

Suppose we want to search for all links in an HTML tree:
- We can then define a search pattern in regex that searches for "www." for example
- Using regex we will then find all the places in the HTML where it says "www."

Note: Regex only works on text/strings. So we need to convert our HTML tree into one large string before we can use regex on HTML

More about regex in session 8!

### 2. [CSS Selectors ](https://en.wikipedia.org/wiki/CSS)
A CSS selector is used to select the HTML elements ([How can you use a CSS selector?](https://www.scrapingbee.com/blog/python-web-scraping-beautiful-soup/))
- At first it will seem very similar to the BeautifulSoup way of selecting elements (which you will learn in a minute)
    - However, a CSS selector is useful when you cannot rely on *class* and *id* attributes (for example in very messy written HTML)

It is a need way to define a unique path to an element or multiple similar elements in the HTML tree

You can download a CSS Selector as a Google Chrome extension that will do the work for you: [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### BeautifulSoup has a built-in CSS selector:

Just use the function `.select`

In [6]:
url = 'https://www.dr.dk/nyheder/udland'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml') #Make the BeautifulSoup object (soup): Take the HTML content as input and choose your parser (lxml)

In [4]:
# The CSS selector ".dre-hyphenate-text" selects all titles on the DR international news page
soup.select('.dre-hyphenate-text')[0].text #Selecting first title

"Donald Trumps hjem i Florida 'belejret og ransaget af FBI'"

### 3. Parsing HTML with BeautifulSoup
A third way to navigate the HTML tree is BeautifulSoup

It exploits the stucture of tags and attributes

It allows you to:
- Search for elements by tag name and/or by attribute.
- Iterate through them, go up, sideways or down the tree.
- Furthermore it helps you with standard tasks such as extracting raw text from html

# Video 7.2: Parsing the HTML with BeautifulSoup

## Learning by doing: Creating a dataset from www.dr.dk/nyheder/udland

### Let's put together some of the stuff we have learned so far
1. **Mapping:** In this exercise we will collect some URLs from webpages with news articles and save them into a list
2. **Downloading:** Then we will download the HTML content of the webpages
3. **Parsing:** At last we will collect relevant information in each article

## 1. MAPPING

#### First, we investigate the site trying to understand its structure

We do this by opening up the Chrome Developer Tools on the webpage:
1. Right-click anywhere on the webpage
2. Click "Inspect"
3. Choose the panel "Elements"

You can now see the HTML of the webpage and the tree structure.

First, we want to understand where the articles are located in the HTML: 
- The "Elements" panel will jump to the place in the HTML tree where you right-click
- So to find the location of articles in the HTML, just right-click on one of them

#### Get the webpage content and make the BeautifulSoup object:

In [8]:
# Define our URL
url = 'https://www.dr.dk/nyheder/udland' 

# Connects to site
response = requests.get(url)

# Parse data with BeautifulSoup
soup = BeautifulSoup(response.content,'lxml')

#### Find the articles to scrape:

[`find_all`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) finds all elements in the HTML that have the tag ```<div>``` and the class attribute 'dre-teaser-content' 

In [9]:
# Identify articles to scrape by inspecting site
articles = soup.find_all('div', class_ = 'dre-teaser-content') #(class_ is used because class is reserved in Python)

#### Now we want the links to all the articles:
First, I show how to find the link for *one* article, and afterwards I show how to loop through all article links

You can use [`find`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) to find the *first* element. In the code below it is the first element that has the tag ```<a>```.

You can use `['href']` to select the attribute. Here we are interested in the content of the href attribute.

In [11]:
# First find the "link" tag in the HTML
article_link = articles[0].find('a') #(We are only taking the first article)
# Then locate the URL in the href attribute
article_url = article_link['href']
print(article_url)

/nyheder/udland/donald-trumps-hjem-i-florida-belejret-og-ransaget-af-fbi


In [12]:
# Another way to find the tag is by writing `.a` instead of `.find('a')`:
article_link = articles[0].a
article_url = article_link['href']
print(article_url)

/nyheder/udland/donald-trumps-hjem-i-florida-belejret-og-ransaget-af-fbi


#### We create a list of URLs that we want to scrape:

In [13]:
# Create an empty list
list_of_article_urls = []

# Creating a loop that appends the article url to the list above
for i in range(len(articles)):
    list_of_article_urls.append(articles[i].find('a')['href'])

In [14]:
list_of_article_urls

['/nyheder/udland/donald-trumps-hjem-i-florida-belejret-og-ransaget-af-fbi',
 '/nyheder/udland/usas-senat-vedtager-gigantisk-klimaplan-der-ogsaa-kan-komme-danmark-til-gode',
 '/nyheder/udland/korruptionsbekaempere-slaar-alarm-tusindvis-af-russiske-virksomheder-er-dukket-op-i',
 '/nyheder/udland/ambassadoer-kina-vil-ikke-udelukke-vaebnet-konflikt-med-usa-om-taiwan',
 '/nyheder/udland/mystikken-breder-sig-om-stor-militaeroevelse-ved-taiwan-er-den-afblaest-eller-ej',
 '/nyheder/udland/atom-vagthund-advarer-om-risiko-katastrofe-paa-kraftvaerk-ved-fronten-i-ukraine',
 '/nyheder/udland/israel-fortsaetter-luftangreb-i-gaza-anden-dag-i-traek',
 '/nyheder/udland/gaza-er-endnu-en-gang-omdrejningspunkt-i-konflikt-det-er-krigslignende-tilstande',
 '/nyheder/udland/al-qaedas-leder-blev-draebt-paa-balkonen-hvor-mark-engang-grillede-boeffer',
 '/nyheder/udland/militante-grupper-i-gaza-svarer-igen-paa-israelske-angreb-med-byger-af-raketter',
 '/nyheder/udland/kandidater-dybt-uenige-i-hvordan-de-faar-b

#### Some of the links are not to articles 

So we write this code to only keep the article links:

In [15]:
list_of_article_urls_final = []
for link in list_of_article_urls:
    if '/nyheder/udland' in link: #All article URLs have this string in them, so we restrict on it being in the URL
        list_of_article_urls_final.append(link)

## 2. DOWNLOADING + 3. PARSING

#### Now we are ready to scrape each webpage from the URL list:
First, I will show you the procedure for *one* link, and then I will show you how to scrape the first 10 articles

In [16]:
# Creating empty list for the infomation we want to extract for every article
title_list = []
lead_list = []
time_list = []

# This time we scrape for each news article in the url list we created before
url = 'https://www.dr.dk' + list_of_article_urls_final[0] #The scraped links are relative, so we need to add the base URL (Here we have just taken the first link)
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')

In [17]:
# Find title
temp = soup.find_all('h1')
temp = temp[1]
temp = temp.text.strip() #Use strip() to get rid of trailing and leading spaces
title_list.append(temp)

In [23]:
time_list

['2022-08-09T03:41:00+00:00']

In [19]:
# Find lead
temp = soup.find('p', class_='dre-article-title__summary')
temp = temp.text.strip()
lead_list.append(temp)

In [20]:
# Find time posted
temp = soup.find('time', class_='dre-byline__date')
temp = temp['datetime']
time_list.append(temp)

#### Combine all of the code above in a loop to scrape the first 10 articles:

In [40]:
# We want to extract title, lead and time posted from the articles

# Creatig empty list for the infomation we want to extract for every article
title_list = []
lead_list = []
time_list = []

for i in range(10): #len(list_of_article_urls)
    
    # This time we scrape for each news article in the url list we created before
    url = 'https://www.dr.dk' + list_of_article_urls_final[i] #The scraped links are relative, so we need to add the base url
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'lxml')
    
    # Append title to list
    temp = soup.find_all('h1')
    temp = temp[1]
    temp = temp.text.strip()
    title_list.append(temp)
    
    # Append lead to list
    temp = soup.find('p', class_='dre-article-title__summary')
    temp = temp.text.strip()
    lead_list.append(temp)

    # Append time posted to list
    temp = soup.find('time', class_='dre-byline__date')
    temp = temp['datetime']
    time_list.append(temp)

In [41]:
title_list

["'Forventningerne er meget lave' forud for verdens største konference for nedrustning af atomvåben",
 'Der spares, hvor der kan: EU-lande lancerer energibesparende tiltag',
 'Drab på nigeriansk gadesælger ved højlys dag udløser vrede i Italien',
 'Corona-duks har pludselig flere dødsfald end nogensinde',
 'Rusland vil lade FN og Røde Kors undersøge angreb på fængslede krigsfanger',
 "Ukraine bliver kritiseret for at sende børn med handicap på institutioner: 'Det minder om et fængsel'",
 "Endnu en kinesisk raket styrter mod jorden: 'Decideret dumt og uansvarligt'",
 "USA går ind i produktionen af mikrochips: 'Det vil i høj grad komme dansk erhvervsliv til gode'",
 "Det er 'uholdbart', at russiske turister strømmer til Finland",
 'Retten fælder dom i spektakulær fodboldfruesag: Det er... Rebekah Vardy']

In [42]:
lead_list

['Det bliver interessant at se, om atomstormagterne USA og Rusland kan snakke sammen trods krigen i Ukraine, vurderer eksperter.',
 'Hvis ikke de europæiske lande får reduceret energiforbruget, vil det kunne blive nødvendigt med rationeringer til vinter.',
 'Ingen forbipasserende greb ind, da en nigeriansk mand blev tæsket til døde.',
 'Det er en reminder om at være beredt, siger professor i global sundhed.',
 'Både Ukraine og Rusland vil have FN og Røde Kors til at efterforske angreb, der kostede 50 krigsfanger livet.',
 'Organisationer kritiserer Ukraine for at placere tusindvis med handicap på institutioner.',
 'Det skete også i 2021 og 2020, hvor ingen personer kom til skade. Men hvordan kan det blive ved?',
 'Lovforslaget, der blandt andet omfatter 380 milliarder kroner i tilskud til en amerikansk produktion af mikrochips, skal mindske afhængigheden af Kina.',
 'Alene i juli har Finland udstedt 10.000 turistvisa til russere.',
 'En britisk domstol har i dag afsagt dom i en promine

In [43]:
time_list

['2022-08-01T08:50:00+00:00',
 '2022-08-01T03:54:00+00:00',
 '2022-07-31T18:52:00+00:00',
 '2022-07-31T10:08:00+00:00',
 '2022-07-31T04:58:00+00:00',
 '2022-07-30T14:56:00+00:00',
 '2022-07-30T04:45:00+00:00',
 '2022-07-29T16:19:00+00:00',
 '2022-07-29T16:15:00+00:00',
 '2022-07-29T16:10:00+00:00']

#### Lastly, we put our collected information into a dataframe:

In [44]:
import pandas as pd
df = pd.DataFrame({'title':title_list, 'lead':lead_list, 'time':time_list})
df

Unnamed: 0,title,lead,time
0,'Forventningerne er meget lave' forud for verd...,"Det bliver interessant at se, om atomstormagte...",2022-08-01T08:50:00+00:00
1,"Der spares, hvor der kan: EU-lande lancerer en...",Hvis ikke de europæiske lande får reduceret en...,2022-08-01T03:54:00+00:00
2,Drab på nigeriansk gadesælger ved højlys dag u...,"Ingen forbipasserende greb ind, da en nigerian...",2022-07-31T18:52:00+00:00
3,Corona-duks har pludselig flere dødsfald end n...,"Det er en reminder om at være beredt, siger pr...",2022-07-31T10:08:00+00:00
4,Rusland vil lade FN og Røde Kors undersøge ang...,Både Ukraine og Rusland vil have FN og Røde Ko...,2022-07-31T04:58:00+00:00
5,Ukraine bliver kritiseret for at sende børn me...,Organisationer kritiserer Ukraine for at place...,2022-07-30T14:56:00+00:00
6,Endnu en kinesisk raket styrter mod jorden: 'D...,"Det skete også i 2021 og 2020, hvor ingen pers...",2022-07-30T04:45:00+00:00
7,USA går ind i produktionen af mikrochips: 'Det...,"Lovforslaget, der blandt andet omfatter 380 mi...",2022-07-29T16:19:00+00:00
8,"Det er 'uholdbart', at russiske turister strøm...",Alene i juli har Finland udstedt 10.000 turist...,2022-07-29T16:15:00+00:00
9,Retten fælder dom i spektakulær fodboldfruesag...,En britisk domstol har i dag afsagt dom i en p...,2022-07-29T16:10:00+00:00


#### One more thing:
What if we also want the body text of an article?

In [45]:
url = 'https://www.dr.dk/nyheder/udland/gazprom-strammer-ifoelge-tyskland-skruen-uden-grund' 
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')

In [46]:
# We locate the body of the article:
body = soup.find('div', class_ = 'dre-article-body')
body



This body consists of both sections with text and figures. We want it all.

But sections and figures have different tags, so we cannot just use `find_all` to find all elements in the body.

Instead we can use [`.children`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children). It finds all children of the element body:

In [47]:
body_text = []
for child in body.children:
    body_text.append(child.text)

In [48]:
body_text

['Gazprom halverer gasleverancerne til Europa via Nord Stream 1. Årsagen er ifølge selskabet vedligehold af en gasturbine. Den daglige gasforsyning via gasledningen vil fra onsdag morgen blive reduceret til 33 millioner kubikmeter, oplyser Gazprom.Det svarer til cirka 20 procent af den maksimale kapacitet, og det fremgår ikke, hvor længe den yderligt reducerede forsyning af gas vil stå på.',
 '',
 'Den tyske regering anser den forklaringen om vedligeholdelse for at være opfundet til lejligheden.- Ifølge vores oplysninger er der ingen teknisk grund til en reduktion i leverancerne, siger en talskvinde for Finansministeriet og minister Robert Habeck til Frankfurter Allgemeine Zeitung.Tyskerne får 25 procent af deres energi fra gas, hvor en overvejende del er kommet fra Rusland.Gasprisen stiger med 10 procentDet er anden gang indenfor en uge, at Gazprom reducerer leverancen af gas under påskud af reperation af gasturbiner. Da Gazprom efter ti dages vedligehold i sidste uge genåbnede for ga

Note: We have used `.text` to get the text of the HTML. The figure elements do not contain any text, so they will just be empty. 

We can use `.join()` to join all the strings in the list. Just join it on an empty string:

In [49]:
''.join(body_text)

'Gazprom halverer gasleverancerne til Europa via Nord Stream 1. Årsagen er ifølge selskabet vedligehold af en gasturbine. Den daglige gasforsyning via gasledningen vil fra onsdag morgen blive reduceret til 33 millioner kubikmeter, oplyser Gazprom.Det svarer til cirka 20 procent af den maksimale kapacitet, og det fremgår ikke, hvor længe den yderligt reducerede forsyning af gas vil stå på.Den tyske regering anser den forklaringen om vedligeholdelse for at være opfundet til lejligheden.- Ifølge vores oplysninger er der ingen teknisk grund til en reduktion i leverancerne, siger en talskvinde for Finansministeriet og minister Robert Habeck til Frankfurter Allgemeine Zeitung.Tyskerne får 25 procent af deres energi fra gas, hvor en overvejende del er kommet fra Rusland.Gasprisen stiger med 10 procentDet er anden gang indenfor en uge, at Gazprom reducerer leverancen af gas under påskud af reperation af gasturbiner. Da Gazprom efter ti dages vedligehold i sidste uge genåbnede for gasforsyninge