# Installation

With conda, you can install the required dependencies with:

```bash
conda install bs4 requests lxml
```


# Basic usage of BeautifulSoup

First, we import the `BeatifulSoup` class:

In [1]:
from bs4 import BeautifulSoup

We load the html source file from disk and pass the source the the BeautifulSoup constructor. We choose the "lxml" parser for XML documents, which is faster than the defaul parser that comes with BeautifulSoup:

In [2]:
src = open("list.html")
document = BeautifulSoup(src, 'lxml')
print(document)

<!DOCTYPE html>
<html>
<body>
<h2>An Unordered HTML List</h2>
<ul id="unordered_list" style="color:#069">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>
<h2>An Ordered HTML List</h2>
<ol id="ordered_list" style="color:#069">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ol>
</body>
</html>



### Finding tags by name

The document now contains the full html document. We can find the first occuring tag with a specific name with the `find` function. Let's find the first un-ordered list tag:

In [3]:
ulist = document.find("ul")

The result contains all tags contained in the matched tag:

In [4]:
ulist

<ul id="unordered_list" style="color:#069">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>

The `find_all` function returns **all** tags that match the given tag name. We can use it to get a list of all list items:

In [5]:
items = ulist.find_all("li")
items

[<li>Coffee</li>, <li>Tea</li>, <li>Milk</li>]

Finally, we can loop over all items and extract their contant with the `get_text` function:

In [6]:
for item in items:
    print(item.get_text())

Coffee
Tea
Milk


Note that `find_all` is **recursive** by default. This means that we could call it the on the full `document` to get the items
of both the ordered and un-ordered lists:

In [7]:
document.find_all("li", recursive=False)

[]

### Finding tags by attributes

Sometimes the easiest way to find a tag is by its attribute name. In our examples, both lists have an `id` attribute that uniquely identifies the tables. We can also use the `find*` methods to search for attributes:


In [8]:
document.find(attrs={"id":"unordered_list"})

<ul id="unordered_list" style="color:#069">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>

### Accessing attributes

The `ul` tag also contains a `style` attribute. Any bs4 tag behaves like a dictionary with attribute names as keys and attribute values as values:

In [9]:
ulist.attrs

{'id': 'unordered_list', 'style': 'color:#069'}

In [10]:
ulist["style"]

'color:#069'

# Downloading a table from Wikipedia

We aim to get a list of countries sorted by their population size:
https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

First, let's import the required modules:

In [3]:
import requests
from bs4 import BeautifulSoup
import re
import dateutil

This time, we load the html source directly from a website using the requests module:

In [4]:
result = requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")

The web server returns a status code to indicate if the request was (un-)succesfully.
We use that status-code to check if the page was succesfully loaded:

In [5]:
assert result.status_code==200  

Next, we extract the html source and initiated BeautifulSoup:

In [6]:
src = result.content
document = BeautifulSoup(src, 'lxml')

by looking at the document, we can see that we are interested in first table. So we use `find`:

In [7]:
table = document.find("table")

If you are not familiar with html table, read this example first: https://www.w3schools.com/html/tryit.asp?filename=tryhtml_table_intro

At this point, it is a good idea to programatically check that the table contains the correct header:

In [8]:
assert table.find("th").get_text() == "Rank"

In [9]:
rows = table.find_all("tr")  # Note: this works because find_all is resursive by default


In [10]:
for row in rows[2:-1]:
    cells = row.find_all(["td", "th"])
    
    cells_text = [cell.get_text(strip=True) for cell in cells]
    (rank, country, region, population, percentage, updated_at, source, comment) = cells_text
    
    # Clean population using a regular expression str("1,404,890,600") -> int(1404890600)
    population = int(re.sub(',', '', population))   
    
    # Clean country name
    country = re.findall(r'[\w\s()\.]+', country)[0]
    
    # Convert percentage to floats
    percentage = float(re.findall(r'[\d\.]+', percentage)[0])    
    
    # Convert updated_at to date object
    updated_at = dateutil.parser.parse(updated_at)    
    
    print(f"{rank}, {country}, {population}, {percentage}, {updated_at}, {source}")
    

1, China, 1411778724, 17.9, 2020-11-01 00:00:00, 2020 census result[3]
2, India, 1383001530, 17.5, 2021-10-12 00:00:00, National population clock[4]
3, United States, 332520029, 4.21, 2021-10-12 00:00:00, National population clock[5]
4, Indonesia, 271350000, 3.43, 2020-12-31 00:00:00, National annual estimate[6]
5, Pakistan, 225200000, 2.85, 2021-07-01 00:00:00, UN projection[2]
6, Brazil, 213799167, 2.71, 2021-10-12 00:00:00, National population clock[7]
7, Nigeria, 211401000, 2.68, 2021-07-01 00:00:00, UN projection[2]
8, Bangladesh, 171511228, 2.17, 2021-10-12 00:00:00, National population clock[8]
9, Russia, 146171015, 1.85, 2021-01-01 00:00:00, National annual estimate[9]
10, Mexico, 126014024, 1.59, 2020-03-02 00:00:00, 2020 census result[10]
11, Japan, 125210000, 1.58, 2021-09-01 00:00:00, Monthly national estimate[11]
12, Ethiopia, 117876000, 1.49, 2021-07-01 00:00:00, UN projection[2]
13, Philippines, 110945986, 1.4, 2021-10-12 00:00:00, Official 2020 census result[12]
14, Egy

ValueError: not enough values to unpack (expected 8, got 7)

**Attention**: Beautiful Soup does not execute Javascript. This means that you the code in the Google Chrome inspector might look different to the original source code. 

# Another example of downloading a Wikipedia table 

Let's consider another table in a Wikipedia page. This page has a lot more tables, so one challenge will be to pick the right table

https://en.wikipedia.org/wiki/Tiger_Woods


We are interested in extracting these two tables:

![Target Wikipedia tables](pictures/wiki_tables.png)

**Exercise**: 

1) Identify the id="The_Players_Championship", by using title = document.find(id="The_Players_Championship")

2) First find all tables below the id in 1) by title.find_all_next('table').

3) Search for headers (th) by table.find('th') for table in tables to identify the "Tournament" header. Remember to use get_text(strip=True)

4) Save all tables with the header "Tournament" into a list tournament_tables

5) Bonus: Print out the information in the two tables of interest in the terminal

We begin by downloading the webpage and instatiating the BeautifulSoup object:

In [11]:
result = requests.get("https://en.wikipedia.org/wiki/Tiger_Woods")
src = result.content
document = BeautifulSoup(src, 'lxml')

This page contains a lot of tables without specific attributes that would make it easy to find our table of interest. Further, the same headings of the tables are used for multiple tables, making it difficult to find a table just by its headings:

In [12]:
len(document.find_all("table"))

60

Therefore, we choose another strategy. First, we extract the tag that defines the header just before our tables of interest. That header tag has a unique identifier attribute `id="The_Players_Championship"`. Then we use the `find_all_next` function in BeautifulSoup to extract all following table tags:

In [13]:
title = document.find(id="The_Players_Championship")
tables = title.find_all_next("table")

Now, our tables of interest are the first two tables with the "Tournament" heading. We write a small helper function (a generator https://wiki.python.org/moin/Generators) that returns a table with a given heading:

In [14]:
def find_table_with_heading(tables, heading):
    for table in tables:
        if table.find("th").get_text(strip=True) == heading:
            yield table

Next, we can extract the table rows and columns as usual. We only extract the first two tables, as these are the only ones we were interested in:

In [16]:
tournament_tables = list(find_table_with_heading(tables, "Tournament"))

for table in tournament_tables[:2]:
    for row in table.find_all("tr"):
        cells= row.find_all(["th", "td"])
        print([cell.get_text(strip="True") for cell in cells])
        

['Tournament', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009']
['The Players Championship', 'T31', 'T35', 'T10', '2', '1', 'T14', 'T11', 'T16', 'T53', 'T22', 'T37', '', '8']
['Tournament', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
['The Players Championship', 'WD', 'WD', 'T40', '1', '', 'T69', '', '', 'T11', 'T30']


# Exercise:
1) Go to https://en.wikipedia.org/wiki/University_of_Oslo 



2) Download the content from the site using BeautifulSoup and requests

3) Search for all images (using images = document.find_all()) and print out the content

4) Include only images with the attribute "class":"thumbimage" in your list of images. This can be done by inspecting image.attrs

5) Print out a list of the value of the "src" attribute for the images in 4. 

6) See if you can display your image by pasting the results from 5 into your web-browser. Remember to put https: in front of your string in the web-browser