# Beautiful Soup 4 and Web Page Parsing

Beautiful Soup (version 4) is a library for getting data from HTML or XML documents, and it gives you a nice, normalized, idiomatic way of **navigating and querying a document**.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
os.chdir("/content/drive/MyDrive/Lab3")

### Structure of HTML

<img src="https://drive.google.com/file/d/19jpWGX7RkhoRDHTXBQs9tT76W-52j_l_/view?usp=sharing" align="left" style="width:650px;" >

### 1. Reading an HTML File and Extracting Its Contents Using Beautiful Soup

In this exercise, we will do the **simplest** thing possible. We will import the Beautiful Soup or bs4 library and then use it to **read an HTML document**. Then, we will **examine the different kinds of objects** it returns. <br>
While doing the exercises for this topic, you should have the example HTML file (called test.html) open in a text editor so that you can check for the different tags and their attributes and contents:

1. Import the bs4 library:

In [2]:
#Import the bs4 library:
from bs4 import BeautifulSoup

You can **pass a file handler directly to the constructor of the BeautifulSoup** object and it will read the contents from the file that the handler is attached to. We will see that the return type is an instance of bs4.BeautifulSoup. <br>This class holds **all the methods we need to navigate through the DOM tree** that the document represents.

2. use bs4 to read the html file from the disk

In [3]:
#use bs4 to read the html file from the disk
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print(type(soup))

<class 'bs4.BeautifulSoup'>


Print the contents of the file in a nice way, by which we mean that the printing will keep some kind of nice indentation by using the **prettify method** from the class, like this:

3. Print the contents of the file with the prettify method

In [4]:
print(soup.prettify())

# The same information can also be obtained by using the soup.contents member variable.
# it won't print anything pretty and, second, it is essentially a list.
#print(soup.contents)

<h1>
 Lorem ipsum dolor sit amet consectetuer adipiscing 
elit
</h1>
<p>
 Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa
 <strong>
  strong
 </strong>
 . Cum sociis natoque penatibus 
et magnis dis parturient montes, nascetur ridiculus 
mus. Donec quam felis, ultricies nec, pellentesque 
eu, pretium quis, sem. Nulla consequat massa quis 
enim. Donec pede justo, fringilla vel, aliquet nec, 
vulputate eget, arcu. In enim justo, rhoncus ut, 
imperdiet a, venenatis vitae, justo. Nullam dictum 
felis eu pede
 <a class="external ext" href="#">
  link
 </a>
 mollis pretium. Integer tincidunt. Cras dapibus. 
Vivamus elementum semper nisi. Aenean vulputate 
eleifend tellus. Aenean leo ligula, porttitor eu, 
consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
dapibus in, viverra quis, feugiat a, tellus. Phasellus 
viverra nulla ut metus varius laoreet. Quisque rutrum. 
Aenean imperdiet. Etiam ultricies nisi vel augue. 
Curabitur ull

There are many paragraph tags, or  < p > tags. Let's read content from one such < p > tag. We can do that using the simple **"." access modifier** as we would have done for a normal member variable of a class.

4. read content from the first < p > tag

In [5]:
#Of all 6 <p>, show only the first <p>
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print(type(soup.p))
    print(soup.p)

<class 'bs4.element.Tag'>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa 
<strong>strong</strong>. Cum sociis natoque penatibus 
et magnis dis parturient montes, nascetur ridiculus 
mus. Donec quam felis, ultricies nec, pellentesque 
eu, pretium quis, sem. Nulla consequat massa quis 
enim. Donec pede justo, fringilla vel, aliquet nec, 
vulputate eget, arcu. In enim justo, rhoncus ut, 
imperdiet a, venenatis vitae, justo. Nullam dictum 
felis eu pede <a class="external ext" href="#">link</a> 
mollis pretium. Integer tincidunt. Cras dapibus. 
Vivamus elementum semper nisi. Aenean vulputate 
eleifend tellus. Aenean leo ligula, porttitor eu, 
consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
dapibus in, viverra quis, feugiat a, tellus. Phasellus 
viverra nulla ut metus varius laoreet. Quisque rutrum. 
Aenean imperdiet. Etiam ultricies nisi vel augue. 
Curabitur ullamcorper ultricies nisi.</p>


If you need to get all the < b > tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as **find_all()**

5. Use the find_all() method to extract the content from the tag:


In [6]:
all_ps = soup.find_all('p')
print("Total number of <p>  --- {}".format(len(all_ps)))
# Select the desired tag with the index.
#soup.find_all('p')[0]
all_ps[5]

Total number of <p>  --- 6


<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa. 
Cum sociis natoque penatibus et magnis dis parturient 
montes, nascetur ridiculus mus. Donec quam felis, 
ultricies nec, pellentesque eu, pretium quis, sem.</p>

We have seen how to access all the tags of the same type. We have also seen how to get the content of the entire HTML document.

6. Now we will see how to get the contents of a particular HTML tag:

In [19]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    print(table.contents)
    print(table.prettify)
    print(table.contents[3])
    tt = table.contents[3]
    print("-"*30)
    print(tt.contents[1])
    # what happen with table.contents[0]?

['\n', <tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>, '\n', <tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>, '\n', <tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>, '\n', <tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>, '\n']
<bound method Tag.prettify of <table class="data">
<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>
<tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>
<tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>
<tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>
</table>>
<tr>
<td

HTML is represented as a **tree**, and we are able to traverse the children of a particular node. There are a few ways to do this.

7. The first way is by using the **children generator** from any bs4 instance, as follows:

In [20]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    children = table.children
    des = table.descendants
    print(len(list(children)), len(list(des)))

9 61


In [21]:
print(list(table.children))
print(len(list(table.children)[1]))


#print(list(table.descendants))
#print(list(table.descendants)[:10])


['\n', <tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>, '\n', <tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>, '\n', <tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>, '\n', <tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>, '\n']
9


In [11]:
child_table = list(table.children)
print(child_table)
print("-"*30)
print(child_table[1])

desc = list(child_table[1].descendants)
print("-"*30)
print(desc)
print(len(desc))

['\n', <tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>, '\n', <tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>, '\n', <tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>, '\n', <tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>, '\n']
------------------------------
<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>
------------------------------
['\n', <th>Entry Header 1</th>, 'Entry Header 1', '\n', <th>Entry Header 2</th>, 'Entry Header 2', '\n', <th>Entry Header 3</th>, 'Entry Header 3', '\n', <th>Entry Header 4</th>, 'Entry Header 4', '\n']
13


The comparison print at the end of the code block will show us the **difference between children and descendants**. <br>
The length of the list we got from **children is only 9**, whereas the length of the list we got from **descendants is 61**.

The **.descendants** attribute lets you **iterate over all of a tag’s children, recursively**: its direct children, the children of its direct children, and so on:

In [13]:
desc = list(table.children)[3].descendants
print(list(table.children)[3])
print("-"*30)
print(list(desc))
print(len(list(desc)))

<tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>
------------------------------
['\n', <td>Entry First Line 1</td>, 'Entry First Line 1', '\n', <td>Entry First Line 2</td>, 'Entry First Line 2', '\n', <td>Entry First Line 3</td>, 'Entry First Line 3', '\n', <td>Entry First Line 4</td>, 'Entry First Line 4', '\n']
0


In [None]:
#print(list(table.descendants))

***

### 2. DataFrames and BeautifulSoup

Now, we are going to go one step further and use the power of bs4 combined with the power of pandas to generate a DataFrame out of a plain HTML table.<br>
we will extract the data from the test.html page using the BeautifulSoup library. We will then perform a few operations for data preparation and display the data in an easily readable tabular format.

1. Import pandas and read the document, as follows:

In [16]:
import pandas as pd
from bs4 import BeautifulSoup
fd = open("test.html", "r")
soup = BeautifulSoup(fd)
data = soup.findAll('tr')
print("Data is a {} and {} items long".format(type(data), len(data)))

Data is a <class 'bs4.element.ResultSet'> and 4 items long


2. Check the original table structure in the HTML source. You will see that the first row is the column heading and all of the following rows are the data from the HTML source. We'll assign two different variables for the two sections, as follows:

In [17]:
data_without_header = data[1:]
headers = data[0]
headers

<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>

3. Once we have separated the two sections, we need two list comprehensions to make them ready to go in a DataFrame. For the header, this is easy:

In [None]:
col_headers = [th.getText() for th in headers.findAll('th')]
col_headers

In [None]:
#col_headers = []
#for th in headers.findAll('th'):
#  col_headers.append(th.getText())
#print(col_headers)

In [None]:
#data_without_header

Data preparation is a bit tricky for a pandas DataFrame. You need to have a **two-dimensional list**, which is a list of lists. We accomplish that in the following way, using list comprehension.

4. Use the for…in loop to iterate over the data:

In [None]:
df_data = [[td.getText() for td in tr.findAll('td')] for tr in data_without_header]
df_data

5. Invoke the pd.DataFrame method and supply the right arguments by using the following code:

In [None]:
df = pd.DataFrame(df_data, columns=col_headers)
df.head()

**More example**

In [None]:
from bs4 import BeautifulSoup
import urllib.request

In [None]:
#Open url and get html data
url = 'https://en.wikipedia.org/wiki/List_of_English_football_champions'
uh = urllib.request.urlopen(url)
data = uh.read().decode()

In [None]:
print(data)

In [None]:
#Create BeautifulSoup objct from html data
soup = BeautifulSoup(data)
print(type(soup))

In [None]:
#find all tables in the web page
all_tables = soup.find_all('table')
print(len(all_tables))

In [None]:
#Show table data in the IOC table
#print(all_tables[0])
print(all_tables[3].prettify)


In [None]:
rows = all_tables[3].findAll('tr')
print(len(rows))

In [None]:
data_without_header = rows[1:]
headers = rows[0]
headers

In [None]:
col_headers = [th.getText() for th in headers.findAll('th')]
col_headers

In [None]:
df_data = [[td.getText() for td in tr.findAll('td')] for tr in data_without_header]
df_data

In [None]:
df = pd.DataFrame(df_data, columns=col_headers)
df.head()