# Beautiful Soup 4 and Web Page Parsing

Beautiful Soup (version 4) is a library for getting data from HTML or XML documents, and it gives you a nice, normalized, idiomatic way of **navigating and querying a document**.

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
import os
os.chdir("drive/My Drive/Lab3/")

### Structure of HTML

<img src="https://drive.google.com/file/d/19jpWGX7RkhoRDHTXBQs9tT76W-52j_l_/view?usp=sharing" align="left" style="width:650px;" >

### 1. Reading an HTML File and Extracting Its Contents Using Beautiful Soup

In this exercise, we will do the **simplest** thing possible. We will import the Beautiful Soup or bs4 library and then use it to **read an HTML document**. Then, we will **examine the different kinds of objects** it returns. <br>
While doing the exercises for this topic, you should have the example HTML file (called test.html) open in a text editor so that you can check for the different tags and their attributes and contents:

1. Import the bs4 library:

In [3]:
#Import the bs4 library:
from bs4 import BeautifulSoup

You can **pass a file handler directly to the constructor of the BeautifulSoup** object and it will read the contents from the file that the handler is attached to. We will see that the return type is an instance of bs4.BeautifulSoup. <br>This class holds **all the methods we need to navigate through the DOM tree** that the document represents.

2. use bs4 to read the html file from the disk

In [5]:
#use bs4 to read the html file from the disk
with open("Datasets/datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print(type(soup))

<class 'bs4.BeautifulSoup'>


Print the contents of the file in a nice way, by which we mean that the printing will keep some kind of nice indentation by using the **prettify method** from the class, like this:

3. Print the contents of the file with the prettify method

In [6]:
print(soup.prettify())

# The same information can also be obtained by using the soup.contents member variable.
# it won't print anything pretty and, second, it is essentially a list.
#print(soup.contents)

<html>
 <body>
  <h1>
   Lorem ipsum dolor sit amet consectetuer adipiscing elit
  </h1>
  <p>
   Lorem ipsum dolor sit amet, consectetuer adipiscing 
  elit. Aenean commodo ligula eget dolor. Aenean massa
   <strong>
    strong
   </strong>
   . Cum sociis natoque penatibus 
  et magnis dis parturient montes, nascetur ridiculus 
  mus. Donec quam felis, ultricies nec, pellentesque 
  eu, pretium quis, sem. Nulla consequat massa quis 
  enim. Donec pede justo, fringilla vel, aliquet nec, 
  vulputate eget, arcu. In enim justo, rhoncus ut, 
  imperdiet a, venenatis vitae, justo. Nullam dictum 
  felis eu pede
   <a class="external ext" href="#">
    link
   </a>
   mollis pretium. Integer tincidunt. Cras dapibus. 
  Vivamus elementum semper nisi. Aenean vulputate 
  eleifend tellus. Aenean leo ligula, porttitor eu, 
  consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
  dapibus in, viverra quis, feugiat a, tellus. Phasellus 
  viverra nulla ut metus varius laoreet. Quisque rutrum.

There are many paragraph tags, or  < p > tags. Let's read content from one such < p > tag. We can do that using the simple **"." access modifier** as we would have done for a normal member variable of a class.

4. read content from the first < p > tag

In [8]:
#Of all 6 <p>, show only the first <p>
with open("Datasets/datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print(type(soup.p))
    print(soup.p)

<class 'bs4.element.Tag'>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
  elit. Aenean commodo ligula eget dolor. Aenean massa 
  <strong>strong</strong>. Cum sociis natoque penatibus 
  et magnis dis parturient montes, nascetur ridiculus 
  mus. Donec quam felis, ultricies nec, pellentesque 
  eu, pretium quis, sem. Nulla consequat massa quis 
  enim. Donec pede justo, fringilla vel, aliquet nec, 
  vulputate eget, arcu. In enim justo, rhoncus ut, 
  imperdiet a, venenatis vitae, justo. Nullam dictum 
  felis eu pede <a class="external ext" href="#">link</a> 
  mollis pretium. Integer tincidunt. Cras dapibus. 
  Vivamus elementum semper nisi. Aenean vulputate 
  eleifend tellus. Aenean leo ligula, porttitor eu, 
  consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
  dapibus in, viverra quis, feugiat a, tellus. Phasellus 
  viverra nulla ut metus varius laoreet. Quisque rutrum. 
  Aenean imperdiet. Etiam ultricies nisi vel augue. 
  Curabitur ullamcorper ultricies nisi.

If you need to get all the < b > tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as **find_all()**

5. Use the find_all() method to extract the content from the tag:


In [9]:
all_ps = soup.find_all('p')
print("Total number of <p>  --- {}".format(len(all_ps)))
# Select the desired tag with the index.
#soup.find_all('p')[0]
all_ps[5]

Total number of <p>  --- 6


<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
  elit. Aenean commodo ligula eget dolor. Aenean massa. 
  Cum sociis natoque penatibus et magnis dis parturient 
  montes, nascetur ridiculus mus. Donec quam felis, 
  ultricies nec, pellentesque eu, pretium quis, sem.</p>

We have seen how to access all the tags of the same type. We have also seen how to get the content of the entire HTML document.

6. Now we will see how to get the contents of a particular HTML tag:

In [11]:
with open("Datasets/datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    print(table.contents)
    #print(table.prettify)
    #print(table.contents[3])
    #tt = table.contents[3]
    #print("-"*30)
    #print(tt.contents[1])
    # what happen with table.contents[0]?

['\n', <tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>, '\n', <tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>, '\n', <tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>, '\n', <tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>, '\n']


HTML is represented as a **tree**, and we are able to traverse the children of a particular node. There are a few ways to do this.

7. The first way is by using the **children generator** from any bs4 instance, as follows:

In [13]:
with open("Datasets/datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    children = table.children
    des = table.descendants
    print(len(list(children)), len(list(des)))

9 61


In [14]:
print(list(table.children))
print(len(list(table.children)[1]))


#print(list(table.descendants))
#print(list(table.descendants)[:10])


['\n', <tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>, '\n', <tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>, '\n', <tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>, '\n', <tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>, '\n']
9


In [15]:
child_table = list(table.children)
print(child_table)
print("-"*30)
print(child_table[1])

desc = list(child_table[1].descendants)
print("-"*30)
print(desc)
print(len(desc))

['\n', <tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>, '\n', <tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>, '\n', <tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>, '\n', <tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>, '\n']
------------------------------
<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>
------------------------------
['\n', <th>Entry Header 1</th>, 'Entry Header 1', '\n', <th>Entry Header 2</th>, 'Entry Header 2', '\n', <th>Entry Header 3</th>, 'Entry Header 3', '\n', <th>Entry Header 4</th>, 'Entry Header 4', '\n']
13


The comparison print at the end of the code block will show us the **difference between children and descendants**. <br>
The length of the list we got from **children is only 9**, whereas the length of the list we got from **descendants is 61**.

The **.descendants** attribute lets you **iterate over all of a tag’s children, recursively**: its direct children, the children of its direct children, and so on:

In [16]:
desc = list(table.children)[3].descendants
print(list(table.children)[3])
print("-"*30)
print(list(desc))
print(len(list(desc)))

<tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>
------------------------------
['\n', <td>Entry First Line 1</td>, 'Entry First Line 1', '\n', <td>Entry First Line 2</td>, 'Entry First Line 2', '\n', <td>Entry First Line 3</td>, 'Entry First Line 3', '\n', <td>Entry First Line 4</td>, 'Entry First Line 4', '\n']
0


In [17]:
#print(list(table.descendants))

***

### 2. DataFrames and BeautifulSoup

Now, we are going to go one step further and use the power of bs4 combined with the power of pandas to generate a DataFrame out of a plain HTML table.<br>
we will extract the data from the test.html page using the BeautifulSoup library. We will then perform a few operations for data preparation and display the data in an easily readable tabular format.

1. Import pandas and read the document, as follows:

In [19]:
import pandas as pd
from bs4 import BeautifulSoup
fd = open("Datasets/datasets/test.html", "r")
soup = BeautifulSoup(fd)
data = soup.findAll('tr')
print("Data is a {} and {} items long".format(type(data), len(data)))

Data is a <class 'bs4.element.ResultSet'> and 4 items long


2. Check the original table structure in the HTML source. You will see that the first row is the column heading and all of the following rows are the data from the HTML source. We'll assign two different variables for the two sections, as follows:

In [20]:
data_without_header = data[1:]
headers = data[0]
headers

<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>

3. Once we have separated the two sections, we need two list comprehensions to make them ready to go in a DataFrame. For the header, this is easy:

In [21]:
col_headers = [th.getText() for th in headers.findAll('th')]
col_headers

['Entry Header 1', 'Entry Header 2', 'Entry Header 3', 'Entry Header 4']

In [22]:
#col_headers = []
#for th in headers.findAll('th'):
#  col_headers.append(th.getText())
#print(col_headers)

In [23]:
#data_without_header

Data preparation is a bit tricky for a pandas DataFrame. You need to have a **two-dimensional list**, which is a list of lists. We accomplish that in the following way, using list comprehension.

4. Use the for…in loop to iterate over the data:

In [24]:
df_data = [[td.getText() for td in tr.findAll('td')] for tr in data_without_header]
df_data

[['Entry First Line 1',
  'Entry First Line 2',
  'Entry First Line 3',
  'Entry First Line 4'],
 ['Entry Line 1', 'Entry Line 2', 'Entry Line 3', 'Entry Line 4'],
 ['Entry Last Line 1',
  'Entry Last Line 2',
  'Entry Last Line 3',
  'Entry Last Line 4']]

5. Invoke the pd.DataFrame method and supply the right arguments by using the following code:

In [25]:
df = pd.DataFrame(df_data, columns=col_headers)
df.head()

Unnamed: 0,Entry Header 1,Entry Header 2,Entry Header 3,Entry Header 4
0,Entry First Line 1,Entry First Line 2,Entry First Line 3,Entry First Line 4
1,Entry Line 1,Entry Line 2,Entry Line 3,Entry Line 4
2,Entry Last Line 1,Entry Last Line 2,Entry Last Line 3,Entry Last Line 4


**More example**

In [26]:
from bs4 import BeautifulSoup
import urllib.request

In [27]:
#Open url and get html data
url = 'https://en.wikipedia.org/wiki/List_of_English_football_champions'
uh = urllib.request.urlopen(url)
data = uh.read().decode()

In [28]:
print(data)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>List of English football champions - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabl

In [29]:
#Create BeautifulSoup objct from html data
soup = BeautifulSoup(data)
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [30]:
#find all tables in the web page
all_tables = soup.find_all('table')
print(len(all_tables))

8


In [31]:
#Show table data in the IOC table
#print(all_tables[0])
print(all_tables[3].prettify)


<bound method Tag.prettify of <table class="wikitable sortable">
<tbody><tr>
<th width="80">Season</th>
<th width="200">Champions (number of titles)</th>
<th width="200">Runners-up</th>
<th width="200">Third place</th>
<th width="240">Winning manager
</th></tr>
<tr>
<td style="text-align: center;"><a href="/wiki/1992%E2%80%9393_FA_Premier_League" title="1992–93 FA Premier League">1992–93</a>
</td>
<td><a href="/wiki/Manchester_United_F.C." title="Manchester United F.C.">Manchester United</a> (8)
</td>
<td><a href="/wiki/Aston_Villa_F.C." title="Aston Villa F.C.">Aston Villa</a> (10)
</td>
<td><a href="/wiki/Norwich_City_F.C." title="Norwich City F.C.">Norwich City</a> (1)
</td>
<td data-sort-value="Ferguson, Alex"><a href="/wiki/Alex_Ferguson" title="Alex Ferguson">Alex Ferguson</a>
</td></tr>
<tr>
<td style="text-align: center;"><a href="/wiki/1993%E2%80%9394_FA_Premier_League" title="1993–94 FA Premier League">1993–94</a>
</td>
<td><a href="/wiki/Manchester_United_F.C." title="Manche

In [32]:
rows = all_tables[3].findAll('tr')
print(len(rows))

33


In [33]:
data_without_header = rows[1:]
headers = rows[0]
headers

<tr>
<th width="80">Season</th>
<th width="200">Champions (number of titles)</th>
<th width="200">Runners-up</th>
<th width="200">Third place</th>
<th width="240">Winning manager
</th></tr>

In [34]:
col_headers = [th.getText() for th in headers.findAll('th')]
col_headers

['Season',
 'Champions (number of titles)',
 'Runners-up',
 'Third place',
 'Winning manager\n']

In [35]:
df_data = [[td.getText() for td in tr.findAll('td')] for tr in data_without_header]
df_data

[['1992–93\n',
  'Manchester United (8)\n',
  'Aston Villa (10)\n',
  'Norwich City (1)\n',
  'Alex Ferguson\n'],
 ['1993–94\n',
  'Manchester United[b] (9)\n',
  'Blackburn Rovers (1)\n',
  'Newcastle United (3)\n',
  'Alex Ferguson\n'],
 ['1994–95\n',
  'Blackburn Rovers (3)\n',
  'Manchester United (11)\n',
  'Nottingham Forest (4)\n',
  'Kenny Dalglish\n'],
 ['1995–96\n',
  'Manchester United[b] (10)\n',
  'Newcastle United (1)\n',
  'Liverpool (3)\n',
  'Alex Ferguson\n'],
 ['1996–97\n',
  'Manchester United (11)\n',
  'Newcastle United (2)\n',
  'Arsenal (5)\n',
  'Alex Ferguson\n'],
 ['1997–98\n',
  'Arsenal[b] (11)\n',
  'Manchester United (12)\n',
  'Liverpool (4)\n',
  'Arsène Wenger\n'],
 ['1998–99\n',
  'Manchester United[i] (12)\n',
  'Arsenal (4)\n',
  'Chelsea (4)\n',
  'Alex Ferguson\n'],
 ['1999–00\n',
  'Manchester United[j] (13)\n',
  'Arsenal (5)\n',
  'Leeds United (2)\n',
  'Alex Ferguson\n'],
 ['2000–01\n',
  'Manchester United (14)\n',
  'Arsenal (6)\n',
  'Live

In [36]:
df = pd.DataFrame(df_data, columns=col_headers)
df.head()

Unnamed: 0,Season,Champions (number of titles),Runners-up,Third place,Winning manager\n
0,1992–93\n,Manchester United (8)\n,Aston Villa (10)\n,Norwich City (1)\n,Alex Ferguson\n
1,1993–94\n,Manchester United[b] (9)\n,Blackburn Rovers (1)\n,Newcastle United (3)\n,Alex Ferguson\n
2,1994–95\n,Blackburn Rovers (3)\n,Manchester United (11)\n,Nottingham Forest (4)\n,Kenny Dalglish\n
3,1995–96\n,Manchester United[b] (10)\n,Newcastle United (1)\n,Liverpool (3)\n,Alex Ferguson\n
4,1996–97\n,Manchester United (11)\n,Newcastle United (2)\n,Arsenal (5)\n,Alex Ferguson\n
