<a href="https://colab.research.google.com/github/finesketch/data_science/blob/main/Modern_Python_Cookbook/Reading_HTML_Documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We'll use the Beautiful Soup module to parse HTML pages. This is available from the Python Package Index (PyPI). 

See https://pypi.python.org/pypi/beautifulsoup4.


In [1]:
pip install beautifulsoup4



In [1]:
from bs4 import BeautifulSoup
from pathlib import Path

In [2]:
source_path = Path('Volvo Ocean Race.html')

In [4]:
with source_path.open(encoding='utf8') as source_file:
  soup = BeautifulSoup(source_file, 'html.parser')

We've used a context manager to access the file. As an alternative we could simply read the content with source_path.read_text(encodig='utf8'). This works as well as providing an open file to the BeautifulSoup class.

In [6]:
def get_legs(soup):
  legs = []
  thead = soup.table.thead.tr
  for tag in thead.find_all('th'):
    if 'data-title' in tag.attrs:
      leg_description_text = clean_leg(tag.attrs['data-title'])
      legs.append(leg_description_text)
  return legs

In [7]:
def clean_leg(text): 
  leg_soup = BeautifulSoup(text, 'html.parser') 
  return leg_soup.text 

In [8]:
get_legs(soup)

['ALICANTE - CAPE TOWN',
 'CAPE TOWN - ABU DHABI',
 'ABU DHABI - SANYA',
 'SANYA - AUCKLAND',
 'AUCKLAND - ITAJAÍ',
 'ITAJAÍ - NEWPORT',
 'NEWPORT - LISBON',
 'LISBON - LORIENT',
 'LORIENT - GOTHENBURG']

# There's More

The Tag objects of Beautiful Soup represent the hierarchy of the document's structure. There are several kinds of navigation among tags:

* All tags except the root can have a parent.
* The **parents** attribute is a generator for parents of a tag.
* All **Tag** object can have children.
* A tag with children may have multiple levels of tags under it.
* A tag can also have siblings.

In [12]:
# In some cases, a document will have a generally straight-forward organization and a simple search by the id attribute or class attribute will find the relevant data.
ranking_table =soup.find('table', class_='ranking-list')
ranking_table


<table class="ranking-list" width="100%">
<thead>
<tr class="ranking-item">
<th colspan="3"></th>
<th data-htmlcontent="true" data-position="top" data-theme="tooltipster-shadow" data-title="&lt;strong&gt;ALICANTE - CAPE TOWN&lt;/strong&gt;" tooltipster="">LEG 1</th>
<th data-htmlcontent="true" data-position="top" data-theme="tooltipster-shadow" data-title="&lt;strong&gt;CAPE TOWN - ABU DHABI&lt;/strong&gt;" tooltipster="">LEG 2</th>
<th data-htmlcontent="true" data-position="top" data-theme="tooltipster-shadow" data-title="&lt;strong&gt;ABU DHABI - SANYA&lt;/strong&gt;" tooltipster="">LEG 3</th>
<th data-htmlcontent="true" data-position="top" data-theme="tooltipster-shadow" data-title="&lt;strong&gt;SANYA - AUCKLAND&lt;/strong&gt;" tooltipster="">LEG 4</th>
<th data-htmlcontent="true" data-position="top" data-theme="tooltipster-shadow" data-title="&lt;strong&gt;AUCKLAND - ITAJAÍ&lt;/strong&gt;" tooltipster="">LEG 5</th>
<th data-htmlcontent="true" data-position="top" data-theme="toolti

In [13]:
list(tag.name for tag in ranking_table.parents)

['section', 'div', 'div', 'div', 'div', 'body', 'html', '[document]']