In [1]:
from bs4 import BeautifulSoup
import requests
import pandas
pandas.__version__

'0.25.1'

In [2]:
!pip install html5lib



Let's try to get the list of states shown on this page: https://docs.omnisci.com/latest/3_apdx_states.html

# requests module

In [3]:
response = requests.get("https://docs.omnisci.com/latest/3_apdx_states.html")
response.ok

True

Success! 

Use BeautifulSoup to parse the HTML content returned from the URL

In [4]:
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE doctype html>
<!-- short description (e.g. "state-abbrevations") -->
<!-- Persona: Data Steward -->
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="noindex" name="robots"/>
  <script src="./js/analytics.js">
  </script>
  <script src="./js/toc.js">
  </script>
  <title>
   US State Abbreviations
  </title>
  <link href="./css/multiColumnTemplate.css" rel="stylesheet" type="text/css"/>
  <link href="./css/bootstrap-3.3.7.css" rel="stylesheet" type="text/css"/>
  <link href="./css/omnisci_docs.css" rel="stylesheet" type="text/css"/>
  <link href="./images/omnisci-icon.png" rel="icon" type="image/png">
   <link href="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.css" rel="stylesheet">
   </link>
  </link>
 </head>
 <body>
  <header>
   <div class="container">
    <div class="primary_header">
     <table class="search">
  

We expect that the states is in an HTML table. Search for that tag in the HTML content from the page

In [5]:
every_table = soup.findAll('table')
print(type(every_table))
print(len(every_table))

<class 'bs4.element.ResultSet'>
2


Apparently there are two tables in the HTML page. 

Inspect the first table in the list (at list index 0)

In [6]:
every_table[0]

<table class="search"><tr><td class="logo"><a href="./index.html"><img align="left" alt="OmniSci" class="logo" min-width="50px" src="./images/0_banner.png"/></a></td><td><form class="search">
<input id="search-field" placeholder="Search documentation..." style="display: block !important;" type="search"/></form></td></tr></table>

That's not the table that contains the states. What about the other table?

In [7]:
every_table[1]

<table border="1" class="colwidths-given docutils">
<colgroup>
<col width="90%"/>
<col width="10%"/>
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">State</th>
<th class="head">Abbreviation</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>Alabama</td>
<td>AL</td>
</tr>
<tr class="row-odd"><td>Alaska</td>
<td>AK</td>
</tr>
<tr class="row-even"><td>Arizona</td>
<td>AZ</td>
</tr>
<tr class="row-odd"><td>Arkansas</td>
<td>AR</td>
</tr>
<tr class="row-even"><td>California</td>
<td>CA</td>
</tr>
<tr class="row-odd"><td>Colorado</td>
<td>CO</td>
</tr>
<tr class="row-even"><td>Connecticut</td>
<td>CT</td>
</tr>
<tr class="row-odd"><td>Delaware</td>
<td>DE</td>
</tr>
<tr class="row-even"><td>District of Columbia</td>
<td>DC</td>
</tr>
<tr class="row-odd"><td>Florida</td>
<td>FL</td>
</tr>
<tr class="row-even"><td>Georgia</td>
<td>GA</td>
</tr>
<tr class="row-odd"><td>Hawaii</td>
<td>HI</td>
</tr>
<tr class="row-even"><td>Idaho</td>
<td>ID</td>
</tr>
<tr

The second table in the list (at list index 1) does contain the name of the state and the abbreviation. However, the table isn't in the format needed.

HTML tables are separated by the "table row" (tr) tag. Take a look at the set of elements returned when I search for "tr" as a tag within the table:

In [8]:
table_rows = every_table[1].findAll('tr')
print(table_rows[0])

<tr class="row-odd"><th class="head">State</th>
<th class="head">Abbreviation</th>
</tr>


That's the table header. We care about the data which happens in the second row of the table.

In [9]:
print(table_rows[1])

<tr class="row-even"><td>Alabama</td>
<td>AL</td>
</tr>


As a reminder, the HTML table we are currently working with is a BeautifulSoup "Tag"

In [10]:
type(every_table[1])

bs4.element.Tag

Extract the HTML text from the BeautifulSoup Tag

In [11]:
html_table_as_text = str(every_table[1])

Now we can use BeautifulSoup to parse the HTML text

In [12]:
page_content =pandas.read_html(html_table_as_text)

Pandas returns a list with a single element

In [13]:
type(page_content)

list

In [14]:
len(page_content)

1

In the list is a dataframe containing the information we desire!

In [15]:
page_content[0]

Unnamed: 0,State,Abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA
5,Colorado,CO
6,Connecticut,CT
7,Delaware,DE
8,District of Columbia,DC
9,Florida,FL


# Pandas

Pandas passes the URL to urllib

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

In [16]:
page_content =pandas.read_html("https://docs.omnisci.com/latest/3_apdx_states.html")

In [17]:
type(page_content)

list

In [18]:
len(page_content)

2

In [19]:
page_content[0]

Unnamed: 0,0,1
0,,


This not what we were seeking

In [20]:
page_content[1]

Unnamed: 0,State,Abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA
5,Colorado,CO
6,Connecticut,CT
7,Delaware,DE
8,District of Columbia,DC
9,Florida,FL
