# Data Engineering with Beautiful Soup

Data Engineering, the process of gathering and preparing data for analysis, is a very big part of Data Science.

Datasets might not be formatted in the way you need (e.g. you have categorical features but your algorithm requires numerical features); or you might need to cross-reference some dataset to another that has a different format; or you might be dealing with a dataset that contains missing or invalid data.

These are just a few examples of why data retrieval and cleaning are so important.

## Retrieving data from the web

In [1]:
import requests

In [2]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

In [3]:
page = req.text
page[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Harvard University - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"cc2b656f-514a-4a38-9413-66df01af012b","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1024002151,"wgRevisionId":1024002151,"wgArticleId":18426501,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","Articles with short description","Short description

Great! Now we have the text of the Harvard Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called BeautifulSoup.

## BeautifulSoup

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the `import` statement bellow is different from what we used for `requests`. The _from library import thing_ pattern is useful when you don't want to reference a function byt its full name (like we did with `requests.get`), but you also don't want to import every single thing on that library into your namespace.

In [4]:
from bs4 import BeautifulSoup

BeautifulSoup can deal with HTML or XML data, so the next line parser the contents of the `page` variable using its HTML parser, and assigns the result of that to the `soup` variable.

In [5]:
soup = BeautifulSoup(page, 'html.parser')

In [6]:
type(soup)

bs4.BeautifulSoup

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [7]:
type(page)

str

Looks like they are indeed different.

`BeautifulSoup` objects have a cool little method that allows you to see the HTML content in a nice, indented way.

In [8]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Harvard University - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"cc2b656f-514a-4a38-9413-66df01af012b","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1024002151,"wgRevisionId":1024002151,"wgArticleId":18426501,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","Articles with short description","Short

Looks like it's our page!

We can now reference elements of the HTML document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [9]:
soup.title

<title>Harvard University - Wikipedia</title>

This is nice for HTML elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

In [10]:
# Be careful with elements that show up multiple times.
soup.p

<p class="mw-empty-elt">
</p>

Uh Oh. Turns out the attribute syntax in Beautiful soup is what is called syntactic sugar. That's why it is safer to use the explicit commands behind that syntactic sugar I mentioned. These are `BeautifulSoup.find` for getting single elements, and `BeautifulSoup.find_all` for retrieving multiple elements.

In [11]:
len(soup.find_all("p"))

102

---

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the HTML attributes that will be very useful to us is the "class" attribute.

Getting the class of a single element is easy...

In [11]:
soup.table["class"]

['infobox', 'vcard']

Next we will use a list comprehension to see all the tables that have a "class" attribute. 

In [12]:
#the classes of all tables that have a class sttribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

[['infobox', 'vcard'],
 ['toccolours'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable'],
 ['metadata', 'mbox-small'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'hlist', 'mw-collapsible', 

As mentioned, we will be using the Demographics table. To find this, we notice that it is the only table with just the class `wikitable` on it, whereas there are 3 tables with the class `wikitable`, with the other  two having multiple classes on them. This is why `find_all` below returns 3 results.

In [13]:
tables_wikitable = soup.find_all("table", "wikitable")

In [14]:
len(tables_wikitable)

3

Below we use a **matching** lambda function to find the table with just the class wikitable. Note that we have asked for a list with just `wikitable` in it. That ensures its the only class

In [15]:
dfinder = lambda tag: tag.name=='table' and tag.get('class') == ['wikitable']
table_demographics = soup.find_all(dfinder)

By contrast a simple find would give us just the first match. The below would be a great way to do things if we were guaranteed uniqueness. But since we are not, we use the full power of passing in a matching function.

In [16]:
soup.find("table", "wikitable")

<table class="wikitable sortable collapsible collapsed floatright">
<tbody><tr>
<th colspan="4" style="background-color:#A51C30;color:white;box-shadow: inset 2px 2px 0 #1E1E1E, inset -2px -2px 0 #1E1E1E;">National Graduate Rankings<sup class="reference" id="cite_ref-94"><a href="#cite_note-94">[94]</a></sup>
</th></tr>
<tr>
<th>Program
</th>
<th>Ranking
</th></tr>
<tr>
<td>Biological Sciences</td>
<td>4
</td></tr>
<tr>
<td>Business</td>
<td>6
</td></tr>
<tr>
<td>Chemistry</td>
<td>2
</td></tr>
<tr>
<td>Clinical Psychology</td>
<td>10
</td></tr>
<tr>
<td>Computer Science</td>
<td>16
</td></tr>
<tr>
<td>Earth Sciences</td>
<td>8
</td></tr>
<tr>
<td>Economics</td>
<td>1
</td></tr>
<tr>
<td>Education</td>
<td>1
</td></tr>
<tr>
<td>Engineering</td>
<td>22
</td></tr>
<tr>
<td>English</td>
<td>8
</td></tr>
<tr>
<td>History</td>
<td>4
</td></tr>
<tr>
<td>Law</td>
<td>3
</td></tr>
<tr>
<td>Mathematics</td>
<td>2
</td></tr>
<tr>
<td>Medicine: Primary Care</td>
<td>10
</td></tr>
<tr>
<td>Medicine

Since we used `find_all` we get back a list:

In [24]:
from IPython.display import HTML
HTML(str(table_demographics[0]))

Unnamed: 0,Undergrad,Grad/prof
Asian,21%,13%
Black,9%,5%
Hispanic or Latino,11%,7%
White,37%,38%
Two or more races,8%,3%
International,12%,32%


First we'll use a list comprehension to extract the rows (*tr*) elements.

In [25]:
rows = [row for row in table_demographics[0].find_all("tr")]
rows

[<tr>
 <th></th>
 <th>Undergrad</th>
 <th>Grad/prof
 </th></tr>,
 <tr>
 <th>Asian
 </th>
 <td>21%</td>
 <td>13%
 </td></tr>,
 <tr>
 <th>Black
 </th>
 <td>9%</td>
 <td>5%
 </td></tr>,
 <tr>
 <th>Hispanic or Latino
 </th>
 <td>11%</td>
 <td>7%
 </td></tr>,
 <tr>
 <th>White
 </th>
 <td>37%</td>
 <td>38%
 </td></tr>,
 <tr>
 <th>Two or more races
 </th>
 <td>8%</td>
 <td>3%
 </td></tr>,
 <tr>
 <th>International
 </th>
 <td>12%</td>
 <td>32%
 </td></tr>]

In [26]:
header_row = rows[0]
header_row

<tr>
<th></th>
<th>Undergrad</th>
<th>Grad/prof
</th></tr>

### Splitting the data

Next we extract the text value of the columns. If you look at the table above, you'll see that we have two columns and six rows.

Here we're taking the first element (Python indexes start at zero), iterating over the *th* elements inside it, and taking the text value of those elements. We should end up with a list of column names.

But there is one little caveat: the first column of the table is actually an empty string (look at the cell right above the row names). We could add it to our list and then remove it afterwards; but instead we will use the `if` statement inside the list comprehension to filter that out.

Here the `get_text` will return an empty string for the first cell of the table, which means that the test will fail and the value will not be added to the list.

In [27]:
#the if col.get_text() takes care of no-text in the upper left
columns = [col.get_text() for col in header_row.find_all("th") if col.get_text()]
columns

['Undergrad', 'Grad/prof\n']

In [28]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

In [29]:
columns = [c.strip() for c in columns]
columns

['Undergrad', 'Grad/prof']

Now let's do the same for the rows. Notice that since we have already parsed the header row, we will continue from the second row. The `[1:]` is a slice notation and in this case it means we want all values starting from the second position.

In [30]:
indexes = [row.find("th").get_text().strip() for row in rows[1:]]
indexes

['Asian',
 'Black',
 'Hispanic or Latino',
 'White',
 'Two or more races',
 'International']

We need to transform the string on the "data" cells to integers. We start by checking if the last character of the string (Python allows for negative indexes) is a percent sign. If that is true, then we convert the characters before the sign to integers. Lastly, if one of the prior checks fails, we return a value of None.

In [31]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

As always, we try some stuff and discover we need to strip newlines:

In [32]:
for value in rows[1].find_all("td"):
        print(value.get_text()) # see wierd newline below

21%
13%



In [33]:
values = []
for row in rows[1:]:
    for value in row.find_all("td"):
        values.append(to_num(value.get_text().strip()))
values

[21, 13, 9, 5, 11, 7, 37, 38, 8, 3, 12, 32]

The problem with the list above is that the values lost their grouping.

The `zip` function is used to combine two sequences element wise. So `zip([1,2,3], [4,5,6])` would return `[(1, 4), (2, 5), (3, 6)]`.

Here we create 2 arrays corresponding to the 2 columns by putting every 2 values in each list

In [34]:
stacked_values_lists = [values[i::2] for i in range(len(columns))]
stacked_values_lists

[[21, 9, 11, 37, 8, 12], [13, 5, 7, 38, 3, 32]]

We then use `zip`. Notice the use of the `*` in front: that converts the list of lists to a set of arguments to `zip`. 

In [35]:
def print_them(a, b, c):
    print("a", a, "b", b, "c", c)
print_them(1, 2, 3)

a 1 b 2 c 3


In [36]:
print_them(*[1, 2, 3])

a 1 b 2 c 3


In [37]:
stacked_values=zip(*stacked_values_lists)
list(stacked_values)

[(21, 13), (9, 5), (11, 7), (37, 38), (8, 3), (12, 32)]

In [38]:
# Here's the original HTML table for visual understanding
HTML(str(table_demographics))

Unnamed: 0,Undergrad,Grad/prof
Asian,21%,13%
Black,9%,5%
Hispanic or Latino,11%,7%
White,37%,38%
Two or more races,8%,3%
International,12%,32%


---

##  Putting things into Pandas

### Dataframes

To recap, we now have three data structures holding our column names, our row (index) names, and our values grouped by index.

We will now load this data into a Pandas Dataframe. The loading process is pretty straightforward, and all we need to do is tell Pandas which container goes where.


In [39]:
import pandas as pd

In [40]:
list(stacked_values)

[]

Wait! What happened?

Remember that `stacked_values` waz a zip object. We ran a `list(stacked_values)` to print it. But this had an unfortunate side effect. It **exhausted the iterator**, by iterating over the zip. Nothing was left. So we'll need to redefine the zip first. And we'll name it a bit better

In [41]:
stacked_values_iterator = zip(*stacked_values_lists)

Labeling variables like this follows the philosophy of [Hungarian Notation](https://en.wikipedia.org/wiki/Hungarian_notation). Use sparingly, when its critical to the understanding of your code, like here

In [42]:
df = pd.DataFrame(list(stacked_values_iterator), columns=columns, index=indexes)
df

Unnamed: 0,Undergrad,Grad/prof
Asian,21,13
Black,9,5
Hispanic or Latino,11,7
White,37,38
Two or more races,8,3
International,12,32


---

#### Other ways to create the Dataframe

That was one of many ways to construct a dataframe. Here is another that uses a list of dictionaries:

First we combine the list and dictionary comprehensions to get a list of dictionaries representing each row in the data.

In [43]:
stacked_values_iterator = zip(*stacked_values_lists)
data_dicts = [{col: val for col, val in zip(columns, col_values)} for col_values in stacked_values_iterator]
data_dicts

[{'Undergrad': 21, 'Grad/prof': 13},
 {'Undergrad': 9, 'Grad/prof': 5},
 {'Undergrad': 11, 'Grad/prof': 7},
 {'Undergrad': 37, 'Grad/prof': 38},
 {'Undergrad': 8, 'Grad/prof': 3},
 {'Undergrad': 12, 'Grad/prof': 32}]

In [44]:
pd.DataFrame(data_dicts, index=indexes)

Unnamed: 0,Undergrad,Grad/prof
Asian,21,13
Black,9,5
Hispanic or Latino,11,7
White,37,38
Two or more races,8,3
International,12,32


And yet another that uses a dictionary of lists:

To achieve this we group the values columnwise...

In [45]:
stacked_by_col = [values[i::2] for i in range(len(columns))]
stacked_by_col

[[21, 9, 11, 37, 8, 12], [13, 5, 7, 38, 3, 32]]

and then revert the pattern we used to create a list of dictionaries.

In [46]:
data_lists = {col: val for col, val in zip(columns, stacked_by_col)}
data_lists

{'Undergrad': [21, 9, 11, 37, 8, 12], 'Grad/prof': [13, 5, 7, 38, 3, 32]}

In [47]:
pd.DataFrame(data_lists, index=indexes)

Unnamed: 0,Undergrad,Grad/prof
Asian,21,13
Black,9,5
Hispanic or Latino,11,7
White,37,38
Two or more races,8,3
International,12,32
