##  Part 4: Beautiful Soup 
Data Engineering, the process of gathering and preparing data for analysis, is a very big part of Data Science.

Datasets might not be formatted in the way you need (e.g. you have categorical features but your algorithm requires numerical features); or you might need to cross-reference some dataset to another that has a different format; or you might be dealing with a dataset that contains missing or invalid data.

These are just a few examples of why data retrieval and cleaning are so important.

---

### `requests`:  Retrieving Data from the Web

In HW1, you will be asked to retrieve some data from the Internet. `Python` has many built-in libraries that were developed over the years to do exactly that (e.g. `urllib`, `urllib2`, `urllib3`).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckly, as with most tasks in `Python`, someone has developed a library that simplifies these tasks. In reality, the requests made both on this lab and on HW1 are fairly simple, and could easily be done using one of the built-in libraries. However, it is better to get acquainted to `requests` as soon as possible, since you will probably need it in the future.

In [None]:
# You tell Python that you want to use a library with the import statement.
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named `get` function to issue a *GET* request. This is equivalent to typing a URL into your browser and hitting enter.

In [None]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Python is an Object Oriented language, and everything on it is an object. Even built-in functions such as `len` are just syntactic sugar for acting on object properties.

We will not dwell too long on OO concepts, but some of Python's idiosyncrasies will be easier to understand if we spend a few minutes on this subject.

When you evaluate an object itself, such as the `req` object we created above, Python will automatially call the `__str__()` or `__repr__()` method of that object. The default values for these methods are usually very simple and boring. The `req` object however has a custom implementation that shows the object type (i.e. `Response`) and the HTTP status number (200 means the request was successful).

Just to confirm, we will call the `type` function on the object to make sure it agrees with the value above.

In [None]:
type(req)

requests.models.Response

Another very nifty Python function is `dir`. You can use it to list all the properties of an object.

By the way, properties starting with a single and double underscores are usually not meant to be called directly.

In [None]:
dir(req)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

Right now `req` holds a reference to a *Request* object; but we are interested in the text associated with the web page, not the object itself.

So the next step is to assign the value of the `text` property of this `Request` object to a variable.

In [None]:
page = req.text
page[20000:30000]

'span></span></span><span class="geo-multi-punct">&#xfeff; / &#xfeff;</span><span class="geo-nondefault"><span class="geo-dec" title="Maps, aerial photos, and other data for this location">42.37444°N 71.11694°W</span><span style="display:none">&#xfeff; / <span class="geo">42.37444; -71.11694</span></span></span></a></span></span></span></td></tr><tr><th scope="row" class="infobox-label" style="padding-right:0.65em;">Campus</th><td class="infobox-data">Urban, 209 acres (85&#160;ha)</td></tr><tr><th scope="row" class="infobox-label" style="padding-right:0.65em;">Language</th><td class="infobox-data">Mostly English</td></tr><tr><th scope="row" class="infobox-label" style="padding-right:0.65em;">Newspaper</th><td class="infobox-data"><i><a href="/wiki/The_Harvard_Crimson" title="The Harvard Crimson">The Harvard Crimson</a></i></td></tr><tr><th scope="row" class="infobox-label" style="padding-right:0.65em;"><a href="/wiki/School_colors" title="School colors">Colors</a></th><td class="infobo

Great! Now we have the text of the Harvard University Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called `BeautifulSoup`.

### `BeautifulSoup`

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the `import` statement bellow is different from what we used for `requests`. The _from library import thing_ pattern is useful when you don't want to reference a function byt its full name (like we did with `requests.get`), but you also don't want to import every single thing on that library into your namespace.

In [None]:
from bs4 import BeautifulSoup

`BeautifulSoup` can deal with `HTML` or `XML` data, so the next line parses the contents of the `page` variable using its `HTML` parser, and assigns the result of that to the `soup` variable.

In [None]:
soup = BeautifulSoup(page, 'html.parser')

In [None]:
type(soup)

bs4.BeautifulSoup

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [None]:
type(page)

str

Looks like they are indeed different.

`BeautifulSoup` objects have a cool little method that allows you to see the `HTML` content in a nice, indented way.

In [None]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Harvard University - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"27578fa5-4af7-4136-be3f-3840fc4d7e4c","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1074567236,"wgRevisionId":1074567236,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","CS1 errors: generic name"

Looks like it's our page!

We can now reference elements of the `HTML` document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [None]:
soup.title

<title>Harvard University - Wikipedia</title>

This is nice for `HTML` elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

In [None]:
# Be careful with elements that show up multiple times.
soup.p

<p class="mw-empty-elt">
</p>

Uh Oh. Turns out the attribute syntax in `Beautiful` soup is what is called *syntactic sugar*. That's why it is safer to use the explicit commands behind that syntactic sugar I mentioned. These are:
* `BeautifulSoup.find` for getting single elements, and 
* `BeautifulSoup.find_all` for retrieving multiple elements.

In [None]:
len(soup.find_all("p"))

107

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the `HTML` attributes that will be very useful to us is the `class` attribute.

Getting the class of a single element is easy!

In [None]:
soup.table["class"]

['box-Merge_from', 'plainlinks', 'metadata', 'ambox', 'ambox-move']

Next we will use a *list comprehension* to see all the tables that have a `class` attribute. 

In [None]:
# the classes of all tables that have a class attribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

[['box-Merge_from', 'plainlinks', 'metadata', 'ambox', 'ambox-move'],
 ['infobox', 'vcard'],
 ['toccolours'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable'],
 ['wikitable'],
 ['metadata', 'mbox-small'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks

As already mentioned, we will be using the Demographics table for this lab. The next cell contains the `HTML` elements of said table. We will render it in different parts of the notebook to make it easier to follow along the parsing steps.

In [None]:
table_demographics = soup.find_all("table", "wikitable")[3]

In [None]:
from IPython.core.display import HTML
HTML(str(table_demographics))

Unnamed: 0,Undergrad,Grad/prof
Asian,21%,13%
Black,9%,5%
Hispanic or Latino,11%,7%
White,37%,38%
Two or more races,8%,3%
International,12%,32%


First we'll use a list comprehension to extract the rows (*tr*) elements.

In [None]:
rows = [row for row in table_demographics.find_all("tr")]
print(rows)

[<tr>
<th></th>
<th>Undergrad</th>
<th>Grad/prof
</th></tr>, <tr>
<th>Asian
</th>
<td>21%</td>
<td>13%
</td></tr>, <tr>
<th>Black
</th>
<td>9%</td>
<td>5%
</td></tr>, <tr>
<th>Hispanic or Latino
</th>
<td>11%</td>
<td>7%
</td></tr>, <tr>
<th>White
</th>
<td>37%</td>
<td>38%
</td></tr>, <tr>
<th>Two or more races
</th>
<td>8%</td>
<td>3%
</td></tr>, <tr>
<th>International
</th>
<td>12%</td>
<td>32%
</td></tr>]


In [None]:
header_row = rows[0]
HTML(str(header_row))

We will then use a `lambda` expression to replace new line characters with spaces. `Lambda` expressions are to functions what list comprehensions are to lists: namely a more concise way to achieve the same thing.

In reality, both lambda expressions and list comprehensions are a little different from their function and loop counterparts. But for the purposes of this class we can ignore those differences.

In [None]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

#### Splitting the data
Next we extract the text value of the columns. If you look at the table above, you'll see that we have three columns and six rows.

Here we're doing the following:
* Taking the first element (`Python` indices start at zero)
* Iterating over the *th* elements inside it
* Taking the text value of those elements

We should end up with a list of column names.

But there is one little caveat: the first column of the table is actually an empty string (look at the cell right above the row names). We could add it to our list and then remove it afterwards; but instead we will use the `if` statement inside the list comprehension to filter that out.

In the following cell, `get_text` will return an empty string for the first cell of the table, which means that the test will fail and the value will not be added to the list.

In [None]:
# the if col.get_text() takes care of no-text in the upper left
columns = [rem_nl(col.get_text()) for col in header_row.find_all("th") if col.get_text()]
columns

['Undergrad', 'Grad/prof ']

Now let's do the same for the rows. Notice that since we have already parsed the header row, we will continue from the second row.

In [None]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

['Asian\n',
 'Black\n',
 'Hispanic or Latino\n',
 'White\n',
 'Two or more races\n',
 'International\n']

Now we want to transform the string on the cells to integers.  To do this, we follow a very common `python` pattern:
1. Check if the last character of the string is a percent sign
2. If it is, then convert the characters before the percent sign to integers
3. If one of the prior checks fails, return a value of `None`

These steps can be conveniently packaged into a function using `if-else` statements.

In [None]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

Notice the `Python` slices are open on the upper bound. So the `[:-1]` construct will return all elements of the string, except for the last.

Another nice way to write our `to_num` function would be
```python
def to_num(s):
    return int(s[:-1]) if s[-1] == "%" else None
```
Notice that we only had to write `return` one time and everything conveniently fits on one line.  I'll leave it up to you to decide if it's readable or not.

Now we use the `to_num` function in a list comprehension to parse the table values.

Notice that we have two `for ... in ...` in this list comprehension. That is perfectly valid and somewhat common.

Although there is no real limit to how many iterations you can perform at once, having more than two can be visually unpleasant, at which point either regular nested loops or saving intermediate comprehensions might be a better solution.

In [None]:
values = [to_num(value.get_text()) for row in rows[1:] for value in row.find_all("td")]
values

[21, None, 9, None, 11, None, 37, None, 8, None, 12, None]