# Introduction to HTML Parsing with Python

The web has vast troves of data, but to use that data in a machine learning application, it must first be collected and parsed. This workshop aims to show you how to accomplish both of these feats. By the end of this notebook, you will have experience collecting raw data from web sources and parsing some values of interest from that collected data. Let's dive in!

## Browsing HTML with the Chrome Development Tools

To begin our work with HTML data, it will be helpful to use the Chrome browser's development tools, so please [install Chrome](https://www.google.com/chrome/) if you don't have it installed already. Once you've installed the browser, let's visit the following url:

http://www.gutenberg.org/files/2701/2701-h/2701-h.htm

If you open that url, you should see Moby Dick displayed. As you may know, the content rendered on your screen is determined by the HTML markup of the page. Let's take a look at that HTML by using the Chrome developer tools. To open the developer tools, you can click the three dots in the upper-right hand region of your screen, then select "More Tools" -> "Developer Tools"

<img src='./images/dev-tools.png'>

Once the developer tools are open, click the tool in the upper left hand corner of the developer tools panel to activate the selector tool:

<img src='./images/elem-selector.png'>

With the selector tool activated, try hovering on various HTML elements on the page. You should see those elements highlight as your cursor moves over them. If you click an element, you you should see the HTML content for that element displayed in the developer pane:

<img src='./images/selecting.gif'>

As we can see, the element selector makes it easy to identify the HTML that corresponds to the content in a particular area of a web page. 

<h3 style='color:green'>Reviewing the Chrome Developer Tools</h3>

To review element selecting with the Chrome Developer Tools, see if you can identify the HTML that our sample HTML page uses to render the link to "CHAPTER 7: The Chapel".

<br/>  
<details>
  <summary>Solution</summary>
  This link is rendered by the following HTML:
    
  ```  
  <p class="toc">
    <a href="#link2HCH0007"> CHAPTER 7. The Chapel. </a>
  </p>
  ```
 
</details>

## HTML Tags and Attributes

The bulk of our work with HTML parsing will require us to understand two of the fundamental building blocks of HTML data: tags and attributes. Let's examine each below.

### HTML Tags

Let's take a close look at the first line of HTML that's used to create the link to the "ETYMOLOGY" section of Moby Dick:

```
<p class="toc">
```

This line contains an example of an HTML tag and an HTML class. An **HTML tag** is the declared by the content immediately following the less than symbol (also referred to as an "opening angle brace"). In the case above, we would say that this HTML element is a "p tag", because the content immediately following the opening angle brace is the letter p. Likewise, the following elements have predictable names:

```
<a>    "a tag"
<b>    "b tag"
<body> "body tag"
```

There are [many different valid HTML tags](https://developer.mozilla.org/en-US/docs/Web/HTML/Element), and new tags are invented and then implemented by browsers periodically. For now, we just need to remember tags immediately follow the angle braces. 

### HTML Attributes

Let's return to the sample line of HTML we examined above:

```
<p class="toc">
```

The `class` value above is an example of an HTML attribute. An **HTML attribute** is a data specification inside of an HTML tag that changes the behavior of that tag. The `class` attribute seen above is particularly useful for HTML parsing, which is why we call them out specifically. There are however several other common attributes, including the following:

```
<a href='https://dhlab.yale.edu'>    "href attribute"
<b id='workshops'>                   "id attribute"
<div style='font-size: 30px'>        "style attribute"
```

<h3 style='color: green'>Reviewing HTML Tags and Attributes</h3>

Let's review HTML tags and attributes with the following challenge. See if you can identify all of the tags and attributes in the following passage:

```
<p class="toc">
  <a href="#link2HCH0005"> CHAPTER 5. Breakfast. </a>
</p>
```

<details>
  <summary>Solution</summary>
  The passage above has the following tags and attributes:
    
  ```
  A `p` tag with a `class` attribute, and an `a` tag with an `href` attribute.
  ```
 
</details>

## Collecting HTML data with Python

Collecting HTML data with Python is generally a fairly straightforward task, thanks in large part to the availability of great libraries that simplify the task. 

In this workshop, we'll use the `requests` library to collect HTML. If you've worked through our earlier [Introduction to APIs](https://github.com/YaleDHLab/lab-workshops/blob/master/apis/apis.ipynb) workshop, you already have some experience with the `requests` library. If not though, fear not, as fetching HTML data with `requests` is quite easy. To get started with the library, we first need to install it:

In [None]:
!pip install requests

After installing `requests`, we can use the library to fetch the HTML data at a url with the following syntax:

In [None]:
import requests

url = 'http://www.gutenberg.org/files/2701/2701-h/2701-h.htm'

response = requests.get(url)

html = response.text

Next we can print the HTML we fetched to investigate it:

In [None]:
print(html)

That's all it takes to fetch some HTML data with Python!

<h3 style='color:green'>Reviewing HTML Collection with Python</h3>

To practice collecting HTML with Python, see if you can fetch the HTML at the following url:

http://www.gutenberg.org/files/1342/1342-h/1342-h.htm

(Hint: a good strategy might be to copy and paste [and then adjust] the code above to solve this challenge!)

In [None]:
# type your code here

<details>
  <summary>Solution</summary>
  We can fetch the html at that url with the following code:
    
  ```
  # import the library we will use
  import requests
  
  # specify the url where the data we want to fetch lives
  url = 'http://www.gutenberg.org/files/1342/1342-h/1342-h.htm'
  
  # get the data at the requested url
  response = requests.get(url)

  # get the HTML from the response
  html = response.text
  ```
</details>

## Parsing HTML data with `BeautifulSoup`

After fetching some HTML data, the next thing we'll want to do is to "parse" that HTML to extract the subset of the data that's of interest. 

In what follows, we'll use the BeautifulSoup library to parse HTML. To get started with BeautifulSoup, let's install it with the following command:

In [None]:
!pip install beautifulsoup4

### Converting HTML to Plaintext

The `html` variable we defined above contains all sorts of "markup" (the catch-all term for tags, attributes, and other non-visible HTML odds and ends). If we intend to conduct some text mining on this HTML data, we might wish to extract just the visible text content from the page. In other words, we wish to convert our HTML data (text with angle braces) to "plaintext" (text without angle braces). Let's see how to do this with BeautifulSoup:

In [None]:
import requests
import bs4

# specify the url from which to fetch HTML data
url = 'http://www.gutenberg.org/files/2701/2701-h/2701-h.htm'

# get the data at that url
response = requests.get(url)

# fetch the text content from the response
html = response.text

# create a "soup" object that lets us use BeautifulSoup methods
soup = bs4.BeautifulSoup(html, 'html.parser')

# extract the text content from the soup object
text = soup.get_text()

That's all it takes! We can now print the text content from our url with our trusty print command:

In [None]:
print(text)

<h3 style='color:green'>Reviewing HTML to Plaintext Conversions</h3>

In the code block above, we saw how to extract plaintext content from the HTML edition of Moby Dick. See if you can copy and paste that block below, then update that block of code to convert the HTML at the following url to plaintext:

http://www.gutenberg.org/files/1342/1342-h/1342-h.htm

In [None]:
# type your code here

<details>
  <summary>Solution</summary>
  We can fetch the html at that url with the following code:
    
  ```
  # import the library we will use
  import requests
  import bs4

  # specify the url from which to fetch HTML data
  url = 'http://www.gutenberg.org/files/1342/1342-h/1342-h.htm'

  # get the data at that url
  response = requests.get(url)

  # fetch the text content from the response
  html = response.text

  # create a "soup" object that lets us use BeautifulSoup methods
  soup = bs4.BeautifulSoup(html, 'html.parser')

  # extract the text content from the soup object
  text = soup.get_text()
  ```
</details>

### Removing Tags

A keen reader of Melville will notice in the printed output above that our `text` variable contains some text content that wasn't part of Moby Dick. In general, HTML often contains extraneous elements that are not part of the data we wish to work with. In these cases, we can simply remove the undesired HTML elements.

To get started with removing tags from our HTML, let's use our Chrome developer tools to analyze the HTML content at the top of our web page http://www.gutenberg.org/files/2701/2701-h/2701-h.htm:

<img src='./images/top-of-moby-dick.png'>

We can see above that the first HTML tags in this page are:
  
```
html
  head
    title
    style
```

We can also see that the `style` tag contains some text that Melville didn't write. Let's remove that tag with the following:

In [None]:
soup.find('style').decompose()

If we print the content of `soup` after running the line above, we should find that the `style` tag is gone!

In [None]:
print(soup.prettify())

Indeed we can see that the `style` tag content has vanished!

<h3 style='color:green'>Reviewing Tag Removal</h3>

To practice removing tags from HTML with Python, see if you can remove the Project Gutenberg boilerplate that starts Moby Dick using BeautifulSoup:

In [None]:
# type your code here

<details>
  <summary>Solution</summary>
  We can remove the starting Project Gutenberg boilerplate with the following method:
    
  ```
  soup.find('pre').decompose()
  ```
</details>

### Iterating over Tags with BeautifulSoup

In the previous section, we examined how we can remove tags with the `decompose()` method. Sometimes, though, it makes sense not to use a "blacklist" strategy of this sort, but instead to use a "whitelist" method where we selectively choose the tags whose content we want to retain. 

Let's demonstrate how we can selectively collect the text content inside of `p` tags within Moby Dick:

In [None]:
# create a string that will hold the text content we retrieve
data = ''

# find all of the `p` tags that remain in the document
node_list = soup.find_all('p')

# iterate over the `p` tags
for i in node_list:
  
  # get the text content for the current tag
  text = i.get_text()
  
  # add that text content to the string of data we've extracted
  data += text

In [None]:
print(data)

There we go! We've now extracted all of the text content within Moby Dick!

<h3 style='color:green'>Reviewing Tag Iteration</h3>

To review the process of iterating over a series of tags to extract text content, see if you can collect all of the text between `p` tags in the following url into one string:

http://jacklynch.net/Texts/mankind.html

In [None]:
# type your code here

<details>
  <summary>Solution</summary>
  We can remove the starting Project Gutenberg boilerplate with the following method:
    
  ```
  import requests
  import bs4

  # specify the url from which to fetch HTML data
  url = 'http://jacklynch.net/Texts/mankind.html'

  # get the data at that url
  response = requests.get(url)

  # fetch the text content from the response
  html = response.text

  # create a "soup" object that lets us use BeautifulSoup methods
  soup = bs4.BeautifulSoup(html, 'html.parser')

  # extract the text content from the soup object
  for i in soup.find_all('p'):
    print(i.get_text())
  ```
</details>