# Tutorial - HTML and lxml

### A toy HTML example

**HTML** (Hypertext Markup Language) is the language in which are written the documents designed to be displayed in a web browser. The web browser receives HTML documents from a web server or from local storage and renders the documents as multimedia web pages.

An extremely simple example of a HTML document follows. 

    <html>

	<head>
    
        <title>Data Viz</title>
    
    </head>

	<body>

		<div class="course">Data Visualization</div>

		<div class="program">MBA full-time</div>`

		<a class="professor", href="https://www.iese.edu/faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a>

	</body>

	</html>
    
The structure of a HTML document is made by the **tags**. Every part of the document is opened by a **start tag** (`<tag>`) and closed by an `end tag` (`</tag>`). The tags create a **tree-like structure** in the document. The `html` node is the **root node**, with two **child nodes**, `head` and `body`. 

Here, the `head` node has one child, while the `body` node has three children, which are **siblings**. The `title` node contains the string `'Data Viz'`, enclosed between the start tag and the end tag (this can also be said of the `head` node). This string is the **node value**. 

Also, some nodes have **attributes**. The attributes are contained in the start tag. The `div` nodes have one `class` attribute, while the `a` node has two attributes: a `class` attribute and a `href` attribute. `class` attributes are very frequent, and we take advantage of them in **web scraping**. 

The `a` node has a special role. It marks a **hyperlink**, which is used to link a page to another page, or to download a file. The most important attribute of this node is the `href` attribute, which indicates the link's destination.

### Parsing HTML code

The package `lxml` provides many useful tools for parsing XML (and HTML) documents. This tutorial shows how to use some methods of the subpackage `lxml.html` to extract pieces of information from a HTML document, using the toy example displayed above. First, I create in Python a string variable, whose value is the HTML code. Note that I mark the line breaks with the backslash (`\`).

In [1]:
page = '<html> \
  <head> \
  <title>Data Viz</title> \
  </head> \
  <body> \
  <div class="course" >Data Visualization</div> \
  <div class="program">MBA full-time</div> \
  <a class="professor", \
  href="https://www.iese.edu/faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a> \
  </body> \
  </html>'

In [2]:
page

'<html>   <head>   <title>Data Viz</title>   </head>   <body>   <div class="course" >Data Visualization</div>   <div class="program">MBA full-time</div>   <a class="professor",   href="https://www.iese.edu/faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a>   </body>   </html>'

Next, I parse this string with the function `fromstring`, from the pckage `lxml`, learning the tree structure.

In [3]:
from lxml import html

In [4]:
tree = html.fromstring(page)

Here, `tree` is a `lxml` object, which stores in a special way the node hierarchy and the information contained in the nodes.

In [5]:
tree

<Element html at 0x108e1d830>

In [6]:
type(tree)

lxml.html.HtmlElement

The hierarchical structure stored in `tree` can be easily explored:

In [7]:
tree[0]

<Element head at 0x108e262f0>

In [8]:
tree[1]

<Element body at 0x108e265f0>

In [9]:
tree[1][0]

<Element div at 0x107f77f50>

To extract the information from these objects in a format which can be managed by common Python tools, we have two functions: 

* `text` extracts the **node value**, that is, the text between the node tags.

* `attrib` extracts the **value of an attribute** (note that I use square brackets for this function, instead of the usual parenthesis).

The following two examples illustrate the use of these functions.

In [10]:
tree[1][0].text

'Data Visualization'

In [11]:
tree[1][0].attrib['class']

'course'

### XPath expressions

Like HTML, **XML** (eXtensible Markup Language) is also a markup language. The markup in XML is also defined by the tags, but these are not predefined as in HTML (`head`, `body`, etc). **XPath** is a query language for selecting nodes in a XML document, which can be used also with a HTML document.

XPath expressions look like the path expressions in UNIX (Mac/Linux) file systems. For instance, `/html/head` denotes the `head` node, while `/html/body` denotes the `body` node. These two are easy, because there is only one `head` node and one `body` node. But there are typically many `div` nodes. So, the XPath expressions can be multivariate, like `/html/body/div`, which denotes, in the example, the two `div` nodes. These nodes can be identified as `/html/body/div[1]` and `/html/body/div[2]`. Note that XPath starts counting at 1, not at 0 like Python.

For the real-world pages, to find paths for the nodes that contain the relevant information is not as easy as in our toy example, though practice makes a difference. Many practitioners help themselves with the *Inspector*, a tool which comes with the browser in the same menu as the *View Source* tool.

### The function xpath

The function `xpath` extracts all the nodes that agree with a XPath expression. The following example is clear:

In [12]:
tree.xpath('/html/body/div')

[<Element div at 0x107f77f50>, <Element div at 0x108e267d0>]

Note that `xpath` returns a list, whose elements are `lxml` objects. Since HTML documents can be large, it is practical to shorten XPath expressions by omitting intermediate nodes. So, the `div` nodes of the example can be simply identified as `//div`:

In [13]:
tree.xpath('//div')

[<Element div at 0x107f77f50>, <Element div at 0x108e267d0>]

Although this is very practical, it would not be specific enough in a HTML document containing many `div` nodes in different places. A method frequently used in web scraping is to identify the target nodes by specifying an attribute value, as in the following example, where the first `div` node of the example is identified.

In [14]:
tree.xpath('/html/body/div[@class="course"]')

[<Element div at 0x107f77f50>]

### Extracting information from the nodes

To extract the information from the lists returned by `xpath`, we can use the functions `text` and `attrib`. The first one can be added at the end of the XPath expression:

In [15]:
tree.xpath('//div/text()')

['Data Visualization', 'MBA full-time']

But it can also be applied after `tree.xpath`, instead of being part of the XPath expression. Note that, since `tree.xpath('//div/text()')` is a list, so I cannot apply directly `text`, even if the list had a single element (it has two in this example). 

In [16]:
tree.xpath('//div')[0].text

'Data Visualization'

To extract the value of an attribute, you can do as in:

In [17]:
tree.xpath('//a/@href')[0]

'https://www.iese.edu/faculty-research/faculty/miguel-angel-canela'

This also admits an alternative syntax:

In [18]:
tree.xpath('//a')[0].attrib['href']

'https://www.iese.edu/faculty-research/faculty/miguel-angel-canela'

### Homework

IESE Business School displays information of the Faculty members in 11 web pages. The URL for the second one is `https://www.iese.edu/search/professors/2`. You can get the source code of the page through the contextual menu that opens when right-clicking anywhere on the page. The file `iese.html` contains that code, slightly edited by dropping tabs and line breaks that may foul Python when copypasting the code in the console. 

1. Copy the code, enter `page = ''` in the Pyton console and paste the code between the quote marks. Then press the `Return` key. So, you have the source code as a string in Python. This HTML document is much longer than my toy example, and contains other tags, like `<ul>`, `<li>` and `<script>`. 

2. Use the tools presented in this tutorial to extract from the code three lists, with the professors's names (eg "Miguel Ángel Canela"), the professors' descriptions (eg "Associate Professor of Managerial Decision Sciences") and the links to the professors' individual pages (eg "https://www.iese.edu/faculty-research/faculty/miguel-angel-canela"), respectively. 

3. Use the function `pd.DataFrame` to create a data frame with three columns, `name`, `description` and `link`, containing the data of the these lists.

4. Use the function `.to_excel` to export the data collected to an Excel file.