## Using XPath to make `pandas` dataframes from HTML

Last class, we learned how to access the root, individual nodes (whether elements, text, or attributes), and sets of nodes, using XPath. The end of that worksheet shows how to use XPath to read an XML file into a `pandas` dataframe. We were careful not to assume a particular ordering of the leaf nodes, because we assumed data could have been added to the XML file at different moments, e.g., maybe the GDP for France was added before the Population, but for the USA things were added in the opposite order.

Building on that work, we now consider HTML data representing a table. Now we can assume there is a meaningful order to the children of any node in the tree, because it is coming to us representing data that is displayed on a webpage somewhere. For example, consider the table stored at the following URL, and compare it to the html file in our data directory:

http://personal.denison.edu/~bressoud/datasystems/ind2016.html

<img src="figs/ind2016-web.png" width="800">

We will import `lxml.html` and use a custom parser, `HTMLParser`. We'll look at this webpage carefully, to help prepare for chapter 22, then we'll show how to read another webpage into `pandas`:

http://personal.denison.edu/~bressoud/datasystems/topnames.html

First, our usual helper functions and import statements (plus the new `lxml.html`).

In [1]:
import requests
import pandas as pd
from lxml import etree
import lxml.html as lh
import io
import os

def print_tree(node, pretty_print=True, encoding='utf-8', limit=0):
    result = etree.tostring(node, pretty_print=pretty_print)
    if isinstance(result, bytes):
        result = result.decode(encoding)
    if limit > 0:
        print(result[:limit])
    else:
        print(result)
        
def print_results(nodeset):
    """
    This function iterates over all Elements in a given list
    of Elements, printing the tag, text, and attributes of each.
    
    Parameters:
    nodeset - a list of Elements
    """
    print("Length of nodeset result:", len(nodeset))
    for node in nodeset:
        print("Type:", type(node))
        if type(node) == etree._Element:
            print("Tag:", node.tag)
            print("  Text:", node.text)
            print("  Attrib:", node.attrib)
        else:
            print(node)
        print()

In [9]:
# Reading from the web into an XML Element, using custom parser
dataout = "_generated"

protocol = "http"
location = "personal.denison.edu"
resourcepath = "/~bressoud/datasystems/{}"

buildURL = lambda s: "{}://{}{}".format(protocol, location, resourcepath.format(s))

parser = etree.HTMLParser(remove_blank_text=True) 

url = buildURL("ind2016.html")
response = requests.get(url)
assert response.status_code == 200

tree1 = etree.parse(io.BytesIO(response.content), parser)
root1 = tree1.getroot()


In [6]:
# Reading from a local file that we got by saving a webpage
datadir = "public_data"
filename = "ind2016.html"                 # Text file encoded as UTF-8
path = os.path.join(datadir, filename)
indtree = etree.parse(path,parser) # use custom parser
indroot = indtree.getroot()

At this point, we have successfully read the data and obtained an Element representing it in tree form (two equivalent Elements, actually). We want to see what the data looks like. The cell below shows that our function print_tree is not great for this purpose. There is a ton of information in this file (all about how to display on a webpage), and Python is wrapping it around, making it hard to "see" the tree structure.

In [13]:
# It's always wise to look at your data, but this is a lot

# Comment out to have a peek
#print_tree(indroot)
#print_tree(root1)

A better way to see the tree structure is to open the HTML file in a web browser, then go to "View Source" under "Developer Tools" (e.g., in Chrome). Unfortunately, this does not always use indentation to display the tree structure, and sometimes shows more than one tag per line.

Another option is to download a copy of the webpage as a local HTML file on your machine (we puts ours in 'public_data') and open it with a text editor. Instead of double-clicking it, please do an "Open With" to open it with the most mindless text editor you have. For me, Atom displays it in the form of an XML document (with indentation), whereas the default program to open it is a web browser and interprets the HTML tag information to display the file as a webpage (rather than with tags and indentation).

<img src="figs/ind2016peek.png" width="800">

Alternatively, you can modify the html file to make it an "html.xml" file, then open it as you would any XML file (e.g., with any text editor). Unfortunately, all of these solutions sometimes end up putting two tags on the same line, so you need to scan across long lines with your eyes.

With these tools in hand, you can now "see" the tree structure (via tag names as usual). Here is the tree structure for `ind2016.html`. We ignore tags we don't care about. 

- html
    - head
        - meta
        - script
        - ...   
    - body
        - div
        - div
        - ...
        - div id="main-content"
            - h2
            - div ...
 
This structure will be used in a crucial way in the reading.

## Web-scraping `topnames.html`

At the following link, you find a table of data:

http://personal.denison.edu/~bressoud/datasystems/topnames.html

<img src="figs/topnames-web.png" width="400">

Viewing the tree structure as we did for `ind2016.html`, we can determine the format inside the HTML document near where the data is stored.

<img src="figs/topnames-peek.png" width="400">

The reader is strongly encouraged to follow the tree structure in `topnames.html` or `topnames.html.xml` to see this structure. Scroll down in that document to find the table, with `th` tags for the headers (look for `title="Field ..."` attributes) and `tb` tags for the body of the table. That is where the data is stored, and we can get there with an XPath expression.

In [23]:
url = buildURL("topnames.html")
response = requests.get(url)
assert response.status_code == 200

tree2 = etree.parse(io.BytesIO(response.content), parser)
root2 = tree2.getroot()

We see that the header rows are stored with tag `th`, so we can use XPath to extract these four nodes.

In [24]:
headerNodes = root2.xpath("//th")
print(len(headerNodes))

4


Note that if the HTML document had multiple tables, then the XPath expression above would match too many header cells. The path would need to be more specific. See next cell for an example that goes to `h2`, knowing that lower tables would be stored in `h3`, `h4`, etc.

In [19]:
headerNodes = root2.xpath("//*[@id='main-content']/h2/../div//th")
print_results(headerNodes)

Length of nodeset result: 4
Type: <class 'lxml.etree._Element'>
Tag: th
  Text: year
  Attrib: {'title': 'Field #1'}

Type: <class 'lxml.etree._Element'>
Tag: th
  Text: sex
  Attrib: {'title': 'Field #2'}

Type: <class 'lxml.etree._Element'>
Tag: th
  Text: name
  Attrib: {'title': 'Field #3'}

Type: <class 'lxml.etree._Element'>
Tag: th
  Text: count
  Attrib: {'title': 'Field #4'}



We must extract the text from the four header nodes above, and store them in a headerlist we can feed to `pandas` later. As usual, we will have XPath do our heavy lifting.

In [25]:
column_names = root2.xpath("//*[@id='main-content']/h2/../div//th/text()")
print(column_names)

['year', 'sex', 'name', 'count']


We are ready to extract the body of the table, via the `td` tags. XPath will return the data as a single list, resulting from reading the table on the webpage from the left to right and top to bottom. Note that this solution depends on the fact that the order of the `td` elements in the HTML document matches the order you get from scanning over the table in this way.

In [28]:
tdlist = root2.xpath(".//table//tr/td/text()")
print(len(tdlist))
print(tdlist[0:10])

112
['2005', 'Female', 'Emily', '23940', '2005', 'Male', 'Jacob', '25833', '2006', 'Female']


We extract this into a list of lists that we can feed to `pandas`. Each list is a row, i.e., a slice of length 4 from `tdlist`.

In [30]:
LoL = []
fieldcount = 0
for item in tdlist:
    if fieldcount == 0:
        row = []
    row.append(item)
    if fieldcount < 3:
        fieldcount += 1
    else:
        LoL.append(row)
        fieldcount = 0
LoL

[['2005', 'Female', 'Emily', '23940'],
 ['2005', 'Male', 'Jacob', '25833'],
 ['2006', 'Female', 'Emily', '21404'],
 ['2006', 'Male', 'Jacob', '24845'],
 ['2007', 'Female', 'Emily', '19355'],
 ['2007', 'Male', 'Jacob', '24282'],
 ['2008', 'Female', 'Emma', '18813'],
 ['2008', 'Male', 'Jacob', '22594'],
 ['2009', 'Female', 'Isabella', '22306'],
 ['2009', 'Male', 'Jacob', '21175'],
 ['2010', 'Female', 'Isabella', '22913'],
 ['2010', 'Male', 'Jacob', '22127'],
 ['2011', 'Female', 'Sophia', '21842'],
 ['2011', 'Male', 'Jacob', '20371'],
 ['2012', 'Female', 'Sophia', '22313'],
 ['2012', 'Male', 'Jacob', '19074'],
 ['2013', 'Female', 'Sophia', '21223'],
 ['2013', 'Male', 'Noah', '18257'],
 ['2014', 'Female', 'Emma', '20936'],
 ['2014', 'Male', 'Noah', '19305'],
 ['2015', 'Female', 'Emma', '20455'],
 ['2015', 'Male', 'Noah', '19635'],
 ['2016', 'Female', 'Emma', '19496'],
 ['2016', 'Male', 'Noah', '19117'],
 ['2017', 'Female', 'Emma', '19800'],
 ['2017', 'Male', 'Liam', '18798'],
 ['2018', 'Fe

In [33]:
# Turning the LoL into a dataframe
df = pd.DataFrame(LoL,columns=column_names)
df.set_index(['year','sex'], inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
2005,Female,Emily,23940
2005,Male,Jacob,25833
2006,Female,Emily,21404
2006,Male,Jacob,24845
2007,Female,Emily,19355
2007,Male,Jacob,24282
2008,Female,Emma,18813
2008,Male,Jacob,22594
2009,Female,Isabella,22306
2009,Male,Jacob,21175


We now provide an alternative way to solve this problem, by using XPath to extract four lists (one per column), storing these lists in a DoL, and reading it into `pandas`.

The key idea in the XPath expression is to use numeric indices for the children of `tr` (each `tr` node represents a row in the table). The first time the loop below iterates, `index` is zero, so we look at the first `td` child of each `tr` node. This child always represents the `year` in the row, so the invocation of `xpath()` yields a list of years. The loop below iterates four times (once per column).

In [34]:
DoL = {}
for index, column in enumerate(column_names):
    xpath = ".//table//tr/td[{}]/text()".format(index+1)
    DoL[column] = root2.xpath(xpath)
DoL

{'year': ['2005',
  '2005',
  '2006',
  '2006',
  '2007',
  '2007',
  '2008',
  '2008',
  '2009',
  '2009',
  '2010',
  '2010',
  '2011',
  '2011',
  '2012',
  '2012',
  '2013',
  '2013',
  '2014',
  '2014',
  '2015',
  '2015',
  '2016',
  '2016',
  '2017',
  '2017',
  '2018',
  '2018'],
 'sex': ['Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male',
  'Female',
  'Male'],
 'name': ['Emily',
  'Jacob',
  'Emily',
  'Jacob',
  'Emily',
  'Jacob',
  'Emma',
  'Jacob',
  'Isabella',
  'Jacob',
  'Isabella',
  'Jacob',
  'Sophia',
  'Jacob',
  'Sophia',
  'Jacob',
  'Sophia',
  'Noah',
  'Emma',
  'Noah',
  'Emma',
  'Noah',
  'Emma',
  'Noah',
  'Emma',
  'Liam',
  'Emma',
  'Liam'],
 'count': ['23940',
  '25833',
  '21404',
  '24845',
  '19355',
  '24282',
  '188

In [35]:
df = pd.DataFrame(DoL)
df.set_index(['year','sex'], inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
2005,Female,Emily,23940
2005,Male,Jacob,25833
2006,Female,Emily,21404
2006,Male,Jacob,24845
2007,Female,Emily,19355
2007,Male,Jacob,24282
2008,Female,Emma,18813
2008,Male,Jacob,22594
2009,Female,Isabella,22306
2009,Male,Jacob,21175
