# Using lxml to Parse HTML 

The libxml library `lxml` provides a lot of sophisticated (low-level) functionality for traversing XML and html documents.  In this example we use the specialized `html` submodule.

We download some data from wikipeda:

In [None]:
import requests
url = "https://en.wikipedia.org/wiki/United_States_presidential_election_in_Virginia,_2004"
resp = requests.get(url)
resp

## Construct the DOM

Parse the content into a document object model

In [None]:
from lxml import html
dom = html.document_fromstring(resp.content)

In [None]:
dom

## Traversing the DOM

In [None]:
dom.getchildren()

## Jumping directly to an Element

In [None]:
body = dom.find("body")

In [None]:
body.getchildren()

# Using XPath to query HTML 

The following XPath query finds all the table elements starting at anywhere `//` in the tree and then traverses into the table to the row `/tr` and then the data entry `/td` and looks for a link `a` with the title attribute `@title` having the value `"Accomack County, Virginia"` and then gets its parent (the `td`) and then its parent (`tr`) and then its parent (`table`) and returns that.

In [None]:
tables = dom.xpath('//table/tr/td/a[@title="Accomack County, Virginia"]/../../..')

Printing the returned table:

In [None]:
print(html.tostring(tables[0], pretty_print=True).decode('UTF8'))

Building a DataFrame from the table:

In [None]:
import pandas as pd
df = pd.read_html(html.tostring(tables[0]))[0]
df.head()