XPATH is a nifty way that allows you to turn HTML into an easy to parse XML-node language. LXML has a built in xpath parser and there are several other libraries that have their own. It is pretty easy to learn and very portable. With a little bit of regex it can be a supremely powerful scraper.

`/body/p` : would match all paragraphs in the body tag.
    
`//p/a` : would match all links in a paragraph in any child.
    
`../p` : would match all paragraphs in the parent element.

`@` : will single out the attribute (`/a/@href`).

`* / @*` : is a wildcard variable that allows you to match any sub-element or any attribute of the element you are parsing.

`text()` / `comment()` / `node()` : match any of those elements within the current element (`//p/text()` would then match all text of all child paragraphs)

In [1]:
from lxml import html

In [2]:
%run xpath_intro.py

In [3]:
simple_tree

<Element html at 0x1b15d8e38b0>

In [4]:
simple_tree.xpath('//p')

[<Element p at 0x1b15d918d60>, <Element p at 0x1b15d918db0>]

In [5]:
simple_tree.xpath('//p/text()')

['Lorem ipsum dolor sit amet, ...',
 'Nunc cursus, justo eget elementum dictum, ... ']

In [6]:
simple_tree.xpath('node()')

['\n',
 <Element head at 0x1b15d951770>,
 '\n',
 <Element body at 0x1b15d9517c0>,
 '\n']

In [7]:
simple_tree.xpath('*')

[<Element head at 0x1b15d951770>, <Element body at 0x1b15d9517c0>]

In [8]:
simple_tree.xpath('body/*')

[<Element div at 0x1b15d95a0e0>]

In [9]:
simple_tree.xpath('body/div/*')

[<Element div at 0x1b15d7e1ae0>]

In [10]:
simple_tree.xpath('body/div/div/*')

[<Element div at 0x1b15d95a040>,
 <Element div at 0x1b15d95a220>,
 <Element div at 0x1b15d95a270>]

In [11]:
simple_tree.xpath('body//div/*')

[<Element div at 0x1b15d7e1ae0>,
 <Element div at 0x1b15d95a040>,
 <Element div at 0x1b15d95a220>,
 <Element div at 0x1b15d95a540>,
 <Element ul at 0x1b15d95a590>,
 <Element div at 0x1b15d95a5e0>,
 <Element ul at 0x1b15d95a630>,
 <Element div at 0x1b15d95a270>,
 <Element div at 0x1b15d95a680>,
 <Element p at 0x1b15d918d60>,
 <Element div at 0x1b15d95a6d0>,
 <Element p at 0x1b15d918db0>]

In [12]:
simple_tree.xpath('//div/*')

[<Element div at 0x1b15d7e1ae0>,
 <Element div at 0x1b15d95a040>,
 <Element div at 0x1b15d95a220>,
 <Element div at 0x1b15d95a540>,
 <Element ul at 0x1b15d95a590>,
 <Element div at 0x1b15d95a5e0>,
 <Element ul at 0x1b15d95a630>,
 <Element div at 0x1b15d95a270>,
 <Element div at 0x1b15d95a680>,
 <Element p at 0x1b15d918d60>,
 <Element div at 0x1b15d95a6d0>,
 <Element p at 0x1b15d918db0>]

In [13]:
simple_tree.xpath('//div/text()')

['\n',
 '\n',
 '\nHeader\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n\n']

In [14]:
simple_tree.xpath('//div/node()')

['\n',
 <Element div at 0x1b15d7e1ae0>,
 '\n',
 <Element div at 0x1b15d95a040>,
 '\nHeader\n',
 '\n',
 <Element div at 0x1b15d95a220>,
 '\n',
 <Element div at 0x1b15d95a540>,
 '\n',
 <Element ul at 0x1b15d95a590>,
 '\n',
 '\n',
 <Element div at 0x1b15d95a5e0>,
 '\n',
 <Element ul at 0x1b15d95a630>,
 '\n',
 '\n',
 '\n',
 <Element div at 0x1b15d95a270>,
 '\n',
 <Element div at 0x1b15d95a680>,
 '\n',
 <Element p at 0x1b15d918d60>,
 '\n',
 '\n',
 <Element div at 0x1b15d95a6d0>,
 '\n',
 <Element p at 0x1b15d918db0>,
 '\n',
 '\n',
 '\n',
 '\n\n']

##### Can you show all of the text that are in lists on the page?

In [17]:
simple_tree.xpath("//li/text()")

['Foo', 'Bar', 'Boo', 'Far']

##### Can you find all of the attributes on the page?

In [19]:
simple_tree.xpath("//@*")

['http://www.w3.org/1999/xhtml',
 'en',
 'en',
 'container',
 'content',
 'clearfix',
 'header',
 'nav',
 'navblock',
 'navblock',
 'maincontent',
 'contentblock',
 'contentblock']

##### Can you find all of the links on the page?

In [28]:
simple_tree.xpath("//@a")

[]

##### Can you get to the style sheet information?

In [29]:
simple_tree.xpath("//style")

[<Element style at 0x1b15da0f400>, <Element style at 0x1b15da0f450>]

In [30]:
# OR
simple_tree.xpath("//head/style")

[<Element style at 0x1b15da0f400>]

In [31]:
# OR
simple_tree.xpath("//body/style")

[]

##### `//elem[@attr="foo"]` will match elememts where that attribute is equal to foo. Find just the divs that have the class "contentblock".

In [36]:
simple_tree.xpath('//div[@class="contentblock"]')

[<Element div at 0x1b15d95a680>, <Element div at 0x1b15d95a6d0>]

In [40]:
simple_tree.xpath('//div[contains(@class, "contentblock")]')

[<Element div at 0x1b15d95a680>, <Element div at 0x1b15d95a6d0>]

##### Try the above without `[]` and using `/` to get the `@attr`. What kind of responsse do you get?

In [44]:
simple_tree.xpath('//div/@class="contentblock"')

True