# Parse of HTML documents with the lxml lib

![Logo da Aurum](aurum.gif)

- Dev at Aurum, working with extraction, processing and analysis of legal data
- Majoring in Information Systems
- Other cool things: cats, art, videogames, social justice, traveling, Wikipedia

## lxml

[Link] (https://lxml.de/)
- Library (or _toolkit_) to process XML and HTML
- New _binding_ for the C libs libxml2 and libxslt, which already had _bindings_ for python, but they were considered complex and not very pythonic
- I came across it while working at Aurum on parsing complex HTML documents

_Binding_: interface to enable the use of something written in a programming language with another language

## Let's see code!
This talk tried to follow, as far as possible, the topic order of the wondeful [lxml documentation](https://lxml.de/4.3/lxmldoc-4.3.0.pdf), so that it may seem more familiar to you if you wish to consult it.

## What we are going to parse

[Folha informativa - Violência contra as mulheres](https://www.paho.org/bra/index.php?option=com_content&view=article&id=5669:folha-informativa-violencia-contra-as-mulheres&Itemid=820) from OPAS/WHO Brazil (World Health Organization)

## What we're going to need
`Etree` is a tree of HTML elements, which we can iterate over, edit, search, etc. By parsing our HTML with the `html` module, we can create our tree.

In [122]:
import lxml.html


tree = lxml.html.parse('opas_violencia_contra_mulheres.html')

In [123]:
type(tree)

lxml.etree._ElementTree

In [124]:
lxml.html.tostring(tree)[:500]

b'<!DOCTYPE html>\n<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="pt-br" lang="pt-br" dir="ltr">\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n \n\t<script type="text/javascript">\n\t\tvar pathInfo = {\n\t\t\tbase: \'templates/Responsive/\',\n\t\t\tcss: \'css/\',\n\t\t\tjs: \'js/\',\n\t\t\tswf: \'swf/\',\n\t\t}\n\t</script>\n\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\t<!-- <meta name="theme-color" content="#0099d9" /> -->\n\t<meta http-equiv="content-type" content="text/html; charse'

### There's plenty of stuff that doesn't interest us. What we want is what we have in the _body_
First, let's get to the _root_ of our tree. From it we can reach the _body_.

In [125]:
root = tree.getroot()
root

<Element html at 0x7f7e25b90728>

In [126]:
body = root.body
body

<Element body at 0x7f7e25b90ae8>

In [127]:
lxml.html.tostring(body)[:500]

b'<body itemscope itemtype="https://schema.org/WebPage">\n<!-- Google Tag Manager -->\n<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-MDCJXB" title="Google Tag Manager" style="height:0px;width:0px;display:none;visibility:hidden"><span style="visibility:hidden">Google Tag</span></iframe></noscript>\n<script type="text/javascript">(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\nnew Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\nj=d.createElement(s),dl'

### Now that we have the _body_, we can extract some information
We can use `etree` methods, like` findall`, or the XPath itself. There is also a module called `cssselect`, which needs to be installed separately.

- With `findall`, we can find all elements corresponding to a simple XPath expression
- With `find`, only the first correpondent element is  brought
- With `findtext`,  the returned value is the textual content of the corresponding first element
- With `iterfind`, our return is a` generator` of all corresponding elements

In [128]:
body.findall('.//p')[:3]

[<Element p at 0x7f7e25bc3138>,
 <Element p at 0x7f7e25bc3188>,
 <Element p at 0x7f7e25bc31d8>]

In [129]:
body.find('.//p')

<Element p at 0x7f7e25bc3138>

In [130]:
body.findtext('.//p')

'Brasil'

In [131]:
body.iterfind('.//p')

<generator at 0x7f7e25ba04c8>

### However, with XPath, we have access to all features, not just simple expressions. So let's use XPath to find some statistics in the text (percentages).

In [132]:
body.xpath('//*/text()[contains(., "%")]')[:5]

['\n\t\tjQuery(window).on(\'load\', function () {\n\t\tjQuery(\'iframe[id^=twitter-widget-]\').each(function () {\n\t\tvar head = jQuery(this).contents().find(\'head\');\n\t\tif (head.length) {\n\t\thead.append(\'<style type="text/css">.timeline { max-width: 100% !important; width: 100% !important; } .timeline .stream { max-width: none !important; width: 100% !important; }</style>\');\n\t\t}\n\t\tjQuery(\'.twitter-timeline\').append(jQuery(\'\'));\n\t\t})\n\t\t});\n\t',
 'Estimativas globais publicadas pela OMS indicam que aproximadamente uma em cada três mulheres (35%) em todo o mundo sofreram violência física e/ou sexual por parte do parceiro ou de terceiros durante a vida.',
 'A maior parte dos casos é de violência infligida por parceiros. Em todo o mundo, quase um terço (30%) das mulheres que estiveram em um relacionamento relatam ter sofrido alguma forma de violência física e/ou sexual na vida por parte de seu parceiro.',
 'Globalmente, 38% dos assassinatos de mulheres são cometid

### Our first result has nothing to do with statistics, it's a _script_ element. We can take this opportunity to remove all the scripts, so that our tree gets cleaner.

In [133]:
from lxml.html.clean import Cleaner


cleaner = Cleaner(scripts=True)
body = cleaner.clean_html(body)
body.findall('.//script')

[]

In [134]:
body.xpath('//*/text()[contains(., "%")]')[:3]

['Estimativas globais publicadas pela OMS indicam que aproximadamente uma em cada três mulheres (35%) em todo o mundo sofreram violência física e/ou sexual por parte do parceiro ou de terceiros durante a vida.',
 'A maior parte dos casos é de violência infligida por parceiros. Em todo o mundo, quase um terço (30%) das mulheres que estiveram em um relacionamento relatam ter sofrido alguma forma de violência física e/ou sexual na vida por parte de seu parceiro.',
 'Globalmente, 38% dos assassinatos de mulheres são cometidos por um parceiro masculino.']

### Suppose we only wanted statistics from Brazil. In fact none refers to Brazil alone, but _suppose_ that weren't the case. If we had a large document, it could be useful to eliminate the elements that didn't refer to Brazil

In [135]:
percentage_els = body.xpath('//*[contains(., "%")]')

for el in percentage_els:
    text = el.text_content()
    if 'Brasil' not in text and el.tag in ('li', 'p'):
        el.getparent().remove(el)
        
body.xpath('//*[contains(., "%")]')

[]

### Dissecting
- `text = el.text_content ()`: getting the text from an element
- `if 'Brazil' not in text and el.tag in ('li', 'p')`: if "Brazil" isn't in the text and the element is a list item or a paragraph
- `el.getparent (). remove (el)`: getting the parent element of our element and deleting our element (this is how we remove a specific element)

**Attention!** Our `percentage_els` list still contains our elements. It's a copy of part of the _body_ and remains unchanged.

In [136]:
percentage_els[-2].text_content()

'A violência por parte de parceiro e a violência sexual são perpetradas principalmente por homens contra as mulheres. O abuso sexual infantil afeta meninos e meninas. Estudos internacionais revelam que aproximadamente 20% das mulheres e 5%-10% dos homens relatam terem sido vítimas de violência sexual na infância. A violência entre os jovens, incluindo em relacionamentos, é também um grande problema. \xa0'

### If we want to find elements by id or class, we don't have to use XPath. We can use the `find_class` and` get_element_by_id` methods

In [137]:
topics = body.find_class('bkbutton')

for el in topics:
    print(el.text_content())

Principais informações 
Magnitude do problema
Fatores de risco 
Consequências para a saúde   
Impacto em crianças 
Custos sociais e econômicos  
Prevenção e resposta  
Resposta da OMS
Dia Laranja


In [138]:
body.get_element_by_id('footer')

<Element footer at 0x7f7e25bad598>

### Since we just made use of `get_parent`, let's see some ways we can "navigate" through our tree with `etree` methods

Getting child elements

In [139]:
body.getchildren()

[<Element noscript at 0x7f7e25b905e8>, <Element div at 0x7f7e25bc3cc8>]

Iterating over descendants

In [140]:
list(body.iterdescendants())[:5]

[<Element noscript at 0x7f7e25b905e8>,
 <Element span at 0x7f7e25b90818>,
 <Element div at 0x7f7e25bc3cc8>,
 <Element div at 0x7f7e25bad548>,
 <Element header at 0x7f7e25bad5e8>]

Iterating over ancestors

In [141]:
list(body.getchildren()[0].iterancestors())

[<Element body at 0x7f7e25bc3098>]

Getting the next element

In [142]:
body.getchildren()[0].getnext()

<Element div at 0x7f7e25bc3cc8>

Iterating over the next siblings

In [143]:
list(body.getchildren()[0].itersiblings())

[<Element div at 0x7f7e25bc3cc8>]

Getting the previous element

In [144]:
body.getchildren()[1].getprevious()

<Element noscript at 0x7f7e25b905e8>

Getting the original tree

In [145]:
body.getroottree()

<lxml.etree._ElementTree at 0x7f7e25a8fa08>

# Thanks :)
<br>

- GitHub: alana91
- Telegram: AlanaDB
- E-mail: alanadomitbittar@pm.me
- Linkedin: https://www.linkedin.com/in/alanadomitbittar/