# Pull data out from HTML files 

### HTML - Xpath

An agent (browser) convert a document to *DOM*. This convertions allow us search an especifict node 

The most natural way to represent it is with a `tree`.

ET has two classes for this purpose 
  - `ElementTree` that represents the whole XML document as a tree, and 
  - `Element` that represents a single node in this tree. 

Interactions with the `whole document` (reading and writing to/from files) are usually done on the `ElementTree` level. 

Interactions with a `single XML element` and its sub-elements are done on the `Element` level.

In [66]:
from lxml.html import fromstring, tostring, parse
import requests
from pathlib import Path
from urllib.parse import urljoin, urlparse, urlsplit
from dotenv import dotenv_values
import re
from operator import getitem
from collections import namedtuple
import pandas as pd

How to extract information about some products from a html and save the data in a dataframe using `XPATH`

In [69]:
root = parse('Comprar Productos Lácteos_ Yogurt, Leche y más.html')

`parse` is a method of the `lxml.html` module that parses an HTML document from *name_file* or *url*. `fromstring` parses an HTML document from a string. At least the document must have html element, and body

Recall, *parsing* is the process of taking raw HTML code, reading it, and generating a `DOM tree` object structure from it. This allow us search nodes based on xpath.

For this example, we have downloaded the html of a online store of Peruvian supermarket and we will extract some information like price, name, sku and so on.

In [70]:
target_node = root.xpath("//div[contains(@class,'product-shelf') and contains(@class,'n18colunas')]")

In [71]:
target_node_value = getitem(target_node, 0)

In [109]:
array_products = target_node_value.xpath(
"ul/li/div[contains(@class, 'product-item') and not(contains(@class, 'product-item__cart-amount'))]"
)

In [110]:
Product = namedtuple('Product', ['class_name', 'data_name', 'data_price', 'data_brand', 'data_sku'])

In [111]:
array_data_products = []
for item_product in array_products:
    product_facts = Product(
        class_name=item_product.get('class'),
        data_name=item_product.get('data-name'),
        data_price=item_product.get('data-price'),
        data_brand=item_product.get('data-brand'),
        data_sku=item_product.get('data-sku')
    )
    array_data_products.append(product_facts)

In [112]:
dataset_products = pd.DataFrame(array_data_products)

In [113]:
dataset_products.head()

Unnamed: 0,class_name,data_name,data_price,data_brand,data_sku
0,product-item product-item--572685 gotten-produ...,Sixpack Leche Concentrada Sin Lactosa Laive Bo...,S/. 24.95,Laive,39170436
1,product-item product-item--5421 gotten-product...,Yogurt Parcialmente Descremado Gloria Fresa Ga...,S/. 10.90,Gloria,5423
2,product-item product-item--4236 gotten-product...,Mantequilla con Sal Gloria 390g,S/. 21.50,Gloria,4238
3,product-item product-item--1021949 gotten-prod...,Sixpack Leche Ultrafiltrada Gloria Sin Lactosa...,S/. 25.80,Gloria,39257749
4,product-item product-item--1019717 gotten-prod...,Margarina Sello de Oro 200g,S/. 4.20,Sello de Oro,39255525


In [59]:
config = dotenv_values('./env/.env')

In [102]:
user_agent = 'Mozilla/5.0'

response = requests.get(
    config['ESIKA_PATH'], 
    headers={'User-agent': user_agent})

root = fromstring(response.text)

As an Element (node in a DOM), root has a tag and a dictionary of attributes

- `element.tag`
- `element.attrib`
- `element.get("name_of_attrib")`

In [103]:
print('Tag name current:', root.tag)
print('Attributes of Tag name current: ',root.attrib)
print("Get the attribute LANG:" , root.get("lang"))

Tag name current: html
Attributes of Tag name current:  {'lang': 'es-PE'}
Get the attribute LANG: es-PE


Element has some useful methods that help iterate recursively over all the sub-tree below it (its children, their children, and so on)

For example, `Element.iter()`

In [101]:
for img in root.iter('img'):
    src_img = img.get('src')
    if src_img:
        print(getitem(src_img.split('/'), -1))

244545-150-auto?v=638167518158970000&width=150&height=auto&aspect=true
244544-150-auto?v=638167518157400000&width=150&height=auto&aspect=true
244546-150-auto?v=638167518162230000&width=150&height=auto&aspect=true
244547-150-auto?v=638167518164430000&width=150&height=auto&aspect=true
244548-150-auto?v=638167518166770000&width=150&height=auto&aspect=true
244549-150-auto?v=638167518168930000&width=150&height=auto&aspect=true
244545-800-auto?v=638167518158970000&width=800&height=auto&aspect=true
244544-800-auto?v=638167518157400000&width=800&height=auto&aspect=true
244546-800-auto?v=638167518162230000&width=800&height=auto&aspect=true
244547-800-auto?v=638167518164430000&width=800&height=auto&aspect=true
244548-800-auto?v=638167518166770000&width=800&height=auto&aspect=true
244549-800-auto?v=638167518168930000&width=800&height=auto&aspect=true


* `Element.findall()` finds only elements with a tag which are direct children of the current element.
* `Element.find()` finds the first child with a particular tag
* `Element.attrib` return in a dict the attributes and their values of current particular.
  * Once we check the attributes of the element we can use `Element.get()` to extract the value of the element’s attribute.
* `Element.text` accesses the element’s text content. 

In [11]:
for elt_class in root.findall('body'):   # the only tag children is body for root
    print('* ATTRIBUTES: ->'.ljust(15) ,elt_class.tag, elt_class.attrib)
    print('* Get the attribute CLASS:  ->'.ljust(15),elt_class.get('class'))
    print('* Get the TEXT: ->'.ljust(15),elt_class.text)
    print('* Get the CHILDREN: ->'.ljust(15), elt_class.getchildren()) # To obtain all direct children for body's elements

* ATTRIBUTES: -> body {'class': 'bg-base'}
* Get the attribute CLASS:  -> bg-base
* Get the TEXT: -> 
  
* Get the CHILDREN: -> [<Element div at 0x1bee4e80220>, <Element div at 0x1bee4e80270>, <Element script at 0x1bee4e802c0>, <Element script at 0x1bee4e80310>, <Element script at 0x1bee4e80360>, <Element template at 0x1bee4e803b0>, <Element template at 0x1bee4e80400>, <Element template at 0x1bee4e80450>, <Element template at 0x1bee4e804a0>, <Element script at 0x1bee4e804f0>, <Element script at 0x1bee4e80540>, <Element script at 0x1bee4e80590>, <Element script at 0x1bee4e805e0>, <Element script at 0x1bee4e80630>, <Element script at 0x1bee4e80680>, <Element script at 0x1bee4e806d0>, <Element script at 0x1bee4e80720>, <Element script at 0x1bee4e80770>, <Element script at 0x1bee4e807c0>, <Element script at 0x1bee4e80810>, <Element script at 0x1bee4e80860>, <Element script at 0x1bee4e808b0>, <Element script at 0x1bee4e80900>, <Element script at 0x1bee4e80950>, <Element script at 0x1bee4e80

In [12]:
elements_body = root.findall('body')            # The only tag children is body for root

for el_body in elements_body:                   # For each element body
    print('For the tag Name: ',el_body.tag)
    elements_div = el_body.findall('div')       # find all children div
    for i, el_div in enumerate(elements_div):   
        print(f'> Tag Name {el_div.tag} and their attributes {el_div.attrib}')

For the tag Name:  body
> Tag Name div and their attributes {'id': 'styles_iconpack', 'style': 'display:none'}
> Tag Name div and their attributes {'class': 'render-container render-route-store-not-found-product'}


## XPath support

XPath (XML  Path Language) is a language to support queries or transformation of XML documents.

Web browswer support  XML, so it can support XPath, then it can used to extract some information.

In the console of a browser we can pass this code `$x("...")`, obviously changing the 3 dots by the XPath

Selector description  XPath Selector

* Select any child from a node "`*`" 
* Select only div child  "`div`"
* Select all links (can be a child or its descendants)       `'//a'`
* Select all div with class "main"  `'//div[@class="main"]'`
* Select all ul with ID "list"  `'//ul[@id="list"]'`
* Select all text from all paragraphs  `'//p/text()'`
* Select all divs which contain 'test' in the class  `'//div[contains(@class, 'test')]'`
* Select all divs with links or lists in them `'//div[a|ul] '`
* Select a link with google.com in the href  `'//a[contains(@href, "google.com")]'`
* Select the links that no contains google.com in the href  `'//a[not(contains(@href, "google.com"))]'`
* Get specifically data from src `'//td/img/@src'`
* Find element based on child `'//td[img[@scr="https:\\www.my_page\my_cat.img"]]'` 
* Get all labels that their id start with message  "`//label[starts-with(@id,'message')]`"
* Get all links that no contains google.com in the href but do amazon  `'//a[not(contains(@href, "google.com")) and contains(@href, "amazon")]'`  

In [104]:
links = set()
for element in root.xpath('//img'):
    if element.get('data-src'):
        if searching := re.search(r'^http.*\.jpg$', element.get('data-src')):
            link = searching.group(1)
            links.add(link)

In [118]:
top_elements= root.xpath('.') # top elements
for el_0 in top_elements:
    print(el_0.tag)
    for el_1 in el_0.getchildren():
        print('----|----',el_1.tag)
        for el_2 in el_1.getchildren():
            print('----|----'*2,el_2.tag)
            for el_3 in el_2.getchildren():
                print('----|----'*3,el_3.tag)

html
----|---- head
----|--------|---- meta
----|--------|---- meta
----|--------|---- meta
----|--------|---- style
----|--------|---- script
----|--------|---- link
----|--------|---- noscript
----|--------|--------|---- link
----|--------|---- noscript
----|--------|---- template
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- noscript
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|--------|---- link
----|--------|-----