# Pull data out from HTML files 

### Beatiful Soup 

The fourth verion of Beatiful Soup come with many new facts

https://beautiful-soup-4.readthedocs.io/en/latest/

### XML - Xpath

https://stackoverflow.com/questions/4531995/getting-attribute-using-xpath

https://docs.python.org/3/library/xml.etree.elementtree.html

https://riptutorial.com/Download/xpath.pdf

https://www.w3schools.com/xml/xpath_examples.asp

**XML** is an hierarchical data format, and the most natural way to represent it is with a `tree`.

ET has two classes for this purpose 
  - `ElementTree` that represents the whole XML document as a tree, and 
  - `Element` that represents a single node in this tree. 

Interactions with the `whole document` (reading and writing to/from files) are usually done on the `ElementTree` level. 

Interactions with a `single XML element` and its sub-elements are done on the `Element` level.

In [1]:
from lxml.html import fromstring, tostring
import requests
from pathlib import Path
from urllib.parse import urljoin, urlparse, urlsplit
from dotenv import dotenv_values

In [3]:
with open('Science Fiction _ Books to Scrape - Sandbox.html', encoding='utf-8') as page:
    page_source = page.read()

root = fromstring(page_source)

In [11]:
elements_stock = root.xpath("//ol[@class='row']//li/article/div/p[@class='instock availability']")
elements_price = root.xpath('//ol[@class="row"]//li/article/div/p[@class="price_color"]')
elements_title = root.xpath('//ol[@class="row"]//li/article/h3/a')
total = []
for stock, price, title in zip(elements_stock, elements_price, elements_title):
    total.append(
        {
            "stock":stock.text_content().strip(), 
            "price":price.text_content().strip(), 
            "title":title.text_content().strip()
            }
            )
import pandas as pd
pd.DataFrame(total)

In [11]:
config = dotenv_values('./env/.env')

In [8]:
user_agent = 'Mozilla/5.0'

response = requests.get(
    config['ESIKA_PATH'], 
    headers={'User-agent': user_agent})

root = fromstring(response.text)

As an Element, root has a tag and a dictionary of attributes

In [3]:
print('Tag name current:', root.tag)
print('Attributes of Tag name current: ',root.attrib)

Tag name current: html
Attributes of Tag name current:  {'lang': 'es_PE'}


In order to access at children of a root, we use for loop.

In [4]:
for i, child in enumerate(root):
    print(f'Tag name of child {i}: {child.tag}' ) 
    print(f'Attributes of child {i} :' ,child.attrib)
    print(f'text of child {i}: {child.text}' ) 
    print(f'class of child {i}: ', child.get('class'), end='\n\n') # we choose some attribute of child

Tag name of child 0: head
Attributes of child 0 : {}
text of child 0: 

class of child 0:  None

Tag name of child 1: body
Attributes of child 1 : {'class': 'page-productDetails pageType-ProductPage template-pages-product-productLayout1Page  smartedit-page-uid-productDetails smartedit-page-uuid-eyJpdGVtSWQiOiJwcm9kdWN0RGV0YWlscyIsImNhdGFsb2dJZCI6InBlQ29udGVudENhdGFsb2ciLCJjYXRhbG9nVmVyc2lvbiI6Ik9ubGluZSJ9 smartedit-catalog-version-uuid-peContentCatalog/Online  language-es_PE selected-brand-esika rich-relevance-personalization'}
text of child 1: 


class of child 1:  page-productDetails pageType-ProductPage template-pages-product-productLayout1Page  smartedit-page-uid-productDetails smartedit-page-uuid-eyJpdGVtSWQiOiJwcm9kdWN0RGV0YWlscyIsImNhdGFsb2dJZCI6InBlQ29udGVudENhdGFsb2ciLCJjYXRhbG9nVmVyc2lvbiI6Ik9ubGluZSJ9 smartedit-catalog-version-uuid-peContentCatalog/Online  language-es_PE selected-brand-esika rich-relevance-personalization



Element has some useful methods that help iterate recursively over all the sub-tree below it (its children, their children, and so on)

For example, `Element.iter()`

In [5]:
for img in root.iter('a'):
    print(img.attrib)

{}
{'id': 'backEcatalogUrl', 'xlink:href': ''}
{'href': '#', 'class': 'mini-cart-link js-mini-cart-link', 'data-mini-cart-url': '/pe/cart/rollover/EsikaBicMiniCart', 'data-mini-cart-refresh-url': '/pe/cart/miniCart/SUBTOTAL', 'data-mini-cart-name': 'Bolsa de Compras', 'data-mini-cart-empty-name': 'Empty Bag', 'data-mini-cart-items-text': 'Artículos', 'data-cart-page-url': '/pe/cart'}
{'href': 'https://esika.tiendabelcorp.com/pe', 'class': 'esika-active', 'data-role': 'menu', 'data-region': 'Superior', 'data-parent': 'esika', 'data-title': 'esika', 'id': 'lnk-sup-esika'}
{'href': 'https://lbel.tiendabelcorp.com/pe', 'class': 'logo-no-active', 'data-role': 'menu', 'data-region': 'Superior', 'data-parent': 'lbel', 'data-title': 'lbel', 'id': 'lnk-sup-lbel'}
{'href': 'https://cyzone.tiendabelcorp.com/pe', 'class': 'logo-no-active', 'data-role': 'menu', 'data-region': 'Superior', 'data-parent': 'cyzone', 'data-title': 'cyzone', 'id': 'lnk-sup-cyzone'}
{'href': 'https://esika.tiendabelcorp.c

* `Element.findall()` finds only elements with a tag which are direct children of the current element.
* `Element.find()` finds the first child with a particular tag
* `Element.attrib` return in a dict the attributes and their values of current particular.
  * Once we check the attributes of the element we can use `Element.get()` to extract the value of the element’s attribute.
* `Element.text` accesses the element’s text content. 

In [6]:
for elt_class in root.findall('body'):   # the only tag children is body for root
    print('* ATTRIBUTES: ->'.ljust(15) ,elt_class.tag, elt_class.attrib)
    print('* Get the attribute CLASS:  ->'.ljust(15),elt_class.get('class'))
    print('* Get the TEXT: ->'.ljust(15),elt_class.text)
    print('* Get the CHILDREN: ->'.ljust(15), elt_class.getchildren()) # To obtain all direct children for body's elements

* ATTRIBUTES: -> body {'class': 'page-productDetails pageType-ProductPage template-pages-product-productLayout1Page  smartedit-page-uid-productDetails smartedit-page-uuid-eyJpdGVtSWQiOiJwcm9kdWN0RGV0YWlscyIsImNhdGFsb2dJZCI6InBlQ29udGVudENhdGFsb2ciLCJjYXRhbG9nVmVyc2lvbiI6Ik9ubGluZSJ9 smartedit-catalog-version-uuid-peContentCatalog/Online  language-es_PE selected-brand-esika rich-relevance-personalization'}
* Get the attribute CLASS:  -> page-productDetails pageType-ProductPage template-pages-product-productLayout1Page  smartedit-page-uid-productDetails smartedit-page-uuid-eyJpdGVtSWQiOiJwcm9kdWN0RGV0YWlscyIsImNhdGFsb2dJZCI6InBlQ29udGVudENhdGFsb2ciLCJjYXRhbG9nVmVyc2lvbiI6Ik9ubGluZSJ9 smartedit-catalog-version-uuid-peContentCatalog/Online  language-es_PE selected-brand-esika rich-relevance-personalization
* Get the TEXT: -> 


* Get the CHILDREN: -> [<Element noscript at 0x22854549040>, <Element div at 0x22855942090>, <Element noscript at 0x22855942130>, <Element script at 0x22855942040>,

In [7]:
elements_body = root.findall('body')            # The only tag children is body for root

for el_body in elements_body:                   # For each element body
    print('For the tag Name: ',el_body.tag)
    elements_div = el_body.findall('div')       # find all children div
    for i, el_div in enumerate(elements_div):   
        print(f'> Tag Name {el_div.tag} and their attributes {el_div.attrib}')

For the tag Name:  body
* Tag Name div and their attributes {'id': 'js-site-overlay', 'class': 'site-overlay'}
* Tag Name div and their attributes {'class': 'global-page-info-msg-esika global-page-info-msg-component'}
* Tag Name div and their attributes {'class': 'cci-header-mobile-desktop  pe'}
* Tag Name div and their attributes {'class': 'b2c-header-mobile pe branding-mobile hidden-md hidden-lg'}
* Tag Name div and their attributes {'id': 'ariaStatusMsg', 'class': 'skip', 'role': 'status', 'aria-relevant': 'text', 'aria-live': 'polite'}


## XPath support

XPath (XML  Path Language) is a language to support queries or transformation of XML documents.

Web browswer support  XML, so it can support XPath, then it can used to extract some information.

In the console of a browser we can pass this code `$x("...")`, obviously changing the 3 dots by the XPath

* Selector description  XPath Selector
* Select all links        `'//a'`
* Select div with class "main"  `'//div[@class="main"]'`
* Select ul with ID "list"  `'//ul[@id="list"]'`
* Select text from all paragraphs  `'//p/text()'`
* Select all divs which contain 'test' in the class  `'//div[contains(@class, 'test')]'`
* Select all divs with links or lists in them `'//div[a|ul] '`
* Select a link with google.com in the href  `'//a[contains(@href, "google.com")]'`
* Get specifically data from src `'//td/img/(@src)'`

https://www.w3schools.com/xml/xpath_examples.asp

In [8]:
import re
from pathlib import Path

In [9]:
links = set()
for element in root.xpath('//img'):
    if element.get('data-src'):
        if re.search(r'.*http.*\.jpg$', element.get('data-src')):
            link = re.search(r'.*(http.*\.jpg$)', element.get('data-src')).group(1)
            links.add(link)

In [10]:
for element in root.xpath('//h1[@class="name"]'):
    print(element.text.strip())

Delineador líquido punta plumón Eye PRO


* contains

In [108]:
for element in root.xpath('//div[contains(@class,"simple-price")]'): # we can use and and or 
    for el_chil in element.getchildren():
        if el_chil.text:
            print(el_chil.get('class'),el_chil.text.strip())

active-price S/ 44.00


* starts-with

In [None]:
Xpath="//label[starts-with(@id,'message')]"

In [159]:
top_elements= root.xpath('.') # top elements
for el_0 in top_elements:
    print(el_0.tag)
    for el_1 in el_0.getchildren():
        print('|--->',el_1.tag)
        for el_2 in el_1.getchildren():
            print('|--->'*2,el_2.tag)
            for el_3 in el_2.getchildren():
                print('|--->'*3,el_3.tag)

html
|---> head
|--->|---> title
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> script
|--->|---> script
|--->|---> script
|--->|---> script
|--->|---> style
|--->|---> script
|--->|---> script
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> meta
|--->|---> script
|--->|---> script
|--->|---> link
|--->|---> link
|--->|---> script
|--->|---> link
|--->|---> link
|--->|---> link
|--->|---> style
|---> body
|--->|---> noscript
|--->|--->|---> iframe
|--->|---> div
|--->|---> noscript
|--->|--->|---> iframe
|--->|---> script
|--->|---> div
|--->|--->|---> div
|--->|---> div
|--->|--->|---> a
|--->|---> div
|--->|--->|---> button
|--->|--->|---> div
|--->|--->|---> div
|--->|---> main
|--->|--->|---> div
|--->|--->|

In [109]:
elements = root.xpath('/html/body/main/div[3]/div[1]/div/div[3]/div[3]/div') 

for el_0 in elements:
    print('Tag: ', el_0.tag)          # print the actual tag
    print('Attributes: ', el_0.attrib)       # print their attributes
    print('Childrens: ' , el_0.xpath('./*')) # print their children
    for el_1 in el_0.xpath('.//ul'):
        for el_2 in el_1.xpath('.//div[contains(@style,"background") and contains(@class,"category")]'):
            print('|---->' , el_2.attrib)

Tag:  div
Attributes:  {'class': 'yCmsComponent yComponentWrapper page-details-variants-select-component'}
Childrens:  [<Element div at 0x2494a1b6ef0>, <Element div at 0x2494a1b6c70>, <Element ul at 0x2494a1b6d10>]
|----> {'class': 'js-tm-variant-category-pdp variant-category-pdp ', 'style': 'background-color:RGB(163,29,53);'}
|----> {'class': 'js-tm-variant-category-pdp variant-category-pdp ', 'style': 'background-color:RGB(243,241,235);'}
|----> {'class': 'js-tm-variant-category-pdp variant-category-pdp ', 'style': 'background-color:RGB(17,96,90);'}
|----> {'class': 'js-tm-variant-category-pdp variant-category-pdp ', 'style': 'background-color:RGB(74,50,47);'}
|----> {'class': 'js-tm-variant-category-pdp variant-category-pdp ', 'style': 'background-color:RGB(62,85,171);'}
|----> {'class': 'js-tm-variant-category-pdp variant-category-pdp ', 'style': 'background-color:RGB(121,51,60);'}
|----> {'class': 'js-tm-variant-category-pdp variant-category-pdp out-of-stock', 'style': 'background

In [12]:
response = requests.get(config['INEI_PATH'])
tree = fromstring(response.text)
elements = tree.xpath('//html/body//ul//li//a[contains(@href,"ConsultaPorEncuesta")]')

for element in elements:
    print(element.tag)
    print(element.attrib)

    print(element.findall('li'))
    print(element.tag)
    print(element.attrib)


with open('page.txt', 'rb') as reader:
    page = reader.read()

tree = fromstring(page.decode())

tree.xpath('//select[contains(@name, "cmbAnno")]')[0].attrib

tree.xpath('//select[contains(@name, "cmbAnno")]//option[contains(@value, "2010")]')[0].attrib

tree.xpath('//select[contains(@name, "cmbTrimestre")]')[0].attrib

tree.xpath('//select[contains(@name, "cmbTrimestre")]//option[contains(@value, "55")]')[0].attrib

tree.xpath('//td//a[contains(@href, "03.zip") and contains(@href, "SPSS")]')[0].attrib

a
{'href': 'javascript:ConsultaPorEncuesta()'}
[]
a
{'href': 'javascript:ConsultaPorEncuesta()'}


IndexError: list index out of range

In [1]:
with open('./metro-page/Panadería y Pastelería_ Comprar Panes, Pasteles, Tortas.html', 'r', encoding='utf-8') as page:
    page_web = page.read()

root = fromstring(page_web)
xpath = '//div[@id="js-menu-container"]/div/a'
container = []

for item in root.xpath(xpath):
    for child in item.getchildren():
        base = {}
        base['name'] = child.text
        base['link'] = "./"+item.get('href').split('/')[-1]
        container.append(base)