`lxml.etree`
- Pros:
  - Powerful and efficient XML/HTML parsing library.
  - Rich set of methods for traversing and manipulating XML/HTML trees.
  - Excellent support for XPath and XSLT.
  - Can handle a variety of XML and HTML sources.
  - Can be used for both reading and writing XML/HTML documents.
- Cons:
  - No built-in HTTP functionality.
  - Can be more complex to use than simpler HTML parsing libraries.

`lxml.http`
- Pros:
  - Simple and easy-to-use HTTP library.
  - Basic support for cookies, headers, and sessions.
  - Integrates well with `lxml.etree`.
- Cons:
  - Limited functionality compared to more full-featured HTTP libraries.
  - May require additional configuration for advanced use cases.
  - Not as widely used or well-documented as other HTTP libraries.

`selectolax`
- Pros:
  - Very fast HTML/XML parsing library.
  - Provides a simple and intuitive API for extracting data from HTML/XML documents.
  - Can handle invalid HTML/XML.
- Cons:
  - Limited functionality compared to more feature-rich parsing libraries like `lxml`.
  - Does not support XPath or XSLT.
  - Does not support XML namespaces.

In summary, `lxml.etree` and `lxml.http` are powerful and feature-rich libraries for parsing and manipulating XML and HTML documents, with excellent support for XPath and XSLT. `selectolax` is a very fast and lightweight library that has a simple API for extracting data from HTML and XML documents, but lacks some of the advanced features of `lxml`. Overall, the choice between these libraries depends on the specific requirements of your project, such as parsing speed, feature requirements, and ease of use.

In [117]:
import os 
import pandas as pd 
import base64
from selectolax.parser import HTMLParser
from lxml import html, etree 

# PATH_BASE refers to the unzipped directory
PATH_BASE = './project-dp-trend-description/'    

In [118]:
url_paths = os.listdir(PATH_BASE)
htmls_encoded = []

for file in url_paths: 
    html64file = pd.read_csv(PATH_BASE + file, nrows=10)
    htmls_encoded += html64file['html'].to_list()

sample_unicode_html = base64.b64decode(htmls_encoded[0])

In [119]:
%%time
print('Runtime Using Selecolax Library: \n')
sample_parse_selecolax = HTMLParser(sample_unicode_html)

Runtime Using Selecolax Library: 

CPU times: total: 0 ns
Wall time: 2.98 ms


In [120]:
%%time
print('Runtime Using lxlm.html Library: \n')
sample_parse_lxlmhttp = html.fromstring(sample_unicode_html)

Runtime Using lxlm.html Library: 

CPU times: total: 15.6 ms
Wall time: 11 ms


In [121]:
%%time
print('Runtime Using lxlm.etree Library: \n')
sample_parse_lxlmetree = etree.HTML(sample_unicode_html)

Runtime Using lxlm.etree Library: 

CPU times: total: 0 ns
Wall time: 7.97 ms


In [122]:
pages = []
body_texts = []

for encoded_html in htmls_encoded:
    unicode_html = base64.b64decode(encoded_html)
    parsed_html = HTMLParser(unicode_html)
    body_text = parsed_html.body.text()

    if body_text.startswith('402 '):
        pass
    elif body_text.startswith('403 '):
        pass
    elif body_text.startswith('404 '):
        pass 
    else: 
        pages.append(parsed_html)
        #Storing body texts to use later on if needed.
        body_texts.append(body_text) 

In [123]:
print('Number Of Documents: ', len(htmls_encoded), '\nNumber Of Usable Documents: ', len(pages))

Number Of Documents:  140 
Number Of Usable Documents:  132


### Example code to show how to address selectolax objects

In [164]:
def find_og_desc_text(page):
    '''
    can be used to extract the description text 
    that will be used to find the body element

    args: 
        - page: html page parsed with selectolax
    returns: 
        - the description text that was found on 
          the og:description meta tag
    
    '''
    head = page.css_first('head')
    if head is None:
        return None
    
    # Find the head tag with the property of og:description
    og_description = head.css_first('meta[property="og:description"]')
    if og_description is None:
        return None
    
    # Return the content of the tag
    return og_description.attributes.get('content')



def find_target_element(page, content):
    '''
    can be used to extract the description text 
    that will be used to find the body element
    the assumption is that body is not null
    
    args: 
        - page: html page parsed with selectolax
        - content: the content we want to match in this case the description 
    returns: 
        - the element whose text matches the second argument, 
          returned as a selectolax Node object 
    
    '''
    body = page.css_first('body')

    # Find all elements with non-empty text content
    text_elements = [e for e in body.iter()]

    # Traverse the tree and find the innermost element with matching text content
    match = None
    for elem in text_elements:
        if elem.text().__contains__(content):
            # this code does not return the innermost but the first one it finds
            # although deescription tag is mostly found on top of the pages
            match = elem

    return match


og = find_og_desc_text(pages[11])
print(og)
desc_elem = find_target_element(pages[11], og)
print(desc_elem)

در مسیر مهاجرت آسیب‌هایی وجود دارد که متوجه خانواده‌ها و به خصوص کودکان است.
<Node div>


In [129]:
desired_tags = ['a', 'h1', 'p', 'div', 'article', 'span']
leaf_nodes = [node for node in 
              pages[0].body.css('*:not(:has(*))') 
              if node.tag in desired_tags 
                and node.text(deep=False).strip()]
leaf_nodes

[<Node span>,
 <Node a>,
 <Node a>,
 <Node a>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 <Node span>,
 