<a href="https://colab.research.google.com/github/lucapas/VERTEX/blob/master/sampling_algorithm_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sampling algorithm
Implementation of Vertex sampling algorithm described in [_Web-scale information extraction with vertex_](https://ieeexplore.ieee.org/abstract/document/5767842)

In [0]:
import requests
from lxml import html
from collections import defaultdict

## Get all xpath

In [0]:
def get_all_xpath(html_src):
    
    # select nodes whose children include text nodes
    XPATH_SELECTOR = "//*[child::text()]" 
        
    root = html.fromstring(html_src)
    
    tree = root.getroottree()
    
    # leaf_nodes is not properly a list of all leaf nodes. 
    # It contains nodes which are parent of text elements in the DOM
    leaf_nodes = root.xpath(XPATH_SELECTOR)
    
    xpath_list = []
    
    # extract xpath from previously selected nodes and filter out "noisy" nodes
    for leaf in leaf_nodes:
        
        xpath = tree.getpath(leaf) + "/text()"
        
        # Filtering out xpaths which extract javascript code or css stylesheet
        if  "/script" not in xpath and "/noscript" not in xpath and "/style" not in xpath:
                        
            selected_values = root.xpath(xpath)
            selected_string = ''.join(selected_values).strip()
            
            # Filtering out xpaths which extract empty strings
            if selected_string:
                xpath_list.append(xpath)

    return xpath_list    

In [0]:
r = requests.get("https://www.androidworld.it/schede/redmi-7-2/")

In [0]:
xpath_list = get_all_xpath(r.content)

In [5]:
xpath_list

['/html/head/title/text()',
 '/html/body/div/div[3]/div/header/div/div[1]/ul/li[1]/a/text()',
 '/html/body/div/div[3]/div/header/div/div[1]/ul/li[2]/a/text()',
 '/html/body/div/div[3]/div/header/div/div[1]/ul/li[3]/a/text()',
 '/html/body/div/div[3]/div/header/div/div[2]/span/a/text()',
 '/html/body/div/nav/div/ul/li[1]/a/text()',
 '/html/body/div/nav/div/ul/li[2]/a/text()',
 '/html/body/div/nav/div/ul/li[3]/a/text()',
 '/html/body/div/nav/div/ul/li[4]/a/text()',
 '/html/body/div/nav/div/ul/li[5]/a/text()',
 '/html/body/div/nav/div/ul/li[6]/a/text()',
 '/html/body/div/nav/div/ul/li[7]/a/text()',
 '/html/body/div/nav/div/ul/li[8]/a/text()',
 '/html/body/div/nav/div/ul/li[9]/a/text()',
 '/html/body/div/nav/div/ul/li[10]/a/text()',
 '/html/body/div/nav/div/ul/li[11]/a/text()',
 '/html/body/div/div[4]/div/main/div/section/article/figure/div/div[2]/a[2]/span[2]/text()',
 '/html/body/div/div[4]/div/main/div/section/article/figure/div/div[2]/a[3]/span[2]/text()',
 '/html/body/div/div[4]/div/m

In [6]:
len(xpath_list)

290

In [7]:
root = html.fromstring(r.content)
for xpath in xpath_list:
    list_of_values = root.xpath(xpath)
    string_value = ''.join(list_of_values)
    clean_value = string_value.strip()
    #print([string_value])
    print([clean_value])
    print()

['Redmi 7 (3GB) - Scheda tecnica | AndroidWorld']

['smart']

['mobile']

['android']

['AndroidWorld']

['Recensioni']

['Schede tecniche']

['Smartphone']

['Smartwatch']

['Tablet']

['App']

['Giochi']

['Guide']

['Video']

['Forum']

['📸']

['Specifiche']

['notizie']

['Correlati']

['Confronta']

['Redmi 7 (3GB)']

['Scheda Tecnica']

['Confronta']

['CPU']

['octa 1.8 GHz']

['Display']

['6,26" HD+ / 720 x 1520']

['Fotocamera']

['12 Mpx ƒ/2.2']

['Frontale']

['8 Mpx']

['RAM']

['3 GB']

['Memoria interna']

['32 / 64 GB']

['Android']

['9.0 Pie']

['Batteria']

['4000 mAh']

['è uno smartphone con sistema operativo Android di fascia bassa.']

['Redmi 7 (3GB)']

['La tecnologia del display dello smartphone è IPS LCD. Ha una diagonale di  ed una risoluzione di HD+ / 720 x 1520 e quindi un ppi di 269 ppi.']

['6,26 pollici']

['Abbiamo a che fare con una fotocamera a risoluzione di  con il supporto di un flash Singolo utile ad illuminare le foto in condizioni di scarsa luce

## Compute necessary data structures
### Utility functions
#### Get xpath value
Returns xpath selected value from a given page

In [0]:
def get_xpath_value(html_src, xpath):
    
    root = html.fromstring(html_src)
    selected_values = root.xpath(xpath)
    selected_string = ''.join(selected_values).strip()
    
    return selected_string

In [9]:
get_xpath_value(r.content, '/html/head/title/text()')

'Redmi 7 (3GB) - Scheda tecnica | AndroidWorld'

#### xpath to value
Given page source code _src_ and a list of xpath l returns a dict { _xpath_ : _value_ }, where _xpath_ is an xpath and _value_ is the string retrieved from the xpath on _src_

In [0]:
def xpath_to_value(html_src, xpath_list):
    
    result = {}
    
    for xpath in xpath_list:
        value = get_xpath_value(html_src, xpath)
        result.update({xpath: value})
        
    return result

In [11]:
xpath_to_value(r.content, ['/html/head/title/text()',
                            '/html/body/div/div[3]/div/header/div/div[1]/ul/li[1]/a/text()',
                            '/html/body/div/div[3]/div/header/div/div[1]/ul/li[2]/a/text()',
                            '/html/body/div/div[3]/div/header/div/div[1]/ul/li[3]/a/text()'])

{'/html/body/div/div[3]/div/header/div/div[1]/ul/li[1]/a/text()': 'smart',
 '/html/body/div/div[3]/div/header/div/div[1]/ul/li[2]/a/text()': 'mobile',
 '/html/body/div/div[3]/div/header/div/div[1]/ul/li[3]/a/text()': 'android',
 '/html/head/title/text()': 'Redmi 7 (3GB) - Scheda tecnica | AndroidWorld'}

#### get html
Given a list of URLs, returns a dictionary _url_, _html page_

In [0]:
def get_html(list_of_urls):
    
    result = {}
    
    for url in list_of_urls:
        r = requests.get(url)
        if r.ok:
            result[url] = r.content
            
    return result

#### get_data_structures
Return necessary data structures for computing xpaths weights

In [0]:
def get_data_structures(list_of_urls):
    
    html_pages = get_html(list_of_urls)
    
    url_to_xpaths = {}
    xpath_to_value_list = defaultdict(list)
    
    for url in list_of_urls:
        page = html_pages[url]
        xpath_list = get_all_xpath(page)
        
        url_to_xpaths[url] = xpath_list
        
        xpath_to_single_value = xpath_to_value(page, xpath_list)
        
        for xpath in xpath_to_single_value:
            value = xpath_to_single_value[xpath]
            xpath_to_value_list[xpath].append(value)
    
    return (url_to_xpaths, xpath_to_value_list)

In [0]:
bash_url_to_xpath_map, bash_xpath_to_values_map = get_data_structures(['http://www.tldp.org/LDP/abs/html/part1.html',
                    'http://www.tldp.org/LDP/abs/html/invoking.html'])

In [15]:
bash_url_to_xpath_map

{'http://www.tldp.org/LDP/abs/html/invoking.html': ['/html/head/title/text()',
  '/html/body/div[1]/table/tr[1]/th/text()',
  '/html/body/div[1]/table/tr[2]/td[1]/a/text()',
  '/html/body/div[1]/table/tr[2]/td[2]/text()',
  '/html/body/div[1]/table/tr[2]/td[3]/a/text()',
  '/html/body/div[2]/h1/text()',
  '/html/body/div[2]/p[1]/text()',
  '/html/body/div[2]/p[1]/tt[1]/b/text()',
  '/html/body/div[2]/p[1]/a[1]/span/text()',
  '/html/body/div[2]/p[1]/tt[2]/b/text()',
  '/html/body/div[2]/p[1]/tt[3]/b/text()',
  '/html/body/div[2]/p[1]/a[2]/tt/text()',
  '/html/body/div[2]/p[1]/a[3]/text()',
  '/html/body/div[2]/div/dl/dt[1]/text()',
  '/html/body/div[2]/div/dl/dd[1]/p/text()',
  '/html/body/div[2]/div/dl/dd[1]/p/tt/b/text()',
  '/html/body/div[2]/div/dl/dd[1]/p/a/span/text()',
  '/html/body/div[2]/div/dl/dt[2]/text()',
  '/html/body/div[2]/div/dl/dd[2]/p[1]/text()',
  '/html/body/div[2]/div/dl/dd[2]/p[1]/tt/b/text()',
  '/html/body/div[2]/div/dl/dd[2]/p[2]/text()',
  '/html/body/div[2]/

In [16]:
bash_xpath_to_values_map

defaultdict(list,
            {'/html/body/div[1]/table/tr[1]/th/text()': ['Advanced Bash-Scripting Guide:',
              'Advanced Bash-Scripting Guide:'],
             '/html/body/div[1]/table/tr[2]/td[1]/a/text()': ['Prev', 'Prev'],
             '/html/body/div[1]/table/tr[2]/td[2]/text()': ['Chapter 2. Starting Off With a Sha-Bang'],
             '/html/body/div[1]/table/tr[2]/td[3]/a/text()': ['Next', 'Next'],
             '/html/body/div[2]/div/div[1]/p[2]/i/text()': ['script'],
             '/html/body/div[2]/div/div[1]/p[2]/span/text()': ['"gluing together"'],
             '/html/body/div[2]/div/div[1]/p[2]/text()': ["The shell is a command interpreter. More than just the\n      insulating layer between the operating system kernel and the user,\n      it's also a fairly powerful programming language. A shell program,\n      called a , is an easy-to-use tool for\n      building applications by  system\n      calls, tools, utilities, and compiled binaries.  Virtually the\n      

## Compute weights

### Compute frequency
Given a list of values extracted from a xpath _Xi_ returns the frequency of _Xi_

In [0]:
def compute_frequency(values_list):
    return len(values_list)

In [18]:
selected_xpaths = ['/html/head/title/text()', 
                   '/html/body/div[1]/table/tr[1]/th/text()', 
                   '/html/body/div[2]/p[4]/b[2]/text()']

compute_frequency(bash_xpath_to_values_map[selected_xpaths[0]]) #should be 2

2

### Compute informativeness
Given cluster size and a list of values extracted from a xpath _Xi_ returns the informativeness of _Xi_

In [0]:
def compute_informativeness(M, values_list):

    values_set = set(values_list)
    Ti = len(values_set)
    
    sum_F_Xi = compute_frequency(values_list)

    return 1 - sum_F_Xi/(M*Ti)
    

In [20]:
compute_informativeness(10, [1,2,1,1,3,5,5]) #expected: 0.825

0.825

### xpath weight
Given a list of values extracted from a xpath _Xi_ returns the weight of _Xi_

In [0]:
def xpath_weight(cluster_size, list_of_values):
    return compute_frequency(list_of_values)*compute_informativeness(cluster_size, list_of_values)

In [22]:
xpath_weight(2, bash_xpath_to_values_map[selected_xpaths[0]]) #should be 1

1.0

In [23]:
xpath_weight(2, bash_xpath_to_values_map[selected_xpaths[1]]) #should be 0

0.0

In [24]:
xpath_weight(2, bash_xpath_to_values_map[selected_xpaths[2]]) #should be 0.5

0.5

### page_weight
Arguments:
- **list of xpath**: list of xpath of a given page
- **xpath_to_values_map**: dictionary where keys are xpath and values are values retrieved from the xpath
- **cluster_size**
- **intersection** (optional): if None nothing happens. Otherwise only xpath in **list of xpath** $\cap$ **intersection** will be considered in computing weight

In [0]:
def F(xpath,second_data):
  return len(second_data.get(xpath))  
   

def I(xpath,second_data,list_of_urls):
  return 1 - ( F(xpath,second_data) / ( len(list_of_urls) * len(set(second_data.get(xpath))) ) )

def w(xpath,second_data,list_of_urls):
  return F(xpath,second_data)*I(xpath,second_data,list_of_urls)



def calcola_pesi(second_data,list_of_urls):
  w_list_xpath={}
  for xpath in second_data:
    w_list_xpath.update({xpath: w(xpath,second_data,list_of_urls)})
  return w_list_xpath
  


### Max weight page
Arguments:
- **url_to_xpaths_map**: dictionary where keys are urls and values are xpaths extracted from urls
- **xpath_to_values_map**: dictionary where keys are xpaths and values are values retrieved from the xpath
- **cluster_size**
- **intersection** (optional): if None nothing happens. Otherwise only xpath in **list of xpath** $\cap$ **intersection** will be considered in computing weight

In [0]:
def max_weight_page(list_of_urls, xpath_to_values_map, list_xpath_weight):
  massimo_url=''
  massimo=0
  for uri in list_of_urls:
    somma=0
    for xpath in xpath_to_values_map[uri]:
      w=list_xpath_weight.get(xpath)
      if bool(w):
        somma=somma+w    
    if massimo<somma:
      massimo=somma
      massimo_url=uri
      
  return massimo_url  


In [0]:
def del_xpath(uri,list_xpath_weight,url_to_xpaths_map):
  for xpath in url_to_xpaths_map[uri]:
    if bool(list_xpath_weight.get(xpath)):
      list_xpath_weight.pop(xpath)

## Sampling algorithm

In [0]:
def sampling(list_of_urls, k = 20):

    cluster_size = len(list_of_urls)
    url_to_xpaths_map, xpath_to_values_map = get_data_structures(list_of_urls)

    list_xpath_weight = calcola_pesi(xpath_to_values_map,list_of_urls)

    #X = list(xpath_to_values_map) #insert dictionary keys into a list
    result = []

    while list_xpath_weight and len(result) <= k:
        max_weight_url = max_weight_page(list_of_urls, url_to_xpaths_map, list_xpath_weight)
        result.append(max_weight_url)
        del_xpath(max_weight_url,list_xpath_weight,url_to_xpaths_map)

    return result

In [0]:
#mettere k=1 se no non funziona
list_uri=["https://www.androidworld.it/schede/redmi-7-2/", "https://www.androidworld.it/schede/samsung-galaxy-a70/"]
sampling(list_uri)

In [30]:
from lxml import html
import requests
url = "http://www.europarl.europa.eu/news/en/press-room/page/"
list_of_links = []
for page in range(10):
    r = requests.get(url + str(page))
    source = r.content
    page_source = html.fromstring(source)
    list_of_links.extend(page_source.xpath('//a[@title="Read more"]/@href'))
print(list_of_links)

['http://www.europarl.europa.eu/news/en/press-room/20190404IPR35103/eu-member-states-test-cybersecurity-preparedness-for-free-and-fair-eu-elections', 'http://www.europarl.europa.eu/news/en/press-room/20190405IPR35201/the-european-parliament-launches-a-website-on-european-election-results', 'http://www.europarl.europa.eu/news/en/press-room/20190402IPR34671/mobility-package-parliament-adopts-position-on-overhaul-of-road-transport-rules', 'http://www.europarl.europa.eu/news/en/press-room/20190402IPR34670/meps-adopted-measures-to-reconcile-work-and-family-life', 'http://www.europarl.europa.eu/news/en/press-room/20190402IPR34682/meps-back-first-eu-management-plan-for-fish-stocks-in-the-western-mediterranean', 'http://www.europarl.europa.eu/news/en/press-room/20190402IPR34683/schengen-meps-adopt-their-position-on-temporary-checks-at-national-borders', 'http://www.europarl.europa.eu/news/en/press-room/20190402IPR34673/natural-gas-parliament-extends-eu-rules-to-pipelines-from-non-eu-countries'

In [33]:
sampling(list_of_links)

['http://www.europarl.europa.eu/news/en/press-room/20190321IPR32135/new-rules-to-help-consumers-join-forces-to-seek-compensation',
 'http://www.europarl.europa.eu/news/en/press-room/20190307IPR30738/uk-must-make-clear-what-it-wants-meps-say-in-brexit-debate',
 'http://www.europarl.europa.eu/news/en/press-room/20190218IPR26760/new-erasmus-more-opportunities-for-disadvantaged-youth',
 'http://www.europarl.europa.eu/news/en/press-room/20190321IPR32132/acp-eu-parliamentary-assembly-strengthening-the-partnership',
 'http://www.europarl.europa.eu/news/en/press-room/20190207IPR25282/new-forms-of-work-deal-on-measures-boosting-workers-rights',
 'http://www.europarl.europa.eu/news/en/press-room/20190123IPR24127/finnish-prime-minister-calls-for-a-more-united-eu-of-concrete-actions',
 'http://www.europarl.europa.eu/news/en/press-room/20190318IPR31813/council-strongly-criticised-over-failing-to-act-to-protect-eu-values-in-hungary',
 'http://www.europarl.europa.eu/news/en/press-room/20190109IPR2302