# Foxlink's clustering algorithm evaluation
Evaluating Foxlink's clustering algorithm on bookdepository.com pages. The aim is to calculate precision and recall for "book details" cluster in bookdepository.com.

In [1]:
%matplotlib inline
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd

FILEPATH = '../../../datasets/bookdepository.csv'
FILEPATH

'../../../datasets/bookdepository.csv'

In [2]:
df = pd.read_csv(FILEPATH)

## Data analisys
Some preliminary analisys of the dataset

In [3]:
print("First 5 rows")
print("------------")
df.head()

First 5 rows
------------


Unnamed: 0,url,referer_url,src,label,shingle_vector
0,https://www.bookdepository.com/,https://www.bookdepository.com/,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",,"(0, 3, 2, 0, 5, 1, 1, 1)"
1,https://www.bookdepository.com/author/J-K-Rowling,https://www.bookdepository.com/,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",,"(0, 3, 2, 0, 7, 1, 2, 1)"
2,https://www.bookdepository.com/category/3098/T...,https://www.bookdepository.com/,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",,"(0, 3, 1, 0, 7, 1, 1, 1)"
3,https://www.bookdepository.com/category/3392/B...,https://www.bookdepository.com/,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",,"(0, 3, 1, 0, 5, 1, 1, 1)"
4,https://www.bookdepository.com/category/2967/T...,https://www.bookdepository.com/,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",,"(0, 3, 2, 0, 7, 1, 1, 1)"


In [4]:
print("No. of rows and columns")
print("-----------------------")
df.shape

No. of rows and columns
-----------------------


(25549, 5)

In [5]:
print("Check null values")
print("-----------------")
df.isnull().any().any()

Check null values
-----------------


True

In [6]:
print("Check duplicate values")
print("----------------------")
len(df['url'].unique()) != df.shape[0]

Check duplicate values
----------------------


False

In [7]:
print("DataFrame column types")
print("----------------------")
df.info()

DataFrame column types
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25549 entries, 0 to 25548
Data columns (total 5 columns):
url               25549 non-null object
referer_url       25549 non-null object
src               25549 non-null object
label             15807 non-null object
shingle_vector    25549 non-null object
dtypes: object(5)
memory usage: 998.1+ KB


In [8]:
print("Some stats")
print("----------------")
df.describe()

Some stats
----------------


Unnamed: 0,url,referer_url,src,label,shingle_vector
count,25549,25549,25549,15807,25549
unique,25549,10496,25549,1,23
top,https://www.bookdepository.com/author/Robert-B...,https://www.bookdepository.com/,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",product,"(0, 1, 2, 3, 0, 1, 2, 1)"
freq,1,27,1,15807,6941


In [9]:
fmt_string = 'There are {} row with {} label'
print(fmt_string.format(len(df[df['label'].isnull()]),'no'))
print(fmt_string.format(len(df[df['label']=='product']), 'product'))

There are 9742 row with no label
There are 15807 row with product label


## Run Foxlink's clustering algorithm

In [10]:
#add top level folder to sys.path
import sys
sys.path.append('../../../')

In [11]:
from foxlink_clustering.clustering.structural_clustering import structural_clustering

clusters = structural_clustering(df)

In [12]:
len(clusters)

6

So Foxlink's clustering algorithm discovered 6 clusters. Let's see how many pages contains each cluster

In [15]:
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(df.index) - noOfPages))
    

cluster n. 1 has 521 pages
cluster n. 2 has 9077 pages
cluster n. 3 has 649 pages
cluster n. 4 has 4023 pages
cluster n. 5 has 11201 pages
cluster n. 6 has 56 pages

25527 pages were clustered using Foxlink's clustering algorithm. 22 pages were discarded


Looking at each cluster it seems that cluster n.5 should be the one which groups pages that show books details. Let's check that

In [19]:
for i in range(10):
    print(clusters[4][1][i])

https://www.bookdepository.com/Look-Inside-Our-World-Emily-Bone/9781409563945?ref=grid-view
https://www.bookdepository.com/Look-Inside-Things-That-Go-Rob-Lloyd-Jones/9781409550259?ref=grid-view
https://www.bookdepository.com/Look-Inside-Space-Rob-Lloyd-Jones/9781409523383?ref=grid-view
https://www.bookdepository.com/Angel-Time-Professor-Anne-Rice/9781400078950
https://www.bookdepository.com/Art-Mass-Effect-Universe-Casey-Hudson/9781595827685?ref=grid-view
https://www.bookdepository.com/Talking-Heads-Fear-Music-Jonathan-Lethem/9781441121004?ref=grid-view
https://www.bookdepository.com/Time-Death-Susan-Ericksen/9781469205793?ref=grid-view
https://www.bookdepository.com/Holiday-Death-J-D-Robb/9781469233758
https://www.bookdepository.com/Holiday-Death-J-D-Robb/9781417711772
https://www.bookdepository.com/Meditations-Marcus-Aurelius/9780141018829?ref=grid-view


However, as said previously there are 15807 pages which show book details. So there are 4606 pages displaying book details which are (probably ?) distributed across the remaining 5 clusters or discarded. Therefore it seems that Foxlink's clustering algorithm should have a high precision (possibly 1) and a lower value of recall.

## Calculate precision and recall
Calculate precision and recall considering the entire dataset.

In [20]:
bookCluster = clusters[4]
urlsFromBookCluster = bookCluster[1]

In [21]:
pages_retrieved_for_book_query = len(urlsFromBookCluster)

true_positive = 0

all_positives = len(df[df['label']=='product'])

for url in urlsFromBookCluster:
    matchingRow  = df[df['url'] == url][['url','label']].iloc[0]
    if matchingRow['label'] == 'product':
        true_positive += 1

In [22]:
recall = true_positive/all_positives
precision = true_positive/pages_retrieved_for_book_query
eval_fmt = '{}: {}'
print(eval_fmt.format('Recall', recall))
print(eval_fmt.format('Precision', precision))

Recall: 0.7048143227683937
Precision: 0.9946433354164806


Let's find out how much precision and recall are dependent from the dataset size. Let's consider a dataset of 100, 500, 1000 and 5000 pages.

### Evaluate recall and precision using a dataset of 100 pages

In [23]:
sample100 = df.sample(100)
sample100.describe()

Unnamed: 0,url,referer_url,src,label,shingle_vector
count,100,100,100,59,100
unique,100,99,100,1,11
top,https://www.bookdepository.com/Il-figlio-di-Ba...,https://www.bookdepository.com/author/Eric-Gol...,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",product,"(0, 3, 2, 0, 7, 1, 2, 1)"
freq,1,2,1,59,29


In [24]:
print(fmt_string.format(len(sample100[sample100['label'].isnull()]),'no'))
print(fmt_string.format(len(sample100[sample100['label']=='product']), 'product'))

There are 41 row with no label
There are 59 row with product label


In [25]:
clusters = structural_clustering(sample100)
print("There are {} clusters".format(len(clusters)))

There are 2 clusters


In [28]:
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(sample100.index) - noOfPages))

cluster n. 1 has 39 pages
cluster n. 2 has 40 pages

79 pages were clustered using Foxlink's clustering algorithm. 21 pages were discarded


Note that during Foxlink's clustering algorithm some pages are discarded. 
Printing out each cluster data points

In [29]:
for i in range(10):
    print(clusters[0][1][i])

https://www.bookdepository.com/author/Robert-Walser?page=2
https://www.bookdepository.com/author/Colin-Lee
https://www.bookdepository.com/author/Deirdre-Purcell
https://www.bookdepository.com/publishers/Ancora
https://www.bookdepository.com/author/M-DIncalci
https://www.bookdepository.com/publishers/Olms-Georg-AG
https://www.bookdepository.com/author/Eric-Goldberg
https://www.bookdepository.com/author/Freeman-Of-Dublin
https://www.bookdepository.com/author/Jeffrey-Kennedy
https://www.bookdepository.com/author/Yvonne-Markus


In [30]:
for i in range(10):
    print(clusters[1][1][i])

https://www.bookdepository.com/Idylle-en-exil/9782867467448
https://www.bookdepository.com/Account-Memorial-Presented-His-Majesty-by-Captain-Pedro-Fernandez-de-Quir-Concerning-Population-Discovery-Fourth-Part-Pedro-Fernandes-De-Queirs/9781149746653
https://www.bookdepository.com/Dont-Marry-Me-Plowman-Patricia-Jeffery/9780813319940
https://www.bookdepository.com/Political-System-United-States-John-D-Lees/9780571048786
https://www.bookdepository.com/Tom-Peters-Business-School-Eric-Goldberg/9780517170014
https://www.bookdepository.com/CSM-VCE-Specialist-Mathematics-Units-3-4-Michael-Evans/9781107587434
https://www.bookdepository.com/Donne-e-guerra-Jean-B-Elshtain/9788815031457
https://www.bookdepository.com/Between-Two-Rivers/9780891200154
https://www.bookdepository.com/Water-Cycle-at-Work-Olien-Rebecca/9781474712293
https://www.bookdepository.com/Soil-Slope-Instability-Stablisation/9789061917304


So it seems that the second cluster contains pages from the 'product' template. Let's calculate now precision and recall for that cluster.

In [31]:
bookCluster = clusters[1]
urlsFromBookCluster = bookCluster[1]

In [32]:
pages_retrieved_for_book_query = len(urlsFromBookCluster)

true_positive = 0

all_positives = len(sample100[sample100['label']=='product'])

for url in urlsFromBookCluster:
    matchingRow  = sample100[sample100['url'] == url][['url','label']].iloc[0]
    if matchingRow['label'] == 'product':
        true_positive += 1

In [33]:
recall = true_positive/all_positives
precision = true_positive/pages_retrieved_for_book_query
eval_fmt = '{}: {}'
print(eval_fmt.format('Recall', recall))
print(eval_fmt.format('Precision', precision))

Recall: 0.6779661016949152
Precision: 1.0


Finally let's calculate again precision and recall considering 1000 pages and 5000 pages

### Evaluate recall and precision using a dataset of 1000 pages

In [42]:
sample1000 = df.sample(1000)
sample1000.describe()

Unnamed: 0,url,referer_url,src,label,shingle_vector
count,1000,1000,1000,629,1000
unique,1000,927,1000,1,16
top,https://www.bookdepository.com/Dinosaur-Pirate...,https://www.bookdepository.com/author/Daniel-R...,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",product,"(0, 1, 2, 3, 0, 1, 2, 1)"
freq,1,3,1,629,257


In [43]:
print(fmt_string.format(len(sample1000[sample1000['label'].isnull()]),'no'))
print(fmt_string.format(len(sample1000[sample1000['label']=='product']), 'product'))

There are 371 row with no label
There are 629 row with product label


In [44]:
clusters = structural_clustering(sample1000)
print("There are {} clusters".format(len(clusters)))

There are 5 clusters


In [45]:
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(sample1000.index) - noOfPages))

cluster n. 1 has 429 pages
cluster n. 2 has 345 pages
cluster n. 3 has 177 pages
cluster n. 4 has 25 pages
cluster n. 5 has 21 pages

997 pages were clustered using Foxlink's clustering algorithm. 3 pages were discarded


Printing each cluster's data points

In [46]:
for i in range(10):
    print(clusters[0][1][i])

https://www.bookdepository.com/Data-Structures-Using-C-Aaron-M-Tenenbaum/9780132004114
https://www.bookdepository.com/Mercury-Mining-Empire-Human-Ecological-Cost-Colonial-Silver-Mining-Andes-Nicholas-Robins/9781283235914
https://www.bookdepository.com/Draft-Constitution-for-Mercia-Jeff-Kent/9780952915249
https://www.bookdepository.com/Hornbill-Strs-Girl-Sparkling-Eyes-Colin-Swatridge-Editor/9780333566817
https://www.bookdepository.com/Bleak-House-Charles-Dickens/9780140430639
https://www.bookdepository.com/Etudes-Sur-Les-Animaux-Domestiques-Guy-De-Charnace/9781246206647
https://www.bookdepository.com/Modern-Mountain-Hideaways/9782875500458
https://www.bookdepository.com/Wacky-Things-about-Animals-2-Tricia-Martineau-Wagner/9781942875703
https://www.bookdepository.com/Hist%C3%B2ria-dOsona-Joaquim-Albareda/9788460034346
https://www.bookdepository.com/Anlatsam-Gecer-Mi-Sila-Gencoglu/9786050946581


In [47]:
for i in range(10):
    print(clusters[1][1][i])

https://www.bookdepository.com/author/Staiano-Oriana
https://www.bookdepository.com/author/Johannes-Liebrecht
https://www.bookdepository.com/author/Christian-Campbell
https://www.bookdepository.com/author/Luvvie-Ajayi
https://www.bookdepository.com/author/Llu%C3%ADs-Delgado-Pico
https://www.bookdepository.com/author/Joe-Perry
https://www.bookdepository.com/author/Patricia-Burch
https://www.bookdepository.com/author/V-Durr
https://www.bookdepository.com/author/H-C-Henry-Charles-1859-19-Beeching
https://www.bookdepository.com/author/Mark-S-Reed


In [48]:
for i in range(10):
    print(clusters[2][1][i])

https://www.bookdepository.com/500-White-Wines-Natasha-Hughes/9781416207719?ref=grid-view&qid=1557069335042&sr=1-30
https://www.bookdepository.com/500-White-Wines-Natasha-Hughes-Patricia-Langton/9781416207719
https://www.bookdepository.com/W-Is-for-Wind-Pat-Michaels/9781585362370
https://www.bookdepository.com/Power-Broker-Robert-Caro/9780394720241
https://www.bookdepository.com/Storia-dItalia-C-Vivanti/9788806427627?ref=bd_ser_1_1
https://www.bookdepository.com/Mulan-Verliebt-Shanghai-Susanne-Hornfeck/9783423650229?ref=bd_ser_1_1
https://www.bookdepository.com/Star-Wars-Use-Force-Michael-Siglain/9781484704646
https://www.bookdepository.com/Cuba-Ana-Maria-B-Vazquez/9780516027586
https://www.bookdepository.com/Modern-European-Philosophy-Carl-Schmitts-Critique-Liberalism-Against-Politics-Technology-John-P-McCormick/9780521664578?ref=grid-view&qid=1557068418743&sr=1-3
https://www.bookdepository.com/Dinosaur-Rocket-Ms-Penny-Dale/9780857633828?ref=bd_ser_1_1


In [49]:
for i in range(10):
    print(clusters[3][1][i])

https://www.bookdepository.com/Software-Projektmanagement-Georg-E-Thaller/9783935042284
https://www.bookdepository.com/Kaffee-1-Audio-CD/9783981025675
https://www.bookdepository.com/Keywords-Professor-Raymond-Williams/9780856642890
https://www.bookdepository.com/Consonancia-de-Sentencia-y-Acusacion-Luis-Emilio-Duran-G/9789586481915
https://www.bookdepository.com/Denn-Ich-Will-Aus-Mir-Machen-Das-Feinste-Hannelore-Cyrus/9783926768001
https://www.bookdepository.com/Alcoletge-Daniel-Rubio-Ruiz/9788497911740
https://www.bookdepository.com/Cambridge-Field-Guide-Old-World-Archaeology-Assaf-Yasur-Landau/9780521719933
https://www.bookdepository.com/Stability-Economics/9780985587956
https://www.bookdepository.com/Unkraut-vergeht-nicht-Klaus-Arlt/9783897573574
https://www.bookdepository.com/Konstruktivistische-Didaktik-m-CD-ROM-Kersten-Reich/9783407254108


In [50]:
for i in range(10):
    print(clusters[4][1][i])

https://www.bookdepository.com/category/134/vid/3389/Film-Theory-Criticism-Audio-Books
https://www.bookdepository.com/category/2671/National-Liberation-Independence-Post-colonialism
https://www.bookdepository.com/category/1389/Forensic-Medicine
https://www.bookdepository.com/category/88/Special-Kinds-Photography
https://www.bookdepository.com/category/294/Literary-Theory
https://www.bookdepository.com/category/106/Textile-Design-Theory
https://www.bookdepository.com/category/2722/vid/3389/Philosophy-Audio-Books?page=4
https://www.bookdepository.com/category/3171/Christian-Counselling?page=5
https://www.bookdepository.com/category/3008/vid/3389/Trees-Wildflowers-Plants-Audio-Books
https://www.bookdepository.com/category/1013/Public-Relations


although it is not clear, there are three clusters (first, third and fourth) of pages that match the book (or product) template. In this case it won't be necessary to calculate precision and recall since they should be really low values.

### Evaluate recall and precision using a dataset of 5000 pages

In [51]:
sample5000 = df.sample(5000)
sample5000.describe()

Unnamed: 0,url,referer_url,src,label,shingle_vector
count,5000,5000,5000,3080,5000
unique,5000,3891,5000,1,18
top,https://www.bookdepository.com/author/Patch-Ad...,https://www.bookdepository.com/publishers/Egmo...,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n\n ...",product,"(0, 1, 2, 3, 0, 1, 2, 1)"
freq,1,8,1,3080,1351


In [52]:
print(fmt_string.format(len(sample5000[sample5000['label'].isnull()]),'no'))
print(fmt_string.format(len(sample5000[sample5000['label']=='product']), 'product'))

There are 1920 row with no label
There are 3080 row with product label


In [53]:
clusters = structural_clustering(sample5000)
print("There are {} clusters".format(len(clusters)))

There are 5 clusters


In [54]:
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(sample5000.index) - noOfPages))

cluster n. 1 has 1792 pages
cluster n. 2 has 2208 pages
cluster n. 3 has 757 pages
cluster n. 4 has 127 pages
cluster n. 5 has 103 pages

4987 pages were clustered using Foxlink's clustering algorithm. 13 pages were discarded


Printing each cluster's data points

In [55]:
for i in range(10):
    print(clusters[0][1][i])

https://www.bookdepository.com/author/Helen-Stern
https://www.bookdepository.com/author/Sean-McManus
https://www.bookdepository.com/publishers/Raj-Publications-India
https://www.bookdepository.com/author/Alexandra-Feodorovna
https://www.bookdepository.com/author/Kat-Uno
https://www.bookdepository.com/author/Jos%C3%A9-Ram%C3%B3n-Ayll%C3%B3n
https://www.bookdepository.com/publishers/Longman-Schools-Division-A-Pearson-Education-Company
https://www.bookdepository.com/author/Lily-Kahn
https://www.bookdepository.com/author/Tim-Pabon
https://www.bookdepository.com/author/Dieter-Werkm%C3%BCller


In [56]:
for i in range(10):
    print(clusters[1][1][i])

https://www.bookdepository.com/Hot-Wheels-Field-Guide-Michael-Zarnock/9781440232091?ref=bd_ser_1_1
https://www.bookdepository.com/Last-Battle-Garth-Ennis/9781592911042
https://www.bookdepository.com/Foreign-Language-Instruction-United-States-Nancy-C-Rhodes/9781887744430?ref=grid-view&qid=1557067800709&sr=1-12
https://www.bookdepository.com/Thousand-Sons-Graham-McNeill/9781849708203
https://www.bookdepository.com/Exploring-Planets-Our-Solar-System-Rebecca-Olien/9781404234673?ref=bd_ser_1_1
https://www.bookdepository.com/Educare-con-il-lavoro-Raniero-Regni/9788883588679
https://www.bookdepository.com/Recht-Und-Elektrizitat-Jan-H%C3%B6vermann/9783161552298?ref=bd_ser_1_1
https://www.bookdepository.com/Modeen-Transformation-Frank-H-Jordan/9781502863614?ref=bd_ser_1_1
https://www.bookdepository.com/Recreational-Drones-Matt-Chandler/9781474733151?ref=bd_ser_1_1
https://www.bookdepository.com/Ill-be-Devil-Leo-Butler/9781408101490


In [57]:
for i in range(10):
    print(clusters[2][1][i])

https://www.bookdepository.com/Spotlight-on-Young-Children-Rossella-Procopio/9781938113314?ref=bd_ser_1_1
https://www.bookdepository.com/Politics-Party-Policy-Anika-Gauja/9780230283459
https://www.bookdepository.com/Mein-Kinderbuchschatz-Die-sch%C3%B6nsten-Geschichten-mit-Pippi-den-Olchis-Mama-Muh-und-Pu-Kirsten-Boie/9783789142765?ref=bd_ser_1_1
https://www.bookdepository.com/Escritura-Femenina-y-Discurso-Autobiografico-en-la-Nueva-Novela-Espanola-Isolina-Ballesteros/9780820462059?ref=bd_ser_1_1
https://www.bookdepository.com/Beezus-Ramona-Beverly-Cleary/9780061774058?ref=grid-view
https://www.bookdepository.com/Social-Cognition-Communication-Joseph-P-Forgas/9781848726635?ref=grid-view&qid=1557067435026&sr=1-7
https://www.bookdepository.com/Ruby-Redfort-Pick-Your-Poison-Rachael-Stirling/9781978619593?ref=bd_ser_1_1
https://www.bookdepository.com/Coccidioidomycosis/9781573316880
https://www.bookdepository.com/Transactional-Analysis-Approaches-Brief-Therapy-Keith-Tudor/9780761956808?ref=

In [58]:
for i in range(10):
    print(clusters[3][1][i])

https://www.bookdepository.com/S%C3%A4kularisierung-und-Resakralisierung/9783826020339
https://www.bookdepository.com/NHS-Trust-Development-Authority-annual-report-accounts-NHS-Trust-Development-Authority/9781474121194
https://www.bookdepository.com/Deconstructing-Nation-Max-Silverman/9780203323915
https://www.bookdepository.com/Gottesdienstmen%C3%A4um-f%C3%BCr-den-Monat-April-auf-der-Grundlage-der-Handschrift-Sin-165-des-Staatlichen-Historischen-Museums-Moskau-GIM-Tl-2/9783506773883
https://www.bookdepository.com/TagTr%C3%A4ume-1-Audio-CD/9783942432047
https://www.bookdepository.com/Flannigans-M-T-Dohaney/9781897317068
https://www.bookdepository.com/Luz-de-Ayer-Luz-de-Hoy-Luis-Daniel-Rubio-Morales/9786074244298
https://www.bookdepository.com/African-Americans-Sports-Lisa-Wade-McCormick/9781422223949
https://www.bookdepository.com/Sables-Et-Biscuits-Si-Bons-Si-Faciles-Marie-Pourrech/9782013944243
https://www.bookdepository.com/Jesus-neu-entwerfen-Carter-Heyward/9783905577495


In [59]:
for i in range(10):
    print(clusters[4][1][i])

https://www.bookdepository.com/category/1032/Working-Patterns-Practices
https://www.bookdepository.com/category/980/vid/3389/Business-Strategy-Audio-Books
https://www.bookdepository.com/category/3127/Religious-Ethics
https://www.bookdepository.com/category/1315/Metabolism
https://www.bookdepository.com/category/1772/Industrial-Quality-Control
https://www.bookdepository.com/category/347/Second-World-War
https://www.bookdepository.com/category/1612/Cellular-Biology-cytology?page=3
https://www.bookdepository.com/category/2840/vid/3389/Mind-Body-Spirit-Meditation-Visualisation-Audio-Books
https://www.bookdepository.com/category/3314/Musical-Scores-Lyrics-Libretti
https://www.bookdepository.com/category/390/Automation-Library-Information-Processes


So it seems that the second cluster contains pages matching with the book template. Let's calculate precision and recall again

In [60]:
bookCluster = clusters[1]
urlsFromBookCluster = bookCluster[1]

In [61]:
pages_retrieved_for_book_query = len(urlsFromBookCluster)

true_positive = 0

all_positives = len(sample5000[sample5000['label']=='product'])

for url in urlsFromBookCluster:
    matchingRow  = sample5000[sample5000['url'] == url][['url','label']].iloc[0]
    if matchingRow['label'] == 'product':
        true_positive += 1

In [62]:
recall = true_positive/all_positives
precision = true_positive/pages_retrieved_for_book_query
eval_fmt = '{}: {}'
print(eval_fmt.format('Recall', recall))
print(eval_fmt.format('Precision', precision))

Recall: 0.712987012987013
Precision: 0.9945652173913043


Final insights: Dataset size has no correlation with recall and precision values. The discarded pages during clustering are 15 pages at average. Considering only the product template, the precision is near 1 and the recall is poor. It seems that the shingle vector size is so small that it can't store all the necessary information to group similar pages (there are "collision" between pages having different templates)
Let's see if the threshold value could possibly change the recall and precision values for Foxlink's clustering algorithm. For that, let's consider the entire dataset and three values of threshold: 