# Foxlink's clustering algorithm evaluation
Evaluating Foxlink's clustering algorithm on bookdepository.com pages. The aim is to calculate precision and recall for "book details" cluster and the "catalog" cluster in blackwells.co.uk.

In [1]:
%matplotlib inline
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd

FILEPATH = '../../../datasets/blackwells.csv'
FILEPATH

'../../../datasets/blackwells.csv'

In [2]:
df = pd.read_csv(FILEPATH)

## Data analisys
Some preliminary analisys of the dataset

In [3]:
print("First 5 rows")
print("------------")
df.head()

First 5 rows
------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
0,https://blackwells.co.uk/bookshop/basket,https://blackwells.co.uk/bookshop/home,"\n\n\n \n<!DOCTYPE html>\n<html lang=""e...","(0, 1, 5, 1, 1, 6, 3, 1)",
1,https://blackwells.co.uk/bookshop/search/,https://blackwells.co.uk/bookshop/home,"\n\n\n \n<!DOCTYPE html>\n<html lang=""e...","(0, 1, 5, 1, 1, 0, 3, 0)",list
2,https://blackwells.co.uk/bookshop/home,https://blackwells.co.uk/bookshop/home,"\n\n\n \n<!DOCTYPE html>\n<html lang=""e...","(0, 1, 0, 1, 0, 0, 3, 1)",
3,https://blackwells.co.uk/bookshop/product/9781...,https://blackwells.co.uk/bookshop/home,"\n\n\n \n<!DOCTYPE html>\n<html lang=""e...","(0, 1, 1, 1, 1, 0, 0, 1)",product
4,https://blackwells.co.uk/bookshop/mapping,https://blackwells.co.uk/bookshop/basket,"\n\n\n\n\n\n<!DOCTYPE html>\n<html lang=""en"" c...","(2, 22, 1, 1, 7, 15, 7, 5)",


In [4]:
print("No. of rows and columns")
print("-----------------------")
df.shape

No. of rows and columns
-----------------------


(10919, 5)

In [5]:
print("Check null values")
print("-----------------")
df.isnull().any().any()

Check null values
-----------------


True

In [6]:
print("Check duplicate values")
print("----------------------")
len(df['url'].unique()) != df.shape[0]

Check duplicate values
----------------------


False

In [7]:
print("DataFrame column types")
print("----------------------")
df.info()

DataFrame column types
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10919 entries, 0 to 10918
Data columns (total 5 columns):
url               10919 non-null object
referer_url       10919 non-null object
src               10919 non-null object
shingle_vector    10919 non-null object
label             10899 non-null object
dtypes: object(5)
memory usage: 426.6+ KB


In [8]:
print("Some stats")
print("----------------")
df.describe()

Some stats
----------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
count,10919,10919,10919,10919,10899
unique,10919,6375,10525,73,2
top,https://blackwells.co.uk/bookshop/product/The-...,https://blackwells.co.uk/bookshop/home,"\n\n\n \n<!DOCTYPE html>\n<html lang=""e...","(0, 1, 5, 0, 1, 0, 3, 0)",product
freq,1,12,7,2197,10405


In [11]:
fmt_string = 'There are {} row with {} label'
print(fmt_string.format(len(df[df['label'].isnull()]),'no'))
print(fmt_string.format(len(df[df['label']=='product']), 'product'))
print(fmt_string.format(len(df[df['label']=='list']), 'list'))

There are 20 row with no label
There are 10405 row with product label
There are 494 row with list label


## Run Foxlink's clustering algorithm

In [12]:
#add top level folder to sys.path
import sys
sys.path.append('../../../')

In [13]:
from foxlink_clustering.clustering.structural_clustering import structural_clustering

clusters = structural_clustering(df)

In [14]:
len(clusters)

7

So Foxlink's clustering algorithm discovered 7 clusters. Let's see how many pages contains each cluster

In [15]:
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(df.index) - noOfPages))
    

cluster n. 1 has 4760 pages
cluster n. 2 has 1390 pages
cluster n. 3 has 98 pages
cluster n. 4 has 522 pages
cluster n. 5 has 4030 pages
cluster n. 6 has 29 pages
cluster n. 7 has 25 pages

10854 pages were clustered using Foxlink's clustering algorithm. 65 pages were discarded


By looking at each cluster's size we might infere that the first cluster (which has the largest number of pages) is the one containing pages which belongs to the "product" cluster

In [17]:
for i in range(10):
    print(clusters[0][1][i])

https://blackwells.co.uk/bookshop/search/
https://blackwells.co.uk/bookshop/product/The-Haunting-by-Margaret-Mahy-author/9781510105041
https://blackwells.co.uk/bookshop/product/Gunslinger-Girl-by-Lyndsay-Ely-author/9780316555241
https://blackwells.co.uk/bookshop/product/The-Fashion-Business-Manual-by-Fashionary/9789887710974
https://blackwells.co.uk/bookshop/product/Thinking-Outside-the-Box-by-Jonathan-Bell-author/9781911339168
https://blackwells.co.uk/bookshop/product/Radio-Frequency-Identification-RFID-Technology-and-Application-in-Fashion-and-Textile-Supply-Chain-by-Rajkishore-Nayak-author/9780815376231
https://blackwells.co.uk/bookshop/product/Planetary-Gear-Trains-by-Kiril-Borisov-Arnaudov-author-Dimitar-Petkov-Karaivanov-author/9781138311855
https://blackwells.co.uk/bookshop/product/Altarpieces-and-Their-Viewers-in-the-Churches-of-Rome-from-Caravaggio-to-Guido-Reni-by-Pamela-M-Jones/9780754661795
https://blackwells.co.uk/bookshop/product/Automation-in-the-Virtual-Testing-of-Mecha

In [18]:
for i in range(10):
    print(clusters[1][1][i])

https://blackwells.co.uk/bookshop/product/9781784702113
https://blackwells.co.uk/bookshop/product/Rooftoppers-by-Katherine-Rundell-author/9781784702113
https://blackwells.co.uk/bookshop/product/The-Boy-in-the-Dress-by-David-Walliams-Quentin-Blake/9781784702113
https://blackwells.co.uk/bookshop/product/Wonder-Woman-Talking-Figure-and-Illustrated-Book-by-Running-Press-author/9780762456949
https://blackwells.co.uk/bookshop/product/Library-of-Light-by-Jo-Joelson-author/9781848222533
https://blackwells.co.uk/bookshop/product/I-Love-Llamas-Activity-Book-by-Emily-Stead-author/9781784702113
https://blackwells.co.uk/bookshop/product/The-Fashion-Business-Manual-by-Fashionary/9781784702113
https://blackwells.co.uk/bookshop/product/A-Marine-Artists-Portfolio-by-Susanne-Fournais-Grube-artist/9781784702113
https://blackwells.co.uk/bookshop/product/London-Midland-Steam-by-R-J-Buckley-photographer-Brian-J-Dickson-compiler/9781784702113
https://blackwells.co.uk/bookshop/product/Hey-Awesome-by-Karen-You

In [19]:
for i in range(10):
    print(clusters[2][1][i])

https://blackwells.co.uk/bookshop/about
https://blackwells.co.uk/bookshop/product/The-Woman-in-the-Lake-by-Nicola-Cornick-author/9781848456945
https://blackwells.co.uk/bookshop/product/The-Riders-Balance-by-Sylvia-Loch-author/9781910016343
https://blackwells.co.uk/bookshop/product/OtherEarth-by-Jason-Segel-author-Kirsten-Miller-author/9781786074522
https://blackwells.co.uk/bookshop/product/The-Villa-of-Mysteries-by-David-Hewson-author/9781847519511
https://blackwells.co.uk/bookshop/product/Primary-Maths-for-Scotland-Textbook-2A-by-Lowther-Craig/9780008313982
https://blackwells.co.uk/bookshop/product/Contemporary-Health-Studies-by-Louise-Warwick-Booth-Ruth-Cross-Diane-Lowcock/9780745650227
https://blackwells.co.uk/bookshop/product/The-Masterpiece-by-Fiona-Davis-author/9781524742959
https://blackwells.co.uk/bookshop/product/UK-Politics-by-Sarra-Jenkins-author-Nick-Gallop-author/9781510447646
https://blackwells.co.uk/bookshop/product/National-45-Modern-Studies-Course-Notes-by-Jennifer-M-G

In [20]:
for i in range(10):
    print(clusters[3][1][i])

https://blackwells.co.uk/bookshop/editorial/Kitschies
https://blackwells.co.uk/bookshop/category/_biography/
https://blackwells.co.uk/bookshop/category/_biography/9781784702113
https://blackwells.co.uk/bookshop/category/_artanddesign?offset=48
https://blackwells.co.uk/bookshop/category/_artanddesign?offset=96
https://blackwells.co.uk/bookshop/category/_artanddesign?offset=144
https://blackwells.co.uk/bookshop/category/_artanddesign?offset=240
https://blackwells.co.uk/bookshop/category/_artanddesign?sortValue=Rating&offset=432
https://blackwells.co.uk/bookshop/search/publisher/Wiley%20Blackwell
https://blackwells.co.uk/bookshop/search/publisher/Grove%20Press


In [21]:
for i in range(10):
    print(clusters[4][1][i])

https://blackwells.co.uk/bookshop/product/Kudos-by-Rachel-Cusk-author/9780571346721
https://blackwells.co.uk/bookshop/product/The-Wonderful-Adventure-of-Nils-Holgersson-by-Selma-Lagerlf-Paul-R-Norln-translator-Bertil-Lybeck-illustrator/9780241206096
https://blackwells.co.uk/bookshop/product/Mr-Stink-by-David-Walliams-Quentin-Blake/9780007279067
https://blackwells.co.uk/bookshop/product/Peter-Pan-by-J-M-Barrie-Minalima-Design-Firm-illustrator/9780062362223
https://blackwells.co.uk/bookshop/product/Rooftoppers-by-Katherine-Rundell-author/9780571280599
https://blackwells.co.uk/bookshop/product/Younger-Fitter-Stronger-by-Matt-Roberts-author/9781472964496
https://blackwells.co.uk/bookshop/product/You-Got-This-by-Bryony-Gordon-author/9781526361868
https://blackwells.co.uk/bookshop/product/Catch-22-by-Joseph-Heller-author/9780099529125
https://blackwells.co.uk/bookshop/product/Becoming-Jo-by-Sophie-McKenzie-author-Louisa-May-Alcott/9781407188157
https://blackwells.co.uk/bookshop/product/The-J

In [22]:
for i in range(10):
    print(clusters[5][1][i])

https://blackwells.co.uk/bookshop/product/Into-the-Bermuda-Triangle-by-Gian-J-Quasar/9780071452175
https://blackwells.co.uk/bookshop/product/Building-a-Scalable-Data-Warehouse-With-Data-Vault-2-0-by-Dan-Linstedt-author-Michael-Olschimke-author/9780128025109
https://blackwells.co.uk/bookshop/product/Oracle-Application-Express-by-Arie-Geller-author-Brian-Spendolini-author/9780071843041
https://blackwells.co.uk/bookshop/product/Multifunctional-Ultrawideband-Antennas-by-Chinmoy-Saha-author-Jawad-Y-Siddiqui-author-Y-M-M-Antar-author/9781138553545
https://blackwells.co.uk/bookshop/product/Alive-at-Work-by-Daniel-M-Cable-author/9781633697669
https://blackwells.co.uk/bookshop/product/Introduction-to-Environmental-Impact-Assessment-by-John-Glasson-author-Riki-Therivel-author/9781138600744
https://blackwells.co.uk/bookshop/product/Cult-Cinema-by-Ernest-Mathijs-Jamie-Sexton/9781405173735
https://blackwells.co.uk/bookshop/product/Photovoltaics-by-Solar-Energy-International/9780865715202
https://bl

In [23]:
for i in range(10):
    print(clusters[6][1][i])

https://blackwells.co.uk/bookshop/product/Pocket-Naples--The-Amalfi-Coast-by-Cristian-Bonetto-author-Brendan-Sainsbury-author/9781788681162
https://blackwells.co.uk/bookshop/product/A-Managers-Guide-to-Self-Development-by-Mike-Pedler-author-John-Burgoyne-author-Tom-Boydell-author/9780077149888
https://blackwells.co.uk/bookshop/product/101-Small-Ways-to-Change-the-World-by-Aubre-Andrus-author/9781787014862
https://blackwells.co.uk/bookshop/product/Machine-Learning-for-Dummies-by-John-Mueller-author-Luca-Massaron-author/9781119245513
https://blackwells.co.uk/bookshop/product/Forensic-Psychology-by-Sandie-Taylor-author/9780815384915
https://blackwells.co.uk/bookshop/product/How-to-Talk-to-Absolutely-Anyone-by-Mark-Rhodes-author/9780857087454
https://blackwells.co.uk/bookshop/product/Immunology-by-David-K-Male/9780323080583
https://blackwells.co.uk/bookshop/product/Spanish-by-Marta-Lpez-author-Cristina-Hernndez-Montero-author/9781786573896
https://blackwells.co.uk/bookshop/product/The-A-Z-

It seems that the pages which belongs to the "product" template are distributed among 6 clusters out of 7. Furthermore the fourth cluster contains pages which belongs to the "list" template

## Calculate precision and recall
Calculate precision and recall considering the entire dataset. We use the first cluster to evaluate precision and recall for the "product" cluster and the fourth cluster for the "list" template:

In [38]:
def evaluate_precision_and_recall(dataFrame, cluster, label):
    urlsFromCluster = cluster[1]
    pages_retrieved_for_query = len(urlsFromCluster)
    true_positive = 0
    all_positives = len(dataFrame[dataFrame['label']==label])
    for url in urlsFromCluster:
        matchingRow  = dataFrame[dataFrame['url'] == url][['url','label']].iloc[0]
        if matchingRow['label'] == label:
            true_positive += 1
        
    recall = true_positive/all_positives
    precision = true_positive/pages_retrieved_for_query
    return (recall, precision)

Calculating recall and precision for the "product" and the "list" template

In [40]:
productRecall, productPrecision = evaluate_precision_and_recall(df, clusters[0], 'product')
catalogRecall, catalogPrecision = evaluate_precision_and_recall(df, clusters[3], 'list')

In [43]:
print("+---------+--------+-----------+")
print("|    -    | Recall | Precision |")
print("+---------+--------+-----------+")
print("| {} |  {} |     {} |".format('Product', round(productRecall,3), round(productPrecision,3)))
print("| {}    |  {} |     {} |".format('List', round(catalogRecall,3), round(catalogPrecision,3)))
print("+---------+--------+-----------+")

+---------+--------+-----------+
|    -    | Recall | Precision |
+---------+--------+-----------+
| Product |  0.457 |     1.0 |
| List    |  0.996 |     0.943 |
+---------+--------+-----------+
