# Foxlink's clustering algorithm evaluation
Evaluating Foxlink's clustering algorithm on bookdepository.com pages. The aim is to calculate precision and recall for "book details" cluster and the "catalog" cluster in blackwells.co.uk.

In [1]:
%matplotlib inline
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd

FILEPATH = '../../../datasets/powells.csv'
FILEPATH

'../../../datasets/powells.csv'

In [2]:
df = pd.read_csv(FILEPATH)

## Data analisys
Some preliminary analisys of the dataset

In [3]:
print("First 5 rows")
print("------------")
df.head()

First 5 rows
------------


Unnamed: 0,url,referer_url,src,shingle_vector,label,tag_count,bitset
0,https://www.powells.com/blog/author/kristen-ar...,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(7, 2, 1, 8, 3, 10, 0, 5)",,"[0.002680965147453083, 0.002680965147453083, 0...","[0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, ..."
1,https://www.powells.com/blog/category/interviews,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(2, 1, 1, 0, 3, 5, 0, 1)",,"[0.0011467889908256881, 0.0011467889908256881,...","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ..."
2,https://www.powells.com/nonfiction-sale,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(0, 2, 2, 8, 3, 0, 0, 0)",,"[0.0013054830287206266, 0.0013054830287206266,...","[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, ..."
3,https://www.powells.com/powells-presents,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(0, 2, 2, 8, 1, 1, 0, 0)",,"[0.0022026431718061676, 0.0022026431718061676,...","[0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, ..."
4,https://www.powells.com/locations,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(2, 0, 0, 4, 2, 2, 0, 0)",,"[0.001976284584980237, 0.001976284584980237, 0...","[0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, ..."


In [4]:
print("No. of rows and columns")
print("-----------------------")
df.shape

No. of rows and columns
-----------------------


(10571, 7)

In [5]:
print("Check null values")
print("-----------------")
df.isnull().any().any()

Check null values
-----------------


True

In [6]:
print("Check duplicate values")
print("----------------------")
len(df['url'].unique()) != df.shape[0]

Check duplicate values
----------------------


False

In [7]:
print("DataFrame column types")
print("----------------------")
df.info()

DataFrame column types
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10571 entries, 0 to 10570
Data columns (total 7 columns):
url               10571 non-null object
referer_url       10571 non-null object
src               10571 non-null object
shingle_vector    10571 non-null object
label             8962 non-null object
tag_count         10571 non-null object
bitset            10571 non-null object
dtypes: object(7)
memory usage: 578.2+ KB


In [8]:
print("Some stats")
print("----------------")
df.describe()

Some stats
----------------


Unnamed: 0,url,referer_url,src,shingle_vector,label,tag_count,bitset
count,10571,10571,10571,10571,8962,10571,10571
unique,10571,6556,10571,30,2,4613,1246
top,https://www.powells.com/searchresults?keyword=...,https://www.powells.com/ProductMoreIsbn?produc...,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(7, 3, 3, 0, 0, 0, 0, 0)",list,"[0.00186219739292365, 0.00186219739292365, 0.0...","[1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, ..."
freq,1,28,1,8081,8545,288,1332


In [9]:
fmt_string = 'There are {} row with {} label'
print(fmt_string.format(len(df[df['label'].isnull()]),'no'))
print(fmt_string.format(len(df[df['label']=='product']), 'product'))
print(fmt_string.format(len(df[df['label']=='list']), 'list'))

There are 1609 row with no label
There are 417 row with product label
There are 8545 row with list label


## Run Foxlink's clustering algorithm

In [10]:
#add top level folder to sys.path
import sys
sys.path.append('../../../')

In [11]:
from foxlink_clustering.clustering.structural_clustering import structural_clustering

clusters = structural_clustering(df)

In [12]:
len(clusters)

5

So Foxlink's clustering algorithm discovered 7 clusters. Let's see how many pages contains each cluster

In [13]:
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(df.index) - noOfPages))
    

cluster n. 1 has 410 pages
cluster n. 2 has 1474 pages
cluster n. 3 has 121 pages
cluster n. 4 has 8097 pages
cluster n. 5 has 447 pages

10549 pages were clustered using Foxlink's clustering algorithm. 22 pages were discarded


By looking at each cluster's size we might infere that the first cluster (which has the largest number of pages) is the one containing pages which belongs to the "product" cluster

In [24]:
for i in range(20):
    print(clusters[0][1][i])

https://www.powells.com/book/how-to-be-a-cat-9781419734991/18-0
https://www.powells.com/book/mostly-dead-things-9781947793309/61-0
https://www.powells.com/book/disappearing-earth-9780525520412/18-0
https://www.powells.com/book/-9781974186044
https://www.powells.com/book/another-monster-at-the-end-of-this-book-9780307987693/18-0
https://www.powells.com/book/landmarks-9780241967874/62-0
https://www.powells.com/book/disappearing-earth-9781984892225/61-0
https://www.powells.com/book/disappearing-earth-9780525520412/61-0
https://www.powells.com/book/how-to-be-a-cat-9781419705281/2-6
https://www.powells.com/book/how-to-be-a-cat-9781419705281/7-8
https://www.powells.com/book/disappearing-earth-9781984892225
https://www.powells.com/book/felt-in-the-jaw-9781974186044/61-0
https://www.powells.com/book/how-to-be-a-cat-9781419705281
https://www.powells.com/book/how-to-be-a-cat-9781419734991/61-0
https://www.powells.com/book/legend-of-sleepy-hollow-9781416906254/1-11
https://www.powells.com/book/di

In [26]:
for i in range(10):
    print(clusters[1][1][i])

https://www.powells.com/login?returnurl=%2fpost%2finterviews%2fpowells-interview-kristen-arnett-author-of-mostly-dead-things
https://www.powells.com/login?returnurl=%2finfo%2fterms-of-use
https://www.powells.com/login?returnurl=%2flittle-golden-books-sale
https://www.powells.com/ShoppingCart.aspx?ProductItemID=253509209
https://www.powells.com/ShoppingCart.aspx?ProductItemID=122740493
https://www.powells.com/login?returnurl=%2flogin
https://www.powells.com/ShoppingCart.aspx?ProductItemID=310006846
https://www.powells.com/login?returnurl=%2finfo%2fshipping%2f
https://www.powells.com/login?returnurl=%2fpostcomment.aspx%3fproductid%3d21076017%26productmoduletabid%3d1318%26productitemid%3d122740493
https://www.powells.com/ShoppingCart.aspx?ProductItemID=251106866


In [27]:
for i in range(10):
    print(clusters[2][1][i])

https://www.powells.com/ProductMoreIsbn?productID=7100473&productItemID=312110090&binding=Hardcover&accountingCategory=Used&type=2&baseProductId=37757332
https://www.powells.com/ProductMoreIsbn?productID=7100473&productItemID=296859339&binding=Hardcover&accountingCategory=Used&type=2&baseProductId=37757332
https://www.powells.com/ProductMoreIsbn?productID=2218193&productItemID=317344959&binding=Hardcover&accountingCategory=New&type=1&baseProductId=%200
https://www.powells.com/ProductMoreIsbn?productID=5409689&productItemID=9879121&binding=Trade%20Paperback&accountingCategory=Used&type=1&baseProductId=%200
https://www.powells.com/ProductMoreIsbn?productID=4909019&productItemID=283698834&binding=Trade%20Paperback&accountingCategory=New&type=1&baseProductId=%200
https://www.powells.com/ProductMoreIsbn?productID=7639681&productItemID=9879121&binding=Hardcover&accountingCategory=Used&type=2&baseProductId=5409689
https://www.powells.com/ProductMoreIsbn?productID=7994739&productItemID=1882815

In [28]:
for i in range(10):
    print(clusters[3][1][i])

https://www.powells.com/SearchResults?keyword=Irving+Washington
https://www.powells.com/SearchResults?keyword=Washington+Irving
https://www.powells.com/searchresults?keyword=Washington%20Irving&pg=4
https://www.powells.com/searchresults?keyword=Washington%20Irving&pg=5
https://www.powells.com/searchresults?keyword=Washington%20Irving&pg=2
https://www.powells.com/searchresults?keyword=Washington%20Irving&pg=3
https://www.powells.com/searchresults?keyword=Washington%20Irving
https://www.powells.com/SearchResults?keyword=Irving+Washington+1783-1859
https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859
https://www.powells.com/SearchResults?keyword=Washington]+1783-1859+[Irving


In [29]:
for i in range(10):
    print(clusters[4][1][i])

https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859&hawkmm=2&book_class=Used&store=Remote%20Warehouse&category=Travel-Writing&price_facet=1
https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859&hawkmm=2&book_class=Used&category=Travel-Writing&price_facet=1
https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859&hawkmm=2&book_class=Used&last_received_date_string=Last%207%20days&store=Cedar%20Hills&category=Travel-Writing
https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859&book_class=Used&category=Travel-Writing&hawkmm=0
https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859&book_class=Used&store=Cedar%20Hills&category=Travel-Writing&hawkmm=0
https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859&category=Travel-Writing&hawkmm=0
https://www.powells.com/searchresults?keyword=Irving%20Washington%201783-1859&book_class=Used&hawkmm=0
https://www.powe

It seems that the pages which belongs to the "product" template are contained in the first cluster while the fourth cluster contains pages which belongs to the "list" template

## Calculate precision and recall
Calculate precision and recall considering the entire dataset. We use the first cluster to evaluate precision and recall for the "product" cluster and the fourth cluster for the "list" template:

In [21]:
def evaluate_precision_and_recall(dataFrame, cluster, label):
    urlsFromCluster = cluster[1]
    pages_retrieved_for_query = len(urlsFromCluster)
    true_positive = 0
    all_positives = len(dataFrame[dataFrame['label']==label])
    for url in urlsFromCluster:
        matchingRow  = dataFrame[dataFrame['url'] == url][['url','label']].iloc[0]
        if matchingRow['label'] == label:
            true_positive += 1
        
    recall = true_positive/all_positives
    precision = true_positive/pages_retrieved_for_query
    return (recall, precision)

Calculating recall and precision for the "product" and the "list" template

In [33]:
productRecall, productPrecision = evaluate_precision_and_recall(df, clusters[0], 'product')
catalogRecall, catalogPrecision = evaluate_precision_and_recall(df, clusters[3], 'list')

In [34]:
print("+---------+--------+-----------+")
print("|    -    | Recall | Precision |")
print("+---------+--------+-----------+")
print("| {} |  {} |     {} |".format('Product', round(productRecall,3), round(productPrecision,3)))
print("| {}    |  {} |     {} |".format('List', round(catalogRecall,3), round(catalogPrecision,3)))
print("+---------+--------+-----------+")

+---------+--------+-----------+
|    -    | Recall | Precision |
+---------+--------+-----------+
| Product |  0.983 |     1.0 |
| List    |  0.948 |     1.0 |
+---------+--------+-----------+
