# Foxlink's clustering algorithm evaluation
Evaluating Foxlink's clustering algorithm on bookdepository.com pages. The aim is to calculate precision and recall for "book details" cluster and the "catalog" cluster in bookoutlet.com.

In [1]:
%matplotlib inline
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd

FILEPATH = '../../../datasets/bookoutlet.csv'
FILEPATH

'../../../datasets/bookoutlet.csv'

In [2]:
df = pd.read_csv(FILEPATH)

## Data analisys
Some preliminary analisys of the dataset

In [3]:
print("First 5 rows")
print("------------")
df.head()

First 5 rows
------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
0,https://bookoutlet.com/,https://bookoutlet.com/,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\...","(0, 3, 1, 4, 0, 1, 3, 0)",
1,https://bookoutlet.com/Store/Sale,https://bookoutlet.com/,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\...","(0, 3, 1, 0, 0, 1, 3, 0)",
2,https://bookoutlet.com/Store/OtherBrowsing,https://bookoutlet.com/,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\...","(0, 3, 1, 8, 1, 1, 1, 0)",
3,https://bookoutlet.com/Store/Browse?N=isTopTen...,https://bookoutlet.com/,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\...","(0, 3, 1, 1, 1, 1, 0, 0)",list
4,https://bookoutlet.com/Store/Browse?N=isGiftCe...,https://bookoutlet.com/,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\...","(0, 3, 1, 1, 1, 1, 0, 0)",list


In [4]:
print("No. of rows and columns")
print("-----------------------")
df.shape

No. of rows and columns
-----------------------


(16387, 5)

In [5]:
print("Check null values")
print("-----------------")
df.isnull().any().any()

Check null values
-----------------


True

In [6]:
print("Check duplicate values")
print("----------------------")
len(df['url'].unique()) != df.shape[0]

Check duplicate values
----------------------


False

In [7]:
print("DataFrame column types")
print("----------------------")
df.info()

DataFrame column types
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16387 entries, 0 to 16386
Data columns (total 5 columns):
url               16387 non-null object
referer_url       16387 non-null object
src               16387 non-null object
shingle_vector    16387 non-null object
label             16381 non-null object
dtypes: object(5)
memory usage: 640.2+ KB


In [8]:
print("Some stats")
print("----------------")
df.describe()

Some stats
----------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
count,16387,16387,16387,16387,16381
unique,16387,8050,16387,11,2
top,https://bookoutlet.com/Store/Browse?Na=134829&...,https://bookoutlet.com/Store/Browse?Npb=6326,"<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\...","(0, 3, 1, 1, 1, 1, 0, 0)",list
freq,1,31,1,6138,11580


In [9]:
fmt_string = 'There are {} row with {} label'
print(fmt_string.format(len(df[df['label'].isnull()]),'no'))
print(fmt_string.format(len(df[df['label']=='product']), 'product'))
print(fmt_string.format(len(df[df['label']=='list']), 'list'))

There are 6 row with no label
There are 4801 row with product label
There are 11580 row with list label


## Run Foxlink's clustering algorithm

In [10]:
#add top level folder to sys.path
import sys
sys.path.append('../../../')

In [11]:
from foxlink_clustering.clustering.structural_clustering import structural_clustering

clusters = structural_clustering(df)

In [12]:
len(clusters)

2

So Foxlink's clustering algorithm discovered 2 clusters. Let's see how many pages contains each cluster

In [13]:
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(df.index) - noOfPages))
    

cluster n. 1 has 11589 pages
cluster n. 2 has 4795 pages

16384 pages were clustered using Foxlink's clustering algorithm. 3 pages were discarded


In [14]:
for i in range(10):
    print(clusters[0][1][i])

https://bookoutlet.com/Store/OtherBrowsing
https://bookoutlet.com/Store/Browse?N=isTopTenSeller&Nq=0&merch=Top+10+Books
https://bookoutlet.com/Store/Browse?N=isGiftCertificate
https://bookoutlet.com/Store/Browse?N=isTopTwoHundred&Nq=0
https://bookoutlet.com/Store/Browse?N=isRetailPromo14&merch=Most+Wanted+Young+Adult&fid=14
https://bookoutlet.com/Store/Browse?N=isBlowout&merch=Clearance+Titles
https://bookoutlet.com/Store/Browse
https://bookoutlet.com/Store/Browse?Nc=39
https://bookoutlet.com/Store/Browse?N=isTweens
https://bookoutlet.com/Store/Browse?Nc=88


In [15]:
for i in range(10):
    print(clusters[1][1][i])

https://bookoutlet.com/Store/Details/9780062202611B/love-loss-and-what-we-ate
https://bookoutlet.com/Store/Details/9781524741709B/people-like-us
https://bookoutlet.com/Store/Details/9781476775692B/wake-up-happy-the-dream-big-win-big-guide-to
https://bookoutlet.com/Store/Details/9780671631987S/teach-your-child-to-read-in-100-easy-lessons
https://bookoutlet.com/Store/Details/9781250103505B/radical-candor-be-a-kickass-boss-without-losi
https://bookoutlet.com/Store/Details/9780310337379B/sacred-marriage-what-if-god-designed-marriage
https://bookoutlet.com/Store/Details/9781250069825B/nine-perfect-strangers
https://bookoutlet.com/Store/Details/9780312304355B/molokai
https://bookoutlet.com/Store/Details/9780064438780B/the-leprechauns-gold
https://bookoutlet.com/Store/Details/9781408843673B/bricks-and-mortals-ten-great-buildings-and-th


It seems that the pages which belongs to the "product" template are contained in the second cluster, while the first cluster contains pages which belongs to the "list" template

## Calculate precision and recall
Calculate precision and recall considering the entire dataset. We use the first cluster to evaluate precision and recall for the "product" cluster and the fourth cluster for the "list" template:

In [18]:
def evaluate_precision_and_recall(dataFrame, cluster, label):
    urlsFromCluster = cluster[1]
    pages_retrieved_for_query = len(urlsFromCluster)
    true_positive = 0
    all_positives = len(dataFrame[dataFrame['label']==label])
    for url in urlsFromCluster:
        matchingRow  = dataFrame[dataFrame['url'] == url][['url','label']].iloc[0]
        if matchingRow['label'] == label:
            true_positive += 1
        
    recall = true_positive/all_positives
    precision = true_positive/pages_retrieved_for_query
    return (recall, precision)

Calculating recall and precision for the "product" and the "list" template

In [19]:
productRecall, productPrecision = evaluate_precision_and_recall(df, clusters[1], 'product')
catalogRecall, catalogPrecision = evaluate_precision_and_recall(df, clusters[0], 'list')

In [20]:
print("+---------+--------+-----------+")
print("|    -    | Recall | Precision |")
print("+---------+--------+-----------+")
print("| {} |  {} |     {} |".format('Product', round(productRecall,3), round(productPrecision,3)))
print("| {}    |  {} |     {} |".format('List', round(catalogRecall,3), round(catalogPrecision,3)))
print("+---------+--------+-----------+")

+---------+--------+-----------+
|    -    | Recall | Precision |
+---------+--------+-----------+
| Product |  0.999 |     1.0 |
| List    |  1.0 |     0.999 |
+---------+--------+-----------+
