# Foxlink's clustering algorithm evaluation
Evaluating Foxlink's clustering algorithm on bookdepository.com pages. The aim is to calculate precision and recall for "book details" cluster and the "catalog" cluster in blackwells.co.uk.

In [1]:
%matplotlib inline
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd

FILEPATH = '../../../datasets/bookswagon.csv'
FILEPATH

'../../../datasets/bookswagon.csv'

In [2]:
df = pd.read_csv(FILEPATH)

## Data analisys
Some preliminary analisys of the dataset

In [3]:
print("First 5 rows")
print("------------")
df.head()

First 5 rows
------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
0,https://www.bookswagon.com/,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 3, 6, 4, 0, 2, 10, 1)",
1,https://www.bookswagon.com/view-books/0/new-ar...,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(1, 3, 6, 4, 0, 2, 5, 1)",list
2,https://www.bookswagon.com/travel-holiday-books,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(1, 3, 6, 4, 0, 2, 5, 1)",list
3,https://www.bookswagon.com/all-categories/1000-0,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(7, 3, 3, 4, 0, 2, 10, 1)",
4,https://www.bookswagon.com/view-books/4/textbook,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(1, 3, 6, 4, 0, 2, 5, 1)",list


In [4]:
print("No. of rows and columns")
print("-----------------------")
df.shape

No. of rows and columns
-----------------------


(4447, 5)

In [5]:
print("Check null values")
print("-----------------")
df.isnull().any().any()

Check null values
-----------------


True

In [6]:
print("Check duplicate values")
print("----------------------")
len(df['url'].unique()) != df.shape[0]

Check duplicate values
----------------------


False

In [7]:
print("DataFrame column types")
print("----------------------")
df.info()

DataFrame column types
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4447 entries, 0 to 4446
Data columns (total 5 columns):
url               4447 non-null object
referer_url       4447 non-null object
src               4447 non-null object
shingle_vector    4447 non-null object
label             2842 non-null object
dtypes: object(5)
memory usage: 173.8+ KB


In [8]:
print("Some stats")
print("----------------")
df.describe()

Some stats
----------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
count,4447,4447,4447,4447,2842
unique,4447,2192,4447,26,2
top,https://www.bookswagon.com/shoppingcart.aspx?&...,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 2, 5, 0, 0, 0, 6, 1)",product
freq,1,18,1,1762,2323


In [9]:
fmt_string = 'There are {} row with {} label'
print(fmt_string.format(len(df[df['label'].isnull()]),'no'))
print(fmt_string.format(len(df[df['label']=='product']), 'product'))
print(fmt_string.format(len(df[df['label']=='list']), 'list'))

There are 1605 row with no label
There are 2323 row with product label
There are 519 row with list label


## Run Foxlink's clustering algorithm

In [10]:
#add top level folder to sys.path
import sys
sys.path.append('../../../')

In [11]:
from foxlink_clustering.clustering.structural_clustering import structural_clustering

clusters = structural_clustering(df)

In [12]:
len(clusters)

4

So Foxlink's clustering algorithm discovered 4 clusters. Let's see how many pages contains each cluster

In [13]:
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0
for index, cluster in enumerate(clusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(df.index) - noOfPages))
    

cluster n. 1 has 607 pages
cluster n. 2 has 838 pages
cluster n. 3 has 2323 pages
cluster n. 4 has 662 pages

4430 pages were clustered using Foxlink's clustering algorithm. 17 pages were discarded


In [19]:
for i in range(15):
    print(clusters[0][1][i])

https://www.bookswagon.com/
https://www.bookswagon.com/home
https://www.bookswagon.com/publishers
https://www.bookswagon.com/view-books/0/new-arrivals
https://www.bookswagon.com/travel-holiday-books
https://www.bookswagon.com/view-books/4/textbook
https://www.bookswagon.com/view-books/3/coming-soon-pre-order-now
https://www.bookswagon.com/sports-books
https://www.bookswagon.com/view-books/5/award-winners
https://www.bookswagon.com/self-help-personal-development-books
https://www.bookswagon.com/science-mathematics-books
https://www.bookswagon.com/view-books/1/top-selling-books
https://www.bookswagon.com/technology-engineering-books
https://www.bookswagon.com/personal-social-issues-books
https://www.bookswagon.com/society-social-sciences-books


In [15]:
for i in range(10):
    print(clusters[1][1][i])

https://www.bookswagon.com/shoppingcart.aspx?pid=26677150&vid=11&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?&pid=10341972&vid=51&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?&pid=12363363&vid=179&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?&pid=23020162&vid=63&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?ptype=1
https://www.bookswagon.com/shoppingcart.aspx?&pid=143772&vid=10&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?pid=864509&vid=142&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?pid=2823305&vid=220&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?pid=2859520&vid=220&ptype=1
https://www.bookswagon.com/shoppingcart.aspx?pid=3665910&vid=220&ptype=1


In [16]:
for i in range(10):
    print(clusters[2][1][i])

https://www.bookswagon.com/book/rgive-design-henry-petroski/9780674065840
https://www.bookswagon.com/book/cambodia-brian-fawcett/9780020321507
https://www.bookswagon.com/book/rgive-design-henry-petroski/9780674416826
https://www.bookswagon.com/book/rgive-design-henry-petroski/9781511399906
https://www.bookswagon.com/book/success-through-failure-henry-petroski/9780691180991
https://www.bookswagon.com/book/too-much-tv-bob-reese/9781617418150
https://www.bookswagon.com/book/too-much-tv-gladys-moreta/9781627170642
https://www.bookswagon.com/book/old-new-york-dodo-press/9781406573497
https://www.bookswagon.com/book/success-through-failure-henry-petroski/9780691136424
https://www.bookswagon.com/book/summer-dodo-press-edith-wharton/9781406566154


In [17]:
for i in range(10):
    print(clusters[3][1][i])

https://www.bookswagon.com/review/too-much-tv-gladys-moreta/9781627170642
https://www.bookswagon.com/review/afterward-dodo-press-edith-wharton/9781409915409
https://www.bookswagon.com/review/summer-dodo-press-edith-wharton/9781406566154
https://www.bookswagon.com/review/age-innocence-edith-wharton/9781791669669
https://www.bookswagon.com/review/too-much-tv-gladys-moreta/9781612360195
https://www.bookswagon.com/review/erenstain-bears-too-much-tv/9780812413892
https://www.bookswagon.com/review/narrative-arthur-gordon-pym-nantucket/9781536929416
https://www.bookswagon.com/review/erenstain-bears-too-much-birthday/9780394873329
https://www.bookswagon.com/review/age-innocence-edith-wharton/9781973944355
https://www.bookswagon.com/review/dirk-gentlys-holistic-detective-agency/9785512320303


It seems that the pages which belongs to the "product" template are contained in the third cluster while the first cluster contains pages which belongs to the "list" template

## Calculate precision and recall
Calculate precision and recall considering the entire dataset. We use the first cluster to evaluate precision and recall for the "product" cluster and the fourth cluster for the "list" template:

In [20]:
def evaluate_precision_and_recall(dataFrame, cluster, label):
    urlsFromCluster = cluster[1]
    pages_retrieved_for_query = len(urlsFromCluster)
    true_positive = 0
    all_positives = len(dataFrame[dataFrame['label']==label])
    for url in urlsFromCluster:
        matchingRow  = dataFrame[dataFrame['url'] == url][['url','label']].iloc[0]
        if matchingRow['label'] == label:
            true_positive += 1
        
    recall = true_positive/all_positives
    precision = true_positive/pages_retrieved_for_query
    return (recall, precision)

Calculating recall and precision for the "product" and the "list" template

In [21]:
productRecall, productPrecision = evaluate_precision_and_recall(df, clusters[2], 'product')
catalogRecall, catalogPrecision = evaluate_precision_and_recall(df, clusters[0], 'list')

In [22]:
print("+---------+--------+-----------+")
print("|    -    | Recall | Precision |")
print("+---------+--------+-----------+")
print("| {} |  {} |     {} |".format('Product', round(productRecall,3), round(productPrecision,3)))
print("| {}    |  {} |     {} |".format('List', round(catalogRecall,3), round(catalogPrecision,3)))
print("+---------+--------+-----------+")

+---------+--------+-----------+
|    -    | Recall | Precision |
+---------+--------+-----------+
| Product |  1.0 |     1.0 |
| List    |  0.998 |     0.853 |
+---------+--------+-----------+
