## Our clustering algorithm evaluation
Evaluating our clustering algorithm on bookswagon.com pages. The aim is to calculate precision and recall for "book details" cluster and the "catalog" cluster in bookswagon.com.

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import ast
FILEPATH = '../../../datasets/powells.csv'
FILEPATH

'../../../datasets/powells.csv'

In [2]:
df = pd.read_csv(FILEPATH, converters={'bitset': ast.literal_eval, 'tag_count': ast.literal_eval})

## Data analisys
Some preliminary analisys of the dataset

In [3]:
print("First 5 rows")
print("------------")
df.head()

First 5 rows
------------


Unnamed: 0,url,referer_url,src,shingle_vector,label,tag_count,bitset
0,https://www.powells.com/blog/author/kristen-ar...,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(7, 2, 1, 8, 3, 10, 0, 5)",,"[0.002680965147453083, 0.002680965147453083, 0...","[0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, ..."
1,https://www.powells.com/blog/category/interviews,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(2, 1, 1, 0, 3, 5, 0, 1)",,"[0.0011467889908256881, 0.0011467889908256881,...","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ..."
2,https://www.powells.com/nonfiction-sale,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(0, 2, 2, 8, 3, 0, 0, 0)",,"[0.0013054830287206266, 0.0013054830287206266,...","[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, ..."
3,https://www.powells.com/powells-presents,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(0, 2, 2, 8, 1, 1, 0, 0)",,"[0.0022026431718061676, 0.0022026431718061676,...","[0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, ..."
4,https://www.powells.com/locations,https://www.powells.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","(2, 0, 0, 4, 2, 2, 0, 0)",,"[0.001976284584980237, 0.001976284584980237, 0...","[0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, ..."


In [4]:
print("No. of rows and columns")
print("-----------------------")
df.shape

No. of rows and columns
-----------------------


(10571, 7)

In [5]:
print("Check null values")
print("-----------------")
df.isnull().any().any()

Check null values
-----------------


True

In [16]:
print("Check duplicate values")
print("----------------------")
len(df['url'].unique()) != df.shape[0]

Check duplicate values
----------------------


False

In [6]:
print("DataFrame column types")
print("----------------------")
df.info()

DataFrame column types
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10571 entries, 0 to 10570
Data columns (total 7 columns):
url               10571 non-null object
referer_url       10571 non-null object
src               10571 non-null object
shingle_vector    10571 non-null object
label             8962 non-null object
tag_count         10571 non-null object
bitset            10571 non-null object
dtypes: object(7)
memory usage: 578.2+ KB


In [8]:
fmt_string = 'There are {} row with {} label'
print(fmt_string.format(len(df[df['label'].isnull()]),'no'))
print(fmt_string.format(len(df[df['label']=='product']), 'product'))
print(fmt_string.format(len(df[df['label']=='list']), 'list'))

There are 1609 row with no label
There are 417 row with product label
There are 8545 row with list label


## Run MeanShift clustering algorithm

In [9]:
#add top level folder to sys.path
import sys
sys.path.append('../../../')

In [10]:
from astarwars_clustering.clustering import clusteringevaluation
from astarwars_clustering.utils import utility
from astarwars_clustering.clustering.structural_clustering import dbscanclustering, meanshiftclustering

In [11]:
sample=df.sample(3000)
bitsetmat=sample['bitset'].tolist()
tagcountmat=sample['tag_count'].tolist()

In [12]:
clustering = meanshiftclustering(bitsetmat,10)

Elapsed time to calculate MeanShift clustering:00:06:27.11


In [15]:
print(fmt_string.format(len(sample[sample['label'].isnull()]),'no'))
print(fmt_string.format(len(sample[sample['label']=='product']), 'product'))
print(fmt_string.format(len(sample[sample['label']=='list']), 'list'))

There are 439 row with no label
There are 114 row with product label
There are 2447 row with list label


In [14]:
predictedLabels = clustering.labels_
noOfClusters = np.unique(predictedLabels)
sample['predicted_label'] = predictedLabels
print('There are ' + str(noOfClusters) + 'clusters')
print()
print()
print('Cluster labels:')
noOfClusters

There are [ 0  1  2  3  4  5  6  7  8  9 10 11]clusters


Cluster labels:


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [16]:
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0

for index ,el in enumerate(noOfClusters):
    print(cluster_fmt.format(index ,utility.count_occurrences(predictedLabels,el)))

cluster n. 0 has 2313 pages
cluster n. 1 has 369 pages
cluster n. 2 has 129 pages
cluster n. 3 has 114 pages
cluster n. 4 has 35 pages
cluster n. 5 has 29 pages
cluster n. 6 has 5 pages
cluster n. 7 has 2 pages
cluster n. 8 has 1 pages
cluster n. 9 has 1 pages
cluster n. 10 has 1 pages
cluster n. 11 has 1 pages


In [17]:
sample[sample['predicted_label'] == 0]['url'].head(20)

2622    https://www.powells.com/searchresults?keyword=...
3602    https://www.powells.com/searchresults?keyword=...
4066    https://www.powells.com/searchresults?keyword=...
4326    https://www.powells.com/searchresults?keyword=...
4132    https://www.powells.com/searchresults?keyword=...
1538    https://www.powells.com/searchresults?keyword=...
6934    https://www.powells.com/searchresults?keyword=...
3808    https://www.powells.com/searchresults?keyword=...
3097    https://www.powells.com/searchresults?keyword=...
6362    https://www.powells.com/searchresults?keyword=...
8213    https://www.powells.com/searchresults?keyword=...
8231    https://www.powells.com/searchresults?keyword=...
8380    https://www.powells.com/searchresults?keyword=...
9244    https://www.powells.com/searchresults?keyword=...
6341    https://www.powells.com/searchresults?keyword=...
8992    https://www.powells.com/searchresults?keyword=...
1632    https://www.powells.com/searchresults?keyword=...
1244    https:

In [18]:
sample[sample['predicted_label'] == 1]['url'].head(10)

4442     https://www.powells.com/login?returnurl=%2fsea...
1621     https://www.powells.com/login?returnurl=%2fsea...
6235     https://www.powells.com/login?returnurl=%2fsea...
5881     https://www.powells.com/login?returnurl=%2fsea...
445      https://www.powells.com/login?returnurl=%2fboo...
766      https://www.powells.com/login?returnurl=%2fpos...
10350    https://www.powells.com/login?returnurl=%2fsea...
3750     https://www.powells.com/login?returnurl=%2fpos...
9884     https://www.powells.com/login?returnurl=%2fsea...
6990     https://www.powells.com/login?returnurl=%2fsea...
Name: url, dtype: object

In [20]:
sample[sample['predicted_label'] == 2]['url'].head(30)

10349    https://www.powells.com/searchresults?keyword=...
2665     https://www.powells.com/searchresults?keyword=...
2842     https://www.powells.com/searchresults?keyword=...
10259    https://www.powells.com/searchresults?keyword=...
7452     https://www.powells.com/searchresults?keyword=...
6433     https://www.powells.com/searchresults?keyword=...
7346     https://www.powells.com/searchresults?keyword=...
9614     https://www.powells.com/searchresults?keyword=...
1411     https://www.powells.com/searchresults?keyword=...
10086    https://www.powells.com/searchresults?keyword=...
10517    https://www.powells.com/searchresults?keyword=...
2642     https://www.powells.com/searchresults?keyword=...
10128    https://www.powells.com/searchresults?keyword=...
2637     https://www.powells.com/searchresults?keyword=...
2588     https://www.powells.com/searchresults?keyword=...
3591     https://www.powells.com/searchresults?keyword=...
6384     https://www.powells.com/searchresults?keyword=.

## Evaluate recall and precision

In [21]:
p1,r1=clusteringevaluation.calculate_precision_and_recall(sample,clustering,'list',0)

Recall is 0.9452390682468329
Precision is 1.0


In [25]:
p1,r1=clusteringevaluation.calculate_precision_and_recall(sample,clustering,'product',3)

Recall is 1.0
Precision is 1.0


# DBSCAN algorithm

In [26]:
dbsclustering=dbscanclustering(bitsetmat,10,20)

Elapsed time to calculate DBSCAN clustering:00:00:39.23


In [27]:
predictedLabels = dbsclustering.labels_
noOfClusters = np.unique(predictedLabels)
sample['predicted_label'] = predictedLabels
print('There are ' + str(noOfClusters) + 'clusters')
print()
print()
print('Cluster labels:')
noOfClusters

There are [-1  0  1  2  3  4  5]clusters


Cluster labels:


array([-1,  0,  1,  2,  3,  4,  5])

In [28]:
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0

for index ,el in enumerate(noOfClusters):
    print(cluster_fmt.format(index ,utility.count_occurrences(predictedLabels,el)))

cluster n. 0 has 15 pages
cluster n. 1 has 2309 pages
cluster n. 2 has 29 pages
cluster n. 3 has 35 pages
cluster n. 4 has 369 pages
cluster n. 5 has 114 pages
cluster n. 6 has 129 pages


In [29]:
p1,r1=clusteringevaluation.calculate_precision_and_recall(sample,dbsclustering,'list',0)

Recall is 0.9436044135676338
Precision is 1.0


In [30]:
p1,r1=clusteringevaluation.calculate_precision_and_recall(sample,dbsclustering,'product',4)

Recall is 1.0
Precision is 1.0
