## Our clustering algorithm evaluation
Evaluating our clustering algorithm on bookswagon.com pages. The aim is to calculate precision and recall for "book details" cluster and the "catalog" cluster in bookswagon.com.

In [1]:
# Importing libraries
import pandas as pd
FILEPATH = '../../../datasets/bookswagon.csv'
FILEPATH

'../../../datasets/bookswagon.csv'

In [2]:
df = pd.read_csv(FILEPATH,nrows=2000)

## Data analisys
Some preliminary analisys of the dataset

In [3]:
print("First 5 rows")
print("------------")
df.head()

First 5 rows
------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
0,https://www.bookswagon.com/,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 3, 6, 4, 0, 2, 10, 1)",
1,https://www.bookswagon.com/view-books/0/new-ar...,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(1, 3, 6, 4, 0, 2, 5, 1)",list
2,https://www.bookswagon.com/travel-holiday-books,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(1, 3, 6, 4, 0, 2, 5, 1)",list
3,https://www.bookswagon.com/all-categories/1000-0,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(7, 3, 3, 4, 0, 2, 10, 1)",
4,https://www.bookswagon.com/view-books/4/textbook,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(1, 3, 6, 4, 0, 2, 5, 1)",list


In [4]:
print("No. of rows and columns")
print("-----------------------")
df.shape

No. of rows and columns
-----------------------


(2000, 5)

In [6]:
print("Check null values")
print("-----------------")
df.isnull().any().any()

Check null values
-----------------


True

In [7]:
print("Check duplicate values")
print("----------------------")
len(df['url'].unique()) != df.shape[0]

Check duplicate values
----------------------


False

In [8]:
print("DataFrame column types")
print("----------------------")
df.info()

DataFrame column types
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 5 columns):
url               3000 non-null object
referer_url       3000 non-null object
src               3000 non-null object
shingle_vector    3000 non-null object
label             1962 non-null object
dtypes: object(5)
memory usage: 117.3+ KB


In [5]:
print("Some stats")
print("----------------")
df.describe()

Some stats
----------------


Unnamed: 0,url,referer_url,src,shingle_vector,label
count,2000,2000,2000,2000,1279
unique,2000,1018,2000,26,2
top,https://www.bookswagon.com/shoppingcart.aspx?p...,https://www.bookswagon.com/,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 2, 5, 0, 0, 0, 6, 1)",product
freq,1,18,1,845,1052


In [6]:
fmt_string = 'There are {} row with {} label'
print(fmt_string.format(len(df[df['label'].isnull()]),'no'))
print(fmt_string.format(len(df[df['label']=='product']), 'product'))
print(fmt_string.format(len(df[df['label']=='list']), 'list'))

There are 721 row with no label
There are 1052 row with product label
There are 227 row with list label


## Run MeanShift clustering algorithm

In [7]:
#add top level folder to sys.path
import sys
sys.path.append('../../../')

In [8]:
from astarwars_clustering.clustering import clusteringevaluation
from astarwars_clustering.utils import utility
from astarwars_clustering.clustering.structural_clustering import dbscanclustering, meanshiftclustering, createFeatureMatrix

In [9]:
sample2000 = df.sample(2000)
sample2000.head()

Unnamed: 0,url,referer_url,src,shingle_vector,label
710,https://www.bookswagon.com/shoppingcart.aspx?&...,https://www.bookswagon.com/author/colin-nicholson,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(2, 3, 13, 2, 3, 2, 7, 0)",
506,https://www.bookswagon.com/book/mathematical-s...,https://www.bookswagon.com/book/evidencebased-...,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 2, 5, 0, 0, 0, 6, 1)",product
562,https://www.bookswagon.com/book/chinese-televi...,https://www.bookswagon.com/book/american-telev...,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 2, 5, 0, 0, 0, 6, 1)",product
410,https://www.bookswagon.com/book/das-m-rchen-un...,https://www.bookswagon.com/book/burke-int-fran...,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 2, 5, 0, 0, 0, 6, 1)",product
229,https://www.bookswagon.com/author/brent-wood,https://www.bookswagon.com/book/sexts-brent-wo...,"\r\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1...","(0, 3, 6, 4, 0, 2, 5, 1)",list


In [10]:
print(fmt_string.format(len(sample2000[sample2000['label'].isnull()]),'no'))
print(fmt_string.format(len(sample2000[sample2000['label']=='product']), 'product'))
print(fmt_string.format(len(sample2000[sample2000['label']=='list']), 'list'))

There are 721 row with no label
There are 1052 row with product label
There are 227 row with list label


In [11]:
featureMatrix = createFeatureMatrix(df['src'])

Elapsed time to calculate features:00:03:30.85


In [12]:
msclustering = meanshiftclustering(0.1, featureMatrix)

Elapsed time to calculate MeanShift clustering:00:00:22.75


In [36]:
import numpy as np
meanShiftPredictedLabels = msclustering.labels_
noOfClusters = np.unique(meanShiftPredictedLabels)

for el in noOfClusters:
    print(utility.count_occurrences(meanShiftPredictedLabels,el))

p1,r1=clusteringevaluation.calculate_precision_and_recall(df,msclustering,'product',0)
p1,r1=clusteringevaluation.calculate_precision_and_recall(df,msclustering,'list',3)

1404
351
18
15
2
2
9
193
1
1
1
1
1
1
Recall is 0.8935361216730038
Precision is 0.6695156695156695
Recall is 0.02643171806167401
Precision is 0.4


In [54]:
from sklearn.cluster import KMeans
kmeansclustering=KMeans(n_clusters=10).fit(X=featureMatrix)

nclusterskmeans=np.unique(kmeansclustering.labels_)
print(nclusterskmeans)
kmeanslabels=kmeansclustering.labels_

for el in nclusterskmeans:
    print(utility.count_occurrences(kmeanslabels,el))
p1,r1=clusteringevaluation.calculate_precision_and_recall(df,kmeansclustering,'product',9)
p1,r1=clusteringevaluation.calculate_precision_and_recall(df,kmeansclustering,'list',1)

[0 1 2 3 4 5 6 7 8 9]
402
153
319
236
11
95
330
39
1
414
Recall is 0.31749049429657794
Precision is 0.8067632850241546
Recall is 0.02643171806167401
Precision is 0.0392156862745098


In [52]:
dbscanclustering=dbscanclustering(featureMatrix)

TypeError: 'DBSCAN' object is not callable

In [16]:
dbscanLabels= dbscanclustering.labels_
noOfClusters=np.unique(dbscanLabels)
noOfClusters

array([-1,  0,  1])

In [18]:
from astarwars_clustering.clustering import clusteringevaluation
p1,r1=clusteringevaluation.calculate_precision_and_recall(df,dbscanclustering,'product',0)

Recall is 1.0
Precision is 0.6522008679479231


In [23]:
p1,r1=clusteringevaluation.calculate_precision_and_recall(df,dbscanclustering,'list',0)

Recall is 1.0
Precision is 0.1407315561066336


So there are - clusters

In [None]:
"""
cluster_fmt = 'cluster n. {} has {} pages'
noOfPages = 0
for i in range(noOfClusters):
    clusterSize = len(cluster[1])
    print(cluster_fmt.format(index +1 , clusterSize))
    noOfPages += clusterSize
print()
print('{} pages were clustered using Foxlink\'s clustering algorithm. {} pages were discarded'.format(noOfPages, len(df.index) - noOfPages))
"""
