# Clustering-based Value Normalization (Smart Clustering)

This notebook will cover using the ```py_valuenormalization``` package to cluster a set of values using a smart clustering algorithm and then clean up the resulting clusters.

To use this package, first import it by running the following python command:

In [None]:
import py_valuenormalization as vn

Now load your data values into a list. We have prepared 4 datasets in the following files:

In [None]:
tt = !ls -1 ../py_valuenormalization/data/
tt

Each file contains values to be normalized, one data value per line. Each of these datasets can be loaded using the following commands (here we load ```big_ten.txt```):

In [None]:
vals = vn.read_from_file('../py_valuenormalization/data/big_ten.txt')

from random import sample

sample(vals, 10)

Now we cluster the input values using a smart clustering algorithm, which finds the best parameter settings for standard hierarchical agglomerative clustering (HAC) using input training data.

To obtain the training data, run the following command. It opens a window which directs you through varoius steps to create training data for smart clustering.

In [None]:
(_, training_pairs) = vn.calibrate_normalization_cost_model(vals)

where training_pairs is a dictionary where each key is a value pair (v1, v2) with v1 and v2 being distinct input values, and the corresponding value is True if v1 and v2 refer to the same entity and False otherwise.

Now we cluster the input values using the smart clustering method and the training data in ```training_pairs``` using the following commands:

In [None]:
smc = vn.SmartClustering(vals, training_pairs)

(clusts, best_setting) = smc.cluster()

(agrscore, simk, lnk, thr) = best_setting

The output consists of a dictionary ```clusts``` and a tuple ```best_setting```. Each key of the dictionary ```clusts``` is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster. ```best_setting = (agrscore, simk, lnk, thr)``` is a tuple of agreement score and HAC parameter settings using which ```clusts``` is obtained. ```agrscore``` is the agreement score between ```clusts``` and ```training_pairs```; i.e. the fraction of the value pairs in ```training_pairs``` which agree with ```clusts```. ```sim_measure```, ```linkage``` and ```thr``` are the standard HAC parameters settings using which ```clusts``` is obtained.

Let's take a peak at the resulting clusters:

In [None]:
[(kk,vv) for (kk, vv) in clusts.items()][:10]

Now if there are any mixed clusters in ```clusts```, you can clean them up to arrive at the correct clustering of the input values. This phase consists of two main steps:

1. Split step, where you split clusters containing values referring to more than one real-world entity into smaller clusters each of which contains values referring to a single entity
2. Merge steps, in which you merge clusters referring to the same entity

To clean up the clustering results run the following command:

In [None]:
clean_clusts = vn.normalize_clusters(clusts)

where ```clusts``` is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster. This will open a graphical user interface to clean up ```clusts``` and the results with be returned in ```clean_clusts``` which is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.

Let's take a peak at the final clusters:

In [None]:
[(kk,vv) for (kk, vv) in clean_clusts.items()][:10]