# Clustering-based Value Normalization (Vanilla Clustering)

This notebook will cover using the ```py_valuenormalization``` package to cluster a set of values using a vanilla clustering algorithm and then clean up the resulting clusters.

To use this package, first import it by running the following python command:

In [None]:
import py_valuenormalization as vn

Now load your data values into a list. We have prepared 4 datasets in the following files:

In [None]:
tt = !ls -1 ../py_valuenormalization/data/
tt

Each file contains values to be normalized, one data value per line. Each of these datasets can be loaded using the following commands (here we load ```big_ten.txt```):

In [None]:
vals = vn.read_from_file('../py_valuenormalization/data/big_ten.txt')

from random import sample

sample(vals, 10)

Now we cluster the input values using a vanilla hierarchical agglomerative clustering (HAC) algorithm:

In [None]:
hac = vn.HierarchicalClustering(vals)

clusts = hac.cluster(
    sim_measure = '3gram Jaccard', 
    linkage = 'single', 
    thr = 0.7
)

where ```vals``` is the set of input values, ```sim_measure```, ```linkage``` and ```thr``` are standard HAC parameters, and ```clusts``` is a dictionary where each key is the label of a cluster of ```vals``` values, and the corresponding value is the set of data values in this cluster.

Let's take a peak at the resulting clusters:

In [None]:
[(kk,vv) for (kk, vv) in clusts.items()][:10]

Now if there are any mixed clusters in ```clusts```, you can clean them up to arrive at the correct clustering of the input values. This phase consists of two main steps:

1. Split step, where you split clusters containing values referring to more than one real-world entity into smaller clusters each of which contains values referring to a single entity
2. Merge steps, in which you merge clusters referring to the same entity

To clean up the clustering results run the following command:

In [None]:
clean_clusts = vn.normalize_clusters(clusts)

where ```clusts``` is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster. This will open a graphical user interface to clean up ```clusts``` and the results with be returned in ```clean_clusts``` which is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.

Let's take a peak at the final clusters:

In [None]:
[(kk,vv) for (kk, vv) in clean_clusts.items()][:10]