# (Attribute) Value Normalization

Attribute value normalization (value normalization for short) is one of the important steps in any data cleaning pipeline. Given a set of input values $V = \{v_1, \dots, v_m\}$, we want to group them into subsets $\{c_1, \dots, c_n\} = C$ such that for each $i \neq j$, $c_i \cap c_j = \emptyset$ (i.e. $C$ is a partitioning of $V$), all the values in each $c_i$ refer to the same real-world entity and no two distinct groups $c_i$ and $c_j$ with $i\neq j$ refer to the same entity.

# Normalizing the Big Ten Dataset using Various Methods

In the rest of this notebook, you try various methods to normalize different variations of the names of the Big Ten schools. First load the dataset by running the following commands:

In [None]:
import py_valuenormalization as vn
vals = vn.read_from_file('../py_valuenormalization/data/big_ten.txt')

``vals`` is a list of school name variations. Run the following commands to see a sample of these values:

In [None]:
from IPython.display import Image
im = Image(filename='bigtensamplegoldenclusters.png')

sample_vals = sorted(vals, key=lambda x: x.lower())[:20]
sample_vals

The correct clustering of these values looks like the following (run the following cell):
{{im}}

As you can see, all the values in each group refer to the same school and no two distinct groups of values refer to the same school.

## Getting Familiar with the Data and the Procedures

Let's try normalizing this small sample using the following two value normalization approaches:

   1. Manual
   2. Clustering-based

### Manual Value Normalization

In manual value normalization, essentially you group together values which refer to the same school from scratch. Run the following command and follow the instructions to normalize the sample above manually:

In [None]:
(sample_man_clusts, time_to_finish) = vn.normalize_values(sample_vals)

It took you {{time_to_finish}} seconds.

The clusters you created are stored in the variable ``sample_man_clusts`` which is a dictionary where each key is the label of a cluster of ``sample_vals`` values, and the corresponding value is the list of data values in this cluster. We call such a dictionary a <i>cluster dictionary</i> from now on. Run the following command to see the clusters:

In [None]:
sample_man_clusts

### Clustering-based Value Normalization

Manual value normalization could be cumbersome and very time-consuming, particularly when normalizing large datasets. To make the above process easier, we can use a clustering algorithm. At this step, we cluster the sample values using a standard clustering algorithm, i.e. hierarchical agglomerative clustering or HAC, by running the following commands:

In [None]:
hac = vn.HierarchicalClustering(sample_vals)

sample_clusts = hac.cluster(
    sim_measure = '3gram Jaccard', 
    linkage = 'single', 
    thr = 0.8
)

The automatically formed clusters look like the following:

In [None]:
sample_clusts

As an example, consider the first cluster, that is ``'Buckeyes': ['Buckeyes', 'iowa hawkeyes']``: ``'Buckeyes'`` is the label of this cluster and the cluster contains the values ``'Buckeyes'`` and ``'iowa hawkeyes'``.

As you can see, the HAC algorithm has clustered some of the values correctly. For example it has correctly grouped ``'Maryland'`` and ``'maryland'`` together.

However some of the clusters are not correct. For example consider the first cluster again. ``'Buckeyes'`` and ``'iowa hawkeyes'`` do not refer to the same school. So we need to <i>split</i> this cluster into subclusters which contain values referring only to the same school. In this case, we split this cluster into two clusters each containing a single value.

Then we need to <i>merge</i> the clusters ``'iowa hawkeyes': ['iowa hawkeyes']`` and ``'iowa': ['iowa']`` into a single cluster since they refer to the same school.

To perform edits like above, run the following command and follow the instructions:

In [None]:
(clean_sample_clusts, ttf) = vn.normalize_clusters(sample_clusts)

The resulting cluster dictionary ``clean_sample_clusts`` would look like the following:

In [None]:
clean_sample_clusts

## Experiments

Now that you have familiarized yourself with the basic concepts of and steps involved in value normalization, let's go through the actual experiments.

To summarize, you are going to normalize the Big Ten dataset 7 times; thank you! For each experiment, we measure the time it would take you to normalize the values and compare them at the end. In order to remove the bias, we mask the actual method used for each experiment by naming methods 1 through 7. Just run the commands in the cell for each experiment and follow the instructions. Please pay as little attention to the cell contents as possible! ;) You'll see the final timing results after finishing all 7 experiments.

**Important Note 1**: Please try to follow the instructions as closely as possible.

**Important Note 2**: Please run all the cells to the end and save the notebook so we can access the experimental results later.

**Important Note 3**: Please familiarize yourself with the data before starting the experiments below; i.e. look at all the values and make sure you know which school each value refers to. You will see a review screen immediately after you start the first method below. You can use that screen to learn all the values.

### Method 1

Run the cell below and follow the instructions:

In [None]:
hac = vn.HierarchicalClustering(vals)
van_8_clusts = hac.cluster(sim_measure = '3gram Jaccard', linkage = 'single', thr = 0.8)
(van_8_clean_clusts, van_8_ttf) = vn.normalize_clusters(van_8_clusts)

### Method 2

Run the cell below and follow the instructions:

In [None]:
vn.HierarchicalClustering._default_thr = 0.3
if 'training_pairs_3' not in locals() and 'training_pairs_3' not in globals():
    (cm_3, training_pairs_3, calib_3_ttf) = vn.calibrate_normalization_cost_model(vals)
smc = vn.SmartClustering(vals, training_pairs_3)
(smt_3_clusts, best_setting_3) = smc.cluster()
(agrscore_3, simk_3, lnk_3, thr_3) = best_setting_3
(smt_3_clean_clusts, smt_3_ttf) = vn.normalize_clusters(smt_3_clusts)

### Method 3

Run the cell below and follow the instructions:

In [None]:
vn.HierarchicalClustering._default_thr = 0.8
if 'cm_8' not in locals() and 'cm_8' not in globals():
    (cm_8, training_pairs_8, calib_8_ttf) = vn.calibrate_normalization_cost_model(vals)
hybhac = vn.HybridClustering(vals, cm_8)
(hyb_8_clusts, hyb_8_mcl) = hybhac.cluster()
(hyb_8_clean_clusts, hyb_8_ttf) = vn.normalize_clusters(hyb_8_clusts)

### Method 4

Run the cell below and follow the instructions:

In [None]:
(man_clean_clusts, man_ttf) = vn.normalize_values(vals)

### Method 5

Run the cell below and follow the instructions:

In [None]:
vn.HierarchicalClustering._default_thr = 0.3
if 'cm_3' not in locals() and 'cm_3' not in globals():
    (cm_3, training_pairs_3, calib_3_ttf) = vn.calibrate_normalization_cost_model(vals)
hybhac = vn.HybridClustering(vals, cm_3)
(hyb_3_clusts, hyb_3_mcl) = hybhac.cluster()
(hyb_3_clean_clusts, hyb_3_ttf) = vn.normalize_clusters(hyb_3_clusts)

### Method 6

Run the cell below and follow the instructions:

In [None]:
hac = vn.HierarchicalClustering(vals)
van_3_clusts = hac.cluster(sim_measure = '3gram Jaccard', linkage = 'single', thr = 0.3)
(van_3_clean_clusts, van_3_ttf) = vn.normalize_clusters(van_3_clusts)

### Method 7

Run the cell below and follow the instructions:

In [None]:
vn.HierarchicalClustering._default_thr = 0.8
if 'training_pairs_8' not in locals() and 'training_pairs_8' not in globals():
    (cm_8, training_pairs_8, calib_8_ttf) = vn.calibrate_normalization_cost_model(vals)
smc = vn.SmartClustering(vals, training_pairs_8)
(smt_8_clusts, best_setting_8) = smc.cluster()
(agrscore_8, simk_8, lnk_8, thr_8) = best_setting_8
(smt_8_clean_clusts, smt_8_ttf) = vn.normalize_clusters(smt_8_clusts)

## Comparing the Results

Now let's compare the results for different methods:

In [None]:
res_3 = {
    'Manual': man_ttf,
    'Vanilla with thr = 0.3': van_3_ttf,
    'Smart with thr = 0.3': calib_3_ttf + smt_3_ttf,
    'Hybrid with thr = 0.3': calib_3_ttf + hyb_3_ttf
}

rank_3 = sorted(res_3.keys(), key=lambda x: res_3[x])

rank_3


In [None]:
res_8 = {
    'Manual': man_ttf,
    'Vanilla with thr = 0.8': van_8_ttf,
    'Smart with thr = 0.8': calib_8_ttf + smt_8_ttf,
    'Hybrid with thr = 0.8': calib_8_ttf + hyb_8_ttf
}

rank_8 = sorted(res_8.keys(), key=lambda x: res_8[x])

rank_8