# End-to-End Data Cleaning Pipeline with Raha and Baran (Minimal and Integrated)
We build an end-to-end data cleaning pipeline with our configuration-free error detection and correction systems, Raha and Baran.

In [1]:
import pandas
import IPython.display
import ipywidgets

import raha

## 1. Instantiating the Detection and Correction Classes
We first instantiate the `Detection` and `Correction` classes.

In [2]:
app_1 = raha.Detection()
app_2 = raha.Correction()

# How many tuples would you label?
app_1.LABELING_BUDGET = 20
app_2.LABELING_BUDGET = 0

# Would you like to see the logs?
app_1.VERBOSE = True
app_2.VERBOSE = True

## 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [3]:
dataset_dictionary = {
        "name": "beers",
        "path": "datasets/beers/dirty.csv",
        "clean_path": "datasets/beers/clean.csv"
    }
d = app_1.initialize_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,index,id,beer_name,style,ounces,abv,ibu,brewery_id,brewery_name,city,state
0,1,1436,Pub Beer,American Pale Lager,12.0 oz,0.05,,408,10 Barrel Brewing Company,Bend,OR
1,2,2265,Devil's Cup,American Pale Ale (APA),12.0 oz.,0.066,,177,18th Street Brewery,Gary,IN
2,3,2264,Rise of the Phoenix,American IPA,12.0 ounce,0.071,,177,18th Street Brewery,Gary,IN
3,4,2263,Sinister,American Double / Imperial IPA,12.0 oz,0.09%,,177,18th Street Brewery,Gary,IN
4,5,2262,Sex and Candy,American IPA,12.0 OZ.,0.075,,177,18th Street Brewery,Gary,IN


## 3. Generating Features and Clusters
Raha runs (all or the promising) error detection strategies on the dataset. This step could take a while because all the strategies should be run on the dataset. Raha then generates a feature vector for each data cell based on the output of error detection strategies. Raha next builds a hierarchical clustering model for our clustering-based sampling approach.

In [4]:
app_1.run_strategies(d)
app_1.generate_features(d)
app_1.build_clusters(d)

583 cells are detected by ["PVD", ["style", "t"]].
120 cells are detected by ["PVD", ["city", "b"]].
176 cells are detected by ["PVD", ["brewery_name", "H"]].
810 cells are detected by ["PVD", ["brewery_id", "3"]].
503 cells are detected by ["PVD", ["brewery_id", "0"]].
697 cells are detected by ["PVD", ["brewery_id", "4"]].
1 cells are detected by ["PVD", ["beer_name", "\u00e8"]].2 cells are detected by ["PVD", ["beer_name", "/"]].

46 cells are detected by ["PVD", ["beer_name", "J"]].
86 cells are detected by ["PVD", ["city", "E"]].
55 cells are detected by ["PVD", ["beer_name", "V"]].
317 cells are detected by ["PVD", ["abv", "9"]].4 cells are detected by ["PVD", ["style", "Q"]].

68 cells are detected by ["PVD", ["state", "D"]].
244 cells are detected by ["PVD", ["brewery_name", "A"]].
313 cells are detected by ["PVD", ["ibu", "1"]].
2382 cells are detected by ["PVD", ["ounces", "1"]].
725 cells are detected by ["PVD", ["beer_name", "d"]].
9 cells are detected by ["PVD", ["brewery_

211 cells are detected by ["PVD", ["style", "R"]].1295 cells are detected by ["OD", ["histogram", "0.3", "0.1"]].

1669 cells are detected by ["PVD", ["brewery_name", "C"]].
4738 cells are detected by ["OD", ["gaussian", "1.3"]].612 cells are detected by ["PVD", ["id", "0"]].

42 cells are detected by ["PVD", ["style", "\u00f6"]].
56 cells are detected by ["PVD", ["state", "F"]].629 cells are detected by ["PVD", ["index", "4"]].
359 cells are detected by ["PVD", ["city", "C"]].

63 cells are detected by ["PVD", ["brewery_name", "."]].
104 cells are detected by ["PVD", ["state", "Y"]].485 cells are detected by ["PVD", ["brewery_name", "h"]].
1131 cells are detected by ["OD", ["histogram", "0.9", "0.1"]].

162 cells are detected by ["PVD", ["style", "w"]].
1005 cells are detected by ["PVD", ["ibu", "A"]].
175 cells are detected by ["PVD", ["beer_name", "'"]].
154 cells are detected by ["PVD", ["ounces", "O"]].
1 cells are detected by ["PVD", ["style", "q"]].
84 cells are detected by ["PV

587 cells are detected by ["PVD", ["brewery_id", "5"]].
619 cells are detected by ["PVD", ["index", "0"]].
270 cells are detected by ["PVD", ["beer_name", "L"]].
1 cells are detected by ["PVD", ["beer_name", "?"]].1567 cells are detected by ["PVD", ["ounces", "2"]].

601 cells are detected by ["PVD", ["beer_name", "c"]].1513 cells are detected by ["PVD", ["style", "l"]].0 cells are detected by ["RVD", ["index", "city"]].


1176 cells are detected by ["PVD", ["city", "i"]].
0 cells are detected by ["RVD", ["index", "id"]].
265 cells are detected by ["PVD", ["ibu", "3"]].
302 cells are detected by ["PVD", ["city", "p"]].3 cells are detected by ["PVD", ["beer_name", "\u00e9"]].

1964 cells are detected by ["PVD", ["style", "a"]].244 cells are detected by ["PVD", ["city", "A"]].

204 cells are detected by ["PVD", ["beer_name", "G"]].
9 cells are detected by ["PVD", ["beer_name", "Q"]].
182 cells are detected by ["PVD", ["style", "D"]].
547 cells are detected by ["PVD", ["abv", "4"]].
4554 

4568 cells are detected by ["RVD", ["brewery_name", "beer_name"]].
1093 cells are detected by ["PVD", ["id", "2"]].
4810 cells are detected by ["RVD", ["ounces", "brewery_id"]].
1396 cells are detected by ["PVD", ["city", "o"]].
227 cells are detected by ["PVD", ["beer_name", "w"]].
2 cells are detected by ["PVD", ["beer_name", "\u00fc"]].
4656 cells are detected by ["RVD", ["abv", "brewery_id"]].
54 cells are detected by ["PVD", ["style", "G"]].
4810 cells are detected by ["RVD", ["ounces", "style"]].
619 cells are detected by ["PVD", ["index", "7"]].
129 cells are detected by ["PVD", ["city", "R"]].
1 cells are detected by ["PVD", ["beer_name", "\u00e4"]].
15 cells are detected by ["PVD", ["style", "v"]].
4774 cells are detected by ["RVD", ["ibu", "state"]].
1784 cells are detected by ["PVD", ["style", "A"]].
1018 cells are detected by ["PVD", ["city", "t"]].
5388 cells are detected by ["OD", ["histogram", "0.3", "0.3"]].
1368 cells are detected by ["PVD", ["beer_name", "o"]].
14 cel

## 4. Interactive Tuple Sampling and Labeling
Raha then iteratively samples a tuple. We should label data cells of each sampled tuple.

In [5]:
def on_button_clicked(_):
    for j in range(0, len(texts)):
        cell = (d.sampled_tuple, j)
        error_label = 0
        correction = texts[j].value
        if d.dataframe.iloc[cell] != correction:
            error_label = 1
        d.labeled_cells[cell] = [error_label, correction]
    d.labeled_tuples[d.sampled_tuple] = 1

app_1.sample_tuple(d)
print("Fix the dirty cells in the following sampled tuple.")
sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
IPython.display.display(sampled_tuple)  
texts = [ipywidgets.Text(value=d.dataframe.iloc[d.sampled_tuple, j]) for j in range(d.dataframe.shape[1])]
button = ipywidgets.Button(description="Save the Annotation")
button.on_click(on_button_clicked)
output = ipywidgets.VBox(children=texts + [button])
IPython.display.display(output)

Tuple 2106 is sampled.
Fix the dirty cells in the following sampled tuple.


Unnamed: 0,index,id,beer_name,style,ounces,abv,ibu,brewery_id,brewery_name,city,state
2106,2107,1196,Wild Plum Farmhouse Ale,Saison / Farmhouse Ale,16.0 OZ.,0.055999999999999994%,20,45,Tallgrass Brewing Company,Manhattan,KS


VBox(children=(Text(value='2107'), Text(value='1196'), Text(value='Wild Plum Farmhouse Ale'), Text(value='Sais…

For the sake of time, we use the ground truth of the dataset to label tuples below.

In [6]:
%%capture
while len(d.labeled_tuples) < app_1.LABELING_BUDGET:
    app_1.sample_tuple(d)
    if d.has_ground_truth:
        app_1.label_with_ground_truth(d)

## 5. Propagating User Labels and Predicting the Labels
Raha then propagates each user label through its cluster. Raha then trains and applies one classifier per data column to predict the label of the rest of data cells.

In [7]:
app_1.propagate_labels(d)
app_1.predict_labels(d)

The number of labeled data cells increased from 220 to 24636.
A classifier is trained and applied on column 0.
A classifier is trained and applied on column 1.
A classifier is trained and applied on column 2.
A classifier is trained and applied on column 3.
A classifier is trained and applied on column 4.
A classifier is trained and applied on column 5.
A classifier is trained and applied on column 6.
A classifier is trained and applied on column 7.
A classifier is trained and applied on column 8.
A classifier is trained and applied on column 9.
A classifier is trained and applied on column 10.


## 6. Initializing and Updating the Error Corrector Models
Baran initializes the error corrector models. Baran then iteratively samples a tuple. We should label data cells of each sampled tuple. It then udpates the models accordingly and generates a feature vector for each pair of a data error and a correction candidate. Finally, it trains and applies a classifier to each data column to predict the final correction of each data error. Since we already labeled tuples for Raha, we use the same labeled tuples and do not label new tuples here.

In [8]:
app_2.initialize_models(d)
app_2.initialize_dataset(d)
for si in d.labeled_tuples:
    d.sampled_tuple = si
    app_2.update_models(d)
    app_2.generate_features(d)
    app_2.predict_corrections(d)

The error corrector models are initialized.
The error corrector models are updated with new labeled tuple 310.
214966 pairs of (a data error, a potential correction) are featurized.
28% (1205 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 1252.
217179 pairs of (a data error, a potential correction) are featurized.
33% (1440 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 280.
217187 pairs of (a data error, a potential correction) are featurized.
48% (2117 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 97.
218198 pairs of (a data error, a potential correction) are featurized.
91% (3956 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 1586.
218216 pairs of (a data error, a potential correction) are featurized.
96% (4173 / 4369) of data errors are corrected.
The error corrector model

## 7. Storing Results
Both Raha and Baran can also store the error detection/correction results.

In [9]:
app_1.store_results(d)
app_2.store_results(d)

The results are stored in datasets/beers/raha-baran-results-beers/error-detection/detection.dataset.
The results are stored in datasets/beers/raha-baran-results-beers/error-correction/correction.dataset.


## 8. Evaluating the Data Cleaning Task
We can finally evaluate our data cleaning task.

In [10]:
edp, edr, edf = d.get_data_cleaning_evaluation(d.detected_cells)[:3]
ecp, ecr, ecf = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]

evaluation_df = pandas.DataFrame(columns=["Task", "Precision", "Recall", "F1 Score"])
evaluation_df = evaluation_df.append({"Task": "Error Detection (Raha)", "Precision": "{:.2f}".format(edp), 
                                      "Recall": "{:.2f}".format(edr), "F1 Score": "{:.2f}".format(edf)}, ignore_index=True)
evaluation_df = evaluation_df.append({"Task": "Error Correction (Baran)", "Precision": "{:.2f}".format(ecp), 
                                      "Recall": "{:.2f}".format(ecr), "F1 Score": "{:.2f}".format(ecf)}, ignore_index=True)
evaluation_df.head()

Unnamed: 0,Task,Precision,Recall,F1 Score
0,Error Detection (Raha),1.0,1.0,1.0
1,Error Correction (Baran),0.96,0.95,0.96
