# End-to-End Data Cleaning Pipeline with Raha and Baran (Minimal and Integrated)
We build an end-to-end data cleaning pipeline with our configuration-free error detection and correction systems, Raha and Baran.

In [1]:
import pandas
import IPython.display
import ipywidgets

import raha

## 1. Instantiating the Detection and Correction Classes
We first instantiate the `Detection` and `Correction` classes.

In [2]:
from raha import analysis_utilities
app_1 = raha.Detection()
app_2 = raha.Correction()

# How many tuples would you label?
app_1.LABELING_BUDGET = 20
app_2.LABELING_BUDGET = 0

# Would you like to see the logs?
app_1.VERBOSE = True
app_2.VERBOSE = True

## 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [3]:
dataset_dictionary = {
        "name": "movies_1",
        "path": "datasets/movies_1/dirty.csv",
        "clean_path": "datasets/movies_1/clean.csv"
    }
d = app_1.initialize_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,id,name,year,release_date,director,creator,actors,full_cast,language,country,duration,rating_value,rating_count,review_count,genre,filming_locations,description
0,tt0054215,Psycho,1960,8 September 1960 (USA),Alfred Hitchcock,"Joseph Stefano,Robert Bloch","Anthony Perkins,Janet Leigh,Vera Miles","Anthony Perkins,Vera Miles,John Gavin,Janet Le...",English,USA,109 min,8.6,379998,"976 user,290 critic","Horror,Mystery,Thriller","Title and Trust Building, 114 West Adams Stree...","A Phoenix secretary steals $40,000 from her em..."
1,tt0088993,Day of the Dead,1985,19 July 1985 (USA),George A. Romero,George A. Romero,"Lori Cardille,Terry Alexander,Joseph Pilato","Lori Cardille,Terry Alexander,Joseph Pilato,Ja...",English,USA,96 min,7.2,46421,"414 user,177 critic","Action,Drama,Horror","Sanibel Island, Florida, USA",A small group of military officers and scienti...
2,tt0032484,Foreign Correspondent,1940,16 August 1940 (USA),Alfred Hitchcock,"Charles Bennett,Joan Harrison","Joel McCrea,Laraine Day,Herbert Marshall","Joel McCrea,Laraine Day,Herbert Marshall,Georg...","English,Dutch,Latvian",USA,120 min,7.6,12684,"124 user,73 critic","Romance,Thriller,War","Hotel de l'Europe, Nieuwe Doelenstraat 2-14, A...","On the eve of WWII, a young American reporter ..."
3,tt0889671,Trumbo,2007,27 June 2008 (USA),Peter Askin,"Christopher Trumbo,Christopher Trumbo","Dalton Trumbo,Joan Allen,Brian Dennehy","Dalton Trumbo,Joan Allen,Brian Dennehy,Michael...",English,USA,96 min,7.5,724,"14 user,33 critic","Documentary,Biography",,Through a focus on the life of Dalton Trumbo (...
4,tt1325014,The People vs. George Lucas,2010,29 August 2011 (USA),Alexandre O. Philippe,Alexandre O. Philippe,"Joe Nussbaum,Daryl Frazetti,Doug Jones","Joe Nussbaum,Daryl Frazetti,Doug Jones,Damian ...","English,French","USA,UK",93 min,6.7,3341,"27 user,49 critic","Documentary,Comedy","Geneva, Canton de Genève, Switzerland",An examination of the widespread fan disenchan...


## 3. Generating Features and Clusters
Raha runs (all or the promising) error detection strategies on the dataset. This step could take a while because all the strategies should be run on the dataset. Raha then generates a feature vector for each data cell based on the output of error detection strategies. Raha next builds a hierarchical clustering model for our clustering-based sampling approach.

In [4]:
app_1.run_strategies(d)
app_1.generate_features(d)
app_1.build_clusters(d)

I just load strategies' results as they have already been run on the dataset!


1246 strategy profiles are collected.
88 Features are generated for column 0.
149 Features are generated for column 1.
76 Features are generated for column 2.
123 Features are generated for column 3.
144 Features are generated for column 4.
143 Features are generated for column 5.
149 Features are generated for column 6.
158 Features are generated for column 7.
119 Features are generated for column 8.
97 Features are generated for column 9.
67 Features are generated for column 10.
78 Features are generated for column 11.
63 Features are generated for column 12.
71 Features are generated for column 13.
104 Features are generated for column 14.
152 Features are generated for column 15.
162 Features are generated for column 16.
A hierarchical clustering model is built for column 0.
A hierarchical clustering model is built for column 1.
A hierarchical clustering model is built for column 2.
A hierarchical clustering model is built for column 3.
A hierarchical clustering model is built for 

## 4. Interactive Tuple Sampling and Labeling
Raha then iteratively samples a tuple. We should label data cells of each sampled tuple.

In [5]:
# def on_button_clicked(_):
#     for j in range(0, len(texts)):
#         cell = (d.sampled_tuple, j)
#         error_label = 0
#         correction = texts[j].value
#         if d.dataframe.iloc[cell] != correction:
#             error_label = 1
#         d.labeled_cells[cell] = [error_label, correction]
#     d.labeled_tuples[d.sampled_tuple] = 1
#
# app_1.sample_tuple(d)
# print("Fix the dirty cells in the following sampled tuple.")
# sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
# IPython.display.display(sampled_tuple)
# texts = [ipywidgets.Text(value=d.dataframe.iloc[d.sampled_tuple, j]) for j in range(d.dataframe.shape[1])]
# button = ipywidgets.Button(description="Save the Annotation")
# button.on_click(on_button_clicked)
# output = ipywidgets.VBox(children=texts + [button])
# IPython.display.display(output)

For the sake of time, we use the ground truth of the dataset to label tuples below.

In [6]:
%%capture
while len(d.labeled_tuples) < app_1.LABELING_BUDGET:
    app_1.sample_tuple(d)
    if d.has_ground_truth:
        app_1.label_with_ground_truth(d)

## 5. Propagating User Labels and Predicting the Labels
Raha then propagates each user label through its cluster. Raha then trains and applies one classifier per data column to predict the label of the rest of data cells.

In [7]:
app_1.propagate_labels(d)
app_1.predict_labels(d)

The number of labeled data cells increased from 340 to 91959.
A classifier is trained and applied on column 0.
A classifier is trained and applied on column 1.
A classifier is trained and applied on column 2.
A classifier is trained and applied on column 3.
A classifier is trained and applied on column 4.
A classifier is trained and applied on column 5.
A classifier is trained and applied on column 6.
A classifier is trained and applied on column 7.
A classifier is trained and applied on column 8.
A classifier is trained and applied on column 9.
A classifier is trained and applied on column 10.
A classifier is trained and applied on column 11.
A classifier is trained and applied on column 12.
A classifier is trained and applied on column 13.
A classifier is trained and applied on column 14.
A classifier is trained and applied on column 15.
A classifier is trained and applied on column 16.


## 6. Initializing and Updating the Error Corrector Models
Baran initializes the error corrector models. Baran then iteratively samples a tuple. We should label data cells of each sampled tuple. It then udpates the models accordingly and generates a feature vector for each pair of a data error and a correction candidate. Finally, it trains and applies a classifier to each data column to predict the final correction of each data error. Since we already labeled tuples for Raha, we use the same labeled tuples and do not label new tuples here.

In [8]:
app_2.initialize_models(d)
app_2.initialize_dataset(d)
for si in d.labeled_tuples:
    d.sampled_tuple = si
    app_2.update_models(d)
    app_2.generate_features(d)
    app_2.predict_corrections(d)

The error corrector models are initialized.
The error corrector models are updated with new labeled tuple 1545.
20364593 pairs of (a data error, a potential correction) are featurized.
Prediction Method in this step:
    Column 0: No train 0
    Column 2: Train
    Column 3: Train
    Column 5: No train 0
    Column 10: Train
    Column 11: Train
    Column 12: Train
    Column 14: Train
    Column 16: No train 0
Train sizes in this step:
    Column 0: 36025
    Column 2: 535
    Column 3: 23575
    Column 4: 3745
    Column 5: 21752
    Column 10: 810
    Column 11: 588
    Column 12: 8300
    Column 14: 2548
    Column 16: 31150
Corrections applied in this step:
    Column 0: 0 Real changes: 0
    Column 2: 95 Real changes: 95
    Column 3: 0 Real changes: 0
    Column 5: 0 Real changes: 0
    Column 10: 144 Real changes: 144
    Column 11: 156 Real changes: 156
    Column 12: 2426 Real changes: 2426
    Column 14: 45 Real changes: 45
    Column 16: 0 Real changes: 0
31% (2879 / 9230

## 7. Storing Results
Both Raha and Baran can also store the error detection/correction results.

In [None]:
#app_1.store_results(d)
#app_2.store_results(d)

## 8. Evaluating the Data Cleaning Task
We can finally evaluate our data cleaning task.

In [None]:
edp, edr, edf = d.get_data_cleaning_evaluation(d.detected_cells)[:3]
ecp, ecr, ecf = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]

evaluation_df = pandas.DataFrame(columns=["Task", "Precision", "Recall", "F1 Score"])
evaluation_df = evaluation_df.append({"Task": "Error Detection (Raha)", "Precision": "{:.2f}".format(edp), 
                                      "Recall": "{:.2f}".format(edr), "F1 Score": "{:.2f}".format(edf)}, ignore_index=True)
evaluation_df = evaluation_df.append({"Task": "Error Correction (Baran)", "Precision": "{:.2f}".format(ecp), 
                                      "Recall": "{:.2f}".format(ecr), "F1 Score": "{:.2f}".format(ecf)}, ignore_index=True)
evaluation_df.head()

In [None]:
import importlib
importlib.reload(analysis_utilities)

In [None]:
actual_errors = d.get_actual_errors_dictionary()

In [None]:
analysis_utilities.detection_evaluation(d, actual_errors)

In [None]:
correction_confidence_df = analysis_utilities.get_correction_confidence_df(d, actual_errors)

In [None]:
correction_confidence_df.shape[0]

In [None]:
(correction_confidence_df["confidence"] < 0.98).sum()

In [None]:
analysis_utilities.correction_confidence_distributions(correction_confidence_df)

In [None]:
f = analysis_utilities.correction_correctness_by_confidence(correction_confidence_df)
f.show()