# End-to-End Data Cleaning Pipeline with Raha and Baran (Minimal and Integrated)
We build an end-to-end data cleaning pipeline with our configuration-free error detection and correction systems, Raha and Baran.

In [1]:
import pandas
import IPython.display
import ipywidgets

import raha

## 1. Instantiating the Detection and Correction Classes
We first instantiate the `Detection` and `Correction` classes.

In [2]:
app_1 = raha.Detection()
app_2 = raha.Correction()

# How many tuples would you label?
app_1.LABELING_BUDGET = 20
app_2.LABELING_BUDGET = 0

# Would you like to see the logs?
app_1.VERBOSE = True
app_2.VERBOSE = True

## 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [3]:
dataset_dictionary = {
        "name": "beers",
        "path": "datasets/beers/dirty.csv",
        "clean_path": "datasets/beers/clean.csv"
    }
d = app_1.initialize_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,index,id,beer_name,style,ounces,abv,ibu,brewery_id,brewery_name,city,state
0,1,1436,Pub Beer,American Pale Lager,12.0 oz,0.05,,408,10 Barrel Brewing Company,Bend,OR
1,2,2265,Devil's Cup,American Pale Ale (APA),12.0 oz.,0.066,,177,18th Street Brewery,Gary,IN
2,3,2264,Rise of the Phoenix,American IPA,12.0 ounce,0.071,,177,18th Street Brewery,Gary,IN
3,4,2263,Sinister,American Double / Imperial IPA,12.0 oz,0.09%,,177,18th Street Brewery,Gary,IN
4,5,2262,Sex and Candy,American IPA,12.0 OZ.,0.075,,177,18th Street Brewery,Gary,IN


## 3. Generating Features and Clusters
Raha runs (all or the promising) error detection strategies on the dataset. This step could take a while because all the strategies should be run on the dataset. Raha then generates a feature vector for each data cell based on the output of error detection strategies. Raha next builds a hierarchical clustering model for our clustering-based sampling approach.

In [4]:
app_1.run_strategies(d)
app_1.generate_features(d)
app_1.build_clusters(d)

23 cells are detected by ["PVD", ["city", "J"]].2348 cells are detected by ["PVD", ["abv", "."]].

91 cells are detected by ["PVD", ["state", "P"]].
18 cells are detected by ["PVD", ["style", "j"]].
15 cells are detected by ["PVD", ["style", "v"]].
7 cells are detected by ["PVD", ["state", "J"]].
126 cells are detected by ["PVD", ["brewery_name", "N"]].
797 cells are detected by ["PVD", ["brewery_name", "l"]].
4 cells are detected by ["PVD", ["brewery_name", "q"]].
974 cells are detected by ["PVD", ["city", "l"]].
4810 cells are detected by ["RVD", ["ounces", "style"]].0 cells are detected by ["RVD", ["index", "abv"]].

278 cells are detected by ["PVD", ["beer_name", "("]].275 cells are detected by ["PVD", ["brewery_name", "v"]].

1093 cells are detected by ["PVD", ["id", "2"]].
1005 cells are detected by ["PVD", ["ibu", "A"]].
2299 cells are detected by ["PVD", ["brewery_name", "B"]].
874 cells are detected by ["PVD", ["beer_name", "P"]].
190 cells are detected by ["PVD", ["beer_name"

4774 cells are detected by ["RVD", ["ibu", "state"]].
346 cells are detected by ["RVD", ["beer_name", "ounces"]].83 cells are detected by ["PVD", ["city", "f"]].

221 cells are detected by ["PVD", ["city", "P"]].175 cells are detected by ["PVD", ["beer_name", "'"]].

8 cells are detected by ["PVD", ["brewery_name", "5"]].
54 cells are detected by ["PVD", ["city", "O"]].
2410 cells are detected by ["PVD", ["ounces", " "]].
2080 cells are detected by ["PVD", ["brewery_name", "n"]].
1463 cells are detected by ["OD", ["gaussian", "2.5"]].
1443 cells are detected by ["PVD", ["beer_name", "l"]].
1194 cells are detected by ["PVD", ["city", "e"]].
39 cells are detected by ["PVD", ["brewery_name", "1"]].
256 cells are detected by ["PVD", ["beer_name", "2"]].
9061 cells are detected by ["OD", ["histogram", "0.5", "0.5"]].
30 cells are detected by ["PVD", ["beer_name", "8"]].
2150 cells are detected by ["PVD", ["style", " "]].
9061 cells are detected by ["OD", ["histogram", "0.3", "0.5"]].
16260 

92 cells are detected by ["RVD", ["beer_name", "brewery_id"]].
378 cells are detected by ["RVD", ["beer_name", "id"]].
278 cells are detected by ["PVD", ["beer_name", ")"]].
572 cells are detected by ["PVD", ["beer_name", "S"]].
1396 cells are detected by ["PVD", ["city", "o"]].
7633 cells are detected by ["OD", ["gaussian", "1.0"]].
104 cells are detected by ["PVD", ["state", "Y"]].
244 cells are detected by ["PVD", ["beer_name", "T"]].
175 cells are detected by ["PVD", ["brewery_name", "G"]].
2 cells are detected by ["PVD", ["beer_name", "\u00fc"]].
453 cells are detected by ["PVD", ["city", "g"]].
1018 cells are detected by ["PVD", ["city", "t"]].
396 cells are detected by ["PVD", ["ibu", "2"]].
2256 cells are detected by ["PVD", ["ounces", "o"]].
2410 cells are detected by ["PVD", ["ounces", "."]].
114 cells are detected by ["PVD", ["city", "W"]].
1005 cells are detected by ["PVD", ["ibu", "N"]].
108 cells are detected by ["PVD", ["city", "w"]].
395 cells are detected by ["PVD", ["

5388 cells are detected by ["OD", ["histogram", "0.3", "0.3"]].
0 cells are detected by ["RVD", ["index", "brewery_name"]].
129 cells are detected by ["PVD", ["ounces", "T"]].0 cells are detected by ["RVD", ["index", "city"]].

953 cells are detected by ["PVD", ["index", "2"]].
4450 cells are detected by ["RVD", ["city", "abv"]].
244 cells are detected by ["PVD", ["city", "A"]].
429 cells are detected by ["PVD", ["brewery_name", "S"]].
92 cells are detected by ["PVD", ["beer_name", "z"]].
30 cells are detected by ["PVD", ["beer_name", "6"]].
1732 cells are detected by ["RVD", ["brewery_id", "city"]].
120 cells are detected by ["PVD", ["brewery_name", "'"]].
3 cells are detected by ["PVD", ["beer_name", "\u2122"]].
0 cells are detected by ["RVD", ["index", "ibu"]].
0 cells are detected by ["RVD", ["id", "index"]].
14 cells are detected by ["RVD", ["brewery_name", "state"]].
7 cells are detected by ["PVD", ["city", "-"]].
2159 cells are detected by ["OD", ["gaussian", "1.7"]].
315 cells 

## 4. Interactive Tuple Sampling and Labeling
Raha then iteratively samples a tuple. We should label data cells of each sampled tuple.

In [5]:
def on_button_clicked(_):
    for j in range(0, len(texts)):
        cell = (d.sampled_tuple, j)
        error_label = 0
        correction = texts[j].value
        if d.dataframe.iloc[cell] != correction:
            error_label = 1
        d.labeled_cells[cell] = [error_label, correction]
    d.labeled_tuples[d.sampled_tuple] = 1

app_1.sample_tuple(d)
print("Fix the dirty cells in the following sampled tuple.")
sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
IPython.display.display(sampled_tuple)  
texts = [ipywidgets.Text(value=d.dataframe.iloc[d.sampled_tuple, j]) for j in range(d.dataframe.shape[1])]
button = ipywidgets.Button(description="Save the Annotation")
button.on_click(on_button_clicked)
output = ipywidgets.VBox(children=texts + [button])
IPython.display.display(output)

Tuple 630 is sampled.
Fix the dirty cells in the following sampled tuple.


Unnamed: 0,index,provider_number,name,address_1,address_2,address_3,city,state,zip,county,phone,type,owner,emergency_service,condition,measure_code,measure_name,score,sample,state_average
630,631,10029,east alabama medical center and snf,2000 pepperell parkway,empty,empty,opelika,al,36801,lee,3347493411,acute care hospitals,government - hospital district or authority,yes,pneumonia,pn-5c,pneumonia patients given initial antibiotic(s)...,96%,153 patients,al_pn-5c


VBox(children=(Text(value='631'), Text(value='10029'), Text(value='east alabama medical center and snf'), Text…

For the sake of time, we use the ground truth of the dataset to label tuples below.

In [5]:
%%capture
while len(d.labeled_tuples) < app_1.LABELING_BUDGET:
    app_1.sample_tuple(d)
    if d.has_ground_truth:
        app_1.label_with_ground_truth(d)

## 5. Propagating User Labels and Predicting the Labels
Raha then propagates each user label through its cluster. Raha then trains and applies one classifier per data column to predict the label of the rest of data cells.

In [6]:
app_1.propagate_labels(d)
app_1.predict_labels(d)

The number of labeled data cells increased from 220 to 24558.
A classifier is trained and applied on column 0.
A classifier is trained and applied on column 1.
A classifier is trained and applied on column 2.
A classifier is trained and applied on column 3.
A classifier is trained and applied on column 4.
A classifier is trained and applied on column 5.
A classifier is trained and applied on column 6.
A classifier is trained and applied on column 7.
A classifier is trained and applied on column 8.
A classifier is trained and applied on column 9.
A classifier is trained and applied on column 10.


## 6. Initializing and Updating the Error Corrector Models
Baran initializes the error corrector models. Baran then iteratively samples a tuple. We should label data cells of each sampled tuple. It then udpates the models accordingly and generates a feature vector for each pair of a data error and a correction candidate. Finally, it trains and applies a classifier to each data column to predict the final correction of each data error. Since we already labeled tuples for Raha, we use the same labeled tuples and do not label new tuples here.

In [7]:
app_2.initialize_models(d)
app_2.initialize_dataset(d)
for si in d.labeled_tuples:
    d.sampled_tuple = si
    app_2.update_models(d)
    app_2.generate_features(d)
    app_2.predict_corrections(d)

The error corrector models are initialized.
The error corrector models are updated with new labeled tuple 869.
215102 pairs of (a data error, a potential correction) are featurized.
65% (2845 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 310.
215311 pairs of (a data error, a potential correction) are featurized.
71% (3113 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 1624.
215311 pairs of (a data error, a potential correction) are featurized.
71% (3115 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 1837.
217192 pairs of (a data error, a potential correction) are featurized.
71% (3115 / 4369) of data errors are corrected.
The error corrector models are updated with new labeled tuple 1985.
218205 pairs of (a data error, a potential correction) are featurized.
94% (4116 / 4369) of data errors are corrected.
The error corrector mod

## 7. Storing Results
Both Raha and Baran can also store the error detection/correction results.

In [8]:
app_1.store_results(d)
app_2.store_results(d)

The results are stored in datasets/beers/raha-baran-results-beers/error-detection/detection.dataset.
The results are stored in datasets/beers/raha-baran-results-beers/error-correction/correction.dataset.


## 8. Evaluating the Data Cleaning Task
We can finally evaluate our data cleaning task.

In [9]:
edp, edr, edf = d.get_data_cleaning_evaluation(d.detected_cells)[:3]
ecp, ecr, ecf = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]

evaluation_df = pandas.DataFrame(columns=["Task", "Precision", "Recall", "F1 Score"])
evaluation_df = evaluation_df.append({"Task": "Error Detection (Raha)", "Precision": "{:.2f}".format(edp), 
                                      "Recall": "{:.2f}".format(edr), "F1 Score": "{:.2f}".format(edf)}, ignore_index=True)
evaluation_df = evaluation_df.append({"Task": "Error Correction (Baran)", "Precision": "{:.2f}".format(ecp), 
                                      "Recall": "{:.2f}".format(ecr), "F1 Score": "{:.2f}".format(ecf)}, ignore_index=True)
evaluation_df.head()

Unnamed: 0,Task,Precision,Recall,F1 Score
0,Error Detection (Raha),1.0,1.0,1.0
1,Error Correction (Baran),0.78,0.75,0.77
