# End-to-End Data Cleaning Pipeline with Raha and Baran (Minimal and Sequential)
We build an end-to-end data cleaning pipeline with our configuration-free error detection and correction systems, Raha and Baran.

In [1]:
import pandas
import IPython.display

import raha

## Error Detection with Raha

### 1. Instantiating the Detection Class
We first instantiate the `Detection` class.

In [2]:
app_1 = raha.Detection()

# How many tuples would you label?
app_1.LABELING_BUDGET = 20

# Would you like to see the logs?
app_1.VERBOSE = True

### 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [3]:
dataset_dictionary = {
    "name": "flights",
    "path": "datasets/flights/dirty.csv",
    "clean_path": "datasets/flights/clean.csv"
}
d = app_1.initialize_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,1,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,2,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,3,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,4,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,5,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.


### 3. Running Error Detection Strategies
Raha runs (all or the promising) error detection strategies on the dataset. This step could take a while because all the strategies should be run on the dataset. 

In [4]:
app_1.run_strategies(d)

72 cells are detected by ["PVD", ["sched_dep_time", "c"]].
92 cells are detected by ["PVD", ["src", "q"]].

76 cells are detected by ["PVD", ["act_arr_time", "i"]].17 cells are detected by ["PVD", ["sched_arr_time", "F"]].1593 cells are detected by ["PVD", ["sched_arr_time", ":"]].

16 cells are detected by ["PVD", ["sched_arr_time", "A"]].
1 cells are detected by ["PVD", ["act_arr_time", "\u00a0"]].
671 cells are detected by ["PVD", ["src", "s"]].
919 cells are detected by ["PVD", ["tuple_id", "2"]].
741 cells are detected by ["PVD", ["sched_arr_time", "0"]].55 cells are detected by ["PVD", ["src", "-"]].

43 cells are detected by ["PVD", ["act_arr_time", "r"]].
31 cells are detected by ["PVD", ["act_dep_time", "s"]].
145 cells are detected by ["PVD", ["flight", "V"]].
13 cells are detected by ["PVD", ["sched_arr_time", "b"]].7 cells are detected by ["PVD", ["sched_arr_time", "h"]].

279 cells are detected by ["PVD", ["flight", "K"]].
1426 cells are detected by ["PVD", ["tuple_id", "1

### 4. Generating Features
Raha then generates a feature vector for each data cell based on the output of error detection strategies. 

In [5]:
app_1.generate_features(d)

40 Features are generated for column 0.
65 Features are generated for column 1.
62 Features are generated for column 2.
65 Features are generated for column 3.
71 Features are generated for column 4.
65 Features are generated for column 5.
86 Features are generated for column 6.


### 5. Building Clusters
Raha next builds a hierarchical clustering model for our clustering-based sampling approach.

In [6]:
app_1.build_clusters(d)

A hierarchical clustering model is built for column 0.
A hierarchical clustering model is built for column 1.
A hierarchical clustering model is built for column 2.
A hierarchical clustering model is built for column 3.
A hierarchical clustering model is built for column 4.
A hierarchical clustering model is built for column 5.
A hierarchical clustering model is built for column 6.


### 6. Interactive Tuple Sampling and Labeling
Raha then iteratively samples a tuple. We should label data cells of each sampled tuple.

In [7]:
while len(d.labeled_tuples) < app_1.LABELING_BUDGET:
    app_1.sample_tuple(d)
    if d.has_ground_truth:
        app_1.label_with_ground_truth(d)
    else:
        print("Label the dirty cells in the following sampled tuple.")
        sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
        IPython.display.display(sampled_tuple)
        for j in range(d.dataframe.shape[1]):
            cell = (d.sampled_tuple, j)
            value = d.dataframe.iloc[cell]
            correction = input("What is the correction for value '{}'? Type in the same value if it is not erronous.\n".format(value))
            user_label = 1 if value != correction else 0
            d.labeled_cells[cell] = [user_label, correction]
        d.labeled_tuples[d.sampled_tuple] = 1

Tuple 1676 is sampled.
Tuple 1676 is labeled.
Tuple 688 is sampled.
Tuple 688 is labeled.
Tuple 5 is sampled.
Tuple 5 is labeled.
Tuple 809 is sampled.
Tuple 809 is labeled.
Tuple 190 is sampled.
Tuple 190 is labeled.
Tuple 2238 is sampled.
Tuple 2238 is labeled.
Tuple 506 is sampled.
Tuple 506 is labeled.
Tuple 1859 is sampled.
Tuple 1859 is labeled.
Tuple 945 is sampled.
Tuple 945 is labeled.
Tuple 1900 is sampled.
Tuple 1900 is labeled.
Tuple 800 is sampled.
Tuple 800 is labeled.
Tuple 1249 is sampled.
Tuple 1249 is labeled.
Tuple 1183 is sampled.
Tuple 1183 is labeled.
Tuple 1453 is sampled.
Tuple 1453 is labeled.
Tuple 1340 is sampled.
Tuple 1340 is labeled.
Tuple 232 is sampled.
Tuple 232 is labeled.
Tuple 1043 is sampled.
Tuple 1043 is labeled.
Tuple 1473 is sampled.
Tuple 1473 is labeled.
Tuple 243 is sampled.
Tuple 243 is labeled.
Tuple 1775 is sampled.
Tuple 1775 is labeled.


### 7. Propagating User Labels
Raha then propagates each user label through its cluster.

In [8]:
app_1.propagate_labels(d)

The number of labeled data cells increased from 140 to 11188.


### 8. Predicting Labels of Data Cells
Raha then trains and applies one classifier per data column to predict the label of the rest of data cells.

In [9]:
app_1.predict_labels(d)

A classifier is trained and applied on column 0.
A classifier is trained and applied on column 1.
A classifier is trained and applied on column 2.
A classifier is trained and applied on column 3.
A classifier is trained and applied on column 4.
A classifier is trained and applied on column 5.
A classifier is trained and applied on column 6.


### 9. Storing Results
Raha can also store the error detection results.

In [10]:
app_1.store_results(d)

The results are stored in datasets/flights/raha-baran-results-flights/error-detection/detection.dataset.


### 10. Evaluating the Error Detection Task
We can finally evaluate our error detection task.

In [11]:
p, r, f = d.get_data_cleaning_evaluation(d.detected_cells)[:3]
print("Raha's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Raha's performance on flights:
Precision = 0.82
Recall = 0.87
F1 = 0.85


# Error Correction with Baran

### 1. Instantiating the Correction Class
We first instantiate the `Correction` class.

In [12]:
app_2 = raha.Correction()

# How many tuples would you label?
app_2.LABELING_BUDGET = 20

# Would you like to see the logs?
app_2.VERBOSE = True

### 2. Initializing the Dataset Object
We next initialize the dataset object.

In [13]:
d = app_2.initialize_dataset(d)
d.dataframe.head()

Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,1,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,2,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,3,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,4,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,5,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.


### 3. Initializing the Error Corrector Models
Baran initializes the error corrector models.

In [14]:
app_2.initialize_models(d)

The error corrector models are initialized.


### 4. Interactive Tuple Sampling, Labeling, Model updating, Feature Generation, and Correction Prediction
Baran then iteratively samples a tuple. We should label data cells of each sampled tuple. It then udpates the models accordingly and generates a feature vector for each pair of a data error and a correction candidate. Finally, it trains and applies a classifier to each data column to predict the final correction of each data error. Since we already labeled tuples for Raha, we use the same labeled tuples and do not label new tuples here.

In [15]:
# while len(d.labeled_tuples) < app_2.LABELING_BUDGET:
#     app_2.sample_tuple(d)
#     if d.has_ground_truth:
#         app_2.label_with_ground_truth(d)
#     else:
#         print("Label the dirty cells in the following sampled tuple.")
#         sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
#         IPython.display.display(sampled_tuple)
#         for j in range(d.dataframe.shape[1]):
#             cell = (d.sampled_tuple, j)
#             value = d.dataframe.iloc[cell]
#             correction = input("What is the correction for value '{}'? Type in the same value if it is not erronous.\n".format(value))
#             user_label = 1 if value != correction else 0
#             d.labeled_cells[cell] = [user_label, correction]
#         d.labeled_tuples[d.sampled_tuple] = 1
#     app_2.update_models(d)
#     app_2.generate_features(d)
#     app_2.predict_corrections(d)

for si in d.labeled_tuples:
    d.sampled_tuple = si
    app_2.update_models(d)
    app_2.generate_features(d)
    app_2.predict_corrections(d)

The error corrector models are updated with new labeled tuple 1676.
490353 pairs of (a data error, a potential correction) are featurized.
50% (2619 / 5223) of data errors are corrected.
The error corrector models are updated with new labeled tuple 688.
496333 pairs of (a data error, a potential correction) are featurized.
50% (2634 / 5223) of data errors are corrected.
The error corrector models are updated with new labeled tuple 5.
501895 pairs of (a data error, a potential correction) are featurized.
51% (2685 / 5223) of data errors are corrected.
The error corrector models are updated with new labeled tuple 809.
508650 pairs of (a data error, a potential correction) are featurized.
55% (2867 / 5223) of data errors are corrected.
The error corrector models are updated with new labeled tuple 190.
521136 pairs of (a data error, a potential correction) are featurized.
56% (2907 / 5223) of data errors are corrected.
The error corrector models are updated with new labeled tuple 2238.
531

### 5. Storing Results
Baran can also store the error correction results.

In [16]:
app_2.store_results(d)

The results are stored in datasets/flights/raha-baran-results-flights/error-correction/correction.dataset.


### 6. Evaluating the Error Correction Task
We can finally evaluate our error correction task.

In [17]:
p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Baran's performance on flights:
Precision = 0.80
Recall = 0.51
F1 = 0.62


In [18]:
d?

[0;31mType:[0m           Dataset
[0;31mString form:[0m    <raha.dataset.Dataset object at 0x7fd27edbab50>
[0;31mFile:[0m           ~/source/MA/raha/raha/dataset.py
[0;31mDocstring:[0m      The dataset class.
[0;31mInit docstring:[0m The constructor creates a dataset.


In [20]:
d.corrected_cells

{(506, 3): '7:35 a.m.',
 (688, 3): '1:29 p.m.',
 (809, 3): '7:15 a.m.',
 (1249, 3): '6:00 a.m.',
 (1340, 3): '8:15 a.m.',
 (1775, 3): '8:15 a.m.',
 (2238, 3): '9:05 a.m.',
 (50, 3): '7:10 a.m.',
 (52, 3): '6:40 a.m.',
 (53, 3): '8:00 a.m.',
 (54, 3): '8:41 a.m.',
 (55, 3): '7:45 p.m.',
 (56, 3): '3:27 p.m.',
 (58, 3): '11:25 a.m.',
 (59, 3): '10:39 a.m.',
 (60, 3): '11:25 p.m.',
 (61, 3): '6:00 a.m.',
 (63, 3): '4:00 p.m.',
 (66, 3): '9:05 a.m.',
 (67, 3): '8:15 a.m.',
 (68, 3): '7:00 a.m.',
 (69, 3): '4:15 p.m.',
 (70, 3): '10:45 a.m.',
 (73, 3): '2:55 p.m.',
 (74, 3): '11:50 a.m.',
 (76, 3): '6:00 a.m.',
 (77, 3): '11:05 a.m.',
 (80, 3): '10:15 a.m.',
 (81, 3): '1:00 p.m.',
 (82, 3): '7:53 a.m.',
 (83, 3): '12:30 p.m.',
 (84, 3): '11:45 a.m.',
 (85, 3): '2:30 p.m.',
 (86, 3): '12:00 p.m.',
 (89, 3): '10:40 a.m.',
 (90, 3): '12:57 p.m.',
 (92, 3): '7:15 a.m.',
 (93, 3): '6:30 p.m.',
 (94, 3): '6:45 p.m.',
 (95, 3): '12:00 p.m.',
 (97, 3): '1:55 p.m.',
 (99, 3): '6:00 a.m.',
 (100, 3):