Skip to content

Latest commit

 

History

History
47 lines (30 loc) · 3.51 KB

data-cleaning.md

File metadata and controls

47 lines (30 loc) · 3.51 KB

Under construction. Coming soon!

Surveys

Traditional and ML-based Data Cleaning

These tools focus on identify errors in datasets, without taking the downstream model or application into account. These include traditional constraint-based data cleaning methods, as well as those that use machine learning to detect and resolve data errors.

  • HoloClean functional dependencies, quantitative statistics, external information as a single factor-graph model.
  • Raha uses a library of error detectors, and treats the output of each as a feature in a holistic detection model. It then uses clustering and active learning to train the holistic model with few labels.
  • Picket: Self-supervised Data Diagnostics for ML Pipelines: self-supervision to learn an error detection model.

ML-Aware Data Cleaning

These data cleaning tools are meant to clean training datasets, and are co-designed with the trained model in mind.

  • ActiveClean VLDB 2016: leverages model convexity to treat cleaning as an active learning problem.
  • CPClean VLDB 2021: leverages robustness of NN classifiers to local perturbations.
  • Boost and AlphaClean: models data cleaning pipeline generation as an optimization problems, given a "data quality" function.
  • Conformance Constraints SIGMOD 21: learning constraints that should fail if inference over a test record may be untrustworthy.

Application-Aware Data Cleaning

These data cleaning tools are used to clean training datasets by using errors detected in the downstream application results. For instance, the application may use the model as part of an analytic query and visualize the result. If the user sees an anomaly in the visualization, she can submit the issue as a complaint.

This line of work is closely related to the area of query explanations (e.g., Wu2013, Roy2014, Abuzaid2019) in that it uses errors in downstream results for data debugging..

Tools