This code accompanies the NIPS 2017 ML Systems Workshop paper/poster, "The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets."
The Data Linter identifies potential issues (lints) in your ML training data.
You'll need the following installed to use the Data Linter:
- Python
- Apache Beam
- TensorFlow
- Facets
The easiest way to see how to use the Data Linter is to follow the demo
instructions found in demo/README.md
.
Running the Data Linter requires the following steps:
- Encoding your data in TFRecord format.
- Generating summary statistics for those data, using Facets.
- Running the Data Linter.
- Using the Lint Explorer to produce the lint results.
To see how to convert CSV files to the TFRecord format, look at the example code
in demo/convert_to_tfrecord.py
.
To see how to generate summary statistics for your data, see the example code in
demo/summarize_data.py
.
Once you have both the data and summary statistics, you can run the Data Linter as such:
python data_linter_main.py --dataset_path PATH_TO_TFRECORDS \
--stats_path PATH_TO_FACETS_SUMMARIES --results_path PATH_FOR_SAVING_RESULTS
For example, if you follow the instructions in the demo folder, you'll invoke the Data Linter like this:
python data_linter_main.py --dataset_path /tmp/adult.tfrecords \
--stats_path /tmp/adult_summary.bin \
--results_path /tmp/datalinter/results/lint_results.bin
After the Data Linter is done examining your data, you can view the results using this command:
python lint_explorer_main.py --results_path PATH_TO_RESULTS
For example:
python lint_explorer_main.py --results_path \
/tmp/datalinter/results/lint_results.bin
The code makes use of
Google's protobuf format.
The protos are defined in protos/
.
To make it easier to run the code, we include protobuf definitions from TensorFlow and Facets in this distribution.
This is not an official Google project. This project will not be supported or maintained, and we will not accept any pull requests.
The Data Linter was created by Nick Hynes (nhynes@berkeley.edu) during an internship at Google with Michael Terry (michaelterry@google.com).