Code for Paper Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

This folder contains the code to reproduce the experimental results for the submission Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

The following instructions show how the results can be obtained using Ubuntu 20.04.1, an Intel processor, CUDA Toolkit 10.1 and NVIDIA TITAN Xp GPUs.

Running inference

For all datasets, embeddings are applied to the raw images and texts.

Installation

Prepare environment using Conda by running

conda env create -f environment.yml

using the supplied environment.yml file. The command will create an environment with name snoopy. For more information on creating the Conda environment please refer to the documentation .

Activate it with the command

conda activate snoopy

YELP data

yelp folder should contain yelp_train.csv and yelp_test.csv files obtained after preparing the YELP dataset. Due to spatial constraints, we did not upload them together with the code, but can provide them upon request.

Running

Adjust the batch_size values if needed in the embed.py file and run the following scripts

bash embed-cifar10.sh
bash embed-cifar100.sh
bash embed-mnist.sh
bash embed-imdb_reviews.sh
bash embed-sst2.sh
bash embed-yelp.sh

Embeddings and original datasets will be saved in the cache folder and the embedded datasets in the results folder as

<dataset name>-<embedding name>.npz

where <dataset name> corresponds to one of the

cifar10
cifar100
mnist
imdb_reviews
sst2
yelp

and <embedding name> correspond to the shorter name given to each of the embeddings ( keys of the embeddings dictionary in embed.py). First available GPU will be used for running inference.

Note: cache and results folders may take up a lot of space after running the scripts.

Simulating convergence curves

Installation

Prepare the snoopy-cpu-analysis environment by running

conda env create -f environment-cpu-analysis.yml

using the supplied environment-cpu-analysis.yml file and activate it with

conda activate snoopy-cpu-analysis

Running

Run

python convergence.py <dataset name>

where the <dataset name> is one of the datasets listed above.

This will produce convergence curve data stored in files

<dataset name>-<embedding name>-errs-cosine-0.0.npy

in the results folder.

Computing errors (Bayes error estimates) for different methods and noise values

Installation

Prepare the snoopy-errors environment by running

conda env create -f environment-errors.yml

using the supplied environment-errors.yml file and activate it with

conda activate snoopy-errors

Running

Run

python errors.py

This will produce error data stored in files

<dataset name>-<embed name>-test.txt

in the results folder.

Each file contains a JSON dictionary where the key at the first level denotes the amount of label noise (one of the "0.0", "0.1", "0.2", ..., "1.0") and the key at the second level denotes the method (one of the "GHP Upper", "GHP Lower", "1-NN" , "1-NN cosine", "1-NN LOO", "1-NN cosine LOO", where 1-NN and 1-NN LOO are using the Euclidean distance). For results related to GHP we only use the "GHP Lower" key. The values are error values. All values except for the GHP are not Bayes error estimates.

Note: For the yelp dataset a lot of RAM is needed in order to compute the results related to the GHP method.

Provided results

File errors_data.txt contains a JSON file that contains results for all datasets and embeddings. Namely, the key at first level denotes the dataset, at the second level the embedding and at the third level the split used (always test). Further levels correspond to individual .txt files that are produced by running errors.py.

Please note that the file contains a subset of data that could be generated by the errors.py that is sufficient for the results presented in the paper.

LR model

Installation

Activate the already installed snoopy-cpu-analysis environment

conda activate snoopy-cpu-analysis

Running

Run

python lr.py results lr.txt yes

This will run LR on top of each embedding for all datasets for different amounts of label noise and hyper-parameters. Each experiment will be repeated 5 times. The resulting file lr.txt will contain a JSON dictionary where the key at the first level denotes the dataset, the key at second level the embedding, the key at third level the amount of label noise, the key at fifth level L2 regularization and SGD learning rate parameters, and the keys at the last level the achieved error rates and runtimes.

Fine-tuning

Installation

Activate the already installed snoopy environment

conda activate snoopy

Running

Run

python finetune/export_text_file_to_tfrecords.py

and then

bash finetune/run_all.sh

For the default hyper-parameters (best for both without noise) run the scripts without any parameters. For changing the hyper-parameters for running the grid search for the image task please set the argument --lr to a value in [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.01], and the argument --reg to a value in [0.000001, 0.000003, 0.00001, 0.00003, 0.0001, 0.0003, 0.001].

Collecting the results

Run

python finetune/collect_results.py

AutoKeras

Installation

Activate the already installed snoopy environment

conda activate snoopy

Running

Run

bash autokeras/run_all.sh

Collecting the results

Run

python autokeras/collect_results.py

Plots

Installation

Activate the already installed snoopy-cpu-analysis environment

conda activate snoopy-cpu-analysis

Running

Run

python plots.py <convergence curve path>

where the <convergence curve path> should be a folder with subfolders

cifar10
cifar100
mnist
imdb_reviews
sst2
yelp

and each of the subfolders should contain the convergence curve files from Simulating convergence curves. The program will in any case used the provided results in the errors_data.txt file.

End-to-End Use-Case and Evaluation of BER Estimations

We provide a notebook with the code required to reproduce the experimental results for both the evaluation of ber esitmations agains the LR, Fine-Tune and AutoKeras baselines, as well as the end-to-end use-case simulation in the file

end2end_ber_evaluation.ipynb

VTAB-1K Results on Public Embeddings

The results of all the 19 VTAB-1K datasets and 235 public embeddings from Huggingface are available via the directory vtab-results.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
autokeras		autokeras
cache		cache
convergence_curves		convergence_curves
end2end_figures		end2end_figures
environments		environments
finetune		finetune
snoopy		snoopy
vtab-results		vtab-results
yelp		yelp
.gitignore		.gitignore
1nn_areas.csv		1nn_areas.csv
README.md		README.md
convergence.py		convergence.py
embed-cifar10-aggre.sh		embed-cifar10-aggre.sh
embed-cifar10-random1.sh		embed-cifar10-random1.sh
embed-cifar10-random2.sh		embed-cifar10-random2.sh
embed-cifar10-random3.sh		embed-cifar10-random3.sh
embed-cifar10-worst.sh		embed-cifar10-worst.sh
embed-cifar10.sh		embed-cifar10.sh
embed-cifar100-noisy.sh		embed-cifar100-noisy.sh
embed-cifar100.sh		embed-cifar100.sh
embed-imdb_reviews.sh		embed-imdb_reviews.sh
embed-mnist.sh		embed-mnist.sh
embed-sst2.sh		embed-sst2.sh
embed-yelp.sh		embed-yelp.sh
embed.py		embed.py
end2end_ber_evaluation.ipynb		end2end_ber_evaluation.ipynb
errors.py		errors.py
errors_data.txt		errors_data.txt
errors_linear.txt		errors_linear.txt
lr.py		lr.py
plots.py		plots.py
times.py		times.py

DS3Lab/snoopy-paper

Folders and files

Latest commit

History

Repository files navigation

Code for Paper Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

Running inference

Installation

YELP data

Running

Simulating convergence curves

Installation

Running

Computing errors (Bayes error estimates) for different methods and noise values

Installation

Running

Provided results

LR model

Installation

Running

Fine-tuning

Installation

Running

Collecting the results

AutoKeras

Installation

Running

Collecting the results

Plots

Installation

Running

End-to-End Use-Case and Evaluation of BER Estimations

VTAB-1K Results on Public Embeddings

About

Resources

Stars

Watchers

Forks

Languages