Skip to content

DS3Lab/snoopy-paper

Repository files navigation

Code for Paper Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

This folder contains the code to reproduce the experimental results for the submission Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

The following instructions show how the results can be obtained using Ubuntu 20.04.1, an Intel processor, CUDA Toolkit 10.1 and NVIDIA TITAN Xp GPUs.

Running inference

For all datasets, embeddings are applied to the raw images and texts.

Installation

Prepare environment using Conda by running

conda env create -f environment.yml

using the supplied environment.yml file. The command will create an environment with name snoopy. For more information on creating the Conda environment please refer to the documentation .

Activate it with the command

conda activate snoopy

YELP data

yelp folder should contain yelp_train.csv and yelp_test.csv files obtained after preparing the YELP dataset. Due to spatial constraints, we did not upload them together with the code, but can provide them upon request.

Running

Adjust the batch_size values if needed in the embed.py file and run the following scripts

  • bash embed-cifar10.sh
  • bash embed-cifar100.sh
  • bash embed-mnist.sh
  • bash embed-imdb_reviews.sh
  • bash embed-sst2.sh
  • bash embed-yelp.sh

Embeddings and original datasets will be saved in the cache folder and the embedded datasets in the results folder as

<dataset name>-<embedding name>.npz

where <dataset name> corresponds to one of the

  • cifar10
  • cifar100
  • mnist
  • imdb_reviews
  • sst2
  • yelp

and <embedding name> correspond to the shorter name given to each of the embeddings ( keys of the embeddings dictionary in embed.py). First available GPU will be used for running inference.

Note: cache and results folders may take up a lot of space after running the scripts.

Simulating convergence curves

Installation

Prepare the snoopy-cpu-analysis environment by running

conda env create -f environment-cpu-analysis.yml

using the supplied environment-cpu-analysis.yml file and activate it with

conda activate snoopy-cpu-analysis

Running

Run

python convergence.py <dataset name>

where the <dataset name> is one of the datasets listed above.

This will produce convergence curve data stored in files

<dataset name>-<embedding name>-errs-cosine-0.0.npy

in the results folder.

Computing errors (Bayes error estimates) for different methods and noise values

Installation

Prepare the snoopy-errors environment by running

conda env create -f environment-errors.yml

using the supplied environment-errors.yml file and activate it with

conda activate snoopy-errors

Running

Run

python errors.py

This will produce error data stored in files

<dataset name>-<embed name>-test.txt

in the results folder.

Each file contains a JSON dictionary where the key at the first level denotes the amount of label noise (one of the "0.0", "0.1", "0.2", ..., "1.0") and the key at the second level denotes the method (one of the "GHP Upper", "GHP Lower", "1-NN" , "1-NN cosine", "1-NN LOO", "1-NN cosine LOO", where 1-NN and 1-NN LOO are using the Euclidean distance). For results related to GHP we only use the "GHP Lower" key. The values are error values. All values except for the GHP are not Bayes error estimates.

Note: For the yelp dataset a lot of RAM is needed in order to compute the results related to the GHP method.

Provided results

File errors_data.txt contains a JSON file that contains results for all datasets and embeddings. Namely, the key at first level denotes the dataset, at the second level the embedding and at the third level the split used (always test). Further levels correspond to individual .txt files that are produced by running errors.py.

Please note that the file contains a subset of data that could be generated by the errors.py that is sufficient for the results presented in the paper.

LR model

Installation

Activate the already installed snoopy-cpu-analysis environment

conda activate snoopy-cpu-analysis

Running

Run

python lr.py results lr.txt yes

This will run LR on top of each embedding for all datasets for different amounts of label noise and hyper-parameters. Each experiment will be repeated 5 times. The resulting file lr.txt will contain a JSON dictionary where the key at the first level denotes the dataset, the key at second level the embedding, the key at third level the amount of label noise, the key at fifth level L2 regularization and SGD learning rate parameters, and the keys at the last level the achieved error rates and runtimes.

Fine-tuning

Installation

Activate the already installed snoopy environment

conda activate snoopy

Running

Run

python finetune/export_text_file_to_tfrecords.py

and then

bash finetune/run_all.sh

For the default hyper-parameters (best for both without noise) run the scripts without any parameters. For changing the hyper-parameters for running the grid search for the image task please set the argument --lr to a value in [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.01], and the argument --reg to a value in [0.000001, 0.000003, 0.00001, 0.00003, 0.0001, 0.0003, 0.001].

Collecting the results

Run

python finetune/collect_results.py

AutoKeras

Installation

Activate the already installed snoopy environment

conda activate snoopy

Running

Run

bash autokeras/run_all.sh

Collecting the results

Run

python autokeras/collect_results.py

Plots

Installation

Activate the already installed snoopy-cpu-analysis environment

conda activate snoopy-cpu-analysis

Running

Run

python plots.py <convergence curve path>

where the <convergence curve path> should be a folder with subfolders

  • cifar10
  • cifar100
  • mnist
  • imdb_reviews
  • sst2
  • yelp

and each of the subfolders should contain the convergence curve files from Simulating convergence curves. The program will in any case used the provided results in the errors_data.txt file.

End-to-End Use-Case and Evaluation of BER Estimations

We provide a notebook with the code required to reproduce the experimental results for both the evaluation of ber esitmations agains the LR, Fine-Tune and AutoKeras baselines, as well as the end-to-end use-case simulation in the file

end2end_ber_evaluation.ipynb

VTAB-1K Results on Public Embeddings

The results of all the 19 VTAB-1K datasets and 235 public embeddings from Huggingface are available via the directory vtab-results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published