Noisy Imagenette/Woof Proposal #40

tmabraham · 2021-01-26T05:49:39Z

Noisy Imagenette/Woof

Introduction

Most of the time, dataset labels are actually quite noisy as the humans generating those labels are error-prone. This is especially the case for labels generated through crowdsourcing. Recently, there has been significant research in dealing with noisy labels in datasets and training deep learning models that are robust to noise (ex: here, here, here, here, here, and here). It would be great to be able to implement some of these techniques in fastai (some work has already been done on this front, ex: here) and test them on a benchmarking dataset.

Proposal

I propose to add to this dataset/repository a corrupted version of Imagenette and ImageWoof where the training labels are switched randomly at varying probabilities to simulate datasets with different levels of noise. This dataset is currently available here. The images themselves are the same as Imagenette, but the labels are instead provided in a CSV file. There are 4 noise levels: 1%, 5%, 25%, 50% noise. The generation of these labels are provided here. A baseline based on the original Imagenette baseline is provided as well.

Note that in this imagenette repository there are currently 24 leaderboards (12 for Imagenette and 12 for ImageWoof). Since there are 4 noise levels, there is a total of 96 possible leaderboards. I have run my baseline for all 96 possibilities (done automatically on 4 TITAN RTXs with this bash script) and provided an extended leaderboard over here. I have also selected 16 leaderboards that I have kept on the README. It is up to the maintainers (Jeremy and Hamel) to decide which leaderboards they want to keep.

What needs to be done

These are the things I believe needs to be done for this dataset to be added to Imagenette:

Add URLs for Noisy Imagenette/Woof (same as Imagenette but with the extra CSV file)
Select the leaderboards to include
Add the leaderboards to the repository README
Potentially add the label generation, benchmarks, extended LB in another folder on this repository? Alternatively, I could keep it hosted in my repository.

@jph00 and @hamelsmu Please review my repository and let me know if you think there are any additional tasks that I would need to do.

The text was updated successfully, but these errors were encountered:

hamelsmu · 2021-01-26T15:47:19Z

As far as URLs are concerned I think @jph00 can upload this to the appropriate location https://github.com/tmabraham/noisy_imagenette/blob/main/noisy_imagenette.csv

Tanishq, do you want to start with a proposal on which leaderboards you think might be best to keep or highlight? You would probably know better than me at this point

jph00 · 2021-01-26T18:14:04Z

As far as URLs are concerned I think @jph00 <https://github.com/jph00> can upload this to the appropriate location https://github.com/tmabraham/noisy_imagenette/blob/main/noisy_imagenette.csv

I think it's more than just uploading the CSV - I think we should update the tgz to include the csv, and update the example training script to use the csv too. That way people can easily change to different labels by just changing the column they select in the csv. I can provide the upload details when you're ready to upload.

tmabraham · 2021-01-26T20:43:29Z

@hamelsmu The README has some leaderboards that we could use. I basically kept size 128, and tried both 20 epochs and 200 epochs for each noise level. My reasoning was that it was worth having a somewhat short training run (20 epochs) and a somewhat long training run (200 epochs) for each noise level. I kept size 128 because I assume any observations will transfer over to larger image sizes, and 128x128 images are easier/faster to work with. Of course, if you and Jeremy don't agree, I am willing to change the main leaderboards if necessary.

@jph00 I have already provided a training example that works with the CSV and the only thing that would need to be changed is the path once the CSV is added to the tgz and uploaded.

KeremTurgutlu · 2021-01-31T00:23:41Z

Great work @tmabraham! There are usually 2 types of noise benchmarked in the literature, symmetric noise and asymmetric noise. These functions here from ELR repo might be helpful for noise generation process for the both types.

What we have so far is symmetric noise since we randomly flip labels with equal probability and I believe this is great for start as long as we are transparent about it in the docs. On the other hand, if we would like to add asymmetric noise CSVs as well then we could probably do something like:

Train a model or pick one of the top pretrained models for the desired dataset (ImageWoof or ImageNette).
Generate confusion matrix from validation set.
Use the upper triangle, normalize it and use those as probabilities for flipping a label to another.
or
Take an simpler approach like here and for each label class flip it to their most confused counterpart.

jph00 · 2021-01-31T22:59:59Z

Which is more common in the literature? I assume we should follow whatever is standard.

KeremTurgutlu · 2021-01-31T23:12:05Z

Usually papers report results on both, for example:

We show that our framework results in
DNN models with superior generalization performance on CIFAR-10, CIFAR-100 & ImageNet and
outperforms all previous works under symmetric (uniform) and asymmetric noises.

from SELF.

Reviewers also care about real world noise.

We evaluate the proposed methodology on two standard benchmarks with simulated label noise,
CIFAR-10 and CIFAR-100 [18], and two real-world datasets, Clothing1M [47] and WebVision [24].
For CIFAR-10 and CIFAR-100 we simulate label noise by randomly flipping a certain fraction of
the labels in the training set following a symmetric uniform distribution (as in Eq. (1)), as well as
a more realistic asymmetric class-dependent distribution, following the scheme proposed in [31].
Clothing1M consists of 1 million training images collected from online shopping websites with labels
generated using surrounding text. Its noise level is estimated at 38:5% [36]. For ease of comparison
to previous works [17, 7], we consider the mini WebVision dataset which contains the top 50 classes
from the Google image subset of WebVision, which results in approximately 66 thousand images.
The noise level of WebVision is estimated at 20% [24].

from ELR

As a practitioner I would care whether the benchmarks transfer to the real world or not. Since, it's probably not possible to have real world noise for woof and nette we can probably go with symmetric and asymmetric for now. Otherwise, we could create mini versions of Clothing1M and WebVision.

tmabraham · 2021-02-01T05:43:21Z

I think it's up to @jph00 to decide if we should include asymmetric noise. My only concern is that it adds another set of leaderboards and it might be too much. But if we decide to include it, I can make a version of Imagenette with asymmetric noise and add it to my noisy_imagenette repository.

jph00 · 2021-02-01T17:08:45Z

I'd suggest not including it.

KeremTurgutlu · 2021-02-01T20:32:48Z

Once the leaderboard is official I can work with @tmabraham to test out few papers that we implemented as callbacks and include them as examples/initial results.

hamelsmu · 2021-02-04T19:07:11Z

I'm starting to work on this now, sorry for the delay. I was a bit lost on some things and @jph00 clarified what I needed to do. I'm going to first start by familiarizing myself with the leaderboard and the code that uses it so I understand what is going on, and then move forward from there.

I am not available next week at all, so this may take me a bit. I'll update my progress on this issue as I make progress

tmabraham · 2021-02-04T19:16:01Z

@hamelsmu Thanks for the update! Let me know if you have any questions.

hamelsmu · 2021-02-05T21:00:06Z

Just for my own background learning, can someone educate me why this is not allowed:

No inference time tricks, e.g. no: TTA, validation size > train size

Also what is the last bit about the validation size > train size? Someone would make the validation size greater than the training set size? Why would they do that?

tmabraham · 2021-02-05T21:31:02Z

@hamelsmu I think this is referring to the image size in the validation set being larger than for the training set. I think there are some papers suggesting there can be an improvement in this case, but I don't remember the details now.

This is from the original Imagenette repo, so hopefully @jph00 can confirm.

KeremTurgutlu · 2021-02-05T21:36:43Z

I think you are referring to FixRes paper, they argue lower training image size than test improves results. In general, excluding all the test time tricks is to fairly compare different training approaches. Also, TTA has randomness in it which adds extra noise for comparison.

hamelsmu · 2021-02-06T01:52:43Z

@tmabraham do you have a csv file for imagewoof as well? I only see one for imagenette. Thanks for your help.

tmabraham · 2021-02-06T01:55:17Z

@hamelsmu Here you go:
https://github.com/tmabraham/noisy_imagenette/blob/main/noisy_imagewoof.csv

hamelsmu · 2021-02-06T02:44:30Z

Sorry I was ignoring your training script, which was my stupid mistake!

Also I forgot Jeremy said to only add two new leaderboards

10% and 50% imagenette and no imagewoof

tmabraham · 2021-02-06T04:01:33Z

@hamelsmu Okay thanks for letting me know. Do you need me to re-generate the CSV then? I don't have 10% in the CSV. Also, I would have to rerun the baselines. Are we sticking to size 128 and 20/200 epochs, giving a total of 4 leaderboards?

hamelsmu · 2021-02-06T05:29:24Z

Notes To Reproduce My Work

Taking some notes in this issue because will probably be asked to tweak this one day and I don't want to forget what I did.

Step 1: Download all the files and pre-process the dataset

See this repo for the below code in this step

from fastai.basics import *
from fastai.vision.all import *
from fastai.callback.all import *
from fastai.distributed import *
from fastprogress import fastprogress
from torchvision.models import *
from fastai.vision.models.xresnet import *
from fastai.callback.mixup import *
from fastcore.script import *

_all_urls=[URLs.IMAGENETTE, URLs.IMAGENETTE_160, URLs.IMAGENETTE_320, 
                URLs.IMAGEWOOF, URLs.IMAGEWOOF_160, URLs.IMAGEWOOF_320]
paths = parallel(untar_data, _all_urls, threadpool=True)


nurl = 'https://raw.githubusercontent.com/tmabraham/noisy_imagenette/main/noisy_imagenette.csv'
wurl = 'https://raw.githubusercontent.com/tmabraham/noisy_imagenette/main/noisy_imagewoof.csv'
npath = untar_data(URLs.IMAGENETTE_160)
wpath = untar_data(URLs.IMAGEWOOF_160)

wdf = pd.read_csv(wurl)
ndf = pd.read_csv(nurl)

def get_lbl(p, path=npath):
    pth = (path/p)
    assert pth.exists()
    return pth.parent.name

# Add column for zero noise
ndf['noisy_labels_0'] = ndf.path.apply(get_lbl)
wdf['noisy_labels_0'] = wdf.path.apply(partial(get_lbl, path=wpath))

cols = ['path', 'noisy_labels_0', 'noisy_labels_1', 'noisy_labels_5', 'noisy_labels_25','noisy_labels_50', 'is_valid']
ndf = ndf[cols].set_index('path')
wdf = wdf[cols].set_index('path')

# save noise csvs into appropriate folders
for p in paths:
    if 'woof' in p.name: wdf.to_csv(p/'noisy_imagewoof.csv')
    else: ndf.to_csv(p/'noisy_imagenette.csv')

This code ends up putting the appropriate csv file in each directory like below.
@jph00 mentioned that each dataset should be a self-contained unit, so this duplication is fine.

This will be the directory structure of the new files. Note I know we are not having imagewoof leaderboard for noise, only imagenette, but putting the imagewoof csv files for good measure incase people want to have a go at it for their own entertainment.

├── imagenette2
│   ├── noisy_imagenette.csv
│   ├── train
│   └── val
├── imagenette2-160
│   ├── noisy_imagenette.csv
│   ├── train
│   └── val
├── imagenette2-320
│   ├── noisy_imagenette.csv
│   ├── train
│   └── val
├── imagewoof2
│   ├── noisy_imagewoof.csv
│   ├── train
│   └── val
├── imagewoof2-160
│   ├── noisy_imagewoof.csv
│   ├── train
│   └── val
└── imagewoof2-320
    ├── noisy_imagewoof.csv
    ├── train
    └── val

Step 2: I uploaded the files using the following bash script

for d in */ ; do
    tar czf "${d%?}".tgz $d
done

awscp ()
{
    aws s3 cp $1 s3://fast-ai-$2/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers
}

for d in *.tgz; do
    awscp $d "imageclas"
done

Here are the logs from this (this took about 20 minutes):

upload: ./imagenette2-160.tgz to s3://fast-ai-imageclas/imagenette2-160.tgz
upload: ./imagenette2-320.tgz to s3://fast-ai-imageclas/imagenette2-320.tgz
upload: ./imagenette2.tgz to s3://fast-ai-imageclas/imagenette2.tgz
upload: ./imagewoof2-160.tgz to s3://fast-ai-imageclas/imagewoof2-160.tgz
upload: ./imagewoof2-320.tgz to s3://fast-ai-imageclas/imagewoof2-320.tgz
upload: ./imagewoof2.tgz to s3://fast-ai-imageclas/imagewoof2.tgz

Step 3: I confirmed that files are on s3

aws s3 ls s3://fast-ai-imageclas

Note we are looking for the date 2021-02-05

2018-10-08 15:52:32 1150585339 CUB_200_2011.tgz
2018-12-05 05:54:42 4579163978 bedroom.tgz
2018-10-26 22:11:43  131740031 caltech_101.tgz
2018-10-09 15:27:30  135107811 cifar10.tgz
2018-10-08 15:52:29  169168619 cifar100.tgz
2018-10-08 15:52:44 5686607260 food-101.tgz
2019-03-05 14:58:07   98752094 imagenette-160.tgz
2019-03-05 14:58:16  341289752 imagenette-320.tgz
2019-03-05 14:58:30 1556495367 imagenette.tgz
2021-02-05 21:46:51   99004276 imagenette2-160.tgz
2021-02-05 21:47:29  341662435 imagenette2-320.tgz
2021-02-05 21:49:19 1557156525 imagenette2.tgz
2019-12-10 18:04:12  191498213 imagewang-160.tgz
2019-12-10 18:04:18  669826647 imagewang-320.tgz
2019-12-10 18:04:25 2900347689 imagewang.tgz
2019-03-09 06:50:06   92375355 imagewoof-160.tgz
2019-03-09 06:50:11  328003750 imagewoof-320.tgz
2019-03-09 06:50:17 1343137256 imagewoof.tgz
2021-02-05 21:54:50   92611264 imagewoof2-160.tgz
2021-02-05 21:55:17  328386242 imagewoof2-320.tgz
2021-02-05 21:56:37 1343712960 imagewoof2.tgz
2018-10-14 11:26:34   15683414 mnist_png.tgz
2019-02-02 10:07:20     565372 mnist_var_size_tiny.tgz
2018-10-08 15:53:35  345236087 oxford-102-flowers.tgz
2018-10-08 15:53:40  811706944 oxford-iiit-pet.tgz
2018-10-08 15:54:10 1957803273 stanford-cars.tgz

Step 4: I made changes to the training script in `fastai/fastai`

based on @tmabraham's training script in this PR fastai/fastai#3210

Step 5: Add the 5% and 50% leaderboards

Started that in this PR: #41

Step 6: Update Hashes

You must update the hashes by running this code at the end of 04_data.external

, and check-in only fastai/data/checks.txt into the PR

hamelsmu · 2021-02-06T05:43:18Z

@tmabraham no no need I'm going with 5% instead of 10% (I talked to Jeremy about this). Not sure about the size and epochs part but will ask about that thanks for bringing that up

hamelsmu · 2021-02-06T06:22:21Z

@tmabraham I'm thinking I'll just use the same size and epochs as the non-noisy LB. LMK if you think this sounds like a bad idea for some reason.

jph00 · 2021-02-06T18:12:14Z

@hamelsmu you'll need to update the hashes in fastai too. If you delete the imagenette tgzs from your .fastai/archive directory and your data directory, then call untar_data on imagenette, you'll get an error which tells you how to update the hashes. You'll need to do this for each of the versions of the dataset. Let me know if you need a hand. :)

hamelsmu · 2021-02-06T20:34:00Z

You'll need to do this for each of the versions of the dataset. Let me know if you need a hand. :)

Done in fastai/fastai#3210

hamelsmu · 2021-02-06T22:28:02Z

Ok I believe this is done. Thanks to @tmabraham for the extensive testing.

This was referenced Feb 6, 2021

[DRAFT/WIP] Imagenette noise lb fastai/fastai#3209

Closed

handle noise with script fastai/fastai#3210

Merged

hamelsmu mentioned this issue Feb 6, 2021

Add Noise Leaderboards README.md #41

Merged

hamelsmu closed this as completed Feb 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noisy Imagenette/Woof Proposal #40

Noisy Imagenette/Woof Proposal #40

tmabraham commented Jan 26, 2021 •

edited

hamelsmu commented Jan 26, 2021

jph00 commented Jan 26, 2021 via email

tmabraham commented Jan 26, 2021

KeremTurgutlu commented Jan 31, 2021 •

edited

jph00 commented Jan 31, 2021 via email

KeremTurgutlu commented Jan 31, 2021 •

edited

tmabraham commented Feb 1, 2021 •

edited

jph00 commented Feb 1, 2021 via email

KeremTurgutlu commented Feb 1, 2021

hamelsmu commented Feb 4, 2021

tmabraham commented Feb 4, 2021

hamelsmu commented Feb 5, 2021 •

edited

tmabraham commented Feb 5, 2021 •

edited

KeremTurgutlu commented Feb 5, 2021

hamelsmu commented Feb 6, 2021

tmabraham commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

tmabraham commented Feb 6, 2021

hamelsmu commented Feb 6, 2021 •

edited

hamelsmu commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

jph00 commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

Noisy Imagenette/Woof Proposal #40

Noisy Imagenette/Woof Proposal #40

Comments

tmabraham commented Jan 26, 2021 • edited

Noisy Imagenette/Woof

Introduction

Proposal

What needs to be done

hamelsmu commented Jan 26, 2021

jph00 commented Jan 26, 2021 via email

tmabraham commented Jan 26, 2021

KeremTurgutlu commented Jan 31, 2021 • edited

jph00 commented Jan 31, 2021 via email

KeremTurgutlu commented Jan 31, 2021 • edited

tmabraham commented Feb 1, 2021 • edited

jph00 commented Feb 1, 2021 via email

KeremTurgutlu commented Feb 1, 2021

hamelsmu commented Feb 4, 2021

tmabraham commented Feb 4, 2021

hamelsmu commented Feb 5, 2021 • edited

tmabraham commented Feb 5, 2021 • edited

KeremTurgutlu commented Feb 5, 2021

hamelsmu commented Feb 6, 2021

tmabraham commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

tmabraham commented Feb 6, 2021

hamelsmu commented Feb 6, 2021 • edited

Notes To Reproduce My Work

Step 1: Download all the files and pre-process the dataset

Step 2: I uploaded the files using the following bash script

Step 3: I confirmed that files are on s3

Step 4: I made changes to the training script in fastai/fastai

Step 5: Add the 5% and 50% leaderboards

Step 6: Update Hashes

hamelsmu commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

jph00 commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

hamelsmu commented Feb 6, 2021

tmabraham commented Jan 26, 2021 •

edited

KeremTurgutlu commented Jan 31, 2021 •

edited

KeremTurgutlu commented Jan 31, 2021 •

edited

tmabraham commented Feb 1, 2021 •

edited

hamelsmu commented Feb 5, 2021 •

edited

tmabraham commented Feb 5, 2021 •

edited

hamelsmu commented Feb 6, 2021 •

edited

Step 4: I made changes to the training script in `fastai/fastai`