Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noisy Imagenette/Woof Proposal #40

Closed
3 of 4 tasks
tmabraham opened this issue Jan 26, 2021 · 24 comments
Closed
3 of 4 tasks

Noisy Imagenette/Woof Proposal #40

tmabraham opened this issue Jan 26, 2021 · 24 comments

Comments

@tmabraham
Copy link
Contributor

tmabraham commented Jan 26, 2021

Noisy Imagenette/Woof

Introduction

Most of the time, dataset labels are actually quite noisy as the humans generating those labels are error-prone. This is especially the case for labels generated through crowdsourcing. Recently, there has been significant research in dealing with noisy labels in datasets and training deep learning models that are robust to noise (ex: here, here, here, here, here, and here). It would be great to be able to implement some of these techniques in fastai (some work has already been done on this front, ex: here) and test them on a benchmarking dataset.

Proposal

I propose to add to this dataset/repository a corrupted version of Imagenette and ImageWoof where the training labels are switched randomly at varying probabilities to simulate datasets with different levels of noise. This dataset is currently available here. The images themselves are the same as Imagenette, but the labels are instead provided in a CSV file. There are 4 noise levels: 1%, 5%, 25%, 50% noise. The generation of these labels are provided here. A baseline based on the original Imagenette baseline is provided as well.

Note that in this imagenette repository there are currently 24 leaderboards (12 for Imagenette and 12 for ImageWoof). Since there are 4 noise levels, there is a total of 96 possible leaderboards. I have run my baseline for all 96 possibilities (done automatically on 4 TITAN RTXs with this bash script) and provided an extended leaderboard over here. I have also selected 16 leaderboards that I have kept on the README. It is up to the maintainers (Jeremy and Hamel) to decide which leaderboards they want to keep.

What needs to be done

These are the things I believe needs to be done for this dataset to be added to Imagenette:

  • Add URLs for Noisy Imagenette/Woof (same as Imagenette but with the extra CSV file)
  • Select the leaderboards to include
  • Add the leaderboards to the repository README
  • Potentially add the label generation, benchmarks, extended LB in another folder on this repository? Alternatively, I could keep it hosted in my repository.

@jph00 and @hamelsmu Please review my repository and let me know if you think there are any additional tasks that I would need to do.

@hamelsmu
Copy link
Member

As far as URLs are concerned I think @jph00 can upload this to the appropriate location https://github.com/tmabraham/noisy_imagenette/blob/main/noisy_imagenette.csv

Tanishq, do you want to start with a proposal on which leaderboards you think might be best to keep or highlight? You would probably know better than me at this point

@jph00
Copy link
Member

jph00 commented Jan 26, 2021 via email

@tmabraham
Copy link
Contributor Author

@hamelsmu The README has some leaderboards that we could use. I basically kept size 128, and tried both 20 epochs and 200 epochs for each noise level. My reasoning was that it was worth having a somewhat short training run (20 epochs) and a somewhat long training run (200 epochs) for each noise level. I kept size 128 because I assume any observations will transfer over to larger image sizes, and 128x128 images are easier/faster to work with. Of course, if you and Jeremy don't agree, I am willing to change the main leaderboards if necessary.

@jph00 I have already provided a training example that works with the CSV and the only thing that would need to be changed is the path once the CSV is added to the tgz and uploaded.

@KeremTurgutlu
Copy link
Contributor

KeremTurgutlu commented Jan 31, 2021

Great work @tmabraham! There are usually 2 types of noise benchmarked in the literature, symmetric noise and asymmetric noise. These functions here from ELR repo might be helpful for noise generation process for the both types.

What we have so far is symmetric noise since we randomly flip labels with equal probability and I believe this is great for start as long as we are transparent about it in the docs. On the other hand, if we would like to add asymmetric noise CSVs as well then we could probably do something like:

  • Train a model or pick one of the top pretrained models for the desired dataset (ImageWoof or ImageNette).
  • Generate confusion matrix from validation set.
  • Use the upper triangle, normalize it and use those as probabilities for flipping a label to another.
    or
  • Take an simpler approach like here and for each label class flip it to their most confused counterpart.

@jph00
Copy link
Member

jph00 commented Jan 31, 2021 via email

@KeremTurgutlu
Copy link
Contributor

KeremTurgutlu commented Jan 31, 2021

Usually papers report results on both, for example:

We show that our framework results in
DNN models with superior generalization performance on CIFAR-10, CIFAR-100 & ImageNet and
outperforms all previous works under symmetric (uniform) and asymmetric noises.

from SELF.

Reviewers also care about real world noise.

We evaluate the proposed methodology on two standard benchmarks with simulated label noise,
CIFAR-10 and CIFAR-100 [18], and two real-world datasets, Clothing1M [47] and WebVision [24].
For CIFAR-10 and CIFAR-100 we simulate label noise by randomly flipping a certain fraction of
the labels in the training set following a symmetric uniform distribution (as in Eq. (1)), as well as
a more realistic asymmetric class-dependent distribution, following the scheme proposed in [31].
Clothing1M consists of 1 million training images collected from online shopping websites with labels
generated using surrounding text. Its noise level is estimated at 38:5% [36]. For ease of comparison
to previous works [17, 7], we consider the mini WebVision dataset which contains the top 50 classes
from the Google image subset of WebVision, which results in approximately 66 thousand images.
The noise level of WebVision is estimated at 20% [24].

from ELR

As a practitioner I would care whether the benchmarks transfer to the real world or not. Since, it's probably not possible to have real world noise for woof and nette we can probably go with symmetric and asymmetric for now. Otherwise, we could create mini versions of Clothing1M and WebVision.

@tmabraham
Copy link
Contributor Author

tmabraham commented Feb 1, 2021

I think it's up to @jph00 to decide if we should include asymmetric noise. My only concern is that it adds another set of leaderboards and it might be too much. But if we decide to include it, I can make a version of Imagenette with asymmetric noise and add it to my noisy_imagenette repository.

@jph00
Copy link
Member

jph00 commented Feb 1, 2021 via email

@KeremTurgutlu
Copy link
Contributor

Once the leaderboard is official I can work with @tmabraham to test out few papers that we implemented as callbacks and include them as examples/initial results.

@hamelsmu
Copy link
Member

hamelsmu commented Feb 4, 2021

I'm starting to work on this now, sorry for the delay. I was a bit lost on some things and @jph00 clarified what I needed to do. I'm going to first start by familiarizing myself with the leaderboard and the code that uses it so I understand what is going on, and then move forward from there.

I am not available next week at all, so this may take me a bit. I'll update my progress on this issue as I make progress

@tmabraham
Copy link
Contributor Author

@hamelsmu Thanks for the update! Let me know if you have any questions.

@hamelsmu
Copy link
Member

hamelsmu commented Feb 5, 2021

Just for my own background learning, can someone educate me why this is not allowed:

No inference time tricks, e.g. no: TTA, validation size > train size

Also what is the last bit about the validation size > train size? Someone would make the validation size greater than the training set size? Why would they do that?

@tmabraham
Copy link
Contributor Author

tmabraham commented Feb 5, 2021

@hamelsmu I think this is referring to the image size in the validation set being larger than for the training set. I think there are some papers suggesting there can be an improvement in this case, but I don't remember the details now.

This is from the original Imagenette repo, so hopefully @jph00 can confirm.

@KeremTurgutlu
Copy link
Contributor

I think you are referring to FixRes paper, they argue lower training image size than test improves results. In general, excluding all the test time tricks is to fairly compare different training approaches. Also, TTA has randomness in it which adds extra noise for comparison.

@hamelsmu
Copy link
Member

hamelsmu commented Feb 6, 2021

@tmabraham do you have a csv file for imagewoof as well? I only see one for imagenette. Thanks for your help.

@tmabraham
Copy link
Contributor Author

@hamelsmu
Copy link
Member

hamelsmu commented Feb 6, 2021

Sorry I was ignoring your training script, which was my stupid mistake!

Also I forgot Jeremy said to only add two new leaderboards

10% and 50% imagenette and no imagewoof

@tmabraham
Copy link
Contributor Author

@hamelsmu Okay thanks for letting me know. Do you need me to re-generate the CSV then? I don't have 10% in the CSV. Also, I would have to rerun the baselines. Are we sticking to size 128 and 20/200 epochs, giving a total of 4 leaderboards?

@hamelsmu
Copy link
Member

hamelsmu commented Feb 6, 2021

Notes To Reproduce My Work

Taking some notes in this issue because will probably be asked to tweak this one day and I don't want to forget what I did.

Step 1: Download all the files and pre-process the dataset

See this repo for the below code in this step

from fastai.basics import *
from fastai.vision.all import *
from fastai.callback.all import *
from fastai.distributed import *
from fastprogress import fastprogress
from torchvision.models import *
from fastai.vision.models.xresnet import *
from fastai.callback.mixup import *
from fastcore.script import *

_all_urls=[URLs.IMAGENETTE, URLs.IMAGENETTE_160, URLs.IMAGENETTE_320, 
                URLs.IMAGEWOOF, URLs.IMAGEWOOF_160, URLs.IMAGEWOOF_320]
paths = parallel(untar_data, _all_urls, threadpool=True)


nurl = 'https://raw.githubusercontent.com/tmabraham/noisy_imagenette/main/noisy_imagenette.csv'
wurl = 'https://raw.githubusercontent.com/tmabraham/noisy_imagenette/main/noisy_imagewoof.csv'
npath = untar_data(URLs.IMAGENETTE_160)
wpath = untar_data(URLs.IMAGEWOOF_160)

wdf = pd.read_csv(wurl)
ndf = pd.read_csv(nurl)

def get_lbl(p, path=npath):
    pth = (path/p)
    assert pth.exists()
    return pth.parent.name

# Add column for zero noise
ndf['noisy_labels_0'] = ndf.path.apply(get_lbl)
wdf['noisy_labels_0'] = wdf.path.apply(partial(get_lbl, path=wpath))

cols = ['path', 'noisy_labels_0', 'noisy_labels_1', 'noisy_labels_5', 'noisy_labels_25','noisy_labels_50', 'is_valid']
ndf = ndf[cols].set_index('path')
wdf = wdf[cols].set_index('path')

# save noise csvs into appropriate folders
for p in paths:
    if 'woof' in p.name: wdf.to_csv(p/'noisy_imagewoof.csv')
    else: ndf.to_csv(p/'noisy_imagenette.csv')

This code ends up putting the appropriate csv file in each directory like below.
@jph00 mentioned that each dataset should be a self-contained unit, so this duplication is fine.

This will be the directory structure of the new files. Note I know we are not having imagewoof leaderboard for noise, only imagenette, but putting the imagewoof csv files for good measure incase people want to have a go at it for their own entertainment.

├── imagenette2
│   ├── noisy_imagenette.csv
│   ├── train
│   └── val
├── imagenette2-160
│   ├── noisy_imagenette.csv
│   ├── train
│   └── val
├── imagenette2-320
│   ├── noisy_imagenette.csv
│   ├── train
│   └── val
├── imagewoof2
│   ├── noisy_imagewoof.csv
│   ├── train
│   └── val
├── imagewoof2-160
│   ├── noisy_imagewoof.csv
│   ├── train
│   └── val
└── imagewoof2-320
    ├── noisy_imagewoof.csv
    ├── train
    └── val

Step 2: I uploaded the files using the following bash script

for d in */ ; do
    tar czf "${d%?}".tgz $d
done

awscp ()
{
    aws s3 cp $1 s3://fast-ai-$2/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers
}

for d in *.tgz; do
    awscp $d "imageclas"
done

Here are the logs from this (this took about 20 minutes):

upload: ./imagenette2-160.tgz to s3://fast-ai-imageclas/imagenette2-160.tgz
upload: ./imagenette2-320.tgz to s3://fast-ai-imageclas/imagenette2-320.tgz
upload: ./imagenette2.tgz to s3://fast-ai-imageclas/imagenette2.tgz
upload: ./imagewoof2-160.tgz to s3://fast-ai-imageclas/imagewoof2-160.tgz
upload: ./imagewoof2-320.tgz to s3://fast-ai-imageclas/imagewoof2-320.tgz
upload: ./imagewoof2.tgz to s3://fast-ai-imageclas/imagewoof2.tgz

Step 3: I confirmed that files are on s3

aws s3 ls s3://fast-ai-imageclas

Note we are looking for the date 2021-02-05

2018-10-08 15:52:32 1150585339 CUB_200_2011.tgz
2018-12-05 05:54:42 4579163978 bedroom.tgz
2018-10-26 22:11:43  131740031 caltech_101.tgz
2018-10-09 15:27:30  135107811 cifar10.tgz
2018-10-08 15:52:29  169168619 cifar100.tgz
2018-10-08 15:52:44 5686607260 food-101.tgz
2019-03-05 14:58:07   98752094 imagenette-160.tgz
2019-03-05 14:58:16  341289752 imagenette-320.tgz
2019-03-05 14:58:30 1556495367 imagenette.tgz
2021-02-05 21:46:51   99004276 imagenette2-160.tgz
2021-02-05 21:47:29  341662435 imagenette2-320.tgz
2021-02-05 21:49:19 1557156525 imagenette2.tgz
2019-12-10 18:04:12  191498213 imagewang-160.tgz
2019-12-10 18:04:18  669826647 imagewang-320.tgz
2019-12-10 18:04:25 2900347689 imagewang.tgz
2019-03-09 06:50:06   92375355 imagewoof-160.tgz
2019-03-09 06:50:11  328003750 imagewoof-320.tgz
2019-03-09 06:50:17 1343137256 imagewoof.tgz
2021-02-05 21:54:50   92611264 imagewoof2-160.tgz
2021-02-05 21:55:17  328386242 imagewoof2-320.tgz
2021-02-05 21:56:37 1343712960 imagewoof2.tgz
2018-10-14 11:26:34   15683414 mnist_png.tgz
2019-02-02 10:07:20     565372 mnist_var_size_tiny.tgz
2018-10-08 15:53:35  345236087 oxford-102-flowers.tgz
2018-10-08 15:53:40  811706944 oxford-iiit-pet.tgz
2018-10-08 15:54:10 1957803273 stanford-cars.tgz

Step 4: I made changes to the training script in fastai/fastai

based on @tmabraham's training script in this PR fastai/fastai#3210

Step 5: Add the 5% and 50% leaderboards

Started that in this PR: #41

Step 6: Update Hashes

You must update the hashes by running this code at the end of 04_data.external

image, and check-in only fastai/data/checks.txt into the PR

@hamelsmu
Copy link
Member

hamelsmu commented Feb 6, 2021

@tmabraham no no need I'm going with 5% instead of 10% (I talked to Jeremy about this). Not sure about the size and epochs part but will ask about that thanks for bringing that up

@hamelsmu
Copy link
Member

hamelsmu commented Feb 6, 2021

@tmabraham I'm thinking I'll just use the same size and epochs as the non-noisy LB. LMK if you think this sounds like a bad idea for some reason.

image

@jph00
Copy link
Member

jph00 commented Feb 6, 2021

@hamelsmu you'll need to update the hashes in fastai too. If you delete the imagenette tgzs from your .fastai/archive directory and your data directory, then call untar_data on imagenette, you'll get an error which tells you how to update the hashes. You'll need to do this for each of the versions of the dataset. Let me know if you need a hand. :)

@hamelsmu
Copy link
Member

hamelsmu commented Feb 6, 2021

You'll need to do this for each of the versions of the dataset. Let me know if you need a hand. :)

Done in fastai/fastai#3210

@hamelsmu
Copy link
Member

hamelsmu commented Feb 6, 2021

Ok I believe this is done. Thanks to @tmabraham for the extensive testing.

@hamelsmu hamelsmu closed this as completed Feb 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants