# Dataset Cleaning

*Image Deleter and Relabeler by [Zach Caceres](http://zachcaceres.com/now/) and Jason Hendrix, Duplicate Finder by Francisco Ingham*

In this notebook we will show you how to take advantage of fastai widgets to clean your dataset! We will delete images that do not correspond, relabel images with incorrect labels and delete duplicates. For this, we will use the CIFAR10 dataset but you can use it in your own custom dataset by using the [google images dataset](https://github.com/fpingham/google-images-dataset) notebook.

# Training your first model

In [47]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [48]:
from fastai import *
from fastai.vision import *

In [49]:
path = untar_data(URLs.CIFAR)

We will first train a model since it will suggest us which are the images that are most likely to be mislabelled or not belong to our dataset. We will also use the weights of the pretrained model to find similar images that might be duplicates.

In [50]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train="train", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [None]:
learn.fit_one_cycle(4)

epoch,train_loss,valid_loss,error_rate
1,0.495170,0.313840,0.104583
,,,


In [None]:
learn.save('stage-1');

In [None]:
learn.load('stage-1');

## Cleaning your dataset

In [20]:
from fastai.widgets import *

To start, we will sort the indices of our images by the highest loss images since this suggests that the image might be mislabeled or just not belong to the dataset.

In [None]:
ds, idxs = DatasetFormatter().from_toplosses(learn, ds_type=DatasetType.Valid)

Now we will use the widget to delete or move images. Flag photos for deletion by clicking 'Delete' or move them by using the dropdown menu. Then click 'Next Batch' to delete flagged photos and keep the rest in that row. `ImageCleaner` will show you a new row of images until there are no more to show. 

When you change your dataset, `ImageCleaner` will save the new dataset in a 'cleaned.csv' file in the same file where you have your notebook. 

Pretty sure the first one is not a truck...

In [None]:
ImageCleaner(ds, idxs)

You can also find duplicates in your dataset and delete them! First we need to load our data from the csv and create a new learner object.

In [38]:
np.random.seed(42)
data = ImageDataBunch.from_csv(".", folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [39]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [40]:
learn.load('stage-1');

We will first get the indexes for the most similar images and ImageCleaner will use them to find the potential duplicates.

In [28]:
ds, idxs = DatasetFormatter().from_similars(learn, ds_type=DatasetType.Valid, pool_dim=4)

Getting activations...


FileNotFoundError: Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/jupyter/fastai/fastai/data_block.py", line 477, in __getitem__
    if self.item is None: x,y = self.x[idxs],self.y[idxs]
  File "/home/jupyter/fastai/fastai/data_block.py", line 90, in __getitem__
    if isinstance(try_int(idxs), int): return self.get(idxs)
  File "/home/jupyter/fastai/fastai/vision/data.py", line 266, in get
    res = self.open(fn)
  File "/home/jupyter/fastai/fastai/vision/data.py", line 262, in open
    return open_image(fn, convert_mode=self.convert_mode)
  File "/home/jupyter/fastai/fastai/vision/image.py", line 375, in open_image
    x = PIL.Image.open(fn).convert(convert_mode)
  File "/opt/anaconda3/lib/python3.6/site-packages/PIL/Image.py", line 2609, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '././/home/jupyter/.fastai/data/cifar10/train/automobile/35728_automobile.png'


Take a look at the images in pairs and delete the ones you don't want to see anymore, until there are no more to show. `ImageCleaner` shows 40 images by default. If you still see duplicates in the last of those 40, you can always run the widget again, specifying `start=40` and `end=100` to see the next 60. Remember that if you want to rerun the widget you need to recreate the `ImageDataBunch` object, loading the data from `cleaned.csv`.

In [None]:
ImageCleaner(ds, idxs, duplicates=True)

Turns out there is quite a number of duplicates in CIFAR!

## Train with new dataset

Now we are ready to do our real training with a clean dataset! To use the new dataset we must indicate to our DataBunch object that we will load the labels from a csv.

In [None]:
np.random.seed(42)
data = ImageDataBunch.from_csv(".", folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [None]:
learn.fit_one_cycle(4)