# Dataset Cleaning

*Image Deleter and Relabeler by [Zach Caceres](http://zachcaceres.com/now/) and Jason Hendrix, Duplicate Finder by Francisco Ingham*

In this notebook we will show you how to take advantage of fastai widgets to clean your dataset! We will delete images that do not correspond, relabel images with incorrect labels and delete duplicates. For this, we will use the CIFAR10 dataset but you can use it in your own custom dataset by using the [google images dataset](https://github.com/fpingham/google-images-dataset) notebook.

# Training your first model

In [84]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [85]:
from fastai import *
from fastai.vision import *

In [86]:
path = untar_data(URLs.CIFAR)

We will first train a model since it will suggest us which are the images that are most likely to be mislabelled or not belong to our dataset. We will also use the weights of the pretrained model to find similar images that might be duplicates.

In [152]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [131]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [132]:
# learn.fit_one_cycle(4)

In [133]:
# learn.save('stage-1');

In [134]:
learn.load('stage-1');

## Cleaning your dataset

In [135]:
from fastai.widgets import *
import torch

We will create an `ImageDataBunch` object with no validation set since we want 

In [160]:
np.random.seed(42)
ImageDataBunch.from_folder(path, valid_pct=0.00004,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

ImageDataBunch;
Train: LabelList
y: CategoryList (59998 items)
[Category dog, Category dog, Category dog, Category dog, Category dog]...
Path: /home/jupyter/.fastai/data/cifar10
x: ImageItemList (59998 items)
[Image (3, 32, 32), Image (3, 32, 32), Image (3, 32, 32), Image (3, 32, 32), Image (3, 32, 32)]...
Path: /home/jupyter/.fastai/data/cifar10;
Valid: LabelList
y: CategoryList (2 items)
[Category dog, Category cat]...
Path: /home/jupyter/.fastai/data/cifar10
x: ImageItemList (2 items)
[Image (3, 32, 32), Image (3, 32, 32)]...
Path: /home/jupyter/.fastai/data/cifar10;
Test: None

In [161]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [162]:
learn.load('stage-1');

To start, we will sort the indices of our images by the highest loss images since this suggests that the image might be mislabeled or just not belong to the dataset.

In [163]:
ds, idxs = DatasetFormatter().from_toplosses(learn, ds_type=DatasetType.Train)

Now we will use the widget to delete or move images. Flag photos for deletion by clicking 'Delete' or move them by using the dropdown menu. Then click 'Next Batch' to delete flagged photos and keep the rest in that row. `ImageCleaner` will show you a new row of images until there are no more to show. 

When you change your dataset, `ImageCleaner` will save the new dataset in a 'cleaned.csv' file in the same file where you have your notebook. 

In [77]:
ImageCleaner(ds, idxs, path)

HBox(children=(VBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00d\x00d\x00\x00\xff…

Button(button_style='primary', description='Next Batch', layout=Layout(width='auto'), style=ButtonStyle())

You can also find duplicates in your dataset and delete them! First we need to load our data from the csv and create a new learner object.

In [164]:
df = pd.read_csv('cleaned.csv', header='infer')
df.head()

Unnamed: 0,name,label
0,train/dog/9871_dog.png,dog
1,train/cat/47973_cat.png,cat
2,train/cat/36609_cat.png,cat
3,test/airplane/5703_airplane.png,airplane
4,test/airplane/9356_airplane.png,airplane


In [None]:
np.random.seed(42)
data = ImageDataBunch.from_csv(path, ".", valid_pct=0, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [166]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [167]:
learn.load('stage-1');

We will first get the indexes for the most similar images and ImageCleaner will use them to find the potential duplicates.

In [168]:
ds, idxs = DatasetFormatter().from_similars(learn, ds_type=DatasetType.Train, pool_dim=4)

Getting activations...


Computing similarities...


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3265, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-168-746b06abd9f1>", line 1, in <module>
    ds, idxs = DatasetFormatter().from_similars(learn, ds_type=DatasetType.Train, pool_dim=4)
  File "/home/jupyter/fastai/fastai/widgets/image_cleaner.py", line 36, in from_similars
    similarities = cls.comb_similarity(ds_actns, ds_actns, **kwargs)
  File "/home/jupyter/fastai/fastai/widgets/image_cleaner.py", line 62, in comb_similarity
    for idx1 in progress_bar(range(t1.shape[0]))
  File "/home/jupyter/fastai/fastai/widgets/image_cleaner.py", line 63, in <listcomp>
    for idx2 in range(t2.shape[0])]
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3265, in ru

TypeError: must be str, not list

Take a look at the images in pairs and delete the ones you don't want to see anymore, until there are no more to show. `ImageCleaner` shows 40 images by default. If you still see duplicates in the last of those 40, you can always run the widget again, specifying `start=40` and `end=100` to see the next 60. Remember that if you want to rerun the widget you need to recreate the `ImageDataBunch` object, loading the data from `cleaned.csv`.

In [42]:
ImageCleaner(ds, idxs, path, duplicates=True)

HBox(children=(VBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00d\x00d\x00\x00\xff…

Button(button_style='primary', description='Next Batch', layout=Layout(width='auto'), style=ButtonStyle())

Turns out there is quite a number of duplicates in CIFAR!

## Train with new dataset

Now we are ready to do our real training with a clean dataset! To use the new dataset we must indicate to our DataBunch object that we will load the labels from a csv.

In [43]:
np.random.seed(42)
data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [44]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [45]:
learn.fit_one_cycle(4)

epoch,train_loss,valid_loss,error_rate
1,2.697477,1.928366,0.726316
2,2.147982,1.205164,0.421053
3,1.731222,1.010826,0.294737
4,1.483092,0.964751,0.326316
,,,
