# widgets.image_cleaner

fastai offers several widgets to support the workflow of a deep learning practitioner. The purpose of the widgets are to help you organize, clean, and prepare your data for your model. Widgets are separated by data type.

In [None]:
from fastai.vision import *
from fastai.widgets import DatasetFormatter, ImageCleaner
from fastai.gen_doc.nbdoc import show_doc

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)

In [None]:
learn = create_cnn(data, models.resnet18, metrics=error_rate)

In [None]:
learn.fit_one_cycle(2)

In [None]:
learn.save('stage-1')

We create a databunch with all the data in the training set and no validation set (DatasetFormatter uses only the training set)

In [None]:
db = (ImageItemList.from_folder(path)
                   .no_split()
                   .label_from_folder()
                   .databunch())

In [None]:
learn = create_cnn(db, models.resnet18, metrics=[accuracy])
learn.load('stage-1');

In [None]:
show_doc(DatasetFormatter)

<h2 id="DatasetFormatter"><code>class</code> <code>DatasetFormatter</code><a href="https://github.com/fastai/fastai/blob/master/fastai/widgets/image_cleaner.py#L14" class="source_link">[source]</a></h2>

> <code>DatasetFormatter</code>()

Returns a dataset with the appropriate format and file indices to be displayed.  

The [`DatasetFormatter`](/widgets.image_cleaner.html#DatasetFormatter) class prepares your image dataset for widgets by returning a formatted [`DatasetTfm`](/vision.data.html#DatasetTfm) based on the [`DatasetType`](/basic_data.html#DatasetType) specified. Use `from_toplosses` to grab the most problematic images directly from your learner. Optionally, you can restrict the formatted dataset returned to `n_imgs`.

In [None]:
show_doc(DatasetFormatter.from_similars)

<h4 id="DatasetFormatter.from_similars"><code>from_similars</code><a href="https://github.com/fastai/fastai/blob/master/fastai/widgets/image_cleaner.py#L34" class="source_link">[source]</a></h4>

> <code>from_similars</code>(`learn`, `layer_ls`:`list`=`[0, 7, 2]`, `kwargs`)

Gets the indices for the most similar images in training and validation datasets  

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.widgets.image_cleaner import * 

In [None]:
show_doc(DatasetFormatter.from_toplosses)

<h4 id="DatasetFormatter.from_toplosses"><code>from_toplosses</code><a href="https://github.com/fastai/fastai/blob/master/fastai/widgets/image_cleaner.py#L15" class="source_link">[source]</a></h4>

> <code>from_toplosses</code>(`learn`, `n_imgs`=`None`, `kwargs`)

Gets indices with top losses for both training and validation sets in `learn`.  

In [None]:
show_doc(ImageCleaner)

<h2 id="ImageCleaner"><code>class</code> <code>ImageCleaner</code><a href="https://github.com/fastai/fastai/blob/master/fastai/widgets/image_cleaner.py#L92" class="source_link">[source]</a></h2>

> <code>ImageCleaner</code>(`dataset`, `fns_idxs`, `path`, `batch_size`:`int`=`5`, `duplicates`=`False`)

Display images with their current label.  

[`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) is for cleaning up images that don't belong in your dataset. It renders images in a row and gives you the opportunity to delete the file from your file system. To use [`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) we must first use `DatasetFormatter().from_toplosses` to get the suggested indices for misclassified images.

In [None]:
ds, idxs = DatasetFormatter().from_toplosses(learn)

In [None]:
ImageCleaner(ds, idxs, path)

[`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) does not change anything on disk (neither labels or existence of images). Instead, it creates a 'cleaned.csv' file in your data path from which you need to load your new databunch for the files to changes to be applied. 

In [None]:
df = pd.read_csv(path/'cleaned.csv', header='infer')

In [None]:
# We create a databunch from our csv. We include the data in the training set and we don't use a validation set (DatasetFormatter uses only the training set)
np.random.seed(42)
db = (ImageItemList.from_df(df, path)
                   .no_split()
                   .label_from_df()
                   .databunch(bs=64))

In [None]:
learn = create_cnn(db, models.resnet18, metrics=error_rate)
learn = learn.load('stage-1')

You can then use [`ImageCleaner`](/widgets.image_cleaner.html#ImageCleaner) again to find duplicates in the dataset. To do this, you can specify `duplicates=True` while calling ImageCleaner after getting the indices and dataset from `.from_similars`. Note that if you are using a layer's output which has dimensions [n_batches, n_features, 1, 1] then you don't need any pooling (this is the case with the last layer). The suggested use of `.from_similars()` with resnets is using the last layer and no pooling, like in the following cell.

In [None]:
ds, idxs = DatasetFormatter().from_similars(learn, layer_ls=[0,7,1], pool=None)

Getting activations...


Computing similarities...


In [None]:
ImageCleaner(ds, idxs, path, duplicates=True)

HBox(children=(VBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00d\x00d\x00\x00\xff…

Button(button_style='primary', description='Next Batch', layout=Layout(width='auto'), style=ButtonStyle())

## Methods

## Undocumented Methods - Methods moved below this line will intentionally be hidden

## New Methods - Please document or move to the undocumented section

In [None]:
show_doc(ImageCleaner.make_dropdown_widget)