<a href="https://colab.research.google.com/github/butchland/build-your-own-image-classifier/blob/master/colab-clean-image-dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clean Images from your Image Dataset

## Instructions

1. In the **Specify Project Name and Dataset Type** section below, fill out the project name first _(this is the name of the project you used in the previous notebook. If you didn't change the name of the default project in the previous notebook, you shouldn't have to change the default project name here either so just leave the project name as is)_.

    Modify the `dataset_type` from `raw` to `cleaned` if you already previously ran this notebook and created a clean dataset of your images and you want to continue working on cleaning it up more.

    _(If this is your first time to clean your image dataset, just leave it to `raw`)_ 

    If the cleaned dataset does not exist and you picked it, this will trigger an error. Just update your selection and  press `Cmd/Ctrl-F9` or Click on the menu `Runtime/Run all` again to fetch the correct dataset _(after connecting, that is)_.

1. Click on the `Connect` button on the top right area of the page. This will change into a checkmark with the RAM and Disk health bars once the connection is complete.
1. Press `Cmd/Ctrl+F9` or Click on the menu `Runtime/Run all`
1. Click on the link to `accounts.google.com` that appears and login in to your Google Account if neccessary or select the Google Account to use for your Google Drive. (This will open a new tab)
1. Authorize `Google Drive File Stream` to access your Google Drive (We will use this to save your collected images to a folder on your Google Drive).

1. Copy the generated authentication token and paste it on the input box that appears.

1. Let your notebook run all the way to the section on Image Cleaning. This will show up with on the notebook document below with the message **`START CLEANING YOUR IMAGES`**

1. Once the images show up in Image Cleaning section below, delete or recategorize your images if they are incorrect.
1. Just scroll horizontally to see the images. Read the **Image Cleaning Instructions** section on how to use the `Image Cleaner` widget.

1. Once all the images are cleaned up, click on the section 
entitled `Done Image Cleaning `. 
1. To continue the process of saving your cleaned up dataset
press `Cmd/Ctrl-F10` or Click on the menu `Runtime/Run after` to run all the remaining steps _(including copying your cleaned dataset back into your Google Drive folder)_

1. Once the text 'DONE! DONE! DONE!' is printed at the end of the notebook, You can click on the menu `Runtime/Factory reset runtime` and click `Yes` on the dialog box to end your session.

Your cleaned image dataset will be saved in your Google Drive under `/My Drive/build-your-own-image-classifier/data/<project-name>/cleaned_<project-name>.tgz` _(if you didn't change the defaults, it should be under `/My Drive/build-your-own-image-classifier/data/pets/cleaned_pets.tgz`)_


## What is going on?

This section explains the code behind this notebook

_(Click on SHOW CODE to display the code)_

### Connect to your Google

We'll need to connect to your Google Drive in order to retrieve our collected images as well as save the cleaned images afterwards.

In [None]:
#@title {display-mode: "form"}
from google.colab import drive
drive.mount('/content/drive')

### Specify Project Name and Dataset Type

Fill out the `project name` -- the project name should be the same one used as the project name used in the previous notebook. 

If you already went through this notebook once and created a cleaned version of your image dataset and you want to continue cleaning the data more, you can change the `dataset_type` from `raw` to `cleaned` to fetch this dataset from your Google Drive.

_(Otherwise just leave the `dataset_type` to `raw`)_

In [None]:
#@title Enter your project name {display-mode: "form"}
project = "pets" #@param {type: "string"}
dataset_type = "raw" #@param ["raw","cleaned"]

### Install Python Packages

Install all the python packages to collect the images

In [None]:
#@title {display-mode: "form"} 
!pip install -Uqq fastai --upgrade
!pip install -Uqq git+https://github.com/butchland/my_timesaver_utils

### Copy your Image Dataset from Google Drive

In [None]:
#@title {display-mode: "form"} 
if dataset_type == "cleaned":
    filename = f'cleaned_{project}.tgz'
else:
    file_name = f'{project}.tgz'
folder_path = f'build-your-own-image-classifier/data/{project}'  
file_name = f'{project}.tgz' 
!cp /content/drive/My\ Drive/{folder_path}/{file_name} /content/.
!tar -xzf {file_name}

In [None]:
#@title {display-mode: "form"} 
from fastai.vision.all import *
from fastai.vision.widgets import *
path = Path(f'/content/{project}')
Path.BASE_PATH = path
bears = DataBlock(
    blocks=(ImageBlock,CategoryBlock),
    get_items=get_image_files,
    get_y=parent_label,
    splitter=RandomSplitter(seed=42),
    item_tfms=Resize(128),
    batch_tfms=aug_transforms()
)

### Display a sample of the data images from your dataset
The images below are just a sample from the images
in your dataset.

In [None]:
#@title {display-mode: "form"}
dls = bears.dataloaders(path)
dls.show_batch()

### Build an image classifier for cleaning images

We now create an initial version of our image classifier that will help us clean up the images in your data

In [None]:
#@title {display-mode: "form"}
learn = cnn_learner(dls, resnet18, metrics=accuracy)

This just trains the image classifier on our images. 

It still works, even though some of the images might be mislabeled.

The **accuracy** (times 100) is the percentage at which our initial version of the image classifier correctly predicted the categories on our validation set _(validation metrics are used as a guide on how good our image classifier will be when deployed as an app)_


In [None]:
#@title {display-mode: "form"}
learn.fine_tune(4)

## Image Cleaning

Below is an interactive widget that will
allow us to delete or reclassify our images.

These images have been picked by our image classifier
where it is least confident about.

### Usage Instructions
> The images are segregated into **Train** and **Valid** datasets and are 
> further segregated by category.

> So after either selecting some images for deletion or changing their categories, You can
> click to the `Apply` button to apply the changes, or click on `Reset` to revert all your
> pending changes.

> When cleaning your data, make sure to check on all categories for both **Train** and **Valid** subsets of your dataset, and
> click on the `Apply` button to finalize the changes to each set in your dataset.  


In [None]:
#@title {display-mode: "form"}
from my_timesaver_utils.enhanced_imageclassifiercleaner import *
cleaner = EnhancedImageClassifierCleaner(learn)
cleaner

In [None]:
#@title {display-mode: "form"}
print("START CLEANING YOUR IMAGES")
raise RuntimeError("Click on the DONE CLEANING SECTION to continue the steps after cleaning")

### DONE CLEANING SECTION
Press `Cmd/Ctrl-F10` or Click on the menu `Runtime/Run after` to run all the remaining steps

In [None]:
#@title {display-mode: "form"}
cleaned_filename = f'cleaned_{project}.tgz'
!tar -czf {cleaned_filename} {project}
!mkdir -p /content/drive/My\ Drive/{folder_path}
!cp {cleaned_filename} /content/drive/My\ Drive/{folder_path}
print("DONE! DONE! DONE!")
print(f'Your cleaned image dataset is saved in your Google Drive under the folder My Drive/{folder_path}/{cleaned_filename}')
print("Make sure to end your session (Click on menu Runtime/Factory reset runtime and click 'Yes' on the dialog box to end your session)")
print("before closing this notebook.")