![classify dandelions and grass, image from Pixabay](https://cdn.pixabay.com/photo/2018/05/20/16/13/dandelion-3416140_960_720.jpg "image from Pixabay")

# Create an Image Dataset and Train an Image Classifier using FastAI

*by: Binh Phan. Inspired by [Lesson 2](https://course.fast.ai/videos/?lesson=2) of FastAI. Thanks to Francisco Ingham and Jeremy Howard* 

In this tutorial, we'll create an image dataset from Google Images and train a state-of-the-art image classifier extremely easily using the FastAI library. The FastAI library is built on top of the PyTorch deep learning framework, and provides commands that make training an image classifier very intuitive.

For this tutorial, we'll build a dandelion vs. grass classifier. Let's get started!

In [None]:
from fastai.vision import *
import torch
import fastai
import torchvision


In [None]:
print(torch.__version__)
print(fastai.__version__)
print(torchvision.__version__)

## **Create an Image Dataset from Google Images**

Note: this Kaggle kernel already has the dataset created from these instructions, so if you don't want to create your own dataset, feel free to skip this section and move straight to [6]

**How to save a list of Google Image URLs into a csv file**

Go to Google Images and search for *grass*. Initially, there will be ~50 images, so scroll down and press the button 'Show more results' at the end of the page until ~100 images have loaded. 

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

Press CtrlShiftJ in Windows/Linux and CmdOptJ in Mac, and a small window the javascript 'Console' will appear. That is where you will paste the JavaScript commands.

Run the following commands in the prompt:

```
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

The browser will download the file. Name the file *grass.csv*.

Repeat the same steps above for *dandelion*, and save the respective file as *dandelion.csv*.

**Upload the URLs as a dataset in Kaggle**

In this Kaggle kernel, go to File -> Add or upload data

In the top right corner, press Upload

Now, add *grass.csv* and *dandelion.csv*. Name the dataset *greenr*.

Now, we're going to do a bit of hacky work to get things to work in Kaggle. The folder /kaggle/input is read-only, and we need to manipulate that folder to download the image URLs into our folder, so we're going to move the files to another folder, /kaggle/working. That's actually the output folder, but we'll let our dataset reside there and create the outputs in the same folder. Run the following command:

In [None]:
!cp -r /kaggle/input/greenr /kaggle/working

Now, run the following commands to download the images from URLs into our dataset folder /kaggle/working/greenr/ using the *download_images* function.

Then, we'll make sure all the images are valid using *verify_images*.

After that, we'll create our dataset from *ImageDataBunch*. 

These are all FastAI commands that make it really easy to create a dataset :)

In [None]:
classes = ['grass','dandelion']
folder = 'grass'
file = 'grass.csv'
path = Path('/kaggle/working/greenr/')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)
download_images(path/file, dest, max_pics=200)
folder = 'dandelion'
file = 'dandelion.csv'
path = Path('/kaggle/working/greenr/')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)
download_images(path/file, dest, max_pics=200)

for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

Now let's view our data and see that we have a dataset. Congrats, you now have created your own image dataset!

In [None]:
data.classes

In [None]:
data.show_batch(rows=3, figsize=(7,8))

In [None]:
data.classes, data.c, len(data.train_ds), len(data.valid_ds)

## Train our Image Classifier
Now, let's train an image classifer from our dataset. After this, we'll have a model that classifies dandelions vs. grass.

First, let's import a ResNet34 model using *cnn_learner*. ResNet34 is a pre-trained image classifier that works really well out of the box, and we're simply going to train that model on our dataset to get it to become an expert at classifying dandelions vs. grass!

We'll train on the dataset, find the best learning rate, and save our model using the following commands:

In [None]:
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.fit_one_cycle(4)
learn.save('stage-1')
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))
learn.save('stage-2')
learn.load('stage-2');

## Interpretation
Let's see how well our model did using a confusion matrix.

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

## Cleaning Up

Some of our top losses aren't due to bad performance by our model. There are images in our data set that shouldn't be.

Using the `ImageCleaner` widget from `fastai.widgets` we can prune our top losses, removing photos that don't belong.

Simply mark `delete` to any image that doesn't belong

In [None]:
from fastai.widgets import *
db = (ImageList.from_folder(path)
                   .split_none()
                   .label_from_folder()
                   .transform(get_transforms(), size=224)
                   .databunch()
     )
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)

learn_cln.load('stage-2');
ds, idxs = DatasetFormatter().from_toplosses(learn_cln)
ImageCleaner(ds, idxs, path)

Let's also remove duplicates using this widget:

In [None]:
ds, idxs = DatasetFormatter().from_similars(learn_cln)
ImageCleaner(ds, idxs, path, duplicates=True)

Awesome work! Now, let's retrain our model on our pruned dataset and make it even more accurate!

In [None]:
np.random.seed(42)
data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.fit_one_cycle(4)
learn.save('stage-1')
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))
learn.save('stage-2')
learn.load('stage-2');

Let's see if our confusion matrix has improved:

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

## Export Our Model

Great work! Now, let's export our model, which will create a file named `export.pkl` in our `/kaggle/working/greenr` directory. We can now use this model to make predictions on other images, and deploy it into production! Let's try it out:

In [None]:
learn.export()

In [None]:
defaults.device = torch.device('cpu')
img = open_image(path/'grass'/'00000019.jpg')
img

In [None]:
learn = load_learner(path)
pred_class,pred_idx,outputs = learn.predict(img)
pred_class

If you got *grass* above, then your model works! Congrats, you've now created your own image dataset and trained your own image classifier, using FastAI!

If you'd like to export your model into production, simply download the `export.pkl` file and move it to wherever you want to make your predictions, like your phone or a web application :)

For a live demo of a deployed web app of greenr and its source code, please visit my repository!

[https://github.com/btphan95/greenr](https://github.com/btphan95/greenr)