# Efficient training for image classification

### _Transfer learning using Inception Package - Local Run Experience_
Traditionally, image classification required a very large corpus of training data - often millions of images which may not be available and a long time to train on those images which is expensive and time consuming. That has changed with transfer learning which can be readily used with Cloud ML Engine and without deep knowledge of image classification algorithms using the ML toolbox in Datalab.

This notebook codifies the capabilities discussed in this [blog post](https://cloud.google.com/blog/big-data/2016/12/how-to-train-and-classify-images-using-google-cloud-machine-learning-and-cloud-dataflow). In a nutshell, it uses the pre-trained inception model as a starting point and then uses transfer learning to train it further on additional, customer-specific images. For explanation, simple flower images are used. Compared to training from scratch, the training data requirements, time and costs are drastically reduced.

This notebook does all operations in the Datalab container without calling CloudML API. Hence, this is called "local" operations - though Datalab itself is most often running on a GCE VM. See the corresponding cloud notebook for cloud experience which only adds the --cloud parameter and some config to the local experience commands. The purpose of local work is to do some initial prototyping and debugging on small scale data - often by taking a suitable (say 0.1 - 1%) sample of the full data. The same basic steps can then be repeated with much larger datasets in cloud.

## Setup
All data is available under gs://cloud-datalab/sampledata/flower. eval100 is a subset of eval300, which is a subset of eval670. Same for train data.

In [1]:
!mkdir -p /content/flowerdata

mkdir: /content/flowerdata: Permission denied


In [2]:
!gsutil -m cp gs://cloud-datalab/sampledata/flower/* /content/flowerdata

CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
CommandException: Destination URL must name a di

Define directories for preprocessing, model, and prediction.

In [3]:
import mltoolbox.image.classification as model
from google.datalab.ml import *

worker_dir = '/content/datalab/tmp/flower'
preprocessed_dir = worker_dir + '/flowerrunlocal'
model_dir = worker_dir + '/tinyflowermodellocal'
prediction_dir = worker_dir + '/flowermodelevallocal'
images_dir = worker_dir + '/images'
local_train_file = '/content/flowerdata/train200local.csv'
local_eval_file = '/content/flowerdata/eval100local.csv'

ModuleNotFoundError: No module named 'mltoolbox'

In [None]:
!mkdir -p {images_dir}

In order to get best efficiency, we download the images to local disk, and create our training and evaluation files to reference local path instead of GCS path. Note that the original training files referencing GCS image paths work too, although a bit slower.

In [None]:
import csv
import datalab.storage as gcs
import os


def download_images(input_csv, output_csv, images_dir):
  with open(input_csv) as csvfile:
    data = list(csv.DictReader(csvfile, fieldnames=['image_url', 'label']))
  for x in data:
    url = x['image_url']
    out_file = os.path.join(images_dir, os.path.basename(url))
    with open(out_file, 'w') as f:
      f.write(gcs.Item.from_url(url).read_from())
    x['image_url'] = out_file

  with open(output_csv, 'w') as w:
    csv.DictWriter(w, fieldnames=['image_url', 'label']).writerows(data)


download_images('/content/flowerdata/train200.csv', local_train_file, images_dir)    
download_images('/content/flowerdata/eval100.csv', local_eval_file, images_dir)

The above code can best be illustrated by the comparison below.

In [None]:
!head /content/flowerdata/train200.csv -n 5

In [None]:
!head {local_train_file} -n 5

## Preprocess
Preprocessing uses a Dataflow pipeline to convert the image format, resize images, and run the converted image through a pre-trained model to get the features or embeddings. You can also do this step using alternate technologies like Spark or plain Python code if you like. 

The following cell takes ~5 min on a n1-standard-1 VM. Preprocessing the full 3000 images takes about one hour.

In [None]:
# instead of local_train_file, it can take '/content/flowerdata/train200.csv' too, but processing will be slower.
train_set = CsvDataSet(local_train_file, schema='image_url:STRING,label:STRING')
model.preprocess(train_set, preprocessed_dir)

## Train
The next step is to train the inception model with the preprocessed images using transfer learning. Transfer learning retains most of the inception model but replaces the final layer as shown in the image.

![inception](https://cloud.google.com/blog/big-data/2016/12/images/148114735559140/image-classification-3.png)


In [None]:
import logging
logging.getLogger().setLevel(logging.INFO)
model.train(preprocessed_dir, 30, 800, model_dir)
logging.getLogger().setLevel(logging.WARNING)

Run TensorBoard to visualize the completed training. Review accuracy and loss in particular.

In [None]:
tb_id = TensorBoard.start(model_dir)

We can check the TF summary events from training.

In [None]:
summary = Summary(model_dir)
summary.list_events()

In [None]:
summary.plot('accuracy')
summary.plot('loss')

## Predict
Let's start with a quick check by taking a couple of images and using the model to predict the type of flower locally.

In [None]:
images = [
  'gs://cloud-ml-data/img/flower_photos/daisy/15207766_fc2f1d692c_n.jpg',
  'gs://cloud-ml-data/img/flower_photos/tulips/6876631336_54bf150990.jpg'
]
# set show_image to False to not display pictures.
model.predict(model_dir, images, show_image=True)

## Evaluate
We did a quick test of the model using a few samples. But we need to understand how the model does by evaluating it against much larger amount of labeled data. In the initial preprocessing step, we did set aside enough images for that purpose. Next, we will use normal batch prediction and compare the results with the previously labeled targets.

The following batch prediction and loading of results takes ~3 minutes.

In [None]:
import google.datalab.bigquery as bq
bq.Dataset('flower').create()

In [None]:
eval_set = CsvDataSet(local_eval_file, schema='image_url:STRING,label:STRING')
model.batch_predict(eval_set, model_dir, output_bq_table='flower.eval_results_local')

Now that we have the results and expected results loaded in a BigQuery table, let's start analyzing the errors and plot the confusion matrix.

In [None]:
%%bq query --name wrong_prediction

SELECT * FROM flower.eval_results_local where target != predicted

In [None]:
wrong_prediction.execute().result()

Confusion matrix is a common way of comparing the confusion of the model - aggregate data about where the actual result did not match the expected result.

In [None]:
ConfusionMatrix.from_bigquery('flower.eval_results_local').plot()

More advanced analysis can be done using the feature slice view. For the feature slice view, let's define SQL queries that compute accuracy and log loss and then use the metrics.

In [None]:
%%bq query --name accuracy

SELECT
  target,
  SUM(CASE WHEN target=predicted THEN 1 ELSE 0 END) as correct,
  COUNT(*) as total,
  SUM(CASE WHEN target=predicted THEN 1 ELSE 0 END)/COUNT(*) as accuracy
FROM
  flower.eval_results_local
GROUP BY
  target

In [None]:
accuracy.execute().result()

In [None]:
%%bq query --name logloss

SELECT feature, AVG(-logloss) as logloss, count(*) as count FROM
(
SELECT feature, CASE WHEN correct=1 THEN LOG(prob) ELSE LOG(1-prob) END as logloss
FROM
(
SELECT
target as feature, 
CASE WHEN target=predicted THEN 1 ELSE 0 END as correct,
target_prob as prob
FROM flower.eval_results_local))
GROUP BY feature

In [None]:
FeatureSliceView().plot(logloss)

## Clean up

In [None]:
import shutil
import google.datalab.bigquery as bq

TensorBoard.stop(tb_id)
bq.Table('flower.eval_results_local').delete()
shutil.rmtree(worker_dir)

## Recap
In this notebook, we covered local preprocessing, training, prediction and evaluation. We started from data in GCS in csv form plus images; used transfer learning for very fast training and then used BigQuery for model performance analysis. In the next notebook, we will use CloudML APIs that scale a lot better for larger scale. The syntax and analyses will remain the same.