### Preprocessing

For local run, we cannot afford running with full data. We will sample the data randomly (using hash) to about 200~300 instances. It takes about 15 minutes.

In [1]:
import mltoolbox.image.classification as model
from google.datalab.ml import *

worker_dir = '/content/datalab/tmp/coast'
preprocessed_dir = worker_dir + '/coast300'
model_dir = worker_dir + '/model300'

ModuleNotFoundError: No module named 'mltoolbox'

In [None]:
train_set = BigQueryDataSet('SELECT image_url, label FROM coast.train WHERE rand() < 0.04')
model.preprocess(train_set, preprocessed_dir)

### Train

To get help of a certain method, run 'mymethod??' and on the right side you will see the method signature. For example, run 'local_train??'.

In [None]:
import logging
logging.getLogger().setLevel(logging.INFO)
model.train(preprocessed_dir, 30, 1000, model_dir)
logging.getLogger().setLevel(logging.WARNING)

You can start hosted Tensorboard to check events.

In [None]:
tb_id = TensorBoard.start(model_dir)

### Evaluation

Our model was trained with a small subset of the data so accuracy is not very high.

First, we can check the TF summary events from training.

In [None]:
summary = Summary(model_dir)
summary.list_events()

In [None]:
summary.plot('accuracy')
summary.plot('loss')

We will do more evaluation with more data using batch prediction.

### Prediction

Instant prediction:

In [None]:
# gs://tamucc_coastline/esi_images/IMG_2849_SecDE_Spr12.jpg,3B
# gs://tamucc_coastline/esi_images/IMG_0047_SecBC_Spr12.jpg,10A
# gs://tamucc_coastline/esi_images/IMG_0617_SecBC_Spr12.jpg,7
# gs://tamucc_coastline/esi_images/IMG_2034_SecEGH_Sum12_Pt2.jpg,10A
images = [
  'gs://tamucc_coastline/esi_images/IMG_2849_SecDE_Spr12.jpg',
  'gs://tamucc_coastline/esi_images/IMG_0047_SecBC_Spr12.jpg',
  'gs://tamucc_coastline/esi_images/IMG_0617_SecBC_Spr12.jpg',
  'gs://tamucc_coastline/esi_images/IMG_2034_SecEGH_Sum12_Pt2.jpg'
]
# Set show_image to True to see the images
model.predict(model_dir, images, show_image=False)

Batch prediction. Note that we sample eval data so we use about 200 instances.

In [None]:
eval_set = BigQueryDataSet('select * from coast.eval WHERE rand()<0.1')
model.batch_predict(eval_set, model_dir, output_bq_table='coast.eval200tinymodel')

In [None]:
ConfusionMatrix.from_bigquery('select * from coast.eval200tinymodel').plot()

Compute accuracy per label.

In [None]:
%%bq query --name accuracy
SELECT
  target,
  SUM(CASE WHEN target=predicted THEN 1 ELSE 0 END) as correct,
  COUNT(*) as total,
  SUM(CASE WHEN target=predicted THEN 1 ELSE 0 END)/COUNT(*) as accuracy
FROM
  coast.eval200tinymodel
GROUP BY
  target

In [None]:
accuracy.execute().result()

You can view the results using Feature-Slice-View. This time do logloss.

In [None]:
%%bq query --name logloss

SELECT feature, AVG(-logloss) as logloss, COUNT(*) as count FROM
(
  SELECT feature, CASE WHEN correct=1 THEN LOG(prob) ELSE LOG(1-prob) END as logloss
  FROM
  (
    SELECT
    target as feature, 
    CASE WHEN target=predicted THEN 1 ELSE 0 END as correct,
    target_prob as prob
    FROM coast.eval200tinymodel
  )
)
GROUP BY feature

In [None]:
FeatureSliceView().plot(logloss)

## Clean up

In [None]:
import shutil
import google.datalab.bigquery as bq

TensorBoard.stop(tb_id)
bq.Table('coast.eval200tinymodel').delete()
shutil.rmtree(worker_dir)