# Tutorial 2: Image Classification with NimbusML

## Overview

1. Introduce transfer learning
2. Explore the data
3. Choose a pre-trained DNN and NimbusML classifier
4. Evaluate the classifier performance

## Part 1: What is a Transfer Learning?

* Adapt an existing DNN model to a custom task
* A convolutional deep neural network (DNN) trained on a large dataset
* Example large dataset: image-net 14M images and 1000 classes
* It takes a lot of compute resources to train a DNN
* Useful as image featurizer for small dataset

<img src="https://nimbusml.blob.core.windows.net/mlads/pretrained_model_V2.png" width=500 height=500 />

## Pre-Trained DNN Models in NimbusML

NimbusML can use any TensorFlow or ONNX pretrained models.  Two examples we will look at today are Alexnet and Mobilenet which were trained on ImageNet to ID images as one of 1000 different classes.  We can adapt them with transfer learning to classify images into specialized classes we care about.

Pre-trained DNNs files are large and are not part of the NimbusML package wheel file. They can be downloaded here: 
- https://pytlcexpress.blob.core.windows.net/models/alexnet_frozen.pb 
- https://pytlcexpress.blob.core.windows.net/models/mobilenet_v2_1.0_224_quant_frozen.pb

## Scenario: Superman vs Spiderman Classification

<img src="https://nimbusml.blob.core.windows.net/mlads/vs.jpg" width=300 height=300 />

For this tutorial, we've picked superheroes because they're not in ImageNet categories. Also, Superman and Spiderman both have red and blue costumes to make the detection harder.

Let's build a Superman vs Spiderman classifier in NimbusML, without much deep learning or image processing knowledge.

## Part 2: Exploratory Data Analysis

In [None]:
# Cell 2A
# General imports and helper functions
import os, sys, time
import pandas as pd
from IPython.display import Image
from tutorial_helper import show_gallery, label_counts, update_image_paths

In [None]:
# Cell 2B
# Load data
data = pd.read_csv('files/data/data.csv')

# Exploratory analysis
print(label_counts(data), '\n')
print('Data shape: {}\n'.format(data.shape))

# Update image paths to use the faster disk
update_image_paths(data)

data.head()

In [None]:
# Cell 2C
# Sample images
show_gallery(data)

In [None]:
# Cell 2D
# Prepare train and test data
from sklearn.model_selection import train_test_split
train, test = train_test_split(data,
                               train_size=0.8,
                               test_size=0.2, 
                               stratify=data.IsSuperman,
                               random_state=1) # Replace '1' with a positive integer of your choosing
print(label_counts(train, 'Training'))
print(label_counts(test, 'Test'))

X_train = train.drop(columns='IsSuperman')
y_train = train.IsSuperman
X_test = test.drop(columns='IsSuperman')
y_test = test.IsSuperman

train.head()

## Part 3: Feature Extraction with Pre-Trained DNNs
1. Build a pipeline to extract features
2. Run the pipeline and examine the output features
3. Add a binary classifier to the pipeline and train it

In [None]:
# Cell 3A
from nimbusml import Pipeline
from nimbusml.linear_model import LogisticRegressionBinaryClassifier, FastLinearBinaryClassifier, AveragedPerceptronBinaryClassifier
from nimbusml.ensemble import LightGbmBinaryClassifier
from nimbusml.feature_extraction.image import Loader, Resizer, PixelExtractor
from nimbusml.preprocessing import TensorFlowScorer
from nimbusml.preprocessing.schema import ColumnDropper

In [None]:
# Cell 3B
# Configure our Transfer Learning pipeline

# Choose DNN

## Mobilenet
# file         = 'mobilenet_v2_1.0_224_quant_frozen.pb'
# input_layer  = 'input'
# output_layer = 'output'
# wth, ht      = 224, 224

## MnasNet
file         = 'mnasnet_1.3_224.pb'
input_layer  = 'input'
output_layer = 'mnasnet_1/cell_15/output' # You can try changing this to 'output'
wth, ht      = 224, 224

# Choose final classifier
algo = LogisticRegressionBinaryClassifier()  # Try changing this to FastLinearBinaryClassifier() or AveragedPerceptronBinaryClassifier()

In [None]:
# Cell 3C
# Prepare and clean data

# Load image files as objects
loader = Loader(columns = {input_layer:'ImagePath'}) # columns = {output_col_name:input_col_name}
# Transform all images to same dimensions
resizer = Resizer(image_width=wth, 
                  image_height=ht, 
                  columns = [input_layer])  # equivalent to columns = {'Placeholder':'Placeholder'}
# Extract pixles into arrays
pix_extractor = PixelExtractor(columns = [input_layer],
                               interleave_argb = True)

pipeline = Pipeline([loader, resizer, pix_extractor])

# pipeline.clone().fit_transform(X_train.head())

In [None]:
# Cell 3D
# Add pre-trained model
dnn_featurizer = TensorFlowScorer(model=file,
                                  columns={output_layer:input_layer})
pipeline.append(dnn_featurizer)

# Remove extraneous input columns
remove_inputs = ColumnDropper(columns=[input_layer, 'ImagePath'])
pipeline.append(remove_inputs)

In [None]:
# Cell 3E
# Train a binary classifier
pipeline.append(algo)

clf = pipeline.fit(X_train, y_train)

## Part 4: Evaluate Classifier Performance
1. Look at the predictions
2. Calculate confusion matrix
3. Examine the classifier mistakes
4. Save your classifier model

In [None]:
# Cell 4A
Image(X_test.iloc[0,0])

In [None]:
# Cell 4B
clf.predict(X_test[0:1])

In [None]:
# Cell 4C
# Run on full test set
start = time.time()

metrics, predictions = clf.test(X_test, y_test, output_scores=True)

finish = time.time()
test_time = finish - start
print("Test time: {0:.2f} seconds".format(test_time))

metrics

In [None]:
# Cell 4D
# Join predictions with paths and original label
path_and_label = test.reset_index()[['ImagePath', 'IsSuperman']].rename(columns={'IsSuperman': 'Label'})
predictions = pd.concat([path_and_label, predictions], axis=1)
predictions.head()

In [None]:
# Cell 4E
# View confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(predictions.Label, predictions.PredictedLabel)

In [None]:
# Cell 4F
# Sort test images by predicted probability
predictions['IsMistake'] = predictions.Label != predictions.PredictedLabel
predictions.sort_values('Probability', inplace=True)
show_gallery(predictions, num_images=100, randomize=False, add_prob=True, flag_mistakes=True)

In [None]:
# Cell 4G
# View mistakes: Superman classified incorrectly
superman_mistakes = predictions[(predictions.Label == 1) & predictions.IsMistake] 
show_gallery(superman_mistakes, add_name=True, add_prob=True)

In [None]:
# Cell 4H
# View mistakes: Spiderman classified incorrectly
spiderman_mistakes = predictions[(predictions.Label == 0) & predictions.IsMistake] 
show_gallery(spiderman_mistakes, add_name=True, add_prob=True)

In [None]:
# Cell 4I
# Save your image classifier for use in any Python or .NET app
clf.save_model("superheroes.zip")

# Recap
* Introduced transfer learning are and how to use it in NimbusML
* Built a superman vs spiderman image classifier without any deep learning or image processing knowledge requirement
* Now have a model to take home
