# Tutorial 2: Image Classification with PyTLC

# Overview

* Introduce pre-trained models
* Build a superhero image classifier with pre-trained models in PyTLC
* Evaluate the classifier performance

# What is a Pre-Trained Model?

* A convolutional deep neural network (DNN) trained on a large dataset
* Example large dataset: image-net 14M images and 1000 classes
* It takes a lot of compute resources to train a DNN
* Useful as image featurizer for small dataset

<img src="files/pretrained_model_V2.png" width=500 height=500 />

# Pre-Trained DNN Models in PyTLC

PyTLC comes with the following pre-trained DNN models:

| DNN Model Name | Input Size | Output Size |
| --- | --- | --- |
| Resnet18 | 224 x 224 | 512 |
| Resnet50 | 224 x 224 | 2048 |
| Resnet101 | 224 x 224 | 2048 |
| Alexnet | 227 x 227 | 4096 |

* Pre-trained DNNs files are large and are not part of the PyTlc package wheel file
* PyTlc automatically downloads the DNNs on first use

# Scenario: Superman vs Spiderman Classification

<img src="files/vs.jpg" width=300 height=300 />

For this tutorial, we've picked superheroes because they're not in imagenet categories. Also, Superman and Spiderman both have red and blue costumes to make the detection harder.

Let's build a Superman vs Spiderman classifier in PyTlc, without much deep learning or image processing knowledge.

## Part 1: Exploratory Data Analysis

In [None]:
# General imports and helper functions
import os, sys
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
from tutorial_helper import show_gallery, get_dimensions, label_counts, update_image_paths

In [None]:
# Cell 1A
# Load data
data = pd.read_csv('files/data/data.csv')

# Exploratory analysis
print(data.head(), '\n')
print(label_counts(data), '\n')
print('Data shape: {}\n'.format(data.shape))

# Update image paths to use the faster disk
update_image_paths(data)

In [None]:
# Cell 1B
# Sample images
show_gallery(data)

In [None]:
# Cell 1C
# Explore image dimensions
plt.scatter(*get_dimensions(data), s=5)
plt.xlim(150,350)
plt.ylim(150,350)
plt.xlabel('Image Width')
plt.ylabel('Image Height')
plt.title('Image Dimensions')
plt.show()

## Part 2: Feature Extraction with Pre-Trained DNNs
1. Build a pipeline to extract features
2. Run the pipeline and examine the output features

In [None]:
from microsoftml_scikit import Pipeline
from microsoftml_scikit.linear_model import LogisticRegressionBinaryClassifier
from microsoftml_scikit.feature_extraction.image import DnnFeaturizer, Loader, Resizer, PixelExtractor

In [None]:
# Cell 2A
# Create feature extraction pipeline
feature_extraction_pipeline = Pipeline([
    # Load image from path
    Loader() << {'Features':'ImagePath'},
    
    # Resize image to the correct inputs size of pretrained model
    Resizer(image_width=xxx, image_height=xxxx, resizing='IsoPad'), # Replace xxx with the correct input size
    
    # Read pixel data as arrays
    PixelExtractor(),
    
    # Run the pretrained DNN model
    DnnFeaturizer(dnn_model='xxxx')]) # Replace xxxx with one of these: Resnet18 Resnet50 Resnet101 Alexnet 

# Extract features
X_y = feature_extraction_pipeline.fit_transform(data.head(3))
X_y

In [None]:
# from microsoftml_scikit.utils.exports import img_export_pipeline
# fig = img_export_pipeline(feature_extraction_pipeline,data['ImagePath'])

In [None]:
# Cell 2B
# Load pre-computed features
X_y = pd.read_csv('files/data/xxxx.csv') # Replace xxxx with one of these: Resnet18 Resnet50 Resnet101 Alexnet
update_image_paths(X_y)
X_y.shape

## Part 3: Build Classifier
1. Split data into 80% training set and 20% test set
2. Train a logistic regression classifier
3. Evaluate the classifier with the test set

In [None]:
# Cell 3A
# Prepare train and test data
from sklearn.model_selection import train_test_split
train, test = train_test_split(X_y,
                               train_size=0.8,
                               test_size=0.2, 
                               stratify=data.IsSuperman,
                               random_state=xxxx) # Replace xxxx with a random positive integer
print(label_counts(train, 'Training'))
print(label_counts(test, 'Test'))

In [None]:
# Cell 3B
# Train a linear classifier
X_train = train.iloc[:,:-2]
y_train = train.IsSuperman

clf = Pipeline([LogisticRegressionBinaryClassifier()])
clf.fit(X_train, y_train)

In [None]:
# Cell 3B
# Test the classifier
X_test = test.iloc[:,:-2]
y_test = test.IsSuperman

predictions, metrics = clf.test(X_test, y_test)
metrics

## Part 4: Evaluate Classifier Performance
1. Look at the predictions
2. Calculate confusion matrix
3. Examine the classifier mistakes
4. Calculate accuracy with 5-fold cross validation

In [None]:
# Cell 4A
# View Prediction
predictions.head()

In [None]:
# Cell 4B
# Join predictions with paths and original label
path_and_label = test.reset_index()[['ImagePath', 'IsSuperman']].rename(columns={'IsSuperman': 'Label'})
predictions = pd.concat([path_and_label, predictions], axis=1)
predictions.head()

In [None]:
# Cell 4C
# View confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(predictions.Label, predictions.PredictedLabel)

In [None]:
# Cell 4D
# Sort test images by predicted probability
predictions['IsMistake'] = predictions.Label != predictions.PredictedLabel
predictions.sort_values('Probability', inplace=True)
show_gallery(predictions, num_images=100, randomize=False, add_prob=True, flag_mistakes=True)

In [None]:
# Cell 4E
# View mistakes: Superman classified incorrectly
superman_mistakes = predictions[(predictions.Label == 1) & predictions.IsMistake] 
show_gallery(superman_mistakes, add_name=True, add_prob=True)

In [None]:
# Cell 4F
# View mistakes: Spiderman classified incorrectly
spiderman_mistakes = predictions[(predictions.Label == 0) & predictions.IsMistake] 
show_gallery(spiderman_mistakes, add_name=True, add_prob=True)

In [None]:
# Cell 4G
#Image('files/data/spiderman_111.jpg')

In [None]:
# Cell 4H
# Find accuracy with cross validation
from microsoftml_scikit.model_selection import CV
cross_validator = CV([LogisticRegressionBinaryClassifier()])
cv_results = cross_validator.fit(X_y.iloc[:,:-2], X_y.IsSuperman, cv=5)

In [None]:
# Cell 4I
# Metrics per fold
cv_results['metrics'].set_index('Fold')

In [None]:
# Cell 4J
# Metrics summary statistics
cv_results['metrics_summary'][['AUC', 'Accuracy']]

# Recap
* Introduced what pre-trained DNNs are and how to use them in PyTLC
* Built a superman vs spiderman image classifier without any deep learning or image processing knowledge requirement
* The classifier achieves 93% accuracy with 5-fold cross validation