<img src="../notebooks/img/oscon.png" width="400">


#### In this exercise we walk you through the task of classifying flowers in images. 

We will use a pre-trained VGG model for representing flower-images, and feed the output into a simple linear SVM (no image pre-processing other than chanel mean normalization).

The <b>task</b> you will accomplish in this exercise was the content of a (very successful) PhD just under 10 years ago, demonstrating the huge leap in technical ability, as applied to the field of computer vision, brought about by CNNs. 

This exercise is highly demonstrative of the common practice of using deep learning for image classification problems with a pre-trained model and a much smaller specific labeled dataset. To get better results one might add some pre-processing, or a fancier classifier, but the power of CNNs is demonstrated by the out-of-the-box-ness of the solution, already acheiving over 80% accuracy.


---

#### Materials: 
1. dataset: http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html
(download 1 & 4; 336M)
    Place the files under DATA_DIR
2. VGG pretrained model for Keras (will be auto-downloaded upon first usage; ~60M)
3. Ready-made represenation of the image dataset (this is just the representation of the images using the model from (2)): https://drive.google.com/drive/folders/0B3U30rvx_KQBSjVub3hreGt5blU

---

Place the dataset (1) and representation .csv (3) under IMAGE_DIR as defined below. 

In [2]:
import os
import sys
import numpy as np
import pandas as pd
from scipy.io import loadmat
from scipy.misc import imread, imresize

from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC

import tensorflow as tf
from keras.applications.vgg16 import VGG16
from keras.optimizers import SGD

import matplotlib.pyplot as plt 
%matplotlib inline 

In [3]:
DATA_DIR = '/tmp/data' if not 'win' in sys.platform else "c:\\tmp\\data"
IMAGE_DIR = os.path.join(DATA_DIR, "flowers")
DEFAULT_VGG_IMAGE_SIZE = (224, 224)
NUM_IMAGES = 8141
NUM_CLASSES = 102

The following is a utility for reading a single image file, based on the file name. File names are such that:
~~~python
load_single_img("image_{:05}.jpg".format(i+1))
~~~
will return the i-th image.

In [4]:
def load_single_img(file_name, resize_to=DEFAULT_VGG_IMAGE_SIZE):
    img = imread(os.path.join(IMAGE_DIR, "jpg", file_name))
    img = imresize(img, resize_to)
    return img

The following is a utility for generating a dataset with out of some or all of the 102 flower calsses. For instance:
~~~python
load_images_labels(use_classes=(21, 45, 68))
~~~
will generate a dataset with the classes 21, 45, and 68,
~~~python
load_images_labels(use_classes=None)
~~~
will generate a dataset with <b>all</b> the image classes in the dataset. 

In [None]:
def load_images_labels(use_classes=None, resize_to=DEFAULT_VGG_IMAGE_SIZE):

    # Load the .mat label file
    labels = loadmat(os.path.join(IMAGE_DIR, "imagelabels.mat"))["labels"].ravel()

    # If use_classes is None, it becomes all 102 available classes
    use_classes = use_classes or list(range(NUM_CLASSES))

    # Compile a list of flower-image files we are going to use, and the associated label in the format [(file, label),
    file_name_label = [("image_{:05}.jpg".format(i+1), labels[i])
                       for i in range(NUM_IMAGES) if labels[i] in use_classes]

    # Load images and labels
    images = [load_single_img(file_name, resize_to=resize_to) for file_name, _ in file_name_label]
    images = np.array(images)
    labels = [l for _, l in file_name_label]

    return images, labels

## 1. Visualization 
Using the load_single_img method above, plot a 10X10 grid of random flower images from the dataset. Do you recongnize any of them? 

## 2. In this section we select a subset of the flower classes, and build a classifier for them using the pre-trained Keras VGG model and a linear SVM.

### 2.1 Complete the VGGRep class defined below. 
When you are done, the .represent(images) method should return a numpy.Array with the VGG representation of `images`.

### 2.2 Cross validated classifier results
1. Select a small number (3-5 for instance) of classes out of the 102 flower types in the dataset. 
2. Run the VGG representation of the selected images. 
3. Using a crosss-validation procedure, determine the accuracy of a linear SVM applied to this representation.
    
    hint: use cross_val_score, and LinearSVC (imported above)

In [None]:

class VGGRep(object):
    def load(self):
        m = VGG16(include_top=False, weights='imagenet', pooling='avg')
        m.compile(SGD(), 'categorical_crossentropy')
        m.summary()
        self._m = m
        return self

    @staticmethod
    def pre_process(image):
        # Subtract the TRAINING-DATA mean -- not completely necessary but helps. 
        image[:, :, 0] -= 103.939
        image[:, :, 1] -= 116.779
        image[:, :, 2] -= 123.680
        return image

    def represent_single_image(self, image):
        """
        :param image: an image array of shape (224, 224, 3)
        --
        :return: The VGG representation of image
        """
        .
        .
        .
        return vgg_representation

    def represent(self, images):
        """
        :param images: mlutple images; shape (None, 224, 224, 3)
        --
        :return: The VGG representation of images. Shape should be (None, 512)
        """
        .
        .
        .
        return vgg_representation


Now we use VGGRep in order to represent a sub-set of the flower classes, and use a linear SVM to classify them. We evaluate this with a cross-validation procedure. 

In [None]:
# The 4-class version
CLASSES_TO_USE = (77, 78, 79, 80)
images, labels = load_images_labels(use_classes=CLASSES_TO_USE, 
                                    resize_to=DEFAULT_VGG_IMAGE_SIZE)
print(images.shape)

# TODO: make VGG representation of images 
# TODO: cross validation with LinearSVM


print("Overall percent correct: {:.4f}%".format(pcorr))

## 3. In this section we classify all 102 flower calsses
Since running the VGG network to represent all 8141 images in the dataset may take too long, we have prepared a .csv file (vgg_rep.csv) containing the representation of the entire dataset: 
- The index column is the label (class number)
- The header contains the feature number (0-511 -- for the 512 features in the VGG representation we are using)

### 3.1 Classify the entire dataset
Repeat the procedure from section 2.2(3) and determine how well the linear SVM applied to the VGG representation is able to classify the entire 102 classes of flower images.

### 3.2 Classift arbitrary flowers
Now train a single LinearSVM model on all the data. Download a few images from the internet (belonging to classes in the dataset). After resizing them to the appropreate dimensions (DEFAULT_VGG_IMAGE_SIZE), does the model label them correctly?


In [None]:
# The 102-class version 
vgg_rep_all = pd.read_csv(os.path.join(IMAGE_DIR, "vgg_rep.csv"), index_col=0, header=0)
X = vgg_rep_all.values
y = vgg_rep_all.index

# TODO: cross validation procedure 

print("Overall percent correct: {:.4f}%".format(pcorr))

Finally, train a single LinearSVM on all the data. 

In [None]:
# TODO: train LinearSVM


In [None]:
# TODO: check classifitaion on some flower images from the web