# Module 4 - Margin and Ensemble Classifiers

Before the popularization of deep learning, many applied automatic classification algorithms were variations on ensemble or margin classifiers. Both types of classifiers operate on pre-defined features. As we will see later, this is fundamentally different from neural networks which can learn directly from the images. For now, we will make use of the features pulled from SPC data in the last module.

In [17]:
import numpy as np
import cv2
import skimage
from sklearn import ensemble
import sys
import glob
import os
import random
import matplotlib.pyplot

The important new toolkit we are importing here is *sklearn*, short for scikit-image. It contains most of the tools we will use to explore margin and ensemble classifiers.

For all the techinques discussed in the rest of this module, we will make use of the same features we computed before. We will also need to divide it into seperate sets for training and testing. 

In [7]:
# load in the data. first get all the file paths
ptf = glob.glob(os.path.join("/media/storage/image_data/SPC_data/manual_labels_features/","*.csv"))

# initalize a dictionary for the data. This will contain all the file paths and the associated features
data = dict()

# we will also create a flag and a listto give the labels a numeric value
flag = 0
cls_names = []

for line in ptf:

    # read in the data, but skip the image path
    temp = np.genfromtxt(line, usecols=range(1,71),delimiter=",")
    
    # get the image path, making sure to specify that the data type is string
    temp_path = np.genfromtxt(line, usecols= [0], delimiter=",", dtype=np.str)
    
    # for now, we will ignore any classes with fewer than 10 samples
    if 10 < temp.shape[0]:
        
        # now we are creating a "nested dictionary." Each element is referenced by the image id and contains the features
        # and numeric class label
        for img, feats in zip(temp_path, temp):
            data[img] = {'features': feats, 'class': flag}
            
        # create a list of the names of the categories and the associated numbers
        name = line.split('/')[-1].split('_')[0]
        print("class", str(flag), ":", name, 
              ", num images:", str(temp.shape[0]))
        cls_names.append((flag, name))
        
        flag+=1

print("Total class:", str(flag), ", Total images:", str(len(data)))

class 0 : Ciliate 01 , num samples: 424
class 1 : Glob , num samples: 16
class 2 : Acantharea , num samples: 24
class 3 : Poop 01 , num samples: 63
class 4 : Bubble , num samples: 331
class 5 : Sphere 01 , num samples: 38
class 6 : Bad Seg , num samples: 249
class 7 : Akashiwo , num samples: 447
class 8 : Polykrikos , num samples: 331
class 9 : Nauplius , num samples: 336
class 10 : Ellipse 01 , num samples: 43
class 11 : Lingulodinium , num samples: 653
class 12 : Protoperidinium Feeding , num samples: 11
class 13 : Ciliate 02 , num samples: 14
class 14 : Chain 01 , num samples: 352
class 15 : Diatom chain , num samples: 60
class 16 : Round 01 , num samples: 65
class 17 : Phyto Mix 01 , num samples: 895
class 18 : Ciliate 03 , num samples: 24
class 19 : Ceratium fusus , num samples: 1011
class 20 : Avocado 01 , num samples: 62
class 21 : Ceratium furca two , num samples: 38
class 22 : Cochlodinium , num samples: 332
class 23 : Red Eye , num samples: 44
class 24 : Prorocentrum Skinny ,

We now have 38 classes, comprising a total of 20678 samples. The nested dictionary is how we will interact with the data. Each data point is identified by its image ID that is saved as a dictionary key.

Note that we are using python 3.6 which by default preserves the order of the dictionary. That is, the order of the key-value pairs will remain the same as how they were inserted into the dictionary no matter what. If for some reason you use an earlier version of python, be aware that the order may not be preserved. 

In [3]:
# to get a list of all the dictionary keys use Python's built in list command and the dictionary method keys()
img_ids = list(data.keys())

print("the first sample is:", img_ids[0])

the first sample is: SPCP2-1522716755-312771-000-584-2524-224-128.jpg


We will use this later to display images after they have been classifier. We can call up the information from that particular image by calling that key's associated values from the dictionary.

In [4]:
# get the data related to a particular sample put the key in square brackets
# remember, the data is store as a dictionary itself. We can also print those keys in the same way

print(img_ids[0], "has two keys that can be referenced:", data[img_ids[0]].keys())

SPCP2-1522716755-312771-000-584-2524-224-128.jpg has two keys that can be referenced: dict_keys(['features', 'class'])


Now we can call the features and the class of that first sample.

In [5]:
# to retrieve the class
print("the numeric class is:", data[img_ids[0]]['class'])

# and the features
print("the features are:",  data[img_ids[0]]['features'])

the numeric class is: 0
the features are: [  5.67235429e-01   8.06082589e-01   2.07103476e-02   8.06082589e-01
   2.31120000e+04   8.23555686e-01   1.71543325e+02   0.00000000e+00
   2.84408754e-01   2.12988031e-02   5.75023115e-05   1.03953474e-04
   7.80014848e-09   1.38600425e-05   1.93731586e-09   2.59058582e-01
   4.84876658e-02   2.20199150e-01   4.62453373e-02   9.70800133e-03
   1.31577215e-01   3.45316281e+00   1.79048225e+00   2.35758892e+00
   4.25905750e+00   6.33965540e+00   7.17331783e+00   7.08213941e+00
   9.46854563e-01   9.37746789e-01   9.11843519e-01   8.76248868e-01
   8.59548187e-01   8.64953747e-01   9.08297483e-01   9.03828887e-01
   8.84937543e-01   8.45380756e-01   8.25986643e-01   8.34588555e-01
   7.12060569e-01   6.21818771e-01   3.28782904e-01   7.30131012e-02
   2.06799996e-02   2.69482227e-02   3.39578263e-01   3.40777531e-01
   3.92849714e-01   5.25042430e-01   3.86037203e-01   8.28412190e-01
   6.53627647e-03   5.30657990e-03   6.19726627e-03   8.81415

Now that we have all the data in the workspace, we need to divide it up for training and testing. That is we need to seperate out a subset of the training data to use as an independent set to assess how well the classifier is doing. 

To do the data dictionary needs to be randomized and split into a training and test set. Here we will use an 80-20 train-test split; 80% of the data will be used to train and 20% will be reserved for testing. 

In [21]:
# to avoid copying the dictionary multiple times, we will randomize the list of keys (ie the image IDs) we made above.
random.shuffle(img_ids)

# print one out to double check
print("The new first entry is:", img_ids[0])

the new first entry is: SPCP2-1522483837-314540-000-912-1688-440-616.jpg


In [22]:
# now we can split the list into training and test sets based on the number of entries
idx = 0.8*len(img_ids)

train_ids = img_ids[0:int(idx)]  # this will copy all the image ids from 0 to the 80% cut-off
test_ids = img_ids[int(idx)::]  # this will copy all the image ids from the cut-off to the end

# double check
print("cut off for 80-20 split:", str(int(idx)))
print("number of training images:", str(len(train_ids)))
print("nubmer of test images:", str(len(test_ids)))

cut off for 80-20 split: 16542
number of training images: 16542
nubmer of test images: 4136


Excellent. Now that the data is imported into the workspace in an organized way, we can begin training and testing classifiers. The same training and test data will be used for the both margin and ensemble classifiers. 

## Margin classifiers

A margin classifier seperates data in a space by assigning a distance between each point and the decision boundary. Imagine that we have just 2 features, $x_{1}$ and $x_{2}$, to seperate two classes. We can plot the points in a plane and find a line that seperates them.


<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Svm_separating_hyperplanes_%28SVG%29.svg">
    <figcaption>
        Points and hyperplanes. Courtesy: ZackWeinberg, via Wikipedia
    </figcaption>
</figure>
        

The line labeled $H_{3}$ is the best linear discriminant of this data. 

A classic and widely used margin classifier is the *support vector machine* (SVM). SVMs search a space, definied by the features, to find the optimal seperating hyperplane (ie a plane in many dimensions). In the example above, it would iteratively try many lines such as $H_{1}$ and $H_{2}$ before eventually settling on a particular plane.

We will use the implimentation in SKLearn, svc -- a class that contains all functions need to train, test, and deplpoy a SVM. Do note however, that fitting a SVM is computationally expensive and scale quadratically. In other words, training an SVM with O(10k) samples becomes really time consuming and memory hungry.  

## Ensemble classifiers

Rather than relying on the results of a single classifer, ensemble classifier combine the results of many smaller classifiers. There are many ways of generating such a collection of computer classifiers. Here we will focus on the popular random forest (RFs) models. 

RFs build a collection of decision trees built from a random selection of the feature set. A single decision tree is a type of flow chart: each node in the tree is a test on a single feature and each branch denotes the outcome. A terminal node, or leaf, represents the tree's final classification.

A single decsion tree tends to overfit the data -- it become really good at representing the training data but does not generalize well to new data. RFs get around this by creating many trees, using a random subsample of features and training examples for each tree. A new sample is then fed into every tree and the results averaged at the end to come to a final decision.

To train a RF in python we will use the sklearn's ensemble methods.

In [11]:
# note we are only using a few of the RF parameters. There are many ways to modify this
rf_clf = sklearn.ensemble.RandomForestClassifier(n_estimators=30, n_jobs=8, verbose=1)

30


We have not trained the classifier yet. The code above defines an instance *rf_clf* of the class *RandomForestClassifier*. The parameters we added define a few things:

* n_estimators is the number of trees. For now we are just using 30
* n_jobs parallalizes the process. It subdivides the process out to some number of cores. This speeds up training and is limited by the hardware you are working on. 
* verbose just tells sklearn they we want feedback as it is training. 

There are loads of other parameters that can change how the classifier behaves. For our purposes, mostly using defaults will suffice. Note that we are not exlicitly defining the number of features the will be used in each random tree. The default as set by sklearn is $\sqrt(n_features)$. This is generally a good rule of thumb.

To train the classifier we need to give it data. This is done with the *fit* method of the *RandomForestClassifier*.

In [33]:
# to train it, feed in the features and labels
# pull out the features for the training data
# the next line uses "list comprehension" to pull out the feature vectors only from the trianing data
train_features = [data[line]['features'] for line in train_ids]
train_features = np.asarray(train_features)  # convert to an array

# retrieve the numeric classes of the training data
train_labels = [data[line]['class'] for line in train_ids]
train_labels = np.asarray(train_labels)  # convert to an array

# check to make sure these numbers are right. We expect the training data to be a matrix with 
# dimensions [n_images x n_features] and the test data to be a matrix with dimensions [n_images x 1]
print("train features dim:", train_features.shape)
print("train labels dim:", train_labels.shape)

train features dim: (16542, 70)
train labels dim: (16542,)


In [28]:
train_array = [data[line]['features'] for line in train_ids]

In [29]:
len(train_array)

16542

In [30]:
type(train_array)

list

In [32]:
train_array = np.asarray(train_array)
train_array.shape

(16542, 70)