# Module 4 - Margin and Ensemble Classifiers

Before the popularization of deep learning, many applied automatic classification algorithms were variations on ensemble or margin classifiers. Both types of classifiers operate on pre-defined features. As we will see later, this is fundamentally different from neural networks which can learn directly from the images. For now, we will make use of the features pulled from SPC data in the last module.

In [1]:
import numpy as np
import cv2
import skimage
import sklearn
import sys
import glob
import os
import matplotlib.pyplot

The important new toolkit we are importing here is *sklearn*, short for scikit-image. It contains most of the tools we will use to explore margin and ensemble classifiers.

For all the techinques discussed in the rest of this module, we will make use of the same features we computed before. We will also need to divide it into seperate sets for training and testing. 

In [2]:
# load in the data. first get all the file paths
ptf = glob.glob(os.path.join("/media/storage/image_data/SPC_data/manual_labels_features/","*.csv"))

# initalize a dictionary for the data. This will contain all the file paths and the associated features
data = dict()

# we will also create a flag and a listto give the labels a numeric value
flag = 0
cls_names = []

for line in ptf:

    # read in the data, but skip the image path
    temp = np.genfromtxt(line, usecols=range(1,71),delimiter=",")
    
    # get the image path, making sure to specify that the data type is string
    temp_path = np.genfromtxt(line, usecols= [0], delimiter=",", dtype=np.str)
    
    # for now, we will ignore any classes with fewer than 10 samples
    if 10 < temp.shape[0]:
        
        # now we are creating a "nested dictionary." Each element is referenced by the image id and contains the features
        # and numeric class label
        for img, feats in zip(temp_path, temp):
            data[img] = {'features': feats, 'class': flag}
            
        # create a list of the names of the categories and the associated numbers
        name = line.split('/')[-1].split('_')[0]
        print("class", str(flag), ":", name, 
              ", num samples:", str(temp.shape[0]))
        cls_names.append((flag, name))
        
        flag+=1

print("Total class:", str(flag), "Total images:", str(len(data)))

class 0 : Ciliate 01 , num samples: 424
class 1 : Glob , num samples: 16
class 2 : Acantharea , num samples: 24
class 3 : Poop 01 , num samples: 63
class 4 : Bubble , num samples: 331
class 5 : Sphere 01 , num samples: 38
class 6 : Bad Seg , num samples: 249
class 7 : Akashiwo , num samples: 447
class 8 : Polykrikos , num samples: 331
class 9 : Nauplius , num samples: 336
class 10 : Ellipse 01 , num samples: 43
class 11 : Lingulodinium , num samples: 653
class 12 : Protoperidinium Feeding , num samples: 11
class 13 : Ciliate 02 , num samples: 14
class 14 : Chain 01 , num samples: 352
class 15 : Diatom chain , num samples: 60
class 16 : Round 01 , num samples: 65
class 17 : Phyto Mix 01 , num samples: 895
class 18 : Ciliate 03 , num samples: 24
class 19 : Ceratium fusus , num samples: 1011
class 20 : Avocado 01 , num samples: 62
class 21 : Ceratium furca two , num samples: 38
class 22 : Cochlodinium , num samples: 332
class 23 : Red Eye , num samples: 44
class 24 : Prorocentrum Skinny ,

We now have 38 classes, comprising a total of 20678 samples. The nested dictionary is how we will interact with the data. Each data point is identified by its image ID that is saved as a dictionary key.

In [3]:
# to get a list of all the dictionary keys use Python's built in list command and the dictionary method keys()
img_ids = list(data.keys())

print("the first sample is:", img_ids[0])

the first sample is: SPCP2-1522716755-312771-000-584-2524-224-128.jpg


We will use this later to display images after they have been classifier. We can call up the information from that particular image by calling that key's associated values from the dictionary.

In [4]:
# get the data related to a particular sample put the key in square brackets
# remember, the data is store as a dictionary itself. We can also print those keys in the same way

print(img_ids[0], "has two keys that can be referenced:", data[img_ids[0]].keys())

SPCP2-1522716755-312771-000-584-2524-224-128.jpg has two keys that can be referenced: dict_keys(['features', 'class'])


Now we can call the features and the class of that first sample.

In [5]:
# to retrieve the class
print("the numeric class is:", data[img_ids[0]]['class'])

# and the features
print("the features are:",  data[img_ids[0]]['features'])

the numeric class is: 0
the features are: [  5.67235429e-01   8.06082589e-01   2.07103476e-02   8.06082589e-01
   2.31120000e+04   8.23555686e-01   1.71543325e+02   0.00000000e+00
   2.84408754e-01   2.12988031e-02   5.75023115e-05   1.03953474e-04
   7.80014848e-09   1.38600425e-05   1.93731586e-09   2.59058582e-01
   4.84876658e-02   2.20199150e-01   4.62453373e-02   9.70800133e-03
   1.31577215e-01   3.45316281e+00   1.79048225e+00   2.35758892e+00
   4.25905750e+00   6.33965540e+00   7.17331783e+00   7.08213941e+00
   9.46854563e-01   9.37746789e-01   9.11843519e-01   8.76248868e-01
   8.59548187e-01   8.64953747e-01   9.08297483e-01   9.03828887e-01
   8.84937543e-01   8.45380756e-01   8.25986643e-01   8.34588555e-01
   7.12060569e-01   6.21818771e-01   3.28782904e-01   7.30131012e-02
   2.06799996e-02   2.69482227e-02   3.39578263e-01   3.40777531e-01
   3.92849714e-01   5.25042430e-01   3.86037203e-01   8.28412190e-01
   6.53627647e-03   5.30657990e-03   6.19726627e-03   8.81415

Now that the data is imported into the workspace in an organized way, we can begin training and testing classifiers.

## Margin classifiers

A margin classifier seperates data in a space by assigning a distance between each point and the decision boundary. Imagine that we have just 2 features, $x_{1}$ and $x_{2}$, to seperate two classes. We can plot the points in a plane and find a line that seperates them.


<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Svm_separating_hyperplanes_%28SVG%29.svg">
    <figcaption>
        Points and hyperplanes. Courtesy: ZackWeinberg, via Wikipedia
    </figcaption>
</figure>
        

The line labeled $H_{3}$ is the best linear discriminant of this data. 

A classic and widely used margin classifier is the *support vector machine* (SVM). SVMs search a space, definied by the features, to find the optimal seperating hyperplane (ie a plane in many dimensions). In the example above, it would iteratively try many lines such as $H_{1}$ and $H_{2}$ before eventually settling on a particular plane.

We will use the implimentation in SKLearn, svc -- a class that contains all functions need to train, test, and deplpoy a SVM. Do note however, that fitting a SVM is computationally expensive and scale quadratically. In other words, training an SVM with O(10k) samples becomes really time consuming and memory hungry.  