# Module 4 - Margin and Ensemble Classifiers

Before the popularization of deep learning, many applied automatic classification algorithms were variations on ensemble or margin classifiers. Both types of classifiers operate on pre-defined features. As we will see later, this is fundamentally different from neural networks which can learn directly from the images. For now, we will make use of the features pulled from SPC data in the last module.

In [1]:
import numpy as np
import cv2
import skimage
import sklearn
import sys
import glob
import os
import matplotlib.pyplot

The important new toolkit we are importing here is *sklearn*, short for scikit-image. It contains most of the tools we will use to explore margin and ensemble classifiers.

For all the techinques discussed in the rest of this module, we will make use of the same features we computed before. We will also need to divide it into seperate sets for training and testing. 

In [72]:
# load in the data. first get all the file paths
ptf = glob.glob(os.path.join("/media/storage/image_data/SPC_data/manual_labels_features/","*.csv"))

# make an empty list for the data. 
# the first entry will be the class, the rest the features
data = np.empty((0, 71))

# we will also create a flag and a listto give the labels a numeric value
flag = 0
cls_names = []
for line in ptf:

    # read in the data, but skip the image path
    temp = np.genfromtxt(line, usecols=range(1,71),delimiter=",")
    
    # create an array to hold represent the labels
    temp2 = flag*np.ones([temp.shape[0],])
    
    # add them together
    temp = np.column_stack((temp2, temp))
    
    # for now, we will ignore any classes with fewer than 10 samples
    if 10 < temp.shape[0]:
        # append them to the whole data array
        data = np.concatenate((data, temp), axis=0)

        # create a list of the names of the categories and the associated numbers
        name = line.split('/')[-1].split('_')[0]
        print("class", str(flag), ":", name, 
              ", num samples:", str(temp.shape[0]))
        cls_names.append((flag, name))
        flag+=1

print("Total samples:", str(data.shape))
    

class 0 : Ciliate 01 , num samples: 424
class 1 : Glob , num samples: 16
class 2 : Acantharea , num samples: 24
class 3 : Poop 01 , num samples: 63
class 4 : Bubble , num samples: 331
class 5 : Sphere 01 , num samples: 38
class 6 : Bad Seg , num samples: 249
class 7 : Akashiwo , num samples: 447
class 8 : Polykrikos , num samples: 331
class 9 : Nauplius , num samples: 336
class 10 : Ellipse 01 , num samples: 43
class 11 : Lingulodinium , num samples: 653
class 12 : Protoperidinium Feeding , num samples: 11
class 13 : Ciliate 02 , num samples: 14
class 14 : Chain 01 , num samples: 352
class 15 : Diatom chain , num samples: 60
class 16 : Round 01 , num samples: 65
class 17 : Phyto Mix 01 , num samples: 895
class 18 : Ciliate 03 , num samples: 24
class 19 : Ceratium fusus , num samples: 1011
class 20 : Avocado 01 , num samples: 62
class 21 : Ceratium furca two , num samples: 38
class 22 : Cochlodinium , num samples: 332
class 23 : Red Eye , num samples: 44
class 24 : Prorocentrum Skinny ,

## Margin classifiers

A margin classifier seperates data in a space by assigning a distance between each point and the decision boundary. Imagine that we have just 2 features, $x_{1}$ and $x_{2}$, to seperate two classes. We can plot the points in a plane and find a line that seperates them.


<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Svm_separating_hyperplanes_%28SVG%29.svg">
    <figcaption>
        Points and hyperplanes. Courtesy: ZackWeinberg, via Wikipedia
    </figcaption>
</figure>
        

The line labeled $H_{3}$ is the best linear discriminant of this data. 

A classic and widely used margin classifier is the *support vector machine* (SVM). SVMs search a space, definied by the features, to find the optimal seperating hyperplane (ie a plane in many dimensions). In the example above, it would iteratively try many lines such as $H_{1}$ and $H_{2}$ before eventually settling on a particular plane.

We will use the implimentation in SKLearn, svc -- a class that contains all functions need to train, test, and deplpoy a SVM. Do note however, that fitting a SVM is computationally expensive and scale quadratically. In other words, training an SVM with O(10k) samples becomes really time consuming and memory hungry.  