### Classification

When we talk about classification problems, we talk about the problem of learning a function that maps input data into one of several class labels. A simple example of such a task would be an image classifier; we have as input an image and we wish to know if that image is a picture of a cat or a dog. In this case, "cat" and "dog" are the class labels. We can use a variety of methods to learn a function that can ingest the pixels of an image and produce a decision that maps to either of those classes.

We will talk more about deep learning (deep neural networks) in a later lecture, so we'll focus on what's commonly referred to as "traditional" methods for machine learning, using sklearn (remember: installable via `pip install scikit-learn`

#### Terminology

We'll keep the terminology simple here, but there are multiple way of referring to the different aspects of this problem. What we refer to as the input is translated into a "feature matrix". In deep learning, those features are learned automatically using backpropagation. For more traditional methods, we will have to either take into the input data natively, or compute features on those data. For example, the iris dataset has a number of features: sepalWidth, sepalLength, petalLength, petalWidth. An example of a computed feature might be the ratio of sepalWidth to sepalLength.

What we've referred to as the "label" is also often referred to as the "target". We typically use an 'X' (note the uppercase) to refer to the feature matrix, and 'y' (not the lowercase) to refer to the target. X is usually a matrix, where the rows are the samples and the columns are the specific features, and y is an array, where each element refers to a specific data sample. We can refer to these as "parallel arrays" as the elements line up: eg. the 0th element in the feature matrix is the same sample as the 0th element of the target array. Therefore, there must be the same number of rows in the feature matrix as there are elements in the target array.

$$
\left(\begin{array}{cc} 
0.5 & 0.75 & 0.25 & 0.99\\
0.9 & 0.62 & 0.12 & 0.87\\
0.8 & 0.55 & 0.33 & 0.43
\end{array}\right)
\left(\begin{array}{cc} 
0 \\ 
0  \\
1
\end{array}\right)
$$ 

In the example above, we have a 3x4 feature matrix (3 samples, 4 features) and a 3-element target array. The first two rows have a class label of '0', and the last row has a class label of '1'.

If we're trying to distinguish between two classes (eg. cats vs. dogs) then we have a 'binary classification' problem. If we have more than two classes, we'll refer to it as a 'multi-class classification' problem.

### Classification Process

When building a classifier, we have a couple steps that we need to do. The first thing we need to do is gather our labeled data. This means building our feature matrix and our target array. Upon doing so, we'll then need to split our data into 'training' and 'testing' sets, usually something close to an 80/20 split. There are also methods that we will need to do, such as cross validation and/or creating a validation split. But that is beyond the scope of this class for now.

We will then use the training set to train our model. Once our model is trained, we will test it by using the testing set. Our performance on the testing set can be measured via certain standard metrics such as accuracy, precision, and recall.

### Types of Classifiers

There are a large number of different classifier types. A good overview from sklearn can be found here: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html 

We will focus on just three: Naïve Bayes, Random Forest, and Support Vector Machines (SVM).

### Naïve Bayes

Naïve Bayes (NB) is a very simple classifier based on Bayes Theorem, which you may be familiar with through your statistics classes. 

The basic thought behind NB is "class conditional independence". This means that each feature in our feature matrix is independent. Which might not always be the case! But it makes the math easier. The equation behind NB is as follows:

\begin{equation}
P(h|D) = \frac{P(D|h)P(h)}{P(D)}
\end{equation}

The prior probability of h, the hypthosis is the probability of the hypothesis being true, is given by:
\begin{equation}
P(h)
\end{equation}

The probability of the data, known as the prior probability, is given by:
\begin{equation}
P(D)
\end{equation}

The probability of the hypothesis given the data is known as the posterior probability, is given by:
\begin{equation}
P(h|D)
\end{equation}

And finally, the probability of the data given that the hypothesis was true, known as the posterior probability, is given by:
\begin{equation}
P(D|h)
\end{equation}

NB works by:
1. Calculating the prior probability
2. Finding the likelihood probability for each attribute of the class - remember class conditional independence?
3. Apply Bayes Formula to calculate the posterior probability
4. Find the highest class probability for a given input

### Preparing our data

We'll take a look at our data, which we'll use for each classifier

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

In [None]:
iris = pd.read_json('iris.json')
species = iris['species'].unique()
iris['species'].replace(species,list(range(0,len(species))), inplace=True)
y = iris['species'].to_numpy()  # recall sometimes we have to reshape the array, but not for NB .reshape(-1,1)
X = iris[['sepalLength','sepalWidth','petalLength','petalWidth']]
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.15) 

### Building the Model

Now that we have our data, we'll build our model.

In [None]:
model = GaussianNB()

model.fit(X_train, y_train)

predicted = model.predict(X_test)

In [None]:
print(predicted)
print(y_test)

In [None]:
# time to evaluate

accuracy = accuracy_score(predicted, y_test)
print('Classification accuracy: {}'.format(accuracy))

### Random Forest

Random Forest (RF) is a pretty flexible model, as you can use RF for both regression and classification tasks. We will focus on using RF for classification tasks.

The other unique feature of RF is that it is an ensemble method; the RF is built from a "forest" of decision trees, which work exactly as their name suggests. A decision tree takes in a subset of all the features from the feature matrix, as makes a classification decision. The RF is then a collection of those decision trees, each of which contribute a part to the whole, and their combined decisions are used to make the overall classification determination. 

We'll reuse our data and see how this works.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [None]:
predictions = rf.predict(X_test)

In [None]:
accuracy = accuracy_score(predicted, y_test)
print('Accuracy: {}'.format(accuracy))

In [None]:
list(species)

In [None]:
# we can visualize what decision trees are doing within the RF
print('Number of decision trees: {}'.format(len(rf.estimators_)))
for i in range(1):
    tree = rf.estimators_[i]
    dot_data = export_graphviz(tree, feature_names=X_train.columns, class_names=species, filled=True, max_depth=10, impurity=False, proportion=True)
    graph = graphviz.Source(dot_data)
    display(graph)

### Support Vector Machines

While Support Vector Machines (SVM) can be used to solve regression problems, they are most commonly applied to classification problems. They are an extremely powerful model, and prior to deep learning, was a very common and useful model used in classification tasks. SVMs can work on both continuous and categorical data.

Recall our feature matrix. The size of our feature matrix - in our case, the iris data - has a dimensionality equal to the number of features. So our iris data has dimensionality equal to 4. That means that all of our data exists in a 4-dimensional space. This is hard to visualize obviously, so we'll take some short cuts when we get to that point.

SVMs attempt to learn a hyperplane (recall a cartesian 2D plane) that separates our data in the n-dimensional space. The objective of the SVM algorithm is to find a maximum marginal hyperplane, meaning that it maximizes the distance between the classes in the n-dimensional space.

One quick note: sometimes our data is not linearly separable. I'll draw an example on the board. If this is the case, some SVM models can use something called a "kernel trick" to embed the data into a higher dimensional space where the data is linearly separable. There are many different types of kernels: linear, polynomial, and radial basis function. RBFs map the input data into an infinite dimensional space. We won't go into the details. The performance of our kernel is usually determined by several hyperparameters, most notably "gamma", which ranges from 0 to 1 and indicates how much deference the model should pay to the training data. A value of 1 for gamma will result in the development of a hyperplane that perfectly separates the training data, which is called "overfitting".

In [None]:
from sklearn import svm

In [None]:
svc_model = svm.SVC(kernel='linear')
svc_model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [None]:
accuracy = accuracy_score(predictions, y_test)
print('Accuracy: {}'.format(accuracy))

In [None]:
import seaborn as sns
import numpy as np
from sklearn.decomposition import PCA

In [None]:
X_train = X_train.to_numpy()

In [None]:
# Given the dimensionality of the SVM, it's hard to visualize, so we'll use PCA to reduce it down to 2D so we
# can look at it.
# Code grabbed from: https://stackoverflow.com/questions/51297423/plot-scikit-learn-sklearn-svm-decision-boundary-surface

pca = PCA(n_components=2)
Xreduced = pca.fit_transform(X_train)

def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

model = svm.SVC(kernel='linear')
clf = model.fit(Xreduced, y_train)

fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = Xreduced[:, 0], Xreduced[:, 1]
xx, yy = make_meshgrid(X0, X1)

plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('PC2')
ax.set_xlabel('PC1')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title('Decison surface using the PCA transformed/projected features')
plt.show()