# Classification Algorithms

*Classification* problems involve looking at labeled data and separating it into classes.

## References

1. Scikit-Learn documentation
    * [SVC on the Iris dataset](http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py)
    * [SGD on the Iris dataset](http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_iris.html)


## Support Vector Classification (SVC)
Support vector classification, like support vector regression, uses `scikit`'s SVM components to compare data,
in this case to classify. 
`SVC` is best suited to smaller numeric datasets.

In the example below, taken directly from the `scikit-learn` documentation, 
several different SVM kernels are tested against the same data, 
and the resultant classifications are plotted for comparison.

In [None]:


import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = iris.target

h = .02  # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']


for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()

### Discussion
Note that in the above example, the kernel used for the support vector classification had a significant effect
on the decision boundaries which delineate each subcategory/class in the data.

---
## SGD Classification with SGDClassifier
Much like `SGDRegressor`, `SGDClassifier` is based on stochastic gradient descent. 
It is best suited to classification tasks involving very large data sets (greater than 100,000 samples).

It can, however, be used on smaller datasets, such as the Iris dataset used above. 
This is the basis of the next example, also taken from the `scikit` documentation:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import SGDClassifier

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = iris.target
colors = "bry"

# shuffle
idx = np.arange(X.shape[0])
np.random.seed(13)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]

# standardize
mean = X.mean(axis=0)
std = X.std(axis=0)
X = (X - mean) / std

h = .02  # step size in the mesh

clf = SGDClassifier(alpha=0.001, n_iter=75).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis('tight')

# Plot also the training points
for i, color in zip(clf.classes_, colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                cmap=plt.cm.Paired)
plt.title("Decision surface of multi-class SGD")
plt.axis('tight')

# Plot the three one-against-all classifiers
xmin, xmax = plt.xlim()
ymin, ymax = plt.ylim()
coef = clf.coef_
intercept = clf.intercept_

# plot hyperplanes
def plot_hyperplane(c, color):
    def line(x0):
        return (-(x0 * coef[c, 0]) - intercept[c]) / coef[c, 1]

    plt.plot([xmin, xmax], [line(xmin), line(xmax)],
             ls="--", color=color)

for i, color in zip(clf.classes_, colors):
    plot_hyperplane(i, color)
plt.legend()
plt.show()

### Explanation of SGD classification results

SGD more easily distinguished *I. setosa* from the other variants than *I. versicolor* from *I. virginica*, 
due to stronger collocation between versicolor and virginica datapoints. 
This is consistent with observation; 
*versicolor* and *virginica* are much more difficult to distinguish than *setosa*, as their growth habits are very similar.

---
## Naive Bayes Classification

Naive Bayes is a set of supervised learning algorithms based on [Bayes' theorem](https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/). 
There are a number of variants, 
but the gist is that Naive Bayes is best for data with multiple predictors that are independent of one another 
(this is the 'naive' part of the name; in reality, many response variables have interrelated predictors).

We'll use the variant that assumes Gaussian distribution, `GaussianNB`, as an example:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from collections import OrderedDict

# load data and separate variables
iris = load_iris()
X = iris.data[:, :2]
y = iris.target

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.6)

#print(X_test.shape, Y_test.shape)

# construct Gaussian Naive Bays model and fit data
gnb = GaussianNB().fit(X_train, y_train)

# predict() returns the numeric index of the class to which each test point belongs
labels = gnb.predict(X_test)

# pick your favorite colors!
colors = ["red", "blue", "green"]

# iterate over labels and assign color to each point
for i in range(0,len(X_test)):
    col = colors[labels[i]]
    plt.plot(X_test[:,0][i], X_test[:,1][i], color=col, marker='o', 
             markersize=5, label="Cluster %i" % labels[i])
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='upper right')
plt.show()

Note that this example functions pretty much exactly the same as the [K-means clustering example](./clustering.ipynb#KMC). This would not be the case for unlabeled data.