__Chapter 3 - A tour of machine learning classifiers using scikit-learn__

1. [First steps with sklearn - training a perceptron](#First-steps-with-sklearn-training-a-perceptron)
1. [Modeling class probabilities w/ logistic regression](#Modeling-class-probabilities-w/-logistic-regression)
    1. [Intuition: logistic regression](#Intuition-logistic-regression)
        1. [Sigmoid function](#Sigmoid-function)
        1. [Decision function](#Decision-function)
    1. [Training a logistic regression model](#Training-a-logistic-regression-model)
1. [Maximum margin classification with SVMs](#Maximum-margin-classification-with-SVMs)
    1. [Intuition: SVMs](#SVM-intuition)
    1. [Nonlinearly separable cases and slack variables](#Nonlinearly-separable-cases-and-slack-variables)
    1. [Logistic regression vs. SVMs](#Logistic-regression-vs-SVMs)
    1. [Solving nonlinear problems using a kernel SVM](#Solving-nonlinear-problems-using-a-kernel-SVM)
    1. [$\gamma$ parameter](#gamma-parameter)
1. [Decision tree learning](#Decision-tree-learning)
    1. [Information gain](#Information-gain)
    1. [Combining decisions trees via random forests](#Combining-decisions-trees-via-random-forests)
1. [K-nearest neighbors](#K-nearest-neighbors)


In [None]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings; warnings.simplefilter('ignore')
dataPath = os.path.abspath(os.path.join('../../Data'))
modulePath = os.path.abspath(os.path.join('../../CustomModules'))
sys.path.append(modulePath) if modulePath not in sys.path else None
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))


# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.6f}'.format


# Modeling extensions
import sklearn.base as base
import sklearn.cluster as cluster
import sklearn.datasets as datasets
import sklearn.decomposition as decomposition
import sklearn.ensemble as ensemble
import sklearn.feature_extraction as feature_extraction
import sklearn.feature_selection as feature_selection
import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.neighbors as neighbors
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import sklearn.svm as svm
import sklearn.tree as tree


# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt


# Custom extensions and settings
from quickplot import qp, qpUtil, qpStyle
from mlTools import powerGridSearch
sns.set(rc = qpStyle.rcGrey)


# Magic functions
%matplotlib inline


# First steps with sklearn - training a perceptron

<a id = 'Choosing-a-classifier'></a>

## Choosing a classifier
-Five key steps in training an algorithm
    1. Select features and collect training data
    2. Choose a performance metric
    3. Choose a classifier and optimization algorithm
    4. Evaluate model performance
    5. Tune the algorithm's parameters
Evaluate several different classifers.

In [None]:
# Load data and inspect class labels

iris = datasets.load_iris()
X = iris.data[:,[2,3]]
y = iris.target
print('Class labels: {0}'.format(np.unique(y)))


In [None]:
# Train/test split
# Stratify ensures proportional distribution of classes between the train/test data

xTrain, xTest, yTrain, yTest = model_selection.train_test_split(X, y
                                            ,test_size = 0.3, random_state = 1, stratify = y)

print('Label counts in y: {}'.format(np.bincount(y)))
print('Label counts in yTrain: {}'.format(np.bincount(yTrain)))
print('Label counts in yTest: {}'.format(np.bincount(yTest)))


In [None]:
# Scale data

sc = preprocessing.StandardScaler()
sc.fit(xTrain)
xTrainStd = sc.transform(xTrain)
xTestStd = sc.transform(xTest)


In [None]:
# Review how first feature changes after standard scaling

print('Original mean: {0}'.format(round(xTrain[:,0].mean(),5)))
print('Original standard deviation: {0}'.format(round(xTrain[:,0].std(),5)))
print()
print('Scaled mean: {0}'.format(round(xTrainStd[:,0].mean(),5)))
print('Scaled standard deviation: {0}'.format(round(xTrainStd[:,0].std(),5)))


> Remarks - Standard scaling uses the mean $\mu$ and standard deviation $\sigma$ to alter each feature independently such that each feature has a $\mu = 0$ and a $\sigma = 0$

> It is important to transform the test dataset using the fit performed on the training set only. First, we want the train and test data to be scaled in the same way. Further, we're assuming we don't know about the test data, and in practice new unseen data will need to be scaled based on the existing scaling operation defintions.

In [None]:
# Instantiate Perceptron and fit model

ppn = linear_model.Perceptron(n_iter = 40, eta0 = 0.1, random_state = 1)
ppn.fit(xTrainStd, yTrain)


> Remarks - 
- 'n_iter' limits the number epochs or iterations
- 'eta0' controls to learning rate. A value too high will likely overshoot the minimum. A value too low will make the learning process unnecessarily slow.

# Maximum margin classification with SVMs

SVMs are similar to the perceptron in that the goal is determine a boundary between clases. A key difference is that while the perceptron seeks to minimize classification error, SVMs seek to maximize the margin. The margin is the distance between the separating boundary (hyperplane) and the samples closest to that boundary. These samples are referred to as the support vectors.

In theory, maximizing margin distance leads to lower generalization error because hyperplanes with narrower distances are close to samples and therefore having higher variance and are more prone to overfitting.

<a id = 'SVM-intuition'></a>

## Intuition

One way to view that boundary is that there are actually three hyperplanes (in a binary classification problem):

1. Decision boundary
2. Positive hyperplane
3. Negative hyperplane

The first boundary is the object of the SVM, and the second two hyperplane are parallel to the decision boundary. These positive and negative boundaries are those closest to the positive and negative samples. A simple mathematical expression of these lines:

$$
w_0 + \textbf{w}^T\textbf{x}_{pos} = 1
$$
$$
w_0 + \textbf{w}^T\textbf{x}_{neg} = -1
$$

These linear equations can be subtract from each other, yielding:

$$
\textbf{w}^T\big(\textbf{x}_{pos} - \textbf{x}_{neg}\big) = 2
$$

Normalizing this equation by the length of the vector \texfbf{w}:

$$
\lVert\textbf{w}\rVert = \sqrt{\sum\nolimits_{j=1}^{m}w_j^2}
$$

Which can be transformed into the following equation:

$$
\frac{\textbf{w}^T\big(\textbf{x}_{pos} - \textbf{x}_{neg}\big)}{\lVert\textbf{w}\rVert} = \frac{2}{\lVert\textbf{w}\rVert}
$$
The LHS of the equation above is the distance between the positive and negative hyperplane. This is the margin that we want to maximize. The objective function of the SVM is to maximize that margin by maximizing the RHS subject the constraint that the samples are correctly classified:

$$
w_0 + \textbf{w}^T\textbf{x}^{(i)} \geq 1 \, if \, y^{(i)} = 1
$$

$$
w_0 + \textbf{w}^T\textbf{x}^{(i)} \leq -1 \, if \, y^{(i)} = -1
$$

$$
\mbox{for} \, i = 1...N
$$

where N is the number of samples in the dataset. The equation above essentially say that all positive samples should fall on one side of the positive hyperplane, and all negative samples on one side of the negative hyperplane. Those two key equations can be written compactly as:

$$
y^i\big(w_0 + \textbf{w}^T\textbf{x}^i\big) \geq 1\forall_i
$$


<a id = 'Nonlinearly-separable-cases-and-slack-variables'></a>

## Nonlinearly separable cases and slack variables

To scratch the surface on soft-margin classification, which allows for a certain level of misclassificaiton tolerance through the slack variable $\xi$. This is helpful in cases where the data is not completely linearly separable. To accomplish this, the slack variable is added to the linear constraints described earlier:

$$
w_0 + \textbf{w}^T\textbf{x}^{(i)} \geq 1 - \xi^{(i)} \, if \, y^{(i)} = 1
$$

$$
w_0 + \textbf{w}^T\textbf{x}^{(i)}\leq - 1 + \xi^{(i)} \, if \, y^{(i)} = -1
$$

$$
\mbox{for} \, i = 1...N
$$

$N$ is again the number of samples in the dataset. The New objective function to minimize is:

$$
\frac{1}{2}\lVert\textbf{w}\rVert + C \bigg(\sum_{i}\xi^{(i)}\bigg)
$$

The variable $C$ controls the penalty for misclassification. Larger values of $C$ correspond to larger error penalties, making the algorithm less forgiving of misclassifications. The model will choose narrower boundaries with higher variance in its efforts to minimize error. Smaller variables of $C$ will be more forgiving of errors and may find a boundary with a wider margin and lower variance.

In [None]:
#

sv = svm.SVC(kernel = 'linear', C = 1.0, random_state = 1)
sv.fit(xTrainStd, yTrain)

p = qp.QuickPlot(fig = plt.figure(), chartProp = 15)
ax = p.makeCanvas(title = '', xLabel = 'petal length [standardized]', yLabel = 'petal width [standardized]'
                  ,yShift = 0.6, position = 111)
p.qpDecisionRegion(x = xCombined
                   ,y = yCombined
                   ,classifier = sv
                   ,testIdx = range(105, 150)
                   #,bbox = (1.2, 0.9)
                   ,ax = ax
                  )
plt.legend(loc = 'upper left')


<a id = 'Logistic-regression-vs-SVMs'></a>

## Logistic regression vs. SVMs

Logistic regression and SVMs often produce similar results, but there are a few fundamental differences. 
- Since SVMs are only interested in finding support vectors (samples closest to the boundary), it is generally  not influenced by outlier. the same cannot be said for logistic regression.
- Logistic regression is a simpler model mathematically and can be implemented more easily.
- Logistic regression models can be easily updated with new data.

<a id = 'Solving-nonlinear-problems-using-a-kernel-SVM'></a>

## Solving nonlinear problems using a kernel SVM

It is often impossible to separate data points with a line, plane or hyperplane, making logistic regression and SVM incapable of finding a meaningful solution. One potential solution is to use a kernel method to create nonlinear combinations of the original features and project the observations onto a higher-dimensional space via a mapping function $\phi$ where it becomes linearly separable.

$$
\phi(x_1,x_2) = (z_1,z_2,z_3) = (x_1,x_2,x_1^2,x_1^2)
$$

This enables us to separate two non-linearly separable classes with a linear hyperplane that becomes a nonlinear decision boundary when projected back onto the original feature space.

In practice, the training data is mapped to a higher-dimension space using the mapping function $\phi$, and then unseen data is mapped using the same function to classifiy it using the linear SVM model.

The kernel trick is a solution to high computation expense of creating new features, which is especially high when dealing with high-dimensional data. The operation of finding the dot product with $\textbf{x}^{(i)T}\textbf{x}^{(j)}$ by wrapping each vector with our function $\phi$: $\phi\big(\textbf{x}^{(i)T}\big)\phi\big(\textbf{x}^{(j)}\big)$. This kernel function avoids the otherwise expensive calculation step of calcuting the dot product of two points explicitely:

$$
K(\textbf{x}^{(i)},\textbf{x}^{(j)}) = \phi\big(\textbf{x}^{(i)T}\big)\phi\big(\textbf{x}^{(j)}\big)
$$

A widely used kernel trick is called the radial basis function (RBF) or the Gaussian kernel:

$$
K(\textbf{x}^{(i)},\textbf{x}^{(j)}) = \mbox{exp}\Bigg(-\frac{\lVert\textbf{x}^{(i)} - \textbf{x}^{(j)}\rVert}{2\sigma^2}\Bigg)
$$

Which simplifies to:

$$
K(\textbf{x}^{(i)},\textbf{x}^{(j)}) = \mbox{exp}\Big(-\gamma\lVert\textbf{x}^{(i)} - \textbf{x}^{(j)}\rVert^2\Big)
$$

The parameter $\gamma$ = $\frac{1}{2\sigma^2}$ and is tuned udring optimization.

A kernel is a similarity function in that it evaluates a pair of samples. The minus sign in front of $\gamma$ inverts the distance measure into a similarity score (high values, or less negative values, represent more similar observations). The exponential term ensures the similarity score falls between 1 (exactly similar samples) and 0 (very dissimilar samples).

In [None]:
#

np.random.seed(1)
xXor = np.random.randn(200,2)
yXor = np.logical_xor(xXor[:,0] > 0
                     ,xXor[:,1] > 1)

p = qp.QuickPlot(fig = plt.figure(), chartProp = 15)
ax = p.makeCanvas()
p.qp2dScatterHue(x = xXor[:,0]
                  ,y = xXor[:,1]
                  ,target = yXor
                  ,label = ['1','-1']
                  ,xUnits = 'ddd'
                  ,yUnits = 'ddd'
                  ,ax = ax
                  )


Remarks - Clearly non-linearly separable class

In [None]:
#

sv = svm.SVC(kernel = 'rbf', random_state = 1, gamma = 0.10, C = 10.0)
sv.fit(xXor,yXor)

p = qp.QuickPlot(fig = plt.figure(), chartProp = 15)
ax = p.makeCanvas(title = '', xLabel = '', yLabel = ''
                  ,yShift = 0.8, position = 111)
p.qpDecisionRegion(x = xXor
                   ,y = yXor
                   ,classifier = sv
                   ,testIdx = range(105,150)
                   ,bbox = (1.2, 0.9)
                   ,ax = ax
                  )


<a id = 'gamma-parameter'></a>

## $\gamma$ parameter

The parameter gamma $\gamma$ is a cut-off parameter for the Gaussian sphere. Increasing the value of $\gamma$ increasing the influece of the training samples, which leads to a tighter and bumpier decision boundary. 

> Remarks - There is an 81.9% chance sample 0 belongs to class 1, a 77.1% chance sample 2 belongs to class 0, and an 86.6% chance sample 3 belongs to class 2

In [None]:
# Another way of illustrating to remark above

logReg.predict_proba(xTrainStd[:3, :]).argmax(axis = 1)


In [None]:
# We can also call the predict method on these samples

logReg.predict(xTrainStd[:3, :])


<a id = 'Use-regularization-to-avoid-overfitting'></a>

## Use regularization to avoid overfitting

__Weight decay and $\lambda$__

A model that performs well on the training dataset but poorly on the test set is overfitting the data. Another way of saying this is that the model has high variance. These models often have too many parameters, leading to a very complex model.

Conversely, a model that underfits the data is not adequately capturing the pattern in the data, resulting in poor accuracy on both the training and test set. Underfit models are said to have high bias.

Regularization (also referred to as weight decay) is a method for reducing the complexity of a model. In short, regularization penalizes extreme weight values. A common form of regularization is L2 regularization:

$$
\lambda\sum\limits_{i=1}^{m}{w_i^2}
$$

where $\lambda$ is the regularization parameter that controls the strength of the penalty. Higher values of $\lambda$ lead to higher levels of regularization, and higher penalties for large weights.

__The parameter C__

The parameter C is the inverse of $\lambda$. Consequently, decreasing C increases the regularization strength

In [None]:
# Visualize weight coefficient weights by regularization setting

weights, params = [], []
for c in np.arange(-5, 5):
    logReg = linear_model.LogisticRegression(C = 10.**c, random_state = 1)
    logReg.fit(xTrainStd, yTrain)
    weights.append(logReg.coef_[1])
    params.append(10.**c)

p = qp.QuickPlot(fig = plt.figure(), chartProp = 15)
ax = p.makeCanvas(title = 'L2-regularization', xLabel = 'C', yLabel = 'Weight coefficient', yShift = 0.63)
p.qpLine(x = np.array(params)
           ,y = np.array(weights)
           ,ax = ax
           ,label = ['petal length','petal width']
           ,yMultiVal = True
          )
plt.xscale('log')


<a id = '#Maximum-margin-classification-with-SVMs'></a>

# Modeling class probabilities w/ logistic regression

<a id = 'Intuition-logistic-regression'></a>

## Intuition: logistic regression

Unlike actual regression, logistic regression doesn't try to predict the value of a numeric variable given an observation's inputs. Rather, logistic regression is a classification algorithm that returns a probability that an observation belongs to a certain class given an observation's inputs. In other words, logistic regression returns a value between 0 and 1, whereas actual regression returns a value between $-\infty$ and $\infty$.

The essence of returning a probability $p$ given a set of continuous and categorical attributes begins with the odds ratio, which describes the odds of a particular event occurring and returns a value in the range of 0 to $\infty$:

$$
\mbox{Odds ratio} = \frac{p}{(p - 1)}
$$

$p$ is the probability of an event occurring. As an example, the probability that a patient has diabetes given the patient's attributes. 

This can be further refined by taking the log of the odds ratio (the common convention is to use the natural log):

$$
logit(p) = log\frac{p}{(1 - p)}
$$

The logit function's purpose is to take a probability values between 0 and 1 and transforms them to values that range from $-\infty$ and $\infty$. A different way of looking at this is:

$$
logit\big(p(y = 1|\textbf{x})\big) = w_0x_0 + w_1x_1 + ... w_mx_m = \textbf{w}^T\textbf{x}
$$

Here, the conditional probability that a samples belongs to class 1 given its features $\textbf{x}$ is converted by the logit function from a number between 0 and 1 to a number between $-\infty$ and $\infty$. 

We can use these values to express a linear relationship between feature values and the log-odds. The RHS of the function, in a 3-dimensional example, would be a plane, which is used for separating classes in a 3-dimensional space. Observations can be very close or far away from the plane (or even on the plane), and that distance can inform the probability of an observation belonging to a certain class. Logistic regression seeks to describe that probability. To get this information, we use the inverse form of the logit function, called the logistic sigmoid function.

$$
\phi(z) = \frac{1}{1 + e^{-z}}
$$

$z$ is equal to $\textbf{w}^T\textbf{x}$

This technique can be used for binary classification and multi-class classification with a technique referred to as one-versus-rest.


<a id = 'Sigmoid-function'></a>

### Sigmoid function

In [None]:
# Visualize sigmoid function on the range -7 to 7

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.arange(-7, 7, 0.1)
zPhi = sigmoid(z)


p = qp.QuickPlot(fig = plt.figure(), chartProp = 15)
ax = p.makeCanvas(title = 'Sigmoid function', xLabel = 'z', yLabel = '$\phi (z)$', yShift = 0.81)
p.qpLine(x = z
         ,y = zPhi
         ,yUnits = 'ff'
         ,ax = ax
         )
plt.axvline(0.0, color = 'black')


By reviewing this S-shaped curve it's clear that as $\phi(z)$ approaches 1, $z$ is approaching $\infty$ because $e^{-z}$ becomes very small for large values of $z$. Conversely, as $\phi(z)$ approaches 0, $z$ is approaching $-\infty$. 


<a id = 'Decision-function'></a>

### Decision function


The sigmoid functions also illustrates a key point from above: values between $-\infty$ and $\infty$ are converted to values between 0 and 1. These values between 0 and 1 are probabilities, and can be interpreted as the probability of a particular sample belonging to class 1 given an attribute input vector \textbf{x} and weight vector \textbf{w}. In mathematical notation:

$$
\phi(z)= P(y=1|\textbf{x};\textbf{w})
$$

For example, if $\phi(z)$ = 0.8, then there is an 80% chance that observation belongs to class 1, and a 20% chance of belonging to class 0. This conclusion is based on the decision function:

$$
\phi(z) =
\left\{
    \begin{array}{ll}
        1  & \mbox{if } \phi(z) >= 0.5 \\
        0  & \mbox{otherwise}
    \end{array}
\right.
$$

<a id = 'Training-a-logistic-regression-model'></a>

## Training a logistic regression model

In [None]:
# Predictions with misclassification score

yPred = ppn.predict(xTestStd)
print('Misclassified samples: {0}'.format((yTest != yPred).sum()))


>Remarks - 3 out of 45 samples are incorrectly predicted, yieled an misclassification percent of ~6.7%

In [None]:
#

logReg = linear_model.LogisticRegression(C = 100.0, random_state = 1)
logReg.fit(xTrainStd, yTrain)

p = qp.QuickPlot(fig = plt.figure(), chartProp = 15)
ax = p.makeCanvas(title = '', xLabel = '', yLabel = ''
                  ,yShift = 0.8, position = 111)
p.qpDecisionRegion(x = xCombined
                   ,y = yCombined
                   ,classifier = logReg
                   ,testIdx = range(105,150)
                   ,bbox = (1.2, 0.9)
                   ,ax = ax
                  )

In [None]:
# Display probabilities associated with a few samples

logReg.predict_proba(xTrainStd[:3, :])
