<h2> A bit more on classification</h2>  

Last time we have seen a special case of classification: where there are two, mutually exclusive classes. The generalization of this can go two ways: there are $\lvert C \lvert > 2$ number of classes, which are mutually exclusive or not. The latter case is called *any-of* , *multilabel* , or *multivalue* classification. This problem can be broken down to $\lvert C \lvert$ number of *binary* classifications, each applied independently (but these classes need not to be independent in the statistical sense) to the train and test sets. The other, *one-of*, or *multiclass* case is a bit more complicated... (<a href="http://nlp.stanford.edu/IR-book/html/htmledition/classification-with-more-than-two-classes-1.html">source</a>)

Imagine instances as points in a $d$ dimensional space, where every dimension corresponds to a feature. Linear classification works by dividing this *feature space* by a hyperplane (hyperplane is the generalization of a line to higher dimensions). (Or in other words, linear classifiers make the classification decision based on a linear combination of the features.)
This is most easily understood by a simple example: a problem solvable by linear classification (image **A**) and not solvable by linear classification (image **B**):  
<img src="http://sebastianraschka.com/images/blog/2014/naive_bayes_1/linear_vs_nonlinear_problems.png" width="400px" align="left">  

<h2> Getting a bit technical... </h2>  
Since jupyter notebooks can handle $\LaTeX$, let's write some equations!  
Formally, the linearity means that the classification *can* be expressed like this:  

$$ c = f \left( \sum _{i=0} ^{N} w_i x_i \right) $$  
Where $c \in C$ is the class we predict for a given instance $\bf{x}$, $w_i$ is the weight of attribute $x_i$, and $f$ is a function that maps its input to a class. Geometrically, $\bf{w}$ is the normal of the separator hyperplane. **The weights are basically what we create in the process of learning, and we use the learnt weights to predict the class of a given input instance.**  
Note that $x_0 = 1$ is a dummy feature, only to make the notation convenient, since the equation for a hyperplane is $w_1 x_1 + w_2 x_2 + \dots + w_N x_N + w_0 = 0 $  

Referring back to the *multiclass* classification: In the linear case, it happens by applying binary classifiers to each instance, and the decision is made based on the score/probability/etc. of each binary classifier.

<h2> Example 0: <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression">Logistic regression</a> </h2>

Yes, the logistic regression is linear! Why? Because the predictions can be written in the form 
$$ \hat{p} = \frac{1}{1+e^{-\bf wx}} $$
so more precisely, the *log-odds* are linear functions of $x$.

<h2> Example 1: <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html">Perceptron</a> </h2>  

This is a very simple algorithm: binary classification, where $c$ can be $-1$ or $1$. initialize the weights with some values (0s or arbitrary small random numbers), then go through the training set in some order (but the resulting separator is not unique, it depends on the order!). Training rule: if the sign of the $\bf wx$ product is equal to the known output, don't change the weights. If the sign is different, then modify the weights by $ c \cdot \bf x$. The generalization of this algorithm to the multiclass case, along with good pictures and examples can be read on the <a href="https://en.wikipedia.org/wiki/Perceptron" >Perceptron's Wikipedia page</a>.

<h2> Example 2: <a href="http://scikit-learn.org/stable/modules/naive_bayes.html">Naive Bayes</a> </h2>  

For this, we'll look at the classification problem from a different angle. We have a given input vector $X$, and we need the probability of it belonging to a class $Y$.
Nomenclature: the input features $X_i$ are in $X$, and the class variable is $Y$. $P(Y)$ is called **a priori**, and $P(Y\lvert X)$ is called **a posteriori** probability of $Y$. This a posteriory probability is what we need.  
At first glance, the Bayes classifier works like this:  

$$\hat{Y} = \mathrm{arg}\,\mathrm{max}\, P(Y\lvert X),$$  
that means "choosing the class with the maximum probability given an input $X$".  
Of course, we would need an enormous training set to be able to get $P(Y\lvert X)$ for every possible input $X$. This is where the Bayes theorem comes in:

$$P(Y\lvert X) = \frac{P(X\lvert Y) P(Y)}{P(X)}$$  
Great, we now have to calculate $P(X\lvert Y)$ (but at least we don't have to worry about the divisor, as it is the same for all possible $Y$s). But if we suppose that the attributes are not dependent on each other (**this assumption makes it naive!**), then the probability $P(X\lvert Y)$ can be written in a product form: $ \prod _i P(X_i \lvert Y) $! (The proof of this is given as an exercise to the reader.) We now have to simply calculate the $P(X_i \lvert Y)$ probabilities, which requires a much smaller train set. The calculation is simple if the $X_i$-s are categorical variables - just use relative frequencies from the train set. In the continuous case, we assume that these probabilities are from a distribution (gauss, binomial, multinomial, etc.), and we use the training set to guess the parameters of this distribution. (Note: the naive bayes classification is linear only if this distribution comes from exponential families). Finally $P(Y)$, the a priori probabilities are simply calculated as relative frequencies from the training set. 
So the training consists of calculating $P(X_i \lvert Y)$-s or the distribution parameters, and $P(Y)$ from the training set, and the prediction is just calculating the values $ P(Y) \prod _i P(X_i \lvert Y)$ for each possible $Y$, and choosing which maximises this.

<h2> Now look at the iris dataset </h2>

In [56]:
import numpy as np
import random

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

The data is neatly organised in this dataset, we need to shuffle the rows (the same way!) before slicing the dataset and the class vector.

In [110]:
iris = load_iris()
order = np.arange(150)
np.random.shuffle(order)

In [119]:
X_train = iris.data[order][:110]
y_train = iris.target[order][:110]
X_test = iris.data[order][110:]
y_test = iris.target[order][110:]

In [120]:
logistic = LogisticRegression()
pipe = Pipeline(steps=[('logistic', logistic)])

In [121]:
estimator = pipe.fit(X_train, y_train)

In [122]:
y_pred = estimator.predict(X_test)
print "Prediction accuracy: {:.2f}%".format(np.sum(y_pred == y_test) / float(len(y_pred)) * 100)

Prediction accuracy: 97.50%


The default parameters set the logistic regression to do binary classification for each label, then choose the best.  
Let's try something else! With pipelines, we can easily set parameters for our predictors, like this:

In [126]:
pipe.set_params(logistic__multi_class='multinomial', logistic__solver='sag')

Pipeline(steps=[('logistic', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='sag',
          tol=0.0001, verbose=0, warm_start=False))])

In [127]:
estimator = pipe.fit(X_train, y_train)

In [128]:
y_pred = estimator.predict(X_test)
print "Prediction accuracy: {:.2f}%".format(np.sum(y_pred == y_test) / float(len(y_pred)) * 100)

Prediction accuracy: 100.00%
