# 6. Classification

Once we are in possession of labelled data, we can take a
step further and use those labels in a supervised learning
task, where the labels become our targets. In this chapter we
will discuss how classification algorithms are used and
scored. In particular we will cover some important
algorithms such as K Nearest Neighbours, Logistic
Regression and the famous Naïve Bayes classifier.

## 6.1 Classification

Classification is a task that involves arranging
objects systematically into appropriate groups or categories
depending on the characteristics that define such groupings.
It is important to emphasise that the groups are pre-defined
according to established criteria. In our case, the use of classification is to determine the category to which an
unseen observation belongs, depending on the information
of a training dataset with appropriate labels. Classification
is therefore a supervised learning task. Whereas in clustering
the aim is to determine the groups from the features in the
dataset, classification uses the labelled groups to predict
the best category for unseen data.

### 6.1.1 Confusion Matrices

A very convenient way to evaluate the accuracy of a
classifier is the use of a table that summarises the
performance of our algorithm against the data provided.
Karl Pearson used the name contingency table. The machine learning community tends to call it a confusion
matrix as it lets us determine if the classifier is confusing
two classes by assigning observations of one class to the
other. One advantage of a confusion matrix is that it can be
extended to cases with more than two categories. In any case, the contingency table or confusion matrix is
organised in such a way that its columns are related to the
instances in a predicted category, whereas its rows refer to
actual classes. A False Positive is a case where we have incorrectly made
a prediction for a positive detection. From the table we can see that the troop has predicted 6 cases as aircraft,
but they turned out to be flocks. Finally, a False Negative
is a case where we have incorrectly made a prediction
for a negative detection.

Recall or True Positive Rate (TPR) : It is also known as sensitivity
or hit rate. It corresponds to the proportion of positive data points that are correctly classified as positive versus the total
number of positive points. The true positive rate is also known as recall or sensitivity (TP/TP+FN)

True Negative Rate or Specificity (TNR): It is the counterpart
of the True Positive Rate as it measures the proportion of negatives that have been correctly identified. The true negative rate is also
known as specificity (TN/TN+FP) 

Fallout or False Positive Rate (FPR): It corresponds to the
proportion of negative data points that are mistakenly considered as positive, with respect to all negative data
points. The false positive rate is also
known as fallout.(1-TNR)

Precision or Postitive Predictive Value (PPV): It is the proportion
of positive results that are true positive results. The precision is also known as
positive predictive value.(TP/TP+FP)

Accuracy is given by the ratio of the points that
have been correctly classified and the total number of data
points.(TP+TN)/(TP+FP+TN+FN) 

### 6.1.2 AUC and ROC

The Receiver Operator Characteristic or ROC is a
quantitative analysis technique used in binary classification. It lets us construct a curve in terms of the true positive rate against the false positive rate. Unfortunately
ROC curves are suitable for binary classification problems
only. In a ROC curve the True Positive Rate is plotted as a function of the False Positive Rate for different cut-off
points or thresholds for the classifier. Think of these
thresholds as settings in the receiver used by the radar
operators. If our classifier is able to distinguish the two classes without
overlap, then the ROC would have a point at the 100% sensitivity and 0% fallout, i.e. the upper left corner of the curve. This means that the closer the ROC curve is to that corner, then the better the accuracy of the classifier. It
is clear that we would prefer classifiers that are better than guessing, in other words those whose ROC curve lies above
the diagonal. Also we would prefer those classifiers whose
ROC curves are closer to the curve given by the perfect
classifier. If you end up with a ROC curve that lies below
the diagonal, your classifier is worse than guessing, and it should be immediately discarded. AUC is the Area Under the ROC curve.

## 6.2 Classification with KNN

In the KNN classifier, similarity is given by the distance
between points. We classify new observations taking into account the class of the k nearest labelled data points. This means that we need a distance measure between points, and we can start with the well-known Euclidean distance we
discussed in Section 3.8. As it was the case in k-means for clustering, the value of k in KNN is a parameter that is given as an input to the algorithm. For a new unseen observation, we measure the
distance to the rest of the points in the dataset and pick the
k nearest points. We then simply take the most common
class among these to be the class of the new observation. In
terms of steps we have the following: 1. Choose a value for k as an input. 2. Select the k nearest data points to the new observation. 3. Find the most common class among the k points chosen. 4. Assign this class to the new observation.  

In [3]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target

In [7]:
import sklearn.model_selection as ms
XTrain,XTest,YTrain,YTest = ms.train_test_split(X,Y,test_size=0.3,random_state=7)

In [8]:
# Find the appropriate value of "k" using gridsearch
from sklearn import neighbors
from sklearn.model_selection import GridSearchCV
# search between 1 and 20 and find best value of k using cross-validation
k_neighbours = list(range(1,21,2))
n_grid = [{'n_neighbors':k_neighbours}]
# apply result to classifier function
model = neighbors.KNeighborsClassifier()
cv_knn = GridSearchCV(estimator=model,param_grid=n_grid,cv=ms.KFold(n_splits=10))
cv_knn.fit(XTrain,YTrain)

GridSearchCV(cv=KFold(n_splits=10, random_state=None, shuffle=False),
       error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [10]:
# Model results (best k)
best_k = cv_knn.best_params_['n_neighbors']
print(best_k)

11


In [12]:
# Train model with best k
knnclf = neighbors.KNeighborsClassifier(n_neighbors=best_k)
knnclf.fit(XTrain[:,2:4],YTrain)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=11, p=2,
           weights='uniform')

In [14]:
# Predict
y_pred = knnclf.predict(XTest[:,2:4])
y_pred

array([2, 1, 0, 1, 1, 0, 1, 1, 0, 1, 2, 1, 0, 2, 0, 2, 2, 2, 0, 0, 1, 2,
       1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 0, 2, 1, 0, 0, 0, 0, 2, 2, 1, 2, 2,
       1])

In [32]:
from sklearn.metrics import confusion_matrix
confusion_matrix(YTest,y_pred)

array([[12,  0,  0],
       [ 0, 14,  2],
       [ 0,  2, 15]])

In [33]:
from sklearn.metrics import classification_report
print(classification_report(YTest,y_pred))

precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.88      0.88      0.88        16
           2       0.88      0.88      0.88        17

   micro avg       0.91      0.91      0.91        45
   macro avg       0.92      0.92      0.92        45
weighted avg       0.91      0.91      0.91        45



## 6.3 Classification with Logistic Regression

Logistic regression is used in the prediction of a discrete outcome
and therefore best suited for classification purposes. Logistic
regression is in effect another generalised linear model that
uses the same basic background as linear regression.
However, instead of a continuous dependent variable, the
model is regressing for the probability of a (binary) categorical outcome. We can then use these probabilities to
obtain class labels for our data observations. In logistic regression, we are interested in determining the
probability that an observation belongs to a category (or not)
and therefore the conditional mean of the outcome variable. We need to extend the linear regression model to map the outcome variable into that
unit interval. In logistic regression, however, the outcome variable can
take only two values: Either 0 or 1. This means that instead of following a Gaussian distribution it follows a Bernoulli
one. The Bernoulli distribution corresponds to a random
variable that takes the value 1 with probability p and 0 with
probability q = 1 - p.

In [79]:
bc = pd.read_csv('Data/data.csv')

In [80]:
bc.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [81]:
import pandas as pd
bc = bc.dropna()

In [82]:
# Separate labels
X = bc.drop(['diagnosis'],axis=1)
X = X.values
Y_raw = bc['diagnosis'].values

In [84]:
# Convert labels to 0 and 1
from sklearn import preprocessing
label_enc = preprocessing.LabelEncoder()
label_enc.fit(Y_raw)
Y = label_enc.transform(Y_raw)

In [87]:
# Split in train and test
import sklearn.model_selection as cv
XTrain,XTest,YTrain,YTest = ms.train_test_split(X,Y,test_size=0.3,random_state=1)

In [91]:
# Let's use regularisation in the model and can choose between L1 and L2 penalties
from sklearn.linear_model import LogisticRegression
import numpy as np
pen_val = ['l1','l2']
C_val = 2. ** np.arange(-5,10,step=2)
grid_s = [{'C':C_val,'penalty':pen_val}]
model = LogisticRegression()
from sklearn.model_selection import GridSearchCV
cv_logr = GridSearchCV(estimator=model,param_grid=grid_s,cv=ms.KFold(n_splits=10))
# Model fitting
cv_logr.fit(XTrain,YTrain)
best_c = cv_logr.best_params_['C']
best_penalty = cv_logr.best_params_['penalty']

In [97]:
# Create an instance of the logistic regression model
b_clf = LogisticRegression(C=best_c,penalty=best_penalty)
b_clf.fit(XTrain,YTrain)

LogisticRegression(C=128.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [99]:
# Predict
predict = b_clf.predict(XTest)
y_proba = b_clf.predict_proba(XTest)
print(b_clf.score(XTest,YTest))

0.9766081871345029


In [100]:
# coefficients
print(b_clf.coef_)

[[ 3.84883758e-09 -9.31267991e-01  1.47545778e-01 -1.21785996e-01
  -2.72486306e-03  0.00000000e+00 -3.18103919e+01  1.26021393e+01
   9.85393362e+01  0.00000000e+00  0.00000000e+00  4.33243499e-02
  -1.86163550e-01 -6.15026172e-03  1.69921352e-01  0.00000000e+00
  -6.68164161e+01 -1.94733922e+01  0.00000000e+00  0.00000000e+00
   0.00000000e+00 -5.48167019e-02  2.55421229e-01  2.84395012e-02
   2.35265717e-02  0.00000000e+00 -3.55547741e-01  5.83800329e+00
   5.10734917e+01  2.04216303e+01  0.00000000e+00]]


In [102]:
# odds rastio (exp of coefficients) -> how a unit increase or decrease in a variable affects the odds of having a malignant mass
print(np.exp(b_clf.coef_))

[[1.00000000e+00 3.94053737e-01 1.15898634e+00 8.85337814e-01
  9.97278846e-01 1.00000000e+00 1.53081368e-14 2.97193677e+05
  6.23864074e+42 1.00000000e+00 1.00000000e+00 1.04427655e+00
  8.30137815e-01 9.93868612e-01 1.18521163e+00 1.00000000e+00
  9.59398863e-30 3.48990193e-09 1.00000000e+00 1.00000000e+00
  1.00000000e+00 9.46658653e-01 1.29100532e+00 1.02884776e+00
  1.02380550e+00 1.00000000e+00 7.00789487e-01 3.43093597e+02
  1.51682554e+22 7.39607612e+08 1.00000000e+00]]


In [109]:
from sklearn.metrics import roc_curve,auc
import matplotlib.pyplot as plt
fpr,tpr,threshold = roc_curve(YTest,y_proba[:,1])
print(auc(fpr,tpr))

0.9979423868312757
