In [65]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix as cm
from sklearn.linear_model import LogisticRegression as lr
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import StandardScaler

In [66]:
data = pd.read_csv("data_2.csv")

In [67]:
data.diagnosis.value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [68]:
data.diagnosis = data.diagnosis.apply(lambda x: 1 if x == 'M' else 0)

In [69]:
label = data.diagnosis
data.drop(columns = ["diagnosis", "id", "Unnamed: 32"], inplace=True)

In [72]:
scaler = StandardScaler()
data = scaler.fit_transform(data)

In [73]:
data

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

In [74]:
X_train, X_test, y_train, y_test = tts(data, label, test_size=0.2, random_state=42, stratify=label)

In [75]:
classifier = lr(random_state=42, max_iter=500, verbose=1).fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


In [79]:
classifier.predict(X_test)

array([0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 1])

In [86]:
classifier.score(X_test, y_test)

0.9736842105263158

In [85]:
cm(y_test, classifier.predict(X_test))

array([[71,  1],
       [ 2, 40]])

# How a confusion matrix works

The confusion matrix is tells you a lot more about how your model exactly classifies the data, beyond the accuracy scoring. For a n-class classification, we get a n^2 matrix, where each matrix entry tells you how the class prediction is. For our use case, we'll look at 2-class classification. 

the  matrix above means the following -> (tn, fp, fn, tp) where (t,f) are true and false, which denotes whether the prediction was correct or not, and (p, n) are positive and negative which denotes whether the model predicted it as a positive hit or a negative hit. With that in mind, "tn" would mean the the prediction is _correct_ and the model predicted _negative_. 

Below is a visual depiction of confusion matrix
![](image1.png)

The matrix is better explained by my favourite **cancer patient who bought a burglary detection system** example. With the example in mind, precision would be defined as how precise your model is at detecting something, whereas recall is defined as how good the model is at detecting all the positives. 

So, 

**Recall = TP/(TP + FN)** _(To have high recall, the model should make sure that if it says negative, it better be true else FN increases)_

**Precision = TP/(TP + FP)** _(To have have high precision, the model should make sure that if it says positive, it better be true else FP increases)_


In [110]:
precision = 40/(40+2)
recall = 40/(40+1)

In [111]:
precision

0.9523809523809523

In [112]:
recall

0.975609756097561

# F1 score
The F1 score is a way to combine both precision and recall into one metric. It is the harmonic mean of precision and recall, because harmonic mean punishes extremes. if your model is highly precise, but has very bad recall (meaning whatever it predicted as true are correct, but it missed out on a lot of positives), then F1 score would be very low instead of being a simple average. 

In [113]:
f1 = 2*precision*recall/(precision+recall)

In [114]:
f1

0.963855421686747

# ROC and AUC curves

Before we get into ROC and AUC curves, there is one more measurement we need to know, specificity. 

**specificity** = TN/(TN + FP)

**False positive rate = 1 - Specificity** =  FP/(TN + FP)
