# Now lets discuss other options for getting the model performance

### We have already discussed Accuracy

- But for classification problems this is not a really good metric because it gives fraction of correctly classified samples.

- Let's take a spam classification problem where 99% of the correct and just 1% is spam in our training set.

- Now thats called as class imbalance problem.

- Now i have build a model which would end up classifying all the samples as ham.

- Now 99/100 samples were correctly classified and 1/100 was misclassified and therefore  the model is <b> 99% </b> accurate.

- But this may even be a failed classifier because it did not find out spam -> so purpose is defeated and thus its a failed classifier if you want to fish out spam.

- This is why we resort to a much eloborate method of model performance summarization.


We use a concept called Confusion Matrix

![confusion Matrix](confusion_matrix.png)

There a few important things we need to know frm the confusion matrix

1. The class of interest from predictions side is termed as positive - in the above example spams.
2. thats why the first column is positives.
3. Correct ones are True and wrong predictions are False.
4. These must be sufficient to calculate and term the confusion matrix.


Therefore accuracy is 

![accuracy formula](accuracy.png)

The other important metrics are.

![Other metrics](other_metrics.png)

Now lets put it all in laymans terms before calculating them on our examples.

 Accuracy  - How much I got correctly - true ones out of all.<br>
 Precision - How much are truly spam/cancer/hiv correctly said out of all I said as spam/cancer/hiv.<br>
 Recall    - How much are truly spam/cancer/hiv correctly said out of all which had spam/cancer/hiv.
 
Intutions

High Precision -> Our classifier has very less false positives.(Saying you have hiv to a man with no hiv)<br>
High Recall    -> Our classifier had very less false negatives.(Saying you dont have hiv to a man with hiv)

<br>
F1 Score -> is a harmonic mean of precision and recall

### The Following imports are required to pop up the confusion matrix

In [3]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [6]:
import os
os.getcwd()

'/media/Datascience/Projects/Giridhar/Supervised Learning Training 101'

In [9]:
os.listdir('/media/Datascience/Projects/Giridhar/Datasets/pima-indians-diabetes-database')

['diabetes.csv']

In [10]:
df = pd.read_csv('/media/Datascience/Projects/Giridhar/Datasets/pima-indians-diabetes-database/diabetes.csv')

# dataset can be found at: 
[link](https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/pima-indians-diabetes-database.zip/1)

In [11]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [12]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [14]:
df.iloc[:,range(0,8)].head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [20]:
df.iloc[:,[-1]].head()

Unnamed: 0,Outcome
0,1
1,0
2,1
3,0
4,1


In [24]:
X = df.iloc[:,range(0,8)].values
X.shape

(768, 8)

In [25]:
y = df.iloc[:,[-1]].values
print(y.shape)
print(type(y))

(768, 1)
<class 'numpy.ndarray'>


In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state = 42, stratify = y)

In [27]:
knn = KNeighborsClassifier(n_neighbors=6)

In [28]:
knn.fit(X_train,y_train)

  """Entry point for launching an IPython kernel.


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=6, p=2,
           weights='uniform')

In [29]:
# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[173  28]
 [ 57  50]]
              precision    recall  f1-score   support

           0       0.75      0.86      0.80       201
           1       0.64      0.47      0.54       107

   micro avg       0.72      0.72      0.72       308
   macro avg       0.70      0.66      0.67       308
weighted avg       0.71      0.72      0.71       308



In [30]:
# Unstratified report
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state = 42)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train,y_train)
# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[176  30]
 [ 56  46]]
              precision    recall  f1-score   support

           0       0.76      0.85      0.80       206
           1       0.61      0.45      0.52       102

   micro avg       0.72      0.72      0.72       308
   macro avg       0.68      0.65      0.66       308
weighted avg       0.71      0.72      0.71       308



  after removing the cwd from sys.path.


## Jumping on to the logistic regression

### Despite the name logistic regression is a technique used in classification problems not a regression problem.

- 