# Supervised Learning with scikit-learn (cont.)

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 

### CLASS IMBALANCES AND METRICS OTHER THAN ACCURACY

Accuracy is not always a useful metric when evaluating the performance of a model. It simply represents a fraction of correctly classified samples.

Say we have a classification model for predicting fraudulent bank transactions. 99% of the data fed to the model are legitimate transactions and only the remaining 1% is fraudulent. That model could become a classifier that predicts non-fraudulent bank transactions and would be 99% accurate. However, it would perform terribly at actually predicting fraudulent transactions, which defeats its original purpose. 

This is called **class imbalance**, where a certain class in a dataset has significantly more observations/datapoints than another, creating a disproportionate ratio between classes.

##### The Confusion Matrix
A confusion matrix is a 2x2 table that compares predictions made by a model and the actual values. It allows us to view the proportion of correct classifications, false positives, and false negatives.

&ensp;

For the bank transaction model, it would look something like this:

|                 | Predicted Legitimate |Predicted Fraudulent|
|-----------------|----------------------|--------------------|
|Actual Legitimate|  True Negative (TN)  | False Positive (FP)|
|Actual Fraudulent|  False Negative (FN) | True Positive (TP) |

*Note that a legitimate prediction is referred to as 'negative', as in NOT a fraudulent transaction.*

&ensp;
From this matrix, the following metrics can be computed:

&ensp;
1. **Accuracy**: Simply the proportion of correct predictions. 

&ensp;&ensp;&ensp;&ensp;&ensp;$accuracy = \frac{tp + tn}{tp + tn + fp + fn}$

2. **Precision**: Refers to how many positive predictions made by a model are actually correct. (*Out of the predicted positives, how many are correct?*)

&ensp;&ensp;&ensp;&ensp;&ensp;$precision = \frac{tp}{tp + fp}$

&ensp;&ensp;&ensp;&ensp;&ensp;Also, note that a higher precision would mean a lower false positive rate. 

3. **Recall**: Refers to how many of the actual positive cases a model has correctly identified. (*Out of the actual positives, how many were caught?*)

&ensp;&ensp;&ensp;&ensp;&ensp;$recall = \frac{tp}{tp + fn}$

&ensp;&ensp;&ensp;&ensp;&ensp;Higher recall means a lower false negative rate.

4. **F1 Score**: A metric that balances precision and recall into a single value. It is the harmonic mean of precision and recall. F1 is simply a balance between both values.

&ensp;&ensp;&ensp;&ensp;&ensp; $F1 Score = \frac{precision \cdot recall}{precision + recall}$



&ensp;&ensp;

Below is a demonstration of using the confusion matrix on the churn dataset.

In [5]:
telco_churn = pd.read_csv('../Data/telecom_churn_clean.csv')
telco_churn.head()

Unnamed: 0.1,Unnamed: 0,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,customer_service_calls,churn
0,0,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,1,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,2,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,3,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,4,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [12]:
# Define X and y
pred_X = telco_churn.drop('churn', axis=1).values
target_y = telco_churn['churn'].values

In [27]:
from sklearn.neighbors import KNeighborsClassifier   # The KNN classifier
from sklearn.metrics import classification_report, confusion_matrix   # Import confusion matrix and classification report for metrics
from sklearn.model_selection import train_test_split

# Here, we are deliberately not using the 'stratify' argument to induce a class imbalance
X_train, X_test, y_train, y_test = train_test_split(pred_X, target_y, test_size=0.4, random_state=10) 

telco_knn = KNeighborsClassifier(n_neighbors=7)
telco_knn.fit(X_train, y_train)
y_pred = telco_knn.predict(X_test)

In [28]:
# Print the confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
print(conf_mat)

[[1150    5]
 [ 167   12]]


In [29]:
# Metrics can be calculated manually 
tn = conf_mat[0,0]
fp = conf_mat[0,1]
fn = conf_mat[1,0]
tp = conf_mat[1,1]

accuracy = (tn + tp) / (tn + fp + fn + tp)
precision = tp / (fp + tp)
recall = tp / (tp + fn)
f1_score = (precision * recall) / (precision + recall)

print(f'Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1_score}')

Accuracy: 0.8710644677661169
Precision: 0.7058823529411765
Recall: 0.0670391061452514
F1 Score: 0.06122448979591836


In [30]:
# Metrics can also be calculated automatically with classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1155
           1       0.71      0.07      0.12       179

    accuracy                           0.87      1334
   macro avg       0.79      0.53      0.53      1334
weighted avg       0.85      0.87      0.82      1334

