**Aishwarya Singh, Nosson Weissman**

**DAV 6150 - Data Science**

**Professor James Topor**

**Summer 2022**

__DAV 6150 Practical (Module 5) : Performance Metrics__

In [42]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics

## Intro

In this assignment we are given a dataset containing categorical data predictions. <br>
Considering only the data pertaining to the predicition-correctness, we define our own functions and compare with their respective sklearn comparable functions. 

In [87]:
#1-2. load the data from github
df = pd.read_csv('https://raw.githubusercontent.com/codepharmer/AI-6150/main/M5%20Performance%20Metrics/M5_Data.csv')
df

Unnamed: 0,pregnant,glucose,diastolic,skinfold,insulin,bmi,pedigree,age,class,scored.class,scored.probability
0,7,124,70,33,215,25.5,0.161,37,0,0,0.328452
1,2,122,76,27,200,35.9,0.483,26,0,0,0.273190
2,3,107,62,13,48,22.9,0.678,23,1,0,0.109660
3,1,91,64,24,0,29.2,0.192,21,0,0,0.055998
4,4,83,86,19,0,29.3,0.317,34,0,0,0.100491
...,...,...,...,...,...,...,...,...,...,...,...
176,5,123,74,40,77,34.1,0.269,28,0,0,0.311420
177,4,146,78,0,0,38.5,0.520,67,1,1,0.707210
178,8,188,78,0,0,47.9,0.137,43,1,1,0.888277
179,9,120,72,22,56,20.8,0.733,48,0,0,0.422468


#### Quick check that no data is missing

In [29]:
df.isna().sum()

pregnant              0
glucose               0
diastolic             0
skinfold              0
insulin               0
bmi                   0
pedigree              0
age                   0
class                 0
scored.class          0
scored.probability    0
dtype: int64

#### As per assignment instructions, we focus on the last three columns

In [36]:
raw_data = df[[df.columns[col] for col in [8,9,10]]]
obs = raw_data['class']
pred = raw_data['scored.class']
pred_p = raw_data['scored.class']

### We can use the pandas crosstab function to create a confusion matrix

In [37]:
#3.
pd.crosstab(raw_data['scored.class'],raw_data['class'])

class,0,1
scored.class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,119,30
1,5,27


### Using the pandas flatten function, we can extract the confusion matrix values

In [90]:
#4.
pd_cf = pd.crosstab(raw_data['scored.class'],raw_data['class'])
tn,fn,fp,tp =pd_cf.values.flatten()
tn,fn,fp,tp

(119, 30, 5, 27)

### Definitions of metrics:

$tp$ = true positive, $fp$ = false positive, $tn$ = true negative, $fn$ = false negative

**Precision** =$\Large\frac{tp}{tp+fp}$

**Accuracy** = $\Large\frac{tp+tn}{tp+fp+tn+fn}$

**Sensitivity** = $\Large\frac{tp}{tp+fn}$

**Specificity** = $\Large\frac{tn}{tn+fp}$

**F1** = $\Large\frac{tp}{tp+0.5(fp+fn)}$


### Below we define a function which we will use as a helper within our metric functions defined below 
The function, given binary classification data, creates a confusion matrix

In [11]:
def create_confusion_matrix(pred, obs):
    # get indices for pred value == True
    pred_true = [i for i in range(len(pred)) if pred[i] == True]
    # get indices for when pred value == False
    pred_false = [i for i in range(len(pred)) if pred[i] == False]
    # get the count of true negatives, true positives etc.
    tn = len([i for i in pred_false if obs[i] == False])
    tp = len([i for i in pred_true if obs[i] == True])
    fp = len([i for i in pred_true if obs[i] == False])
    fn = len([i for i in pred_false if obs[i] == True])
    # return confusion matrix of pred obs
    return (pd.DataFrame({'obs_0':[tn,fp],'obs_1':[fn,tp]}))

### In the following five cells we define functions to calculate correcness metrics for binary classification data

In [56]:
def accuracy(pred,obs):
    #accuracy = (tp+tn)/(tp+fp+tn+fn)
    # generate confusion matrix for data
    cf = create_confusion_matrix(pred, obs)
    tn,fn,fp,tp = cf.values.flatten()
    return (tp+tn)/(tp+fp+tn+fn)

In [79]:
def precision(pred,obs):
    #precision = tp / (tp+fp)
    # generate confusion matrix for data
    cf = create_confusion_matrix(pred, obs)
    # extract values from confusion matrix
    tn,fn,fp,tp = cf.values.flatten()
    return (tp/(tp+fp))

In [58]:
def sensitivity(pred,obs):
    #sensitivity = tp / (tp+fn)
    # generate confusion matrix for data
    cf = create_confusion_matrix(pred, obs)
    tn,fn,fp,tp = cf.values.flatten()
    return  tp / (tp+fn)

In [59]:
def specificity(pred,obs):
    #specificity = tn / (tn+fp)
    # generate confusion matrix for data
    cf = create_confusion_matrix(pred, obs)
    tn,fn,fp,tp = cf.values.flatten()
    return tn / (tn+fp)

In [60]:
def f1(pred,obs):
    #f1 = tp/(tp+.5(fp+fn))
    # generate confusion matrix for data
    cf = create_confusion_matrix(pred, obs)
    tn,fn,fp,tp = cf.values.flatten()
    return  tp/(tp+.5*(fp+fn))

### Now, using the functions defined above, and the data pulled from Github we compute each metric and print the results...

In [91]:
print('accuracy: ',accuracy(pred, obs))
print('precision: ',precision(pred, obs))
print('sensitivity: ',sensitivity(pred,obs))
print('specificity: ',specificity(pred,obs))
print('f1 score: ',f1(pred,obs))
# metrics.classification_report()

accuracy:  0.8066298342541437
precision:  0.84375
sensitivity:  0.47368421052631576
specificity:  0.9596774193548387
f1 score:  0.6067415730337079


### We compare our functions with the sklearn built-in functions

In [89]:
#12.
display(metrics.confusion_matrix(pred,obs))
print('accuracy matches sklearn: ', metrics.accuracy_score(obs,pred) == accuracy(pred, obs))
print('precision matches sklearn: ',metrics.precision_score(obs,pred) == precision(pred,obs))
print('sensitivity matches sklearn: ',metrics.recall_score(obs,pred) == sensitivity(pred,obs))
print('f1 score matches sklearn: ',metrics.f1_score(obs,pred) == f1(pred,obs))
# metrics.classification_report(pred,obs)

array([[119,  30],
       [  5,  27]], dtype=int64)

accuracy matches sklearn:  True
precision matches sklearn:  True
sensitivity matches sklearn:  True
f1 score matches sklearn:  True


In [83]:
metrics.precision_score()
# precision(pred,obs)

0.84375

In [82]:
metrics.confusion_matrix(obs,pred)

array([[119,   5],
       [ 30,  27]], dtype=int64)