# Introduction
The purpose of this assignment is to calculate a suite of classification model performance metrics via Python code functions that you create and then comparing our results to those of pre-built Python functions that automatically calculate those same metrics.

The data set has three key columns we will use:
class: the actual classification for the observation
scored.class: the predicted classification for the observation (can only be ‘0’ or ‘1’; identified by comparing the classification model’s probability score (i.e., the likelihood that the correct classification for an observation is a ‘1’) for the observation against a 0.50 threshold))
scored.probability: the classification model’s probability score (i.e., the likelihood that the correct classification for an observation is a ‘1’) for the observation
The order of our completed tasks: 1-9, 11, 12; Then, 10 and 13.

In [2]:
# load the pandas library
import pandas as pd

# load the train_test_split function from the sklearn.model_selection module
from sklearn.model_selection import train_test_split

# start by reading a set of sample data from github. This data set contains information related to flights
# departing from the two major airports in Houston, Texas
filename = "https://raw.githubusercontent.com/YALINYAN-YU/DAV6150/master/M5_Data.csv"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,pregnant,glucose,diastolic,skinfold,insulin,bmi,pedigree,age,class,scored.class,scored.probability
0,7,124,70,33,215,25.5,0.161,37,0,0,0.328452
1,2,122,76,27,200,35.9,0.483,26,0,0,0.27319
2,3,107,62,13,48,22.9,0.678,23,1,0,0.10966
3,1,91,64,24,0,29.2,0.192,21,0,0,0.055998
4,4,83,86,19,0,29.3,0.317,34,0,0,0.100491


In [3]:
# use the crosstab() function to show the contents of a confusion matrix
cm = pd.crosstab(df['class'], df['scored.class'], rownames=['Actual'], colnames=['Predicted'], margins=True)
cm

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,119,5,124
1,30,27,57
All,149,32,181


For the output of confusion matrix, the rows represent the actual classification and the columns represent the predicted classification.
The total number of observations is 181.
The total number of the actual classification 0 value is 124. Among them, 119 observations were predicted correctly, and 5 observations were incorrectly predicted as 1.
The total number of the actual classification 1 value is 57. Among them, 27 observations were predicted correctly, and 30 observations were incorrectly predicted as 0.

In [4]:
# check counts of the actual observations
df['class'].value_counts().rename_axis('actual class').to_frame('counts')

Unnamed: 0_level_0,counts
actual class,Unnamed: 1_level_1
0,124
1,57


In [5]:

# check counts of the predicted observations
df['scored.class'].value_counts().rename_axis('predicted class').to_frame('counts')

Unnamed: 0_level_0,counts
predicted class,Unnamed: 1_level_1
0,149
1,32


In [6]:

# extract the True Positive, False Positive, True Negative, False Negative

TN = cm[0][0]
TP = cm[1][1]
FN = cm[0][1]
FP = cm[1][0]

# put extracted values into a data frame
df02 = pd.DataFrame({'confusion matrix':['True_Negative', 'True_Positive','False_Negative','False_Positive'],
                   'count': [TN,TP,FN,FP]})
df02

Unnamed: 0,confusion matrix,count
0,True_Negative,119
1,True_Positive,27
2,False_Negative,30
3,False_Positive,5


TN(True Negative) is 119: Actual Observation is 0, and model is predicting them as 0.

TP(True Positive) is 27: Actual Observation is 1, and model is predicting them as 1.

FN(False Negative) is 30: Actual Observation is 1, and model is predicting them as 0.

FP(False Positive) is 5 : Actual Observation is 0, and model is predicting them as 1.

In [7]:

# Define a function to compute the accuracy
def compute_accuracy(y_true, y_pred):
    cm = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
    TN = cm[0][0]
    TP = cm[1][1]
    FN = cm[0][1]
    FP = cm[1][0]
    func_accuracy = (TP + TN) / (TP + TN + FP + FN)
    return func_accuracy