# Building a Confusion Matrix in Python

In this notebook, we will train a logistic regression model on an extended dataset.  This includes variations of the original dataset, as well as some randomized records to ensure that our model does not end up perfect.

In addition to the `Pandas` and `NumPy` libraries, we will also use a few functions from `scikit-learn`, another great package for data scientists to use.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv("../data/1553_dos_cm_Py.csv")

## Train the Model

Now that we have loaded the packages, let's quickly define an encoder and then train a model.

In [None]:
string_cols = df.select_dtypes(include=[object]).columns.values
enc = OrdinalEncoder()
enc.fit(df[string_cols])
df[string_cols] = enc.transform(df[string_cols])

In [None]:
y = df['malicious']
x = df.loc[:, df.columns != 'malicious']
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
x[:] = imp_mean.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, stratify=y)
clf = LogisticRegression(random_state=184856).fit(x_train, y_train)
y_pred = clf.predict(x_test)

## Using the Confusion Matrix

`sklearn.metrics` includes two functions which support confusion matrices:  `confusion_matrix()` and `classification_report()`.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

The `confusion_matrix()` function simply returns an array.

Note that this array is shaped differently from R!  In this case, **actual** results are the **rows** and **predicted** values are the **columns**.  Also, labels are ordered in a way that looks like:

| |Pred FALSE|Pred TRUE|
|-|----------|---------|
|**Act FALSE**|TN|FP|
|**Act TRUE**|FN|TP|

This can be a bit confusing because the matrix is partially inverted from what we see in R:

| |Act TRUE|Act FALSE|
|-|----------|---------|
|**Pred TRUE**|TP|FP|
|**Pred FALSE**|FN|TN|

In [None]:
confusion_matrix(y_test, y_pred)

The `classification_report()` function returns information on the confusion matrix.

An important note is that we can translate precision and recall into sensitivity and specificity for a two-class problem like this one.  Precision and recall are calculated on a per-class basis in the sklearn confusion matrix, so:

`False Precision = TN / (TN + FN)`

`False Recall = TN / (TN + FP)`

`True Precision = TP / (FP + TP)`

`True Recall = TP / (FN + TP)`

Translating this back to the terms we used before:

`Sensitivity = TP / (TP + FN) == True Recall`

`Specificity = TN / (TN + FP) == False Recall`

`Positive Predictive Value = TP / (TP + FP) == True Precision`

`Negative Predictive Value = TN / (FN + TN) == False Precision`

In [None]:
print(classification_report(y_test, y_pred))