<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Classification Metrics I

_Authors: Matt Brems (DC), Riley Dallas (AUS)_

---

## Importing libraries
---

We'll need the following libraries for today's lecture:
1. `pandas`
4. `KNeighborsClassifier` from `sklearn`'s `neighbors` module
5. The `load_breast_cancer` function from `sklearn`'s `datasets` module
6. `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module
7. `StandardScaler` from `sklearn`'s `preprocessing` module
8. The `confusion_matrix` function from `sklearn`'s `metrics` module

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer

## Create dataset
---

Similar to `load_iris` from this morning, we'll call the `load_breast_cancer()` function to create our dataset.

In [6]:
cancer = load_breast_cancer()
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [7]:
print(cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

## Create `X` and `y`
---

The dataset labels benign tumors as 1, and malignant tumors as 0. This is contrary to how you typically label data: the more important class (malignant) should be labeled 1.

In [13]:
X = pd.DataFrame(cancer.data)
y = pd.Series(1 - cancer.target)
print(type(X))
print(type(y))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [16]:
y.value_counts(normalize=True)

0    0.627417
1    0.372583
dtype: float64

## Train/Test Split
---

In the cell below, train/test split your `X` and `y` variables. 

**Note** we'll want to create a stratified split.

In [17]:
# stratify = ensures same 0s and 1s in train and test when there is an imbalance 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)


## Scaling our features
---

Because we're using KNN for our model, we'll want to scale our training and testing sets.

In [19]:
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

## Instantiate and fitting our model
---

In the cells provided, create and fit an instance of `KNeighborsClassifier`. You can use the default parameters.

In [21]:
knn = KNeighborsClassifier()

In [22]:
knn.fit(Z_train, y_train)

KNeighborsClassifier()

## Predictions
---

Use our newly fitted KNN model to create predictions from `X_test_scaled`.

In [23]:
pred = knn.predict(Z_test)

## Confusion Matrix
---

We'll create a confusion matrix using the `confusion_matrix` function from `sklearn`'s `metrics` module.

In [31]:
cm = confusion_matrix(y_test, pred)

In [32]:
cm

array([[88,  2],
       [ 3, 50]])

## Confusion DataFrame
---

The confusion matrix we just created isn't very explanatory, so let's drop it into a pandas `DataFrame`.

In [33]:
cm_df = pd.DataFrame(
    cm,
    columns=['Predicted Benign', 'Predicted Malignant'],
    index=['Actual Benign', 'Actual Malignant']
)

In [34]:
cm_df

Unnamed: 0,Predicted Benign,Predicted Malignant
Actual Benign,88,2
Actual Malignant,3,50


## Calculate recall
---

<details>
    <summary>Need a hint?</summary>
    Recall = Sensitivity, and there are no p's in sensitivity.
</details>

In [38]:
# Recall = True Positive / All Actual Positive 
TP = 50
FN = 3
Recall = TP / (TP + FN)
Recall

0.9433962264150944

## How many Type I errors are there?
---

<details>
    <summary>Need a hint?</summary>
    Type I = False positive
</details>

In [35]:
# Type 1: False Positive
FP = 2

## How many Type II errors are there?
---
<details>
    <summary>Need a hint?</summary>
    Type II = False negatives
</details>

In [36]:
# Type 2: False Negative
FN = 4

## Poll
---

/poll "Which error is worse" "Type I" "Type II" anonymous limit 1

Ans: Worse: Type 2. They have untreated cancer/

## Calculate the sensitivity
---

<details>
    <summary>Need a hint?</summary>
    There are no p's in sensitivity: TP/P
</details>

In [None]:
# Sensitivity: Positive.

## Calculate the specificity
---

<details>
    <summary>Need a hint?</summary>
    There is a p in specificity, therefore there are no p's in the calculation: TN/N
</details>