### Codio Activity 12.5: Confusion Matrices and Metrics for Classification

This activity focuses on using confusion matrices to compute different classification metrics.  You will use scikit-learn to generate the confusion matrices and answer questions about the appropriate metric for the given dataset.  

**Expected Time: 60 Minutes**

**Total Points: 50**

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.metrics import plot_confusion_matrix

warnings.simplefilter(action='ignore', category=FutureWarning)
set_config("display")

### The Data

For this exercise you will explore two different scenarios.  The first, is a built in dataset from scikit-learn related to breast cancer tumors.  The second is a dataset representing telecommunications customer data and retention.  

**Cancer Description**

```
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.
```

**Telecommnunications Churn Data**

```
This data set contains information on a communcations company customers.  The target feature is whether or not the customer abandoned their subscription or were "churned".  The features primarily represent information about the customers useage.
```

In [None]:
cancer = load_breast_cancer(as_frame=True)
cancer_df = cancer.frame

In [None]:
cancer_df.head()

In [None]:
cancer_df.info()

In [None]:
churn = pd.read_csv('data/cell_phone_churn.csv').select_dtypes(['float', 'int', 'bool'])
churn.head()

In [None]:
churn.info()

In [None]:
churn_x, churn_y = churn.drop('churn', axis = 1), churn.churn
churn_x_train, churn_x_test, churn_y_train, churn_y_test = train_test_split(churn_x, churn_y, random_state = 42)

In [None]:
cancer_x, cancer_y = cancer.data, cancer.target
cancer_x_train, cancer_x_test, cancer_y_train, cancer_y_test = train_test_split(cancer_x, cancer_y, random_state=42)

[Back to top](#-Index)

### Problem 1

#### Pipeline for cancer data

**10 Points**

As before, you want to scale your data prior to building the model.  Because the cancer dataset contains only numeric features, you can simply pass all features through the `StandardScaler`.  Below, construct a pipeline named `cancer_pipeline` with named steps `scale` and `knn`.  Leave all the settings to default in the `KNeighborsClassifier`. 

Next, use the `fit` function on `cancer_pipeline` to train the pipeline on the training data.

Finally, use the `predict` function to make predictions on the test data.  Assign these as an array to `cancer_preds` below. 

In [None]:
### GRADED

cancer_pipeline = ''
cancer_preds = ''

# YOUR CODE HERE
raise NotImplementedError()

# Answer check
print(cancer_preds[:5])
cancer_pipeline

[Back to top](#-Index)

### Problem 2

#### Confusion matrix for cancer data

**10 Points**

Use the `confusion_matrix` function with arguments `cancer_y_test`, `cancer_preds` and with `labels` equal to `[1, 0]` to comoute the confusion matrix of your predictions. Assign the resul to `cancer_confusion_mat`.

Next, use the `ConfusionMatrixDisplays` function to visualize your confusion matrix on the test data.  Note that in the example of the cancer data a 1 means benign and 0 means malignant.  Use these by setting `display_labels = ['benign', 'malignant']`.  Assign your result to the object `dist` below.

In [None]:
### GRADED

cancer_confusion_mat = ''

# YOUR CODE HERE
raise NotImplementedError()

# Answer check
print(type(cancer_confusion_mat))

[Back to top](#-Index)

### Problem 3

#### Which Errors are worse?

**5 Points**

In this problem, which of the errors would you care more about avoiding?  Those tumors that are identified as malignant but are benign, or those that are benign classified as malignant.  Consider this from the doctors point of view.  Assign your answer as the string `false positive` (classified malignant but benign) or `false negative` (classified as benign but malignant) to `ans3` below.

In [None]:
### GRADED

ans3 = ''

# YOUR CODE HERE
raise NotImplementedError()

# Answer check
print(ans3)

[Back to top](#-Index)

### Problem 4

#### Adjusting the Decision Boundary

**10 Points**

Consider improving the recall score.  By adjusting your decision boundary you can alter the recall.  Below, a new model is fit with `n_neighbors = 10` and predictions with both lower and higher decision boundaries.  Use these to decide which was better in the case of avoiding misclassifying tumors that are malignant as benign.  Select the choice 'a', 'b', or 'c' based on the confusion matrices below and assign to `best_knn` below.

In [None]:
knn_ex = Pipeline([('scale', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors = 10))])
knn_ex.fit(cancer_x_train, cancer_y_train)
low_preds = np.where(knn_ex.predict_proba(cancer_x_test)[:, 1] > .25, 1, 0)
high_preds = np.where(knn_ex.predict_proba(cancer_x_test)[:, 1] > .95, 1, 0)
mid_preds = knn_ex.predict(cancer_x_test)

![](images/three_knn.png)

In [None]:
### GRADED

best_knn = ''

# YOUR CODE HERE
raise NotImplementedError()

# Answer check
print(best_knn)

[Back to top](#-Index)

### Problem 5

#### Cell Phone Churn 

**10 Points**

In the example of the cell phone churn data, consider the problem of investing in customer incentives.  Here, you'd prefer to target customers who will likely churn.  


Below, construct a pipeline named `churn_pipeline` with named steps `scale` and `knn`.  In the KNN classifier, set `n_neighbors=10`. 

Next, use the `fit` function on `cchurn_pipeline` to train the pipeline on the training data.

Next, use the `plot_confusion_matrix` function with aurguments `churn_pipe`, `churn_x_test` and `churn_y_test`. Assign the result to ``churn_confusion_mat``


To begin, use the given training data -- `churn_x_train`, `churn_y_train` -- to build a pipeline named `churn_pipe` with named steps `scale` and `knn` that use `StandardScaler` and `KNeighborsClassifier` with `n_neighbors = 10` to scale and estimate the data.  Visualize your predictions using the `DisplayConfusionMatrix.from_preds` function and assign as `churn_confusion_mat`.  

In [None]:
### GRADED

churn_pipe = ''

# YOUR CODE HERE
raise NotImplementedError()

# Answer check
print(churn_confusion_mat)

[Back to top](#-Index)

### Problem 6

#### Adjusting the Decision Boundary

**5 Points**

Below, create predictions for the positive class that has thresholds of greater than 30% and 80%.  Compare these to your baseline predictions at 50% and identify which one minimizes the number of churns predicted as not churning.  Assign your answer as an integer to `ans6` below -- 30, 50, or 80.

In [None]:
### GRADED

ans6 = ''

# YOUR CODE HERE
raise NotImplementedError()

# Answer check
print(ans6)

While remembering the different metrics of precision, recall, and accuracy are important -- in part they depend on what is considered positive or negative.  Using your confusion matrices can help intuit which metric is best for a specific scenario.   