# Prep Notebook 8
## Classifying Cancer II

This notebook has a few exercises that can serve as a primer for what to expect in the `Problem Session 8` notebook. These exercises will touch upon the basic python, `pandas`, `numpy`, `matplotlib` and some supervised learning basic techniques that you may want a refresher on prior to starting `Problem Session 8`.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

##### 1. Review problem session 7

In `Problem Session 8` we will use what you did in `Problem Session 7`, review your group's work as well as the complete version.

##### 2. Scaling

Generate the data `X` below, run it through `StandardScaler`.

In [2]:
np.random.seed(424)
X = np.zeros((100,5))
X[:,0] = np.random.randn(100)
X[:,1] = 100*np.random.randn(100) + 1000
X[:,2] = 20000*np.random.random(size = 100) - 10000
X[:,3] = np.random.binomial(20, .7, 100)
X[:,4] = 10*np.random.randn(100)

##### SAMPLE SOLUTION

In [3]:
from sklearn.preprocessing import StandardScaler

In [4]:
scale = StandardScaler()

scale.fit(X)

scale.transform(X)

array([[ 0.08808013, -0.58218048,  0.10703008, -0.57716949,  1.58686681],
       [ 1.61690844, -0.79012591, -1.37409025, -0.06640003, -0.85801198],
       [ 0.3682413 , -0.59549129, -0.23325024,  0.9551389 , -0.26282396],
       [-1.50963622,  1.23816246,  0.82524803, -1.59870842,  0.7276778 ],
       [-0.26571315,  0.71939423, -0.5035581 , -1.59870842,  1.76619921],
       [ 1.86370215, -1.01141336,  1.18042845, -0.57716949,  0.22069193],
       [ 0.36329997,  2.14238437,  0.60120612, -1.08793896, -0.8609386 ],
       [-0.87614389,  1.99066405,  0.86832655,  0.9551389 ,  1.27845529],
       [-1.15790052,  0.1812438 , -0.98999177,  0.9551389 , -0.98390195],
       [-1.62400777,  0.71046211,  1.15435651,  0.44436943, -1.19634344],
       [ 0.21361793, -0.48530226,  0.811735  ,  1.46590836, -1.22042398],
       [-0.80016263,  0.99831612,  1.00429317,  0.9551389 , -0.20317943],
       [-1.11722221,  0.55782889,  1.13893221, -1.08793896,  0.23220977],
       [ 0.4584538 , -1.39890746, -0.9

##### 3. Pipeline

Generate the data `y` below, then build a pipeline to:
- Scale `X` and
- Build a $k$NN model to predict `y` with `X` and $k=5$.

In [5]:
y = np.random.binomial(1, .6, 100)

##### Sample Solution

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

In [7]:
pipe = Pipeline([('scale', StandardScaler()),
                    ('knn', KNeighborsClassifier(5))])

pipe.fit(X, y)

Pipeline(steps=[('scale', StandardScaler()), ('knn', KNeighborsClassifier())])

##### 4. Classifier metrics

Calculate the TPR, FPR and precision for the pipeline you fit above.

##### Sample Solution

In [8]:
from sklearn.metrics import confusion_matrix, precision_score

In [9]:
precision_score(y, pipe.predict(X))

0.7358490566037735

In [10]:
conf_mat = confusion_matrix(y, pipe.predict(X))

## TPR
conf_mat[1,1]/(conf_mat[1,0] + conf_mat[1,1])

0.7222222222222222

In [11]:
## FPR
conf_mat[0,1]/(conf_mat[0,0] + conf_mat[0,1])

0.30434782608695654

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)