# HW3 Notebook 


Welcome to your HW3 notebook!

## Notebook Setup

In [1]:
# imports
import numpy as np
import matplotlib.pyplot as plt
colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
import seaborn as sns
import pandas as pd

# 3d figures
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

# creating animations
import matplotlib.animation
from IPython.display import HTML

# styling additions
from IPython.display import HTML
style = '''
    <style>
        div.info{
            padding: 15px; 
            border: 1px solid transparent; 
            border-left: 5px solid #dfb5b4; 
            border-color: transparent; 
            margin-bottom: 10px; 
            border-radius: 4px; 
            background-color: #fcf8e3; 
            border-color: #faebcc;
        }
        hr{
            border: 1px solid;
            border-radius: 5px;
        }
    </style>'''
HTML(style)

## Assignment Outline

<div class='alert alert-block alert-danger'>
    
<font size='5'>🗓</font> **Due: This assignment is due SUNDAY 11/13/22 at 11:59 PM.**

</div>


For this problem you must turn in **4 things**:
- Your completed [SKLearn: feature scaling and categorical](../lectures/SKLearn-Feature-Scaling-Categorical.ipynb) notebook.
- Your completed [SVM](../lectures/SVM.ipynb) notebook
- Your completed [Bias and Variance](../lectures/Bias-and-Variance.ipynb) notebook
- **This** notebook 

For **each** thing, you will turn in:
- the original `.ipynb`
- a PDF of the file

So in total, you will turn in **8 files**.


---

# Problem 0 - Lecture Notebooks

For this problem, go to your
- [SKLearn: feature scaling and categorical](../lectures/SKLearn-Feature-Scaling-Categorical.ipynb) notebook.
- [SVM](../lectures/SVM.ipynb) notebook 
- [Bias and Variance](../lectures/Bias-and-Variance.ipynb) notebook

and **complete all**:
- ```# EDIT HERE```
- Pause-and-Ponders

<br/>

<div class='info'>

<font size='5'>☝🏽</font> **Note:**  Remember what I said in class about what I am looking for on these! 

</div>

---

# Problem 1 - SVM

<br/>
<div class='info'>

<font size='5'>☝🏽</font> **Note:**  We will expand upon the "in-class" exercises problems at the end of the SVM notebook. If you've already done them in that notebook, feel free to just copy/paste most of your answers into this problem.

</div>

In [None]:
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x="sepal_width", y="sepal_length", z='petal_width',
                    color="species", template="simple_white")
fig.update_traces(marker={'size': 4})

In this problem were going to experiment with SVMs on the iris dataset! 

That is, in the SVM notebook we did this with `blobs` we got from `scikitlearn`.  Here, pick two classes between `setosa`, `virginica` and `versicolor` (**Note: the data-frames are defined above**) and pick **two features** from `sepal length`, `sepal width`, `petal length` and `petal width`.

And use an SVM to tell them apart! 

<br/>
<div class='info'>

<font size='5'>🤔</font> **Pause-and-Ponder:**  Experiment with the `C` (regularization) parameter, and the `kernel` parameter! What do you notice? Does it match what we saw previously? Comment below! 

</div>

---

---

# Problem 2 - Over/Under fitting and Bias/Variance

For this problem, carefully go through the [Bias and Variance notebook](../lectures/Bias-and-Variance.ipynb) and answer the Pause-and-Ponder at the end of that notebook here:

<br/>
<div class='info'>

<font size='5'>🤔</font> **Question**  Play around with the `numPoints` and `poly_order` parameters and re-run the experiment. What do you observe? Write down your observations below! 

</div>

**Remember**: We did this in class and had a long discussion! I want your discussion here to reflect that! 

**Note:** feel free to copy/paste any code and/or images from that notebook in here to help you in your explanation! 

---

---

# Problem 3 - Cross-Validation + ROC

In this problem, lets load in the iris dataset and make it a bit more challenging by adding some "noisy features"

In [None]:
from sklearn import svm, datasets
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, RocCurveDisplay, ConfusionMatrixDisplay, auc

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names[:2]

X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape

# Add 'noisy' features to make problem harder
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

Note, in this example, we only have 50 samples per class!

Now setup an SVM classifier with a linear kernel:

In [None]:
classifier = # EDIT HERE

Lets see how this classifier does on this dataset: (fit and score it on the entire dataset)

In [None]:
# EDIT HERE

Now lets see its confusion matrix!

In [None]:
cm = confusion_matrix(y, classifier.predict(X))
cm

We can do better than that! Lets make use of the helper `ConfusionMatrixDisplay` class:

In [None]:
# display it on the subplot figure
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names)
disp.plot();

Much better! 

Now lets see whats going on with our ROC curve, using the helpful `RocCurveDisplay` class:

In [None]:
RocCurveDisplay.from_estimator(classifier,X,y,name='ROC',lw=1);

<div class='info'>

<font size='5'>🤔</font> **Pause-and-Ponder:**  What do you observe in these two plots? Is this good? Bad? Trustworthy? Why or why not? Comment below! 

</div>

---

---

Hm. After some thought above, you might have come to the conclusion that we cannot trust this performance above! It is a completely biased estimate! 

we need to actually get a **cross-validated** estimate of our models performance! 


Now, lets actually setup a cross-validated problem! To do this, we will use the `StratifiedKFold` class from `sklearn.metrics`

Before setting up the full problem, lets look at a simple example:

In [None]:
cv = StratifiedKFold(n_splits=2)
cv.split(X,y)

Ah - this generator object must be consumed through iteration! 

In [None]:
for fold_num, (train_idx, test_idx) in enumerate(cv.split(X,y)):
    print('=====================================================')
    print(f"CV fold #{fold_num+1}")
    print('=====================================================')
    print(f'train idices: {train_idx}')
    print()
    print(f'test idices: {test_idx}')

Ah! So this makes it easy to split my dataset and automatically gives me the **indices** to use on each of these folds!

Now were ready to set this up for real! 

In [None]:
################################################
#                Setup figures
################################################
# NOTE: my default code has this working for 6 folds giving a 2x3 figure
fig_cm, ax_cm = plt.subplots(2,3,figsize=(10,8))

# setup ROC figure
fig, ax = plt.subplots(figsize=(10,8))
ax.set(
    xlim=[-0.05, 1.05],
    ylim=[-0.05, 1.05],
    title="Receiver Operating Characteristic",
)

# setup mean curve variables
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

# plot "no-skill"/"guessing" classifier
ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color="r", label="Chance", alpha=0.8)

################################################
#               Cross-Validation
################################################
cv = StratifiedKFold(n_splits=6)

# actually perform the CV and loop through each fold
for # EDIT HERE
    print('=====================================================')
    print(f"CV fold #{fold_num+1}")
    print('=====================================================')
    
    # fit classifier **on this folds data**
    # EDIT HERE
    
    # predict on this folds "testing" set
    y_pred = # EDIT HERE

    # get this fold's confusion matrix
    cm = confusion_matrix(# EDIT HERE, # EDIT HERE)
    
    # display it on the subplot figure
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names)
    disp.plot(ax=ax_cm.flatten()[fold_num])
    ax_cm.flatten()[fold_num].set(title=f'CM for fold {fold_num+1}')

    # print this folds classification report
    print('Classification report:')
    print(classification_report(# EDIT HERE, # EDIT HERE, target_names=target_names))

    # build and display ROC curve for this fold
    viz = RocCurveDisplay.from_estimator(
        # EDIT HERE,
        # EDIT HERE,
        # EDIT HERE,
        name=f"ROC fold {fold_num+1}",
        alpha=0.4,
        lw=1,
        ax=ax,
    )

    # store information for the mean curve
    interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
    interp_tpr[0] = 0.0
    tprs.append(interp_tpr)
    aucs.append(viz.roc_auc)

# now actually plot MEAN ROC curve
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
ax.plot(
    mean_fpr,
    mean_tpr,
    color="b",
    label=f"Mean ROC (AUC = {mean_auc:0.2f})",
    lw=2,
    alpha=0.8,
)

ax.legend(loc="lower right");

<br/>
<div class='info'>

<font size='5'>🤔</font> **Pause-and-Ponder:**  What do you notice here? How does this compare to what we observed above? Have you conclusions changed? Why or why not? Comment below! 

</div>

---

---

# Problem 4 - Experiment!

For this problem, you're going to use the **breast-cancer dataset** built into scikit learn:

In [None]:
from sklearn.datasets import load_breast_cancer
X,y = load_breast_cancer(return_X_y=True)
bc_df = load_breast_cancer(as_frame=True).data
X.shape,y.shape

In [None]:
bc_df

<br/>
<div class='info'>

<font size='5'>☝🏽</font> **Note:**  Above I'm showing you a few ways to get this data in and interact it with. Pick which one works for you!

</div>

For this problem:
- run **three** different classifiers of your own choosing on this dataset to try and predict the label
- for **each** classifier you picked, report all results we did in problem 3

<br/>
<div class='info'>

<font size='5'>🤔</font> **Pause-and-Ponder:**  Now, using the evidence you've gathered, explain which one is best and why - this is should be a thorough explanation that involves all your evidence! 

</div>

---

---