# Homework 4

## Follow These Steps Before Submitting
Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.


# 1. Wisconsin Breast Cancer Dataset

In this assignment, you will use a modified version of the well-known Wisconsin Breast Cancer dataset. We want to predict if a patient has a malignant or benign tumour. The features in the dataset are described below:


**Cl.thickness**:	Clump Thickness

**Cell.size**:	Uniformity of Cell Size

**Cell.shape**:	Uniformity of Cell Shape

**Marg.adhesion**:	Marginal Adhesion

**Epith.c.size**:	Single Epithelial Cell Size

**Bare.nuclei**:	Bare Nuclei

**Bl.cromatin**:	Bland Chromatin

**Normal.nucleoli**:	Normal Nucleoli

**Mitoses**:	Mitoses

**Age**: Age

**Class**: 1 if malignant, 0 if benign

In [1]:
# Package import
import numpy as np

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_validate, RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Data management imports
import pandas as pd
import polars as pl

# Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Uncomment the line below if you are using Google colab
# !gdown https://drive.google.com/uc?id=12Y-PdpmPLInGGBvFAn_G3eCfXrRETvuF

1. Read the CSV file using Polars and store it. Use "null_values=['NA']". Show summary statistics for the dataset. What is the baseline accuracy for a model?

In [2]:
breastcanc = pl.read_csv("Breast Cancer Data.csv", null_values=['NA'])
breastcanc.describe()

statistic,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class,Age
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""",696.0,694.0,694.0,699.0,699.0,683.0,699.0,699.0,699.0,699.0,699.0
"""null_count""",3.0,5.0,5.0,0.0,0.0,16.0,0.0,0.0,0.0,0.0,0.0
"""mean""",4.426724,3.136888,3.208934,2.806867,3.216023,3.544656,3.437768,2.866953,1.569385,0.344778,50.100143
"""std""",2.815748,3.053632,2.973356,2.855379,2.2143,3.643857,2.438364,3.053634,1.619803,0.475636,17.97766
"""min""",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0
"""25%""",2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0,38.0
"""50%""",4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,0.0,50.0
"""75%""",6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,1.0,62.0
"""max""",10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,9.0,1.0,105.0


In [3]:
posrate = breastcanc.select("Class").mean().to_numpy().item()
baselineaccuracy = 1-posrate
print(f"The baseline accuracy is {baselineaccuracy}")

The baseline accuracy is 0.6552217453505007


The baseline accuracy of the data is 0.6552217453505007

2. Assume that we are only interested in studying people aged 100 or less. Remove anyone with ages larger than that. (Note that this slightly changes your baseline accuracy.)

In [4]:
breastcanc = breastcanc.filter(
    pl.col("Age") > 100
    )

3. Replace the missing values in the dataset using the median of the corresponding predictor.

In [5]:
breastcanc = breastcanc.with_columns(
    pl.col("Cl.thickness").fill_null(pl.col("Cl.thickness").median()),
    pl.col("Cell.size").fill_null(pl.col("Cell.size").median()),
    pl.col("Cell.shape").fill_null(pl.col("Cell.shape").median()),
    pl.col("Bare.nuclei").fill_null(pl.col("Bare.nuclei").median())
)

4. Create a training and testing dataset. Reserve 30% of the data for testing and stratify the split based on the outcome. Use a random state of 0.

In [8]:
Xtrain, Xtest, ytrain, ytest = train_test_split(breastcanc.drop('Class'),
                                                breastcanc.select('Class'),
                                                test_size=0.3,
                                                random_state=0,
                                                stratify=breastcanc.select('Class') # Stratify requires binary values
                                                )

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

5. Using all potential predictors, train a logistic regression model to predict if a patient has the condition. Remember to standardize the predictors. Use the following arguments: solver='lbfgs', penalty=None, max_iter=10000, verbose=1, random_state=0, and n_jobs=-1.

In [7]:
features_to_transform = ['Cl.thickness', 'Cell.size', 'Cell.shape', 'Marg.adhesion', 'Epith.c.size',
                         'Bare.nuclei', 'Bl.cromatin', 'Normal.nucleoli', 'Mitoses', 'Age']

# Create a ColumnTransformer object that scales the features to transform
transform_numbers = ColumnTransformer([('scaler', StandardScaler(), features_to_transform)],
                                      remainder='passthrough',
                                      verbose_feature_names_out=False,
                                      force_int_remainder_cols=False)

# Create a pipeline that scales the features and trains a logistic regression model
logit_pipe = Pipeline([
    ('scaler', transform_numbers),
    ('logistic_regression', LogisticRegression(solver='lbfgs',
                                               penalty = None,
                                               max_iter=10000,
                                               verbose=1,
                                               random_state=0,
                                               n_jobs=-1,
                                               class_weight='balanced'))
])

logit_pipe.fit(Xtrain, ytrain.to_numpy().ravel())

NameError: name 'Xtrain' is not defined

6. Compute the accuracy and AUC of your model on the test set.

In [None]:
accuracy = logit_pipe.score(Xtest, ytest.to_numpy().ravel())
y_test_prediction = logit_pipe.predict_proba(Xtest)
auc = np.round(roc_auc_score(y_true=ytest, y_score=y_test_prediction[:,1]),
               decimals = 3)
print(f"The AUC of the model is {auc}")
print(f"The accuracy of the model is {accuracy}")

- Much better accuracy of the model vs baseline accuracy
- AUC is near 1, which means that the model is much better than random chance

7. Without estimates of the uncertainty of the performance metrics, it can be hard to make definitive conclusions about the performance of the model. Compute 95% confidence intervals for the accuracy and AUC using bootstrapping with 1000 replicates. Interpret your results.

In [None]:
# Get the predicted probabilities of the test data
yprob = logit_pipe.predict_proba(Xtest)[:, 1]

# Get the predicted classes of the test data
ypred = logit_pipe.predict(Xtest)

n_bootstraps = 1000 # set the bootstraps amounts
bootstrapped_accuracy = np.zeros(n_bootstraps)
bootstrapped_auc = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    # Get the indices for the bootstrap sample
    idx = np.random.choice(len(ytest), len(ytest), replace=True)

    # Get the accuracy of the bootstrap sample
    bootstrapped_accuracy[i] = accuracy_score(ytest.to_pandas().iloc[idx], ypred[idx])

    # Get the AUC of the bootstrap sample
    bootstrapped_auc[i] = roc_auc_score(ytest.to_pandas().iloc[idx], yprob[idx])
    
# Get the differences between the bootstrapped values and the original values
accuracy_diff = bootstrapped_accuracy - accuracy
auc_diff = bootstrapped_auc - auc

# Calculate the 95% confidence interval for the accuracy and AUC
accuracy_ci = np.percentile(accuracy_diff, [2.5, 97.5])
auc_ci = np.percentile(auc_diff, [2.5, 97.5])

# Show the ci bounds
print(accuracy_ci)
print(auc_ci)

8. Plot the distribution of the accuracy and AUC using histograms. Make sure to provide a title and axes labels for your plots. Add a red vertical line representing the mean of accuracy and AUC.

9. Compute 95% confidence intervals for the accuracy and AUC using repeated cross-validation. Use 10 splits and 100 repetitions with a random state of 0. Compare your results to what you obtained using bootstrapping. Which method provides better confidence intervals in this case?

10. Using your cross-validation results, compute a 95% confidence interval for each coefficient in the model. Which feature(s) might you remove based on this?

11. Fit your logistic regression model like before but remove the feature(s) you indentified in Q10. Plot the ROC curve of the model over the test set and annotate it with the AUC of the model.

12. Calculate the uncertainty for the prediction of the first testing patient.  Plot a histogram of the different predictions. Give the plot a title and axes labels. Add a red vertical line representing the mean of the predictions.

Hint: If you need to stack a list of arrays, you can use [np.hstack(list)](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html).