#### Assignment 0.1: Download the heart disease dataset from https://www.statlearning.com/s/Heart.csv

#### Assignment 0.2: Load the dataset and drop all variables except the predictors Age, Sex, ChestPain, RestBP, Chol, and the target variable AHD. Drop all rows containing a NaN value.

In [1]:
import pandas as pd

df = pd.read_csv('https://www.statlearning.com/s/Heart.csv')
predictors = ['Age', 'Sex', 'ChestPain', 'RestBP', 'Chol']
target = 'AHD'

df = df[predictors + [target]]
df = df.dropna()

#### Assignment 0.3: Onehot encode the variable ChestPain. This means that where you before had a single column with one of four values ['typical', 'asymptomatic', 'nonanginal', 'nontypical'], you will now have four binary columns (their names don't matter), akin to 'ChestPain_typical' 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical'. A row that before had ChestPain='typical' will now have ChestPain_typical=1 and the other three columns set to 0, ChestPain='asymptomatic' will have ChestPain_asymptomatic=1 and the other three set to 0, etc.

In [2]:
df = pd.get_dummies(df, columns=['ChestPain'])

predictors.remove('ChestPain')
predictors.extend([col for col in df.columns if col.startswith('ChestPain')])

#### Assignment 0.4: Binary encode the target variable AHD such that 'No'=0 and 'Yes'=1.

In [3]:
df['AHD'] = df['AHD'].map({'No': 0, 'Yes': 1})

#### Assignment 1.1: Write a function "stratified_split" that takes three arguments: A dataframe, a number of folds, and a list of variables to stratify by. The function should return a list of dataframes, one for each fold, where the dataframes are stratified by the variables in the list. Test that the function works by splitting the dataset into two folds based on 'AHD', 'Age' and 'RestBP' and print the size of each fold, the counts of 0s and 1s in AHD, and the mean of each of 'Age' and 'RestBP' (all these should be printed individually per fold). Ensure that the function does not modify the original dataframe.

In [4]:
import numpy as np
from collections import Counter
from typing import List


def stratified_split(df: pd.DataFrame, num_folds: int, variables: List[str]):
    df = df.copy()
    df = df.sort_values(variables)
    df['fold'] = np.arange(len(df)) % num_folds

    return [df[df['fold'] == fold].drop(columns=['fold']) for fold in np.arange(num_folds)]

folds = stratified_split(df, 2, ['AHD', 'Age', 'RestBP'])

for i, fold in enumerate(folds):
    print(f'Fold {i} (n={len(fold)})')
    print(f'AHD: {Counter(fold["AHD"])}')
    print(f'Age: {np.mean(fold["Age"]):.2f}')
    print(f'RestBP: {np.mean(fold["RestBP"]):.2f}')

Fold 0 (n=152)
AHD: Counter({0: 82, 1: 70})
Age: 54.36
RestBP: 132.20
Fold 1 (n=151)
AHD: Counter({0: 82, 1: 69})
Age: 54.52
RestBP: 131.18


#### Assignment 1.2: Write a function 'fit_and_predict' that takes 4 arguments: A training set, a validation set, a list of predictors, and a target variable. The function should fit a logistic regression model to the training set using the predictors and target variable, and return the predictions of the model on the validation set.

In [5]:
from sklearn.linear_model import LogisticRegression

def fit_and_predict(
    train: pd.DataFrame, 
    validation: pd.DataFrame, 
    predictors: List[str], 
    target: str
) -> np.ndarray:
    model = LogisticRegression()
    model.fit(train[predictors], train[target])

    return model.predict_proba(validation[predictors])[:,1]

#### Assignment 1.3: Write a function 'fit_and_predict_standardized' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. Using a loop (or a scaler), the function should z-score standardize the given variables in both the training set and the validation set based on the mean and standard deviation in the training set. Then, the function should call the 'fit_and_predict' function and return its result. Ensure that the function does not modify the original dataframes. Test the function using the train and validation set from above (e.g. the two folds from the split), while standardizing the 'Age', 'RestBP' and 'Chol' variables (as mentioned above, the target should be AHD, and you should also include the remaining predictors: 'Sex' and the ChestPain-variables)

In [6]:
from sklearn.preprocessing import StandardScaler

def fit_and_predict_standardized(
    train: pd.DataFrame, 
    validation: pd.DataFrame, 
    predictors: List[str],
    target: str,
    standardize: List[str]
) -> np.ndarray:
    train = train.copy()
    validation = validation.copy()

    scaler = StandardScaler()
    train[standardize] = scaler.fit_transform(train[standardize])
    validation[standardize] = scaler.transform(validation[standardize])

    return fit_and_predict(train, validation, predictors, target)

fit_and_predict_standardized(folds[0], folds[1], predictors, target, ['Age', 'RestBP', 'Chol'])
    

array([0.24580942, 0.10301595, 0.04318178, 0.19265105, 0.05521626,
       0.35005167, 0.12335629, 0.04871613, 0.03574997, 0.20508148,
       0.26053605, 0.13650888, 0.21607723, 0.38691392, 0.63428011,
       0.21666342, 0.05742623, 0.15114677, 0.21690386, 0.16463146,
       0.64549472, 0.66232289, 0.04653685, 0.147312  , 0.37418151,
       0.68481353, 0.27261254, 0.25993081, 0.1947186 , 0.3937794 ,
       0.36757399, 0.05514858, 0.24402256, 0.27384063, 0.45940217,
       0.0929065 , 0.76984209, 0.46038522, 0.23011678, 0.10416688,
       0.52227398, 0.10213225, 0.44635117, 0.46600833, 0.21257989,
       0.09340648, 0.3266818 , 0.10975396, 0.37868051, 0.25900628,
       0.07960065, 0.51744639, 0.27241878, 0.78507658, 0.49181328,
       0.82553098, 0.42673238, 0.33914456, 0.28546738, 0.41136269,
       0.83246662, 0.32374632, 0.66724015, 0.13552456, 0.46893256,
       0.33775133, 0.57526899, 0.12385073, 0.60020893, 0.58881848,
       0.71326235, 0.86031367, 0.2106519 , 0.86117278, 0.35543

#### Assignment 1.4: Write a function 'fit_and_compute_auc' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. The function should call the 'fit_and_predict_standardized' function to retrieve out-of-sample predictions for the validation set. Based on these and the ground truth labels in the validation set, it should compute and return the AUC. Test the function using the train and test set from above, while standardizing the 'Age', 'RestBP' and 'Chol' variables (and including the remaining predictors). Print the AUC.

In [7]:
from sklearn.metrics import roc_auc_score


def fit_and_compute_auc(
    train: pd.DataFrame, 
    validation: pd.DataFrame, 
    predictors: List[str],
    target: str,
    standardize: List[str]
):
    predictions = fit_and_predict_standardized(train, validation, predictors, target, standardize)

    return roc_auc_score(validation[target], predictions)

#### Assignment 2: Use the 'stratified_split' function to split the dataset into 10 folds, stratified on variables you find reasonable. For each fold, use the 'fit_and_compute_auc' function to compute the AUC of the model on the held-out validation set. Print the mean and standard deviation of the AUCs across the 10 folds.

folds = stratified_split(df, 10, ['AHD', 'Age'])

aucs = [
    fit_and_compute_auc(
        pd.concat([fold for j, fold in enumerate(folds) if j != i]),
        folds[i],
        predictors,
        target,
        ['Age', 'RestBP', 'Chol']
    ) for i in range(len(folds))
]

print(f'AUC: {np.mean(aucs):.2f}+/-{np.std(aucs):.2f}')

#### Assignment 3: For 100 iterations, create a bootstrap sample by sampling with replacement from the full dataset until you have a training set equal in size to 80% of the original data. Use the observations not included in the bootstrap sample as the validation set for that iteration.. Fit models and calculate AUCs for each iteration. Print the mean and standard deviation of the AUCs.

In [9]:
from tqdm import tqdm

aucs = []

for _ in tqdm(range(100)):
    train = df.sample(frac=0.8, replace=True)
    validation = df[~df.index.isin(train.index)]
    aucs.append(
        fit_and_compute_auc(
            train,
            validation,
            predictors,
            target,
            ['Age', 'RestBP', 'Chol']
        )
    )

print(f'AUC: {np.mean(aucs):.2f}+/-{np.std(aucs):.2f}')

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 275.79it/s]

AUC: 0.83+/-0.03





#### Assignment 4.1:  List some benefits of wrapping code in functions rather than copying and pasting it multiple times.

- Code becomes more reusable
- Bugs have to be fixed in a single place
- It is often easier to reason about the high-level functioning of code in terms of abstract chunks (i.e. functions) as opposed to individual lines

#### Assignment 4.2:  Explain three classification metrics and their benefits and drawbacks.

- Accuracy: Measures the proportion of correct predictions across a dataset. Very intuitive to interpret, but can be misleading in the case of imbalanced classes.
- Area under the ROC-curve: Measures the trade-off between true positive rate and false positive rate and several classification thresholds. Gives a more comprehensive view of model performance than many metrics that rely on singular classification thresholds. However, before using a model in practice, a threshold often has to be set. Can be misleading in cases of class imbalance where the specific types of misclassification matter.
- Area under the precision-recall curve: Similar to AUROC, but measures the trade-off between precision and recall instead. Can be more informative in cases of severe class imbalance, especially when the negative class dominates

#### Assignment 4.3:  Write a couple of sentences comparing the three methods (train/validation, cross-validation, bootstrap) above as approaches to quantify model performance. Which one yielded the best results? Which one would you expect to yield the best results? Can you mention some theoretical benefits and drawbacks with each? Even if you didn't do the optional bootstrap exercise you should reflect on this as an approach.

- Train/validation: Relies on a single data split, reserving a portion of the data for training and another portion for validation. Provides an unbiased measure of model performance conditioned on the specific split. In practice, this means that we have to assume the measure will be highly variable, depending on exactly what data ends up where.
- Cross-validation: Splits the data into K folds, and runs several training iterations, each reserving a single fold for validation. Provides several model performances (one per fold), which can be useful to get a more comprehensive view of the model performance and its expected variance.
- Bootstrap: If we use the bootstrap approach described here (often called either repeated hold-out or monte-carlo cross-validation) we repeatedly sample a portion of the dataset for training and use the remaining samples for validation. If we, as here, sample 80% of the dataset for training with replacement, this would mean that effectively less data is used for training (due to the resampling) and more for testing. In sum, we would expect this to give a slightly lower mean performance and a slightly lower variance among the observed performances.

#### Assignment 4.4: Why do we stratify the dataset before splitting?

We stratify the data prior to splitting to ensure all of our folds are approximately representative with regards to a few key variables. In addition to ensuring that our models are more probable to generalize to new samples, this also typically mean that we see better out-of-sample performance in held-out data

#### Assignment 4.5: What other use cases can you think of for the bootstrap method?

The bootstrap is most commonly used to provide confidence intervals for parameter estimates