<div align="center">
<h1>Stage 5: Modelling (Cross-Validation)</a></h1>
by Hongnan Gao
<br>
</div>

## Dependencies and Configuration

In [None]:
import logging
import random
from dataclasses import dataclass, field
from time import time
from typing import Any, Callable, Dict, List, Optional, Union

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import model_selection

In [None]:
@dataclass
class config:
    raw_data: str = "https://storage.googleapis.com/reighns/reighns_ml_projects/docs/supervised_learning/classification/breast-cancer-wisconsin/data/raw/data.csv"
    processed_data: str = "https://storage.googleapis.com/reighns/reighns_ml_projects/docs/supervised_learning/classification/breast-cancer-wisconsin/data/processed/processed.csv"
    train_size: float = 0.9
    seed: int = 1992
    num_folds: int = 5
    cv_schema: str = "StratifiedKFold"
    classification_type: str = "binary"
    
    target_col: List[str] = field(default_factory = lambda: ["diagnosis"])
    unwanted_cols : List[str] =  field(default_factory = lambda: ["id", "Unnamed: 32"])
    
    # Plotting
    colors : List[str] =field(default_factory = lambda: ["#fe4a49", "#2ab7ca", "#fed766", "#59981A"])
    cmap_reversed = plt.cm.get_cmap('mako_r')

    def to_dict(self) -> Dict:
        """Convert the config object to a dictionary.

        Returns:
            Dict: The config object as a dictionary.
        """
        return {
            "raw_data": self.raw_data,
            "processed_data": self.processed_data,
            "train_size": self.train_size,
            "seed": self.seed,
            "num_folds": self.num_folds,
            "cv_schema": self.cv_schema,
            "classification_type": self.classification_type,
            "target_col": self.target_col,
            "unwanted_cols": self.unwanted_cols,
            "colors": self.colors,
            "cmap_reversed": self.cmap_reversed
        }

In [None]:
def set_seeds(seed: int = 1234) -> None:
    """Set seeds for reproducibility."""
    np.random.seed(seed)
    random.seed(seed)
    
def init_logger(log_file: str = "info.log"):
    """
    Initialize logger.
    """
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)
    stream_handler = logging.StreamHandler()
    stream_handler.setFormatter(logging.Formatter("%(asctime)s - %(message)s", datefmt= "%Y-%m-%d,%H:%M:%S"))
    file_handler = logging.FileHandler(filename=log_file)
    file_handler.setFormatter(logging.Formatter("%(asctime)s - %(message)s",  datefmt= "%Y-%m-%d,%H:%M:%S"))
    logger.addHandler(stream_handler)
    logger.addHandler(file_handler)
    return logger

In [None]:
config = config()
logger = init_logger()

In [None]:
# set seeding for reproducibility
_ = set_seeds(seed = config.seed)

# read data
df = pd.read_csv(config.processed_data)

## Cross-Validation Strategy

!!! warning "Generalization"
    > Ultimately, we are interested in the Generalization Error made by the model, that is, how well the model perform on <b>unseen data</b> that is not taken from our sample set $\mathcal{D}$. In general, we use <b>validation set</b> for <b>Model Selection</b> and the <b>test set</b> for <b>an estimate of generalization error</b> on new data.
            <br> <b>- Refactored from Elements of Statistical Learning, Chapter 7.2</b></p>

### Step 1: Train-Test-Split

Since this dataset is relatively small, we will not use the <b>train-validation-test</b> split and only split into train and test in a ratio of 9:1, whereby the split is stratified on our target, using `stratify=y` parameter in `train_test_split()` to ensure that our target has equal representation in both train and test. We note that this is a relatively small dataset and in practice, we need a large sample size to get a reliable/stable split, it is also recommended to retrain the whole dataset (without the "unseen" test set) after we have done the model selection process (eg. finding best hyperparameters). 

### Step 2: Resampling Strategy

Note that we will be performing `StratifiedKFold` as our resampling strategy. After our split in Step 1, we have a training set $X_{\text{train}}$, we will then perform our resampling strategy on this $X_{\text{train}}$. We will choose our choice of $K = 5$. The choice of $K$ is somewhat arbitrary, and is derived [empirically](https://stats.stackexchange.com/questions/61783/bias-and-variance-in-leave-one-out-vs-k-fold-cross-validation).

### Cross-Validation Workflow

To recap, we have the following:

- **Training Set ($X_{\text{train}}$)**: This will be further split into K validation sets during our cross-validation. This set is used to fit a particular hypothesis $h \in \mathcal{H}$.
- **Validation Set ($X_{\text{val}}$)**: This is split from our $X_{\text{train}}$ during cross-validation. This set is used for model selection (i.e. find best hyperparameters, attempt to produce a best hypothesis $g \in \mathcal{H}$).
- **Test Set ($X_{\text{test}}$)**: This is an unseen test set, and we will only use it after we finish tuning our model/hypothesis. Suppose we have a final best model $g$, we will use $g$ to predict on the test set to get an estimate of the generalization error (also called out-of-sample error).

---

<figure>
<img src='https://storage.googleapis.com/reighns/reighns_ml_projects/docs/supervised_learning/classification/breast-cancer-wisconsin/data/images/grid_search_workflow.png' width="500"/>
<figcaption align = "center"><b>Courtesy of scikit-learn on a typical Cross-Validation workflow.</b></figcaption>
</figure>

In [None]:
# Make a copy of df and assign it to X
X = df.copy()

# Pop diagnosis, the target column from X and assign the target column data to y
y = X.pop("diagnosis")

In [None]:
# Assign predictors and target accordingly
predictor_cols = X.columns.to_list()
target_col = config.target_col

In [None]:
# Split train - test
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X,
    y,
    train_size=config.train_size,
    shuffle=True,
    stratify=y,
    random_state=config.seed,
)

We confirm that we have stratified properly. We do observe that the distribution of targets in both `y_train` and `y_test` are similar.

In [None]:
# Log and send a class proportion plot to wandb for both train and test set
logger.info(f"Y Train Distribution is : {y_train.value_counts(normalize=True).to_dict()}")
logger.info(f"Y Test Distribution is : {y_test.value_counts(normalize=True).to_dict()}")
# wandb.sklearn.plot_class_proportions(y_train, y_test, labels=[0, 1])

2021-11-13,09:53:22 - Y Train Distribution is : {0: 0.626953125, 1: 0.373046875}
2021-11-13,09:53:22 - Y Test Distribution is : {0: 0.631578947368421, 1: 0.3684210526315789}


In [None]:
def make_folds(
    df: pd.DataFrame,
    num_folds: int,
    cv_schema: str,
    seed: int,
    predictor_col: List,
    target_col: List,
) -> pd.DataFrame:
    """Split the given dataframe into training folds.

    Args:
        df (pd.DataFrame): The dataframe to be split.
        num_folds (int): The number of folds to be created.
        cv_schema (str): The type of cross validation to be used.
        seed (int): The seed number to be used.

    Returns:
        df_folds (pd.DataFrame): The dataframe containing the folds.
    """

    if cv_schema == "KFold":
        df_folds = df.copy()
        kf = model_selection.KFold(n_splits=num_folds, shuffle=True, random_state=seed)

        for fold, (train_idx, val_idx) in enumerate(
            kf.split(X=df_folds[predictor_col], y=df_folds[target_col])
        ):
            df_folds.loc[val_idx, "fold"] = int(fold + 1)

        df_folds["fold"] = df_folds["fold"].astype(int)

    elif cv_schema == "StratifiedKFold":
        df_folds = df.copy()
        skf = model_selection.StratifiedKFold(
            n_splits=num_folds, shuffle=True, random_state=seed
        )

        for fold, (train_idx, val_idx) in enumerate(
            skf.split(X=df_folds[predictor_col], y=df_folds[target_col])
        ):
            df_folds.loc[val_idx, "fold"] = int(fold + 1)

        df_folds["fold"] = df_folds["fold"].astype(int)
        print(df_folds.groupby(["fold", "diagnosis"]).size())

    return df_folds

In [None]:
# Concat X_train and y_train to apply make_folds on it and return a new dataframe df_folds with
# an additional column fold to indicate each sample's fold
X_y_train = pd.concat([X_train, y_train], axis=1).reset_index(drop=True)
df_folds = make_folds(
    X_y_train,
    num_folds=config.num_folds,
    cv_schema=config.cv_schema,
    seed=config.seed,
    predictor_col=predictor_cols,
    target_col=config.target_col,
)

# TODO: write directly to GCP
df_folds.to_csv("df_folds.csv", index=False)

fold  diagnosis
1     0            64
      1            39
2     0            65
      1            38
3     0            64
      1            38
4     0            64
      1            38
5     0            64
      1            38
dtype: int64


Looks good! All our five folds are stratified!