# ML project aspects I

### Outline
1. Bias-variance tradeoff
2. Cross-validation (with stratification)

## Bias-variance tradeoff

### Data quality
Recall that in supervised learning, we have a training set of input-output pairs $(x,y)$ where $x$ is the input and $y$ is the output. The goal is to learn a ground truth function $\mathbf{F}$ that maps each $x$ to the corresponding $y$.  There are two ways in which the quality/quantity of the data may affect the learning process:
1. The dataset may be too small to fully capture the underlying "true" distribution of the data. This is sometimes referred to as *bias* in the dataset (because it might lead you to a biased version of the true distribution).
2. The dataset may contain a lot of "noise", which refers to random errors or irrelevant information in the data that do not reflect true underlying patterns. This is sometimes referred to as *variance* in the dataset.

### Fit and generalizability
Recall that training a model involves finding the best parameters $\mathbf{w}$ that minimize the loss function $\mathcal{L}(\mathbf{w})$. All this, of course, is measured relative to some fixed training set, which is why we refer to training as "fitting" the model to the data.

There is a very very important distinction to be made between a model that is a "good fit" for the data, and a "good model". 
- A good fit means that the model has learned the underlying patterns in the training dataset, and it is able to make accurate predictions on that dataset.
- A good model means that the model is able to make accurate predictions on *any* dataset (with the same features as the training set); in particular, it must also fare well on the test set. Thus, a good model is one that is a good fit for the "true" distribution of the data, not just the training set.

The **generalizability** of a model is the ability to make accurate predictions on new, unseen data. A model that is able to generalize well is one that has learned the underlying patterns in the data, rather than just memorizing the training set. 

Thus, in ML, we want to strike a balance between two things: on the one hand, we want a model that achieves low training error (i.e., a good fit), and on the other hand, we want a model that achieves low test error (i.e., a good model). This leads to a fundamental tradeoff in ML known as the:

### Bias-variance tradeoff
The bias-variance tradeoff is a fundamental concept in machine learning that describes the tradeoff between two sources of error that can affect the performance of a model:
- **Bias**: Bias refers to the error introduced by approximating a real-world problem (which may be complex) by a simplified model. A model with high bias pays very little attention to the training data and oversimplifies the model, leading to high training and test errors. This is known as **underfitting**.
- **Variance**: Variance refers to the error introduced by the model's sensitivity to small fluctuations in the training data. A model with high variance pays too much attention to the training data and captures noise as if it were a true pattern, leading to low training error but high test error. This is known as **overfitting**.

With a little theory, the concepts of bias and variance can be made more precise. The bias-variance tradeoff can be understood in terms of the three components of error, two of which are bias and variance. The third is the **irreducible error**, which is the error that cannot be reduced by any model, and it is caused by noise in the data. The **total error** of a model can be decomposed into three components: bias, variance, and irreducible error. The goal of machine learning is to minimize the total error by finding the right balance between bias and variance. 

### Detecting overfitting and underfitting
In practice, we diagnose **underfitting** and **overfitting** by comparing how well the model performs on the training set and the test (or validation) set. (We've talked about the test set before, we introduce the validation set later in this notebook.)
- **Underfitting**: If the model performs poorly on both the training and test sets, it is likely underfitting. This means that the model is too simple to capture the underlying patterns in the data.
- **Overfitting**: If the model performs well on the training set but poorly on the test set, it is likely overfitting. This means that the model has learned the noise in the training data rather than the underlying patterns. This is arguably more dangerous than underfitting, because if we let our guard down we might think we have a good model when in fact we don't!

## Validation sets

### Train-test split
Recall that we split our dataset into a training set and a test set. The training set is used to train the model, while the test set is used to evaluate the model's performance. The split should be carried out before any training or evaluation and the test set should be kept safe and far away from the training process. All data preprocessing (e.g. standardizing) and feature engineering (e.g. polynomial interaction terms) should be done to the training set *after* the split, and the same transformations should be applied to the test set *only* at the time of evaluation. This is to ensure that the train set has no information about the test set, and vice versa.

### A three-way split
In fact, one often splits the original dataset into three parts:
1. **Training set** — Used to train the model.
2. **Validation set** — Used to tune hyperparameters and detect overfitting.
3. **Test set** — Used only once, at the very end, to evaluate final model performance.

Thus, a typical workflow would be:
1. Split the original dataset into a training set and a test set.
2. Split the training set into a (smaller) training set and a validation set.
3. Fit the model to the training set.
4. Evaluate (*not train*) the model on the validation set.
5. Tune hyperparameters based on validation set performance.
6. Repeat steps 3-5 until the model is satisfactory. (These can be thought of as the "training" steps.)
7. Evaluate the final model on the test set. (This is the "testing" step.)

### Why not just use the test set for validation?
It's natural to avoid the hassle of a validation set and simply do the following:
- You try different hyperparameters or model choices.
- You evaluate each one on the test set.
- You pick the one that performs best on the test set.

The problem with this is that it *leaks* information from the test set into the model. By indirectly optimizing for the test set, you can end up with a model that performs well on the test set but poorly on new data. That is, it is possible the model has learned to "cheat" by memorizing the test set rather than learning the underlying patterns in the data. In other words, we are back to square one: we have a model that (may be) overfitting to the test set, and we have no way of knowing how well it will perform on new data!

**Remark.** Here is a neat analogy provided by Chat GPT. Think of it like this:
- **Training set**: Studying for an exam.
- **Validation set**: Practice quizzes to decide how to study or what strategy works.
- **Test set**: The final exam — you don’t want to have seen it before!

## Cross-validation

### What is it?
If we start off with a small dataset, we may not have enough data to split into three sets. Indeed, the test set will (typically) take up 20\% of the original dataset, and the validation set will take up 20\% of the training set, leaving us with an actual training set of only 64\% of the original dataset, which may be too small to train a good model.

One way to get around this is to use **cross-validation**. Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent dataset. The basic idea is to:
- split the dataset into $k$ subsets (or "folds"),
- train the model on $k-1$ of the folds, and
- use the remaining fold for validation.
Thus, we repeat this process $k$ times, each time using a different fold for validation. NOTE: we are basically creating $k$ distinct training sets (with overlapping data points). 

After training and validating the model on all $k$ folds, we can average the validation scores to get an overall estimate of the model's performance. This is a more robust estimate than using a single validation set, as it takes into account the variability in the data. Moreover, it provides some confidence that the particular choice of validation set did not unduly influence the results.

### Use cross-validation when:
- Your dataset is small or moderately sized, and you want to make the most of your data.
- You want a more reliable estimate of model performance.
- You're comparing multiple models or hyperparameters and want to avoid picking one that performs well due to chance.
- You're preparing for model selection before using a final test set.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../midterm/data/presidential_election_dataset.csv')
df_description = pd.read_csv('../midterm/data/data_dictionary.csv')

# group features by category for easier access

idx = df_description[df_description['category'] == 'id']['feature'].values.tolist()

sex_age_edus = df_description[df_description['category'] == 'sex ~ age ~ education']['feature'].values.tolist()

sex_age_races = df_description[df_description['category'] == 'sex ~ age ~ race']['feature'].values.tolist()

sex_maritals = df_description[df_description['category'] == 'sex ~ marital status']['feature'].values.tolist()

households = df_description[df_description['category'] == 'household']['feature'].values.tolist()

labors = df_description[df_description['category'] == 'labor force']['feature'].values.tolist()

nativities = df_description[df_description['category'] == 'nativity']['feature'].values.tolist()

sexes = df_description[df_description['category'] == 'sex']['feature'].values.tolist()

incomes = df_description[df_description['category'] == 'income']['feature'].values.tolist()

targets = df_description[df_description['category'] == 'target']['feature'].values.tolist()

# possible values of age, edu, race
ages = ['18_24', 
        '25_34', 
        '35_44', 
        '45_64', 
        '65_plus']
edus = ['less_than_9th', 
        'some_hs', 
        'hs_grad', 
        'some_college', 
        'associates', 
        'bachelors', 
        'graduate']
races = ['black',
         'white',
         'aian',
         'asian',
         'nhpi',
         'multi',
         'other']

In [3]:
df_description.head(20)

Unnamed: 0,feature,description,category
0,year,Year of presidential popular election,year
1,gisjoin,Geographic identifier for joining with other d...,id
2,state,State name,id
3,county,County name,id
4,persons_total,Persons: Total,Persons: total
5,persons_male,Persons: Male,sex
6,persons_female,Persons: Female,sex
7,persons_hispanic,Persons: Hispanic or Latino,ethnicity
8,households_total,Households: Total,household
9,male_never_married,Males ~ never married,sex ~ marital status


In [8]:
df[sex_maritals]

Unnamed: 0,male_never_married,male_married,male_separated,male_widowed,male_divorced,female_never_married,female_married,female_separated,female_widowed,female_divorced
0,5553,11814,507,435,1998,5035,11707,538,2095,2721
1,16489,43600,1014,1928,6754,13167,43274,1404,7626,9065
2,4406,5588,410,285,1764,2679,5160,558,1426,1290
3,3015,5515,677,145,1208,1794,4229,122,1083,1308
4,5012,13933,389,571,2626,3661,13523,668,2916,2832
...,...,...,...,...,...,...,...,...,...,...
12407,5364,9284,315,604,1988,3544,8787,303,1524,2135
12408,3847,5355,97,54,1024,2855,5430,45,422,887
12409,2241,4544,71,261,1013,1524,4453,146,484,1289
12410,837,1842,35,140,462,576,1767,20,290,472


In [10]:
# choose some features, e.g. sex ~ marital status
X = df[sex_maritals]

# create a "winner" column with 1 if 'democrat' > 'republican', 0 otherwise
y = df[targets].apply(lambda x: 1 if x['democrat'] > x['republican'] else 0, axis=1)

In [12]:
X.shape, y.shape

((12412, 10), (12412,))

In [14]:
# Import necessary modules
from sklearn.model_selection import KFold, StratifiedKFold

# -------------------------------
# Example 1: Using KFold
# -------------------------------
print("Accessing individual folds using KFold:\n")
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Iterate through each fold generated by KFold
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    print(f"Fold {fold}:")
    # Display the first 10 training and test indices for brevity
    print("  Training indices (first 10):", train_index[:10])
    print("  Test indices (first 10):", test_index[:10])
    
    # Extract the training and test sets using the indices with .iloc for positional indexing
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Here you can train your model on X_train, y_train and evaluate on X_test, y_test
    # For demonstration, we'll just print the shapes of these sets.
    print("  X_train shape:", X_train.shape)
    print("  X_test shape:", X_test.shape, "\n")

Accessing individual folds using KFold:

Fold 1:
  Training indices (first 10): [ 1  2  4  5  6  7  9 10 11 12]
  Test indices (first 10): [ 0  3  8 14 17 19 31 33 35 39]
  X_train shape: (9929, 10)
  X_test shape: (2483, 10) 

Fold 2:
  Training indices (first 10): [0 1 2 3 4 5 6 7 8 9]
  Test indices (first 10): [10 12 20 23 29 30 32 36 37 42]
  X_train shape: (9929, 10)
  X_test shape: (2483, 10) 

Fold 3:
  Training indices (first 10): [0 1 2 3 4 5 6 7 8 9]
  Test indices (first 10): [15 26 27 28 34 44 51 66 69 75]
  X_train shape: (9930, 10)
  X_test shape: (2482, 10) 

Fold 4:
  Training indices (first 10): [ 0  1  3  4  5  8  9 10 11 12]
  Test indices (first 10): [ 2  6  7 18 22 24 25 40 49 52]
  X_train shape: (9930, 10)
  X_test shape: (2482, 10) 

Fold 5:
  Training indices (first 10): [ 0  2  3  6  7  8 10 12 14 15]
  Test indices (first 10): [ 1  4  5  9 11 13 16 21 38 54]
  X_train shape: (9930, 10)
  X_test shape: (2482, 10) 



In [18]:
# -------------------------------
# Example 2: Using StratifiedKFold
# -------------------------------
print("Accessing individual folds using StratifiedKFold:\n")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=43)

# Iterate through each fold generated by StratifiedKFold
for fold, (train_index, test_index) in enumerate(skf.split(X, y), 1):
    print(f"Fold {fold}:")
    print("  Training indices (first 10):", train_index[:10])
    print("  Test indices (first 10):", test_index[:10])
    
    # Extract training and test sets
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # For stratified folds, each test set will have a similar class distribution as the full dataset.
    print("  X_train shape:", X_train.shape)
    print("  X_test shape:", X_test.shape, "\n")

Accessing individual folds using StratifiedKFold:

Fold 1:
  Training indices (first 10): [ 0  4  7  8  9 11 12 14 15 16]
  Test indices (first 10): [ 1  2  3  5  6 10 13 21 24 28]
  X_train shape: (9929, 10)
  X_test shape: (2483, 10) 

Fold 2:
  Training indices (first 10): [ 0  1  2  3  4  5  6  9 10 12]
  Test indices (first 10): [ 7  8 11 14 19 20 27 29 36 39]
  X_train shape: (9929, 10)
  X_test shape: (2483, 10) 

Fold 3:
  Training indices (first 10): [0 1 2 3 4 5 6 7 8 9]
  Test indices (first 10): [12 16 17 18 22 25 32 34 35 37]
  X_train shape: (9930, 10)
  X_test shape: (2482, 10) 

Fold 4:
  Training indices (first 10): [ 0  1  2  3  5  6  7  8 10 11]
  Test indices (first 10): [ 4  9 15 23 26 30 31 33 38 40]
  X_train shape: (9930, 10)
  X_test shape: (2482, 10) 

Fold 5:
  Training indices (first 10): [ 1  2  3  4  5  6  7  8  9 10]
  Test indices (first 10): [  0  45  62  71  78  85  91  96  99 103]
  X_train shape: (9930, 10)
  X_test shape: (2482, 10) 



You can also combine the cross-validations splits with the `cross_val_score` function from `sklearn`, which will automatically perform the cross-validation for you. This is a very useful function that allows you to evaluate the performance of a model using cross-validation without having to manually split the data into training and validation sets.

In [19]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression(random_state=42,
                           max_iter=1000)

# KFold splits the data into 5 parts (folds) randomly. 
# It does not take the distribution of classes into account.
# ----------------------------------------------------------------------
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate the model using cross_val_score which performs training
# and validation in each fold and returns (negative) log-loss.
scores_kf = cross_val_score(model,      # Logistic Regression model
                            X_scaled,   # Scaled features
                            y,          # Target variable
                            cv=kf,      # KFold cross-validation object
                            scoring='neg_log_loss'  # Use log-loss as the scoring metric
                            )

# Print the log-loss for each fold and the mean log-loss across all 5 folds.
print("Standard 5-Fold CV (negative) log-loss:", scores_kf)
print("Mean (negative) log-loss:", np.mean(scores_kf))

Standard 5-Fold CV (negative) log-loss: [-0.42081235 -0.41270505 -0.40442961 -0.41632738 -0.42678047]
Mean (negative) log-loss: -0.41621097219701236


In [20]:
# StratifiedKFold ensures that each fold has approximately the same 
# percentage of samples of each target class as the complete set.
# This is particularly useful for binary or multi-class classification.
# ----------------------------------------------------------------------

model = LogisticRegression(random_state=42,
                           max_iter=1000)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate the model using stratified cross-validation.
scores_skf = cross_val_score(model, X_scaled, y, cv=skf, scoring='neg_log_loss')

# Print the accuracy scores for each stratified fold and the mean accuracy.
print("\nStratified 5-Fold CV (negative) log-loss:", scores_skf)
print("Mean (negative) log-loss:", np.mean(scores_skf))


Stratified 5-Fold CV (negative) log-loss: [-0.42435809 -0.43003124 -0.40638417 -0.39803823 -0.41940113]
Mean (negative) log-loss: -0.4156425732193837
