# Model Training

The work in this notebook focuses on the problem formulation and model selection, training and evaluation.

## Problem Formulation

In this section I work out the sliding window and prediction horizon. Since I couldn't find any mentions of recording frequency for this dataset, I will work with the assumption that the data has been recorded in 1-minute intervals. The exact unknown time intervals aren't as important, the interpretation serves mostly for the readability and ease of understanding of this experiment.

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
train_df = pd.read_csv('../data/train_clean.csv')
test_df = pd.read_csv('../data/test_clean.csv')

### Prediction Horizon H

Determining the size of H presents a trade-off:
- Large (e.g. 60 min) is great for operators, because they get a lot of time to fix the issue, but the prediction accuracy drops significantly. It is very hard to predict a crash 1 hour in advance from minute-level metrics.
- Small (e.g. 1 min): has high prediction accuracy, but useless for operations. By the time the alert fires, the server is already dead.

For this project, I set the value of H to 5 minutes. This is the standard SLA reaction time. It should give operators enough time to react to the incident.

### Sliding Window W

This parameter also presents a trade-off:
- Small W (e.g. 1-2 min): The model only sees the current spike, it cannot detect longer trends.
- Large W (e.g. 60+ min): The model has a clear picture of ongoing changes, but these can also be irrelevant or misleading.

When it comes to cloud infrastructure failures, they can appear in short time (Shock failures) but also over longer time, say 10-20 minutes. Selecting an exact amount can be tricky, so in this project, I will implement two different values for W, train models for both of them and compare results at the end. The two values I chose are:
- 12 minutes: should be long enough to capture some longer trends while maintaining noise resistance.
- 25 minutes: sees further back. Could catch a slower trend that W = 12 misses, but might also get diluted with noise.

For this purpose, I implement the function below, that:
1. Slides a window of size 'W' over the data.
2. Calculates rolling stats (Mean, Std, Delta, Z-Score) for that window.
3. Checks the next 'H' steps to define the Label (1 if crash, 0 if safe).

In [6]:
def create_supervised_dataset(data_df, labels, window_size, horizon):
    
    X_list = []
    y_list = []
    
    if 'label' in data_df.columns:
        feature_data = data_df.drop(columns=['label']).values
    else:
        feature_data = data_df.values
        
    labels = labels.values
    
    print(f"Generating features (W={window_size}, H={horizon})...")
    
    for t in range(window_size, len(feature_data) - horizon):
        
        # The Input Window 
        window = feature_data[t-window_size : t]
        
        # Calculate stats for every feature
        w_mean = np.mean(window, axis=0)
        w_std  = np.std(window, axis=0)
        
        # The change between the very last point and the one before it
        w_delta = window[-1] - window[-2]
        
        # The last point relative to the window's stats
        with np.errstate(divide='ignore', invalid='ignore'):
            w_zscore = (window[-1] - w_mean) / w_std
        w_zscore = np.nan_to_num(w_zscore) 
        
        # Combine all stats into one long vector for this row
        features_vector = np.concatenate([w_mean, w_std, w_delta, w_zscore])
        
        # The Target Window
        future_window = labels[t : t+horizon]
        
        # If an anomaly happens in the future window, label = 1
        is_crash = 1 if np.sum(future_window) > 0 else 0
        
        X_list.append(features_vector)
        y_list.append(is_crash)
        
    return np.array(X_list), np.array(y_list)

Next, I create the two proposed datasets with the two values for W. 

In [7]:
H = 5
W_short = 12
W_long = 25

# Prepare Training Data
print("--- Creating Training Sets ---")
X_train_short, y_train_short = create_supervised_dataset(train_df, train_df['label'], W_short, H)
X_train_long,  y_train_long  = create_supervised_dataset(train_df, train_df['label'], W_long, H)

# Prepare Test Data
print("\n--- Creating Test Sets ---")
X_test_short, y_test_short = create_supervised_dataset(test_df, test_df['label'], W_short, H)
X_test_long,  y_test_long  = create_supervised_dataset(test_df, test_df['label'], W_long, H)

# Validation
print("\n--- Shape Verification ---")
print(f"Short Window (W=12): {X_train_short.shape} features per row")
print(f"Long Window  (W=25): {X_train_long.shape} features per row")

--- Creating Training Sets ---
Generating features (W=12, H=5)...
Generating features (W=25, H=5)...

--- Creating Test Sets ---
Generating features (W=12, H=5)...
Generating features (W=25, H=5)...

--- Shape Verification ---
Short Window (W=12): (19918, 124) features per row
Long Window  (W=25): (19905, 124) features per row


## Model Selection and Training

Based on information observed during EDA as well as the nature of the problem, I have chosen **Random Forest Classifier** for this task. 
The main reasons are:
1. **Non-Linear features:** In the bivariate analysis, I observed that some features (like feat_8 or feat_29) remain at 0.0 for long periods and spike suddenly during an incident. Linear models like Logistic Regression struggle to model these sharp cliffs. Decision Trees naturally model these non-linear thresholds.
2. **Interaction Effects:** Some features (like feat_15 or feat_22) appeared stable but are likely informative when combined with other indicators. Random Forests automatically learn these interaction effects without requiring manual feature combination.
3. **Robustness to Noise:** The cloud metrics contain jitter and spikes that are not incidents. By averaging the predictions of multiple trees, the Random Forest reduces the variance and is less likely to overfit to individual noisy data points compared to a single Decision Tree.
4. **Efficiency for Retraining:** The project description requires a system capable of periodic retraining. Random Forests parallelize training easily and are significantly faster and cheaper to train than recurrent neural networks.


### Training Strategy

I will train two separate models to compare the impact of the window size:
1. Model A: Trained on \(W=12\)
2. Model B: Trained on \(W=25\)

As noted in the data analysis, the anomaly rate is approximately 10%. To ensure the model focuses on capturing incidents (Recall) rather than just maximizing accuracy, I will use class_weight='balanced'. This penalizes the model more heavily for missing a crash than for raising a false alarm.