## **Data Splitting Strategies**

Once the feature-engineered dataset is ready, a crucial step in building robust machine learning models is splitting the data into different sets. This process ensures that the model is evaluated on unseen data, providing a more reliable estimate of its generalization performance. This section will discuss various data splitting methods, their use cases, advantages, and disadvantages.

### **1. Simple Train/Test Split**

This is the most basic and common method of splitting data. The dataset is divided into two parts: a training set (used to train the model) and a test set (used to evaluate the model's performance on unseen data).

**When to Use:**
*   When you have a very large dataset, and a simple split is sufficient to get a representative test set.
*   As a quick initial evaluation of a model.

**Pros:**
*   Simple to implement and understand.
*   Fast to execute.
*   Provides a straightforward estimate of model performance on unseen data.

**Cons:**
*   The performance estimate can be highly dependent on the particular split. If the split is not random or representative, the test set might not accurately reflect real-world data.
*   Information loss: The model is not trained on the entire dataset, potentially leading to a less robust model if the dataset is small.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate a dummy dataset
X = np.random.rand(100, 5) # 100 samples, 5 features
y = (X[:, 0] + X[:, 1] > 1).astype(int) # Binary target variable

print(f"Original dataset shape: X={X.shape}, y={y.shape}")

# Perform a simple train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"\nTrain set shape: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set shape: X_test={X_test.shape}, y_test={y_test.shape}")

Original dataset shape: X=(100, 5), y=(100,)

Train set shape: X_train=(70, 5), y_train=(70,)
Test set shape: X_test=(30, 5), y_test=(30,)


### **2. K-Fold Cross-Validation**

K-Fold Cross-Validation is a more robust method that divides the dataset into *k* equally sized folds. The model is then trained *k* times. In each iteration, one fold is used as the test set, and the remaining *k-1* folds are used as the training set. The final performance metric is the average of the *k* evaluation scores.

**When to Use:**
*   When you want a more reliable estimate of your model's performance.
*   When the dataset is not extremely large, and you want to make the most of your data for training and evaluation.
*   To compare different models and select the best performing one.

**Pros:**
*   Provides a more robust and less biased estimate of model performance compared to a single train/test split.
*   Every data point gets to be in a test set exactly once, and in a training set *k-1* times, maximizing data utilization.
*   Reduces variance in performance estimation.

**Cons:**
*   Computationally more expensive than a simple train/test split because the model is trained *k* times.
*   Can be slow for very large datasets or complex models.

In [None]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize KFold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)

model = LogisticRegression(random_state=42)

accuracy_scores = []

print("Performing K-Fold Cross-Validation (k=5):")
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

    print(f"  Fold {fold+1}: Accuracy = {accuracy:.4f}")

print(f"\nAverage Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Standard Deviation of Accuracy: {np.std(accuracy_scores):.4f}")

Performing K-Fold Cross-Validation (k=5):
  Fold 1: Accuracy = 0.8000
  Fold 2: Accuracy = 0.9500
  Fold 3: Accuracy = 1.0000
  Fold 4: Accuracy = 0.9000
  Fold 5: Accuracy = 0.9500

Average Accuracy: 0.9200
Standard Deviation of Accuracy: 0.0678


### **3. Stratified K-Fold Cross-Validation**

Stratified K-Fold Cross-Validation is a variation of K-Fold that ensures each fold has approximately the same percentage of samples for each target class as the complete set. This is particularly useful for classification problems with imbalanced datasets, where some classes are much rarer than others.

**When to Use:**
*   Primarily for classification tasks, especially with imbalanced datasets.
*   To ensure that each fold is representative of the overall class distribution.

**Pros:**
*   Maintains the proportion of target classes in each fold, leading to more reliable performance estimates for imbalanced datasets.
*   Reduces the risk of having folds with very few or no samples of a minority class.

**Cons:**
*   Similar computational cost to standard K-Fold Cross-Validation.
*   Not directly applicable to regression problems (though stratification by quantiles could be considered).

In [None]:
from sklearn.model_selection import StratifiedKFold

# Create an imbalanced dummy dataset
X_imb = np.random.rand(100, 5)
y_imb = np.array([0]*90 + [1]*10) # 90% class 0, 10% class 1
np.random.shuffle(y_imb) # Shuffle the labels

print(f"Original imbalanced dataset class distribution: {np.bincount(y_imb)}")

# Initialize Stratified KFold cross-validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

model_imb = LogisticRegression(random_state=42)
accuracy_scores_imb = []

print("\nPerforming Stratified K-Fold Cross-Validation (k=5) for imbalanced data:")
for fold, (train_index, test_index) in enumerate(skf.split(X_imb, y_imb)):
    X_train, X_test = X_imb[train_index], X_imb[test_index]
    y_train, y_test = y_imb[train_index], y_imb[test_index]

    # Check class distribution in each fold
    print(f"  Fold {fold+1} train class distribution: {np.bincount(y_train)}")
    print(f"  Fold {fold+1} test class distribution: {np.bincount(y_test)}")

    model_imb.fit(X_train, y_train)
    y_pred = model_imb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores_imb.append(accuracy)

    print(f"  Fold {fold+1}: Accuracy = {accuracy:.4f}")

print(f"\nAverage Stratified Accuracy: {np.mean(accuracy_scores_imb):.4f}")

Original imbalanced dataset class distribution: [90 10]

Performing Stratified K-Fold Cross-Validation (k=5) for imbalanced data:
  Fold 1 train class distribution: [72  8]
  Fold 1 test class distribution: [18  2]
  Fold 1: Accuracy = 0.9000
  Fold 2 train class distribution: [72  8]
  Fold 2 test class distribution: [18  2]
  Fold 2: Accuracy = 0.9000
  Fold 3 train class distribution: [72  8]
  Fold 3 test class distribution: [18  2]
  Fold 3: Accuracy = 0.9000
  Fold 4 train class distribution: [72  8]
  Fold 4 test class distribution: [18  2]
  Fold 4: Accuracy = 0.9000
  Fold 5 train class distribution: [72  8]
  Fold 5 test class distribution: [18  2]
  Fold 5: Accuracy = 0.9000

Average Stratified Accuracy: 0.9000


### **4. Time Series Split**

For time-dependent data (time series), standard random splits or K-Fold Cross-Validation are not appropriate because they would mix future data with past data, leading to data leakage and an overly optimistic performance estimate. Time series splits maintain the temporal order of the data.

**When to Use:**
*   Exclusively for time series data where the order of observations matters (e.g., stock prices, sensor readings, weather data).

**Pros:**
*   Preserves the temporal order, preventing data leakage from future into past.
*   Provides a more realistic evaluation of a model's ability to forecast future events.

**Cons:**
*   The training sets grow larger with each split, while the test sets remain constant (or can be fixed in size).
*   Can be less stable if there are significant temporal shifts in the data that are not captured in early training folds.
*   The earliest training sets might be too small to adequately train a complex model.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import pandas as pd

# Generate a dummy time series dataset
dates = pd.date_range(start='2023-01-01', periods=100, freq='D')
X_ts = np.arange(100).reshape(-1, 1) + np.random.rand(100, 4) # Features
y_ts = np.sin(np.arange(100)/10) + np.random.rand(100) * 0.1 # Target

print(f"Original time series dataset shape: X={X_ts.shape}, y={y_ts.shape}")

# Initialize TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

print("\nPerforming Time Series Split (n_splits=5):")
for fold, (train_index, test_index) in enumerate(tscv.split(X_ts)):
    X_train, X_test = X_ts[train_index], X_ts[test_index]
    y_train, y_test = y_ts[train_index], y_ts[test_index]

    print(f"  Fold {fold+1}:")
    print(f"    Train dates: {dates[train_index.min()]} to {dates[train_index.max()]}")
    print(f"    Test dates: {dates[test_index.min()]} to {dates[test_index.max()]}")
    print(f"    Train size: {len(train_index)}, Test size: {len(test_index)}")

Original time series dataset shape: X=(100, 4), y=(100,)

Performing Time Series Split (n_splits=5):
  Fold 1:
    Train dates: 2023-01-01 00:00:00 to 2023-01-20 00:00:00
    Test dates: 2023-01-21 00:00:00 to 2023-02-05 00:00:00
    Train size: 20, Test size: 16
  Fold 2:
    Train dates: 2023-01-01 00:00:00 to 2023-02-05 00:00:00
    Test dates: 2023-02-06 00:00:00 to 2023-02-21 00:00:00
    Train size: 36, Test size: 16
  Fold 3:
    Train dates: 2023-01-01 00:00:00 to 2023-02-21 00:00:00
    Test dates: 2023-02-22 00:00:00 to 2023-03-09 00:00:00
    Train size: 52, Test size: 16
  Fold 4:
    Train dates: 2023-01-01 00:00:00 to 2023-03-09 00:00:00
    Test dates: 2023-03-10 00:00:00 to 2023-03-25 00:00:00
    Train size: 68, Test size: 16
  Fold 5:
    Train dates: 2023-01-01 00:00:00 to 2023-03-25 00:00:00
    Test dates: 2023-03-26 00:00:00 to 2023-04-10 00:00:00
    Train size: 84, Test size: 16


### **5. Group K-Fold Cross-Validation**

Group K-Fold Cross-Validation is used when your data has inherent groups where samples within a group are related (e.g., multiple medical records from the same patient, or observations from the same geographic region). It ensures that all samples from a single group appear together in either the training set or the test set, but never in both. This prevents data leakage that could occur if samples from the same group are present in both sets.

**When to Use:**
*   When data points are not independent and identically distributed (i.i.d.) due to underlying groups.
*   To avoid data leakage when a model might learn specific characteristics of a group if its members are split across training and test sets (e.g., patient-specific features).

**Pros:**
*   Prevents data leakage due to group dependencies.
*   Provides a more realistic evaluation of model performance on entirely unseen groups.

**Cons:**
*   Requires an explicit 'group' variable in the dataset.
*   Can lead to imbalanced fold sizes if group sizes vary significantly.
*   Computationally more expensive than a simple train/test split.

In [None]:
from sklearn.model_selection import GroupKFold

# Generate a dummy dataset with groups
X_group = np.random.rand(100, 5)
y_group = np.random.randint(0, 2, 100)
# Assume 10 distinct groups, with 10 samples each
groups = np.repeat(np.arange(10), 10)
np.random.shuffle(groups) # Shuffle to make sure groups are not ordered

print(f"Original grouped dataset shape: X={X_group.shape}, y={y_group.shape}, groups={groups.shape}")

# Initialize GroupKFold
gkf = GroupKFold(n_splits=5)

print("\nPerforming Group K-Fold Cross-Validation (n_splits=5):")
for fold, (train_index, test_index) in enumerate(gkf.split(X_group, y_group, groups)):
    X_train, X_test = X_group[train_index], X_group[test_index]
    y_train, y_test = y_group[train_index], y_group[test_index]
    groups_train, groups_test = groups[train_index], groups[test_index]

    print(f"  Fold {fold+1}:")
    print(f"    Train size: {len(train_index)}, Test size: {len(test_index)}")
    print(f"    Train groups: {np.unique(groups_train)}")
    print(f"    Test groups: {np.unique(groups_test)}")
    # Verify no group overlaps between train and test
    assert len(np.intersect1d(np.unique(groups_train), np.unique(groups_test))) == 0

Original grouped dataset shape: X=(100, 5), y=(100,), groups=(100,)

Performing Group K-Fold Cross-Validation (n_splits=5):
  Fold 1:
    Train size: 80, Test size: 20
    Train groups: [0 1 2 3 5 6 7 8]
    Test groups: [4 9]
  Fold 2:
    Train size: 80, Test size: 20
    Train groups: [0 1 2 4 5 6 7 9]
    Test groups: [3 8]
  Fold 3:
    Train size: 80, Test size: 20
    Train groups: [0 1 3 4 5 6 8 9]
    Test groups: [2 7]
  Fold 4:
    Train size: 80, Test size: 20
    Train groups: [0 2 3 4 5 7 8 9]
    Test groups: [1 6]
  Fold 5:
    Train size: 80, Test size: 20
    Train groups: [1 2 3 4 6 7 8 9]
    Test groups: [0 5]


### **Conclusion**

Choosing the right data splitting strategy is crucial for obtaining reliable model evaluation and building robust machine learning models. The decision depends on the nature of your data (e.g., imbalanced classes, time series, grouped data) and the computational resources available. Always consider the potential for data leakage and strive for a validation strategy that closely mimics how your model will perform in a real-world scenario.