# Data Splitting

### Possible Strategies

1. **Random Split**:
    - **Description**: Randomly splits the dataset into training, validation, and test sets.
    - **Use Case**: General-purpose model training and evaluation.
    - **Pros**: Simple to implement, ensures that each split is representative of the overall dataset.
    - **Cons**: May not capture temporal or user/item-specific trends.

2. **Temporal Split**:
    - **Description**: Splits the dataset based on timestamps, using older reviews for training and more recent reviews for validation and testing.
    - **Use Case**: Evaluating models on their ability to generalize to future data.
    - **Pros**: Mimics real-world scenarios where future data is unknown, captures temporal trends.
    - **Cons**: May result in imbalanced splits if the dataset has a seasonal trend or uneven distribution of reviews over time.
    - **Splitting Strategy**: Given a chronological user interaction sequence of length N:
        - **Training part**: Item interactions with timestamp range (-∞, t_1).
        - **Validation part**: Item interactions with timestamp range [t_1, t_2).
        - **Testing part**: Item interactions with timestamp range [t_2, +∞).

3. **User-Based Split**:
    - **Description**: Ensures that reviews from the same user are only present in one of the training, validation, or test sets.
    - **Use Case**: Personalization and recommendation systems where user history is critical.
    - **Pros**: Prevents data leakage from user behavior patterns.
    - **Cons**: May lead to splits that are less representative of the overall data distribution.

4. **Item-Based Split**:
    - **Description**: Ensures that reviews for the same product are only present in one of the training, validation, or test sets.
    - **Use Case**: Evaluating models on new products that were not seen during training.
    - **Pros**: Tests the model's ability to generalize to unseen items.
    - **Cons**: Similar to user-based split, it may result in less representative splits.

5. **Stratified Split**:
    - **Description**: Ensures that the splits maintain the same distribution of a certain feature, such as rating or product category.
    - **Use Case**: Ensuring that the model performs well across different subsets of the data.
    - **Pros**: Maintains the distribution of important features, leading to more balanced splits.
    - **Cons**: More complex to implement, may still miss temporal or user/item-specific trends.

6. **Cross-Validation Split**:
    - **Description**: Uses k-fold cross-validation to create multiple training and validation splits.
    - **Use Case**: Robust model evaluation and hyperparameter tuning.
    - **Pros**: Provides a more comprehensive evaluation by using multiple data splits.
    - **Cons**: Computationally intensive, not suitable for large datasets if computational resources are limited.

7. **Leave Last Out Split**:
    - **Description**: A data-splitting strategy to pick up the latest two item interactions for evaluation. This strategy is widely used in many recommendation papers.
    - **Splitting Strategy**: Given a chronological user interaction sequence of length N:
        - **Training part**: The first N-2 items.
        - **Validation part**: The (N-1)-th item.
        - **Testing part**: The N-th item.

# User parameters

In [3]:
PARQUET_NAME = 'combined_dataset_1000.parquet'
SPLITTING_STRATEGY = 'random'  # 'random', 'temporal', 'leave-last-out'

# Processing

In [4]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split


# Load the data (the parquet file) into a dataframe
combined_df = pd.read_parquet(os.path.join('..', 'data', 'processed', PARQUET_NAME))

match SPLITTING_STRATEGY:
    case 'random':
        # Split the data into train and test sets
        train_df, test_df = train_test_split(combined_df, test_size=0.2, random_state=42)
    case 'temporal':
        # Perform temporal split
        # Use precomputed datasets
        raise ValueError("not implemented")
    case 'leave-last-out':
        # Perform leave-last-out split
        # Use precomputed datasets
        raise ValueError("not implemented")
    case _:
        raise ValueError(f"Unknown splitting strategy: {SPLITTING_STRATEGY}")
    
# save the train and test sets
train_df.to_parquet(os.path.join('..', 'data', 'processed', 'train.parquet'), index=False)
test_df.to_parquet(os.path.join('..', 'data', 'processed', 'test.parquet'), index=False)

In [None]:
# NOTE : 
# - the splitting might have to be done before the preprocessing (as there is already precomputed datasets)