# Dataset Preprocessor Verification

This notebook verifies that the modified preprocessing pipeline with PyTorch Dataset support creates the correct data structures and handles batching properly with seed support.

## Key Features to Verify:
- Time series data shape `(R, l, N)` where R=sequences, l=length, N=variables
- Seed-based reproducible shuffling
- Proper PyTorch Dataset implementation
- Efficient DataLoader batching
- Dynamic seed changing


In [1]:
import sys
import os
import numpy as np
import torch
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))
print(f"Current path: {PROJECT_ROOT}")

from data.preprocess import (
    preprocess_data, 
    load_preprocessed_data,
    create_dataset_from_preprocessed,
)

from utils.preprocess_utils import (
    TimeSeriesDataset,
    create_dataloaders
)

print("All imports successful!")


Current path: c:\Users\14165\Downloads\Unified-benchmark-for-SDGFTS-main
All imports successful!


## Example 1: Basic Preprocessing with Seed Support

### Testing: 
- GOOG: `data/GOOG/GOOG.csv`
- mvt-ts: `data/mvt-ts-data/exchange_rate.txt`

In [2]:
# INSERT_YOUR_CODE

def run_basic_preprocessing_example(train_data, valid_data):
    """
    Run Example 1: Basic Preprocessing with Seed Support.
    This function sets up the configuration, runs preprocessing, and prints summary statistics.
    """
    if train_data is not None and valid_data is not None:
        print(f"\nPreprocessing successful!")
        print(f"Data shapes:")
        print(f"  Train shape: {train_data.shape} (R_train, l, N)")
        print(f"  Valid shape: {valid_data.shape} (R_valid, l, N)")
        print(f"  Sequence length (l): {train_data.shape[1]}")
        print(f"  Number of variables (N): {train_data.shape[2]}")
        print(f"  Total sequences (R): {train_data.shape[0] + valid_data.shape[0]}")
        
        print(f"\nData statistics:")
        print(f"  Train data range: [{train_data.min():.4f}, {train_data.max():.4f}]")
        print(f"  Valid data range: [{valid_data.min():.4f}, {valid_data.max():.4f}]")
        print(f"  Train data mean: {train_data.mean():.4f}")
        print(f"  Valid data mean: {valid_data.mean():.4f}")
        
        # Verify normalization (should be in range [0, 1])
        if train_data.min() >= 0 and train_data.max() <= 1:
            print(f"  Normalization successful: data in range [0, 1]")
        else:
            print(f"  Normalization issue: data outside [0, 1] range")
            
    else:
        print("Preprocessing failed...")


In [6]:
print("=" * 60)
print("EXAMPLE 1: Basic Preprocessing with Seed Support")
print("=" * 60)



config_goog = {
    'original_data_path': 'GOOG/GOOG.csv',
    'output_ori_path': './preprocessed/',
    'dataset_name': 'goog_stock',
    'valid_ratio': 0.1,
    'do_normalization': True,
    'seed': 42  # Reproducible shuffling
}

config_mvt = {
    'original_data_path': 'mvt-ts-data/exchange_rate.txt',
    'output_ori_path': './preprocessed/',
    'dataset_name': 'mvt_ts',
    'valid_ratio': 0.1,
    'do_normalization': True,
    'seed': 42  # Reproducible shuffling
}

print(f"Configuration GOOG Dataset: {config_goog}")
print("\nStarting preprocessing...")
train_data_goog, valid_data_goog = preprocess_data(config_goog)

print(f"Configuration MVT TS Dataset: {config_goog}")
print("\nStarting preprocessing...")
train_data_mvt, valid_data_mvt = preprocess_data(config_mvt)

run_basic_preprocessing_example(train_data_goog, valid_data_goog)
run_basic_preprocessing_example(train_data_mvt, valid_data_mvt)

# Reset each train/valid to None
# train_data_goog = None
# valid_data_goog = None
# train_data_mvt = None
# valid_data_mvt = None

EXAMPLE 1: Basic Preprocessing with Seed Support
Configuration GOOG Dataset: {'original_data_path': 'GOOG/GOOG.csv', 'output_ori_path': './preprocessed/', 'dataset_name': 'goog_stock', 'valid_ratio': 0.1, 'do_normalization': True, 'seed': 42}

Starting preprocessing...
Data preprocessing with settings:{'original_data_path': 'GOOG/GOOG.csv', 'output_ori_path': './preprocessed/', 'dataset_name': 'goog_stock', 'valid_ratio': 0.1, 'do_normalization': True, 'seed': 42}
Data shape: (1132, 125, 5)
Preprocessing done. Preprocessed files saved to ./preprocessed/goog_stock.

Configuration MVT TS Dataset: {'original_data_path': 'GOOG/GOOG.csv', 'output_ori_path': './preprocessed/', 'dataset_name': 'goog_stock', 'valid_ratio': 0.1, 'do_normalization': True, 'seed': 42}

Starting preprocessing...
Data preprocessing with settings:{'original_data_path': 'mvt-ts-data/exchange_rate.txt', 'output_ori_path': './preprocessed/', 'dataset_name': 'mvt_ts', 'valid_ratio': 0.1, 'do_normalization': True, 'seed'

## Example 2: PyTorch Dataset and DataLoader Creation

Now let's create PyTorch datasets and dataloaders to verify proper batching and seed support.


In [7]:
print("\n" + "=" * 60)
print("EXAMPLE 2: PyTorch Dataset and DataLoader Creation")
print("=" * 60)

if train_data_goog is None or valid_data_goog is None:
    print("Cannot proceed - no preprocessed data available")
else:
    # Create datasets with different seeds for reproducible shuffling
    print("Creating TimeSeriesDataset objects...")
    train_dataset = TimeSeriesDataset(train_data_goog, seed=42)
    valid_dataset = TimeSeriesDataset(valid_data_goog, seed=123)
    
    print(f"Created datasets:")
    print(f"  Train dataset length: {len(train_dataset)}")
    print(f"  Valid dataset length: {len(valid_dataset)}")
    print(f"  Sample shape: {train_dataset[0].shape}")
    print(f"  Sample dtype: {train_dataset[0].dtype}")
    
    # Verify the sample is a PyTorch tensor
    sample = train_dataset[0]
    if isinstance(sample, torch.Tensor):
        print(f"  Sample is PyTorch tensor: {type(sample)}")
    else:
        print(f"  Sample is not PyTorch tensor: {type(sample)}")
    
    # Create dataloaders
    print(f"\nCreating DataLoaders...")
    batch_size = 32
    train_loader, valid_loader = create_dataloaders(
        train_data_goog, valid_data_goog,
        batch_size=batch_size,
        train_seed=42,
        valid_seed=123,
        num_workers=0,
        pin_memory=False
    )
    
    print(f"Created dataloaders:")
    print(f"  Train batches: {len(train_loader)}")
    print(f"  Valid batches: {len(valid_loader)}")
    print(f"  Batch size: {batch_size}")
    
    # Demonstrate batching
    print(f"\nBatch Information:")
    for i, batch in enumerate(train_loader):
        print(f"Batch {i+1}: shape {batch.shape}, dtype {batch.dtype}")
        print(f"  Range: [{batch.min():.4f}, {batch.max():.4f}]")
        if i >= 2:  # Show only first 3 batches
            print(f"... and {len(train_loader) - 3} more batches")
            break
    
    # Verify batch shapes are correct
    first_batch = next(iter(train_loader))
    expected_shape = (batch_size, train_data_goog.shape[1], train_data_goog.shape[2])
    if first_batch.shape == expected_shape:
        print(f"\nBatch shapes are correct: {first_batch.shape} == {expected_shape}")
    else:
        print(f"\nBatch shape mismatch: {first_batch.shape} != {expected_shape}")



EXAMPLE 2: PyTorch Dataset and DataLoader Creation
Creating TimeSeriesDataset objects...
Created datasets:
  Train dataset length: 1018
  Valid dataset length: 114
  Sample shape: torch.Size([125, 5])
  Sample dtype: torch.float32
  Sample is PyTorch tensor: <class 'torch.Tensor'>

Creating DataLoaders...
Created dataloaders:
  Train batches: 32
  Valid batches: 4
  Batch size: 32

Batch Information:
Batch 1: shape torch.Size([32, 125, 5]), dtype torch.float32
  Range: [0.0000, 1.0000]
Batch 2: shape torch.Size([32, 125, 5]), dtype torch.float32
  Range: [0.0000, 1.0000]
Batch 3: shape torch.Size([32, 125, 5]), dtype torch.float32
  Range: [0.0000, 1.0000]
... and 29 more batches

Batch shapes are correct: torch.Size([32, 125, 5]) == (32, 125, 5)


## Example 3: Reproducible Training with Seed Control

Let's verify that seeds produce reproducible and different shuffling patterns.


In [8]:
print("\n" + "=" * 60)
print("EXAMPLE 3: Reproducible Training with Seed Control")
print("=" * 60)

if train_data_goog is None:
    print("Cannot proceed - no preprocessed data available")
else:
    # Test reproducibility with same seeds
    print("Testing reproducibility with same seeds...")
    dataset1 = TimeSeriesDataset(train_data_goog, seed=42)
    dataset2 = TimeSeriesDataset(train_data_goog, seed=42)
    
    # Check if the order is identical
    indices1 = dataset1.get_original_indices()
    indices2 = dataset2.get_original_indices()
    
    print(f"Datasets with same seed produce identical order: {indices1[:10] == indices2[:10]}")
    print(f"  First 10 indices (dataset1): {indices1[:10]}")
    print(f"  First 10 indices (dataset2): {indices2[:10]}")
    
    # Test different seeds produce different orders
    print(f"\nTesting different seeds produce different orders...")
    dataset3 = TimeSeriesDataset(train_data_goog, seed=123)
    indices3 = dataset3.get_original_indices()
    
    print(f"Datasets with different seeds produce different order: {indices1[:10] != indices3[:10]}")
    print(f"  First 10 indices (seed=42): {indices1[:10]}")
    print(f"  First 10 indices (seed=123): {indices3[:10]}")
    
    # Test dynamic seed changing
    print(f"\nTesting dynamic seed changing...")
    original_indices = dataset1.get_original_indices()[:10]
    dataset1.set_seed(999)
    new_indices = dataset1.get_original_indices()[:10]
    
    print(f"Seed change produces different order: {original_indices != new_indices}")
    print(f"  Original (seed=42): {original_indices}")
    print(f"  New (seed=999):     {new_indices}")



EXAMPLE 3: Reproducible Training with Seed Control
Testing reproducibility with same seeds...
Datasets with same seed produce identical order: True
  First 10 indices (dataset1): [272, 859, 927, 365, 1014, 290, 790, 211, 946, 894]
  First 10 indices (dataset2): [272, 859, 927, 365, 1014, 290, 790, 211, 946, 894]

Testing different seeds produce different orders...
Datasets with different seeds produce different order: True
  First 10 indices (seed=42): [272, 859, 927, 365, 1014, 290, 790, 211, 946, 894]
  First 10 indices (seed=123): [936, 244, 526, 469, 573, 712, 847, 257, 635, 672]

Testing dynamic seed changing...
Seed change produces different order: True
  Original (seed=42): [272, 859, 927, 365, 1014, 290, 790, 211, 946, 894]
  New (seed=999):     [700, 859, 964, 373, 390, 946, 156, 416, 352, 107]
