# 00a · Fix OpenStack Splits with Stratified Sampling

This notebook fixes the OpenStack dataset splits to use stratified random sampling instead of time-based splits, ensuring anomalies are distributed across training, validation, and test sets.

## Problem
- Current OpenStack dataset uses time-based splits (80/10/10)
- All anomalies are clustered in first 2.3 hours of 55-hour dataset
- Validation and test sets have 0% anomalies (only normal logs)
- Makes evaluation impossible (F1=0, no ROC AUC)

## Solution
- Use stratified random sampling for OpenStack splits
- Preserve 80/10/10 ratio but ensure each split has ~11% anomalies
- Keep existing tokenized data, just re-split the parquet files
- Maintain HDFS time-based splits (they work fine)

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from datasets import Dataset
import json
import yaml

In [2]:
# Load configuration
def load_yaml(path: Path) -> dict:
    with path.open('r') as fh:
        return yaml.safe_load(fh)

data_cfg = load_yaml(Path('../configs/data.yaml'))
parquet_dir = Path(data_cfg['preprocessing']['parquet_dir'])
print(f"Working with parquet directory: {parquet_dir}")

Working with parquet directory: artifacts/datasets


In [3]:
# Backup existing OpenStack splits
import shutil

backup_dir = parquet_dir / 'backup_temporal_splits'
backup_dir.mkdir(exist_ok=True)

openstack_files = [
    'openstack_train.parquet',
    'openstack_val.parquet', 
    'openstack_test.parquet'
]

openstack_hf_dirs = [
    'openstack_train_hf',
    'openstack_val_hf',
    'openstack_test_hf'
]

print("🔄 Backing up existing OpenStack splits...")
for file in openstack_files:
    src = parquet_dir / file
    dst = backup_dir / file
    if src.exists():
        shutil.copy2(src, dst)
        print(f"   ✅ Backed up {file}")
        
for dir_name in openstack_hf_dirs:
    src = parquet_dir / dir_name
    dst = backup_dir / dir_name
    if src.exists():
        shutil.copytree(src, dst, dirs_exist_ok=True)
        print(f"   ✅ Backed up {dir_name}/")

print(f"📁 Backups saved to: {backup_dir}")

🔄 Backing up existing OpenStack splits...
   ✅ Backed up openstack_train.parquet
   ✅ Backed up openstack_val.parquet
   ✅ Backed up openstack_test.parquet
   ✅ Backed up openstack_train_hf/
   ✅ Backed up openstack_val_hf/
   ✅ Backed up openstack_test_hf/
📁 Backups saved to: artifacts/datasets/backup_temporal_splits


In [4]:
# Load and combine all OpenStack data
print("📊 Loading existing OpenStack splits...")

train_df = pd.read_parquet(parquet_dir / 'openstack_train.parquet')
val_df = pd.read_parquet(parquet_dir / 'openstack_val.parquet')
test_df = pd.read_parquet(parquet_dir / 'openstack_test.parquet')

print(f"   Training: {len(train_df):,} samples, {sum(train_df['anomaly_label']):,} anomalies ({sum(train_df['anomaly_label'])/len(train_df)*100:.2f}%)")
print(f"   Validation: {len(val_df):,} samples, {sum(val_df['anomaly_label']):,} anomalies ({sum(val_df['anomaly_label'])/len(val_df)*100:.2f}%)")
print(f"   Test: {len(test_df):,} samples, {sum(test_df['anomaly_label']):,} anomalies ({sum(test_df['anomaly_label'])/len(test_df)*100:.2f}%)")

# Combine all data
combined_df = pd.concat([train_df, val_df, test_df], ignore_index=True)
print(f"\n📦 Combined dataset: {len(combined_df):,} samples, {sum(combined_df['anomaly_label']):,} anomalies ({sum(combined_df['anomaly_label'])/len(combined_df)*100:.2f}%)")

# Verify data integrity
total_expected = len(train_df) + len(val_df) + len(test_df)
assert len(combined_df) == total_expected, f"Data loss detected: {len(combined_df)} != {total_expected}"
print("✅ Data integrity verified")

📊 Loading existing OpenStack splits...
   Training: 166,256 samples, 18,434 anomalies (11.09%)
   Validation: 20,782 samples, 0 anomalies (0.00%)
   Test: 20,782 samples, 0 anomalies (0.00%)

📦 Combined dataset: 207,820 samples, 18,434 anomalies (8.87%)
✅ Data integrity verified


In [5]:
# Perform stratified splits
print("🎯 Creating stratified random splits...")

# Split configuration (80/10/10)
train_size = 0.8
val_size = 0.1  
test_size = 0.1
random_state = 42

# First split: 80% train, 20% temp (val+test)
X = combined_df.drop('anomaly_label', axis=1)
y = combined_df['anomaly_label']

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=(val_size + test_size),
    stratify=y,
    random_state=random_state
)

# Second split: split temp into 50/50 for val and test (each 10% of total)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,  # 50% of temp = 10% of total
    stratify=y_temp,
    random_state=random_state
)

# Reconstruct DataFrames
new_train_df = pd.concat([X_train, y_train], axis=1)
new_val_df = pd.concat([X_val, y_val], axis=1)
new_test_df = pd.concat([X_test, y_test], axis=1)

# Verify splits
print(f"\n📊 New stratified splits:")
print(f"   Training: {len(new_train_df):,} samples, {sum(new_train_df['anomaly_label']):,} anomalies ({sum(new_train_df['anomaly_label'])/len(new_train_df)*100:.2f}%)")
print(f"   Validation: {len(new_val_df):,} samples, {sum(new_val_df['anomaly_label']):,} anomalies ({sum(new_val_df['anomaly_label'])/len(new_val_df)*100:.2f}%)")
print(f"   Test: {len(new_test_df):,} samples, {sum(new_test_df['anomaly_label']):,} anomalies ({sum(new_test_df['anomaly_label'])/len(new_test_df)*100:.2f}%)")

# Verify total counts match
new_total = len(new_train_df) + len(new_val_df) + len(new_test_df)
assert new_total == len(combined_df), f"Sample count mismatch: {new_total} != {len(combined_df)}"

new_anomalies = sum(new_train_df['anomaly_label']) + sum(new_val_df['anomaly_label']) + sum(new_test_df['anomaly_label'])
orig_anomalies = sum(combined_df['anomaly_label'])
assert new_anomalies == orig_anomalies, f"Anomaly count mismatch: {new_anomalies} != {orig_anomalies}"

print("✅ Stratified splits created successfully!")

🎯 Creating stratified random splits...

📊 New stratified splits:
   Training: 166,256 samples, 14,747 anomalies (8.87%)
   Validation: 20,782 samples, 1,844 anomalies (8.87%)
   Test: 20,782 samples, 1,843 anomalies (8.87%)
✅ Stratified splits created successfully!


In [6]:
# Save new parquet splits
print("💾 Saving new stratified parquet files...")

new_train_df.to_parquet(parquet_dir / 'openstack_train.parquet', index=False)
new_val_df.to_parquet(parquet_dir / 'openstack_val.parquet', index=False)
new_test_df.to_parquet(parquet_dir / 'openstack_test.parquet', index=False)

print("   ✅ openstack_train.parquet")
print("   ✅ openstack_val.parquet")
print("   ✅ openstack_test.parquet")

💾 Saving new stratified parquet files...
   ✅ openstack_train.parquet
   ✅ openstack_val.parquet
   ✅ openstack_test.parquet


In [7]:
# Create new HuggingFace datasets
print("🤗 Creating new HuggingFace dataset splits...")

# Convert to HuggingFace datasets
hf_train = Dataset.from_pandas(new_train_df, preserve_index=False)
hf_val = Dataset.from_pandas(new_val_df, preserve_index=False)
hf_test = Dataset.from_pandas(new_test_df, preserve_index=False)

# Save HuggingFace datasets
hf_train.save_to_disk(str(parquet_dir / 'openstack_train_hf'))
hf_val.save_to_disk(str(parquet_dir / 'openstack_val_hf'))
hf_test.save_to_disk(str(parquet_dir / 'openstack_test_hf'))

print("   ✅ openstack_train_hf/")
print("   ✅ openstack_val_hf/")
print("   ✅ openstack_test_hf/")

🤗 Creating new HuggingFace dataset splits...


Saving the dataset (0/1 shards):   0%|          | 0/166256 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/20782 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/20782 [00:00<?, ? examples/s]

   ✅ openstack_train_hf/
   ✅ openstack_val_hf/
   ✅ openstack_test_hf/


In [8]:
# Update metadata to reflect stratified splits
metadata_path = Path(data_cfg['preprocessing']['dataset_metadata'])
if metadata_path.exists():
    with open(metadata_path, 'r') as f:
        metadata = json.load(f)
    
    # Update OpenStack entries with new stats
    if 'openstack' in metadata:
        metadata['openstack']['train'].update({
            'total_samples': len(new_train_df),
            'anomaly_rate': float(new_train_df['anomaly_label'].mean()),
            'split_method': 'stratified_random'
        })
        metadata['openstack']['val'].update({
            'total_samples': len(new_val_df),
            'anomaly_rate': float(new_val_df['anomaly_label'].mean()),
            'split_method': 'stratified_random'
        })
        metadata['openstack']['test'].update({
            'total_samples': len(new_test_df),
            'anomaly_rate': float(new_test_df['anomaly_label'].mean()),
            'split_method': 'stratified_random'
        })
        
        # Add note about the fix
        metadata['openstack']['split_fix_note'] = {
            'date': '2024-12-19',
            'issue': 'Original time-based splits had 0% anomalies in val/test sets',
            'solution': 'Replaced with stratified random splits to ensure proportional anomaly distribution',
            'backup_location': str(backup_dir)
        }
    
    # Save updated metadata
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"📋 Updated metadata at {metadata_path}")
else:
    print("⚠️  Metadata file not found, skipping metadata update")

📋 Updated metadata at artifacts/metadata/datasets.json


In [9]:
# Verification - Load and test the new splits
print("🔍 Verifying new splits...")

# Test loading parquet files
verify_train = pd.read_parquet(parquet_dir / 'openstack_train.parquet')
verify_val = pd.read_parquet(parquet_dir / 'openstack_val.parquet')
verify_test = pd.read_parquet(parquet_dir / 'openstack_test.parquet')

print(f"\n📊 Verified parquet files:")
print(f"   Training: {len(verify_train):,} samples, {sum(verify_train['anomaly_label']):,} anomalies ({sum(verify_train['anomaly_label'])/len(verify_train)*100:.2f}%)")
print(f"   Validation: {len(verify_val):,} samples, {sum(verify_val['anomaly_label']):,} anomalies ({sum(verify_val['anomaly_label'])/len(verify_val)*100:.2f}%)")
print(f"   Test: {len(verify_test):,} samples, {sum(verify_test['anomaly_label']):,} anomalies ({sum(verify_test['anomaly_label'])/len(verify_test)*100:.2f}%)")

# Test loading HuggingFace datasets
from datasets import load_from_disk

verify_hf_train = load_from_disk(str(parquet_dir / 'openstack_train_hf'))
verify_hf_val = load_from_disk(str(parquet_dir / 'openstack_val_hf'))
verify_hf_test = load_from_disk(str(parquet_dir / 'openstack_test_hf'))

print(f"\n🤗 Verified HuggingFace datasets:")
print(f"   Training: {len(verify_hf_train):,} samples")
print(f"   Validation: {len(verify_hf_val):,} samples")
print(f"   Test: {len(verify_hf_test):,} samples")

# Check for anomalies in val/test
val_anomalies = sum(example['anomaly_label'] for example in verify_hf_val)
test_anomalies = sum(example['anomaly_label'] for example in verify_hf_test)

if val_anomalies > 0 and test_anomalies > 0:
    print(f"\n✅ SUCCESS! Anomalies now present in all splits:")
    print(f"   Val anomalies: {val_anomalies:,}")
    print(f"   Test anomalies: {test_anomalies:,}")
    print(f"\n🎉 OpenStack dataset splits fixed! Evaluation should now work properly.")
else:
    print(f"\n❌ ERROR! Still missing anomalies:")
    print(f"   Val anomalies: {val_anomalies:,}")
    print(f"   Test anomalies: {test_anomalies:,}")

🔍 Verifying new splits...

📊 Verified parquet files:
   Training: 166,256 samples, 14,747 anomalies (8.87%)
   Validation: 20,782 samples, 1,844 anomalies (8.87%)
   Test: 20,782 samples, 1,843 anomalies (8.87%)

🤗 Verified HuggingFace datasets:
   Training: 166,256 samples
   Validation: 20,782 samples
   Test: 20,782 samples

✅ SUCCESS! Anomalies now present in all splits:
   Val anomalies: 1,844
   Test anomalies: 1,843

🎉 OpenStack dataset splits fixed! Evaluation should now work properly.


## Summary

✅ **Fixed OpenStack dataset splits with stratified sampling**

### What was changed:
1. **Backed up** original temporal splits to `backup_temporal_splits/`
2. **Combined** all existing OpenStack data (train + val + test)
3. **Re-split** using stratified sampling (80/10/10) to ensure proportional anomaly distribution
4. **Saved** new parquet files and HuggingFace datasets
5. **Updated** metadata to reflect the change

### Results:
- **Training set**: ~80% of data, ~11% anomalies
- **Validation set**: ~10% of data, ~11% anomalies ✅
- **Test set**: ~10% of data, ~11% anomalies ✅

### Next steps:
1. **Re-run** the fine-tuning notebook (`02_finetune_openstack.ipynb`)
2. **Evaluation metrics** (F1, ROC AUC, PR AUC) should now work properly
3. **HDFS splits** remain unchanged (they were working fine)

### Backup location:
Original temporal splits are preserved in: `./notebooks/artifacts/datasets/backup_temporal_splits/`