# Clinical Synthetic Data Generation Framework

This notebook explores the performance of the following Synthetic Table Generation Methods

- **CTGAN** (Conditional Tabular GAN)
- **CTAB-GAN** (Conditional Tabular GAN with advanced preprocessing)
- **CTAB-GAN+** (Enhanced version with WGAN-GP losses, general transforms, and improved stability)
- **GANerAid** (Custom implementation)
- **CopulaGAN** (Copula-based GAN)
- **TVAE** (Variational Autoencoder)

- Section 1 sets the project up. 
- Section 2 reads in the dataset and produces an initial suite of EDA. 
- Section 3 demonstrates the performance of each metholodogy with ambiguous collection of hyperparameters. This section provides output regarding the the training process of those individual runs. 
- Section 4 runs hyperparameter optimization. Graphics describe the hyperparameter optimization process. 
- Section 5 re-runs each model with their respective best hyperparameters. Detailed summaries of each model are provided in respective subsections. A final summaries of metrics across methods facilitate identifying the best of the best.


Refer to readme.md, doc\Model-descriptions.md, doc\Objective-function.md.

## 1 Setup and Configuration

In [1]:
# Code Chunk ID: CHUNK_001 - Import Setup Module
# Import all functionality from setup.py
from setup import *

print("üéØ SETUP MODULE IMPORTED SUCCESSFULLY!")
print("="*60)

Session timestamp captured: 2025-09-18
[OK] Essential data science libraries imported successfully!
Detected sklearn 1.7.1 - applying compatibility patch...
Global sklearn compatibility patch applied successfully
CTAB-GAN imported successfully from ./CTAB-GAN
[OK] CTAB-GAN+ detected and available




[OK] GANerAidModel imported successfully from src.models.implementations
[OK] All required libraries imported successfully
[OK] Comprehensive data quality evaluation function loaded!
[OK] Batch evaluation system loaded!
[OK] Enhanced objective function v2 with DYNAMIC TARGET COLUMN support defined!
[OK] Enhanced hyperparameter optimization analysis function loaded!
[TARGET] SETUP MODULE LOADED SUCCESSFULLY!
[OK] Enhanced objective function dependencies imported
[PACKAGE] Basic libraries imported successfully
[OK] Optuna imported successfully
[OK] CTGAN imported successfully
[CONFIG] Setup imports cell restored from main branch - wasserstein_distance now available globally
[OK] Parameter management functions added to setup.py!
[OK] Comprehensive TRTS framework functions added to setup.py!
[OK] Unified evaluation function added to setup.py!
[OK] Hyperparameter optimization data preprocessing function added to setup.py!
[OK] Notebook compatibility functions added to setup.py!
[OK] TRTSEva

## 2 Data Loading and Pre-processing

#### 2.1.1 USER ATTENTION NEEDED

Adapt this for your incoming dataset.

In [2]:
# Code Chunk ID: CHUNK_005
# =================== USER CONFIGURATION ===================
# üìù CONFIGURE YOUR DATASET: Update these settings for your data
DATA_FILE = 'data/liver_train.csv'            # Path to your CSV file
TARGET_COLUMN = 'Result'                       # Name of your target/outcome column

# üîß DATASET IDENTIFIER (for results folder naming)
# Option 1: Manual override (recommended for consistent naming)
DATASET_IDENTIFIER_OVERRIDE = 'liver-train'  # Changed to match auto-extraction pattern

# üîß OPTIONAL ADVANCED SETTINGS (Auto-detected if left empty)
CATEGORICAL_COLUMNS = ['Gender of the patient'] # List categorical columns or leave empty for auto-detection
MISSING_STRATEGY = 'mice'                    # Options: 'mice', 'drop', 'median', 'mode'
DATASET_NAME = 'Liver Disease Dataset'        # Descriptive name for your dataset

# üö® IMPORTANT: Verify these settings match your dataset before running!
print(f"üìä Configuration Summary:")
print(f"   Dataset: {DATASET_NAME}")
print(f"   File: {DATA_FILE}")
print(f"   Target: {TARGET_COLUMN}")
print(f"   Manual ID Override: {DATASET_IDENTIFIER_OVERRIDE}")
print(f"   Categorical: {CATEGORICAL_COLUMNS}")
print(f"   Missing Data Strategy: {MISSING_STRATEGY}")

# Load and prepare the dataset
data_file = DATA_FILE
target_column = TARGET_COLUMN

print(f"\nüîç Loading dataset from: {data_file}")

try:
    # üîß ENCODING FIX: Try multiple encodings to handle special characters
    encoding_attempts = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']
    data = None
    
    for encoding in encoding_attempts:
        try:
            data = pd.read_csv(data_file, encoding=encoding)
            print(f"‚úÖ Dataset loaded successfully using {encoding} encoding!")
            break
        except UnicodeDecodeError:
            print(f"‚ö†Ô∏è  Failed with {encoding} encoding, trying next...")
            continue
    
    if data is None:
        raise Exception("Could not load file with any supported encoding")
        
    print(f"üìä Original shape: {data.shape}")
    
    # Set up dataset identifier and current data file for new folder structure
    import setup
    if DATASET_IDENTIFIER_OVERRIDE:
        dataset_identifier = DATASET_IDENTIFIER_OVERRIDE
        setup.DATASET_IDENTIFIER = DATASET_IDENTIFIER_OVERRIDE
        setup.CURRENT_DATA_FILE = data_file
        print(f"üìÅ Using manual dataset identifier: {dataset_identifier}")
    else:
        dataset_identifier = setup.extract_dataset_identifier(data_file)
        setup.DATASET_IDENTIFIER = dataset_identifier
        setup.CURRENT_DATA_FILE = data_file
        print(f"üìÅ Auto-extracted dataset identifier: {dataset_identifier}")
    
    # üîß CRITICAL FIX: Set global DATASET_IDENTIFIER for use in other chunks
    DATASET_IDENTIFIER = dataset_identifier  # This was missing!
    
    # üìÅ NEW: Update RESULTS_DIR for organized file outputs using proper structure
    # Don't set a specific RESULTS_DIR here - let each section use get_results_path()
    # This ensures proper date/section structure like: results/dataset/2025-09-12/Section-2/
    RESULTS_DIR = f"results/{dataset_identifier}/"  # Base directory only
    
    print(f"‚úÖ Dataset identifier set: {dataset_identifier}")
    print(f"‚úÖ Global DATASET_IDENTIFIER: {DATASET_IDENTIFIER}")
    print(f"üìÖ Session timestamp: {setup.SESSION_TIMESTAMP}")
    print(f"üóÇÔ∏è  Results will be saved to: results/{dataset_identifier}/")
    
except FileNotFoundError:
    print(f"‚ùå Error: File not found at {data_file}")
    print("   Please check the DATA_FILE path in your configuration above")
    print("   Current working directory:", os.getcwd())
    raise

except Exception as e:
    print(f"‚ùå Error loading dataset: {e}")
    raise

if data is not None:
    print(f"\nüìã Dataset Info:")
    print(f"   ‚Ä¢ Shape: {data.shape}")
    print(f"   ‚Ä¢ Columns: {list(data.columns)}")
    
    # Check if target column exists
    if target_column not in data.columns:
        print(f"\n‚ùå WARNING: Target column '{target_column}' not found!")
        print(f"   Available columns: {list(data.columns)}")
        print("   Please update TARGET_COLUMN in the configuration above")
    else:
        print(f"   ‚Ä¢ Target column '{target_column}' found ‚úÖ")
        print(f"   ‚Ä¢ Target distribution: {data[target_column].value_counts().to_dict()}")
    
    # Check for missing values
    missing_values = data.isnull().sum()
    if missing_values.sum() > 0:
        print(f"\n‚ö†Ô∏è  Missing values detected:")
        for col, count in missing_values[missing_values > 0].items():
            print(f"   ‚Ä¢ {col}: {count} missing ({count/len(data)*100:.1f}%)")
    else:
        print(f"\n‚úÖ No missing values detected")
else:
    print("\n‚ùå Dataset loading failed - please fix the configuration and try again")

üìä Configuration Summary:
   Dataset: Liver Disease Dataset
   File: data/liver_train.csv
   Target: Result
   Manual ID Override: liver-train
   Categorical: ['Gender of the patient']
   Missing Data Strategy: mice

üîç Loading dataset from: data/liver_train.csv
‚úÖ Dataset loaded successfully using utf-8 encoding!
üìä Original shape: (30691, 11)
üìÅ Using manual dataset identifier: liver-train
‚úÖ Dataset identifier set: liver-train
‚úÖ Global DATASET_IDENTIFIER: liver-train
üìÖ Session timestamp: 2025-09-18
üóÇÔ∏è  Results will be saved to: results/liver-train/

üìã Dataset Info:
   ‚Ä¢ Shape: (30691, 11)
   ‚Ä¢ Columns: ['Age of the patient', 'Gender of the patient', 'Total Bilirubin', 'Direct Bilirubin', 'Alkphos Alkaline Phosphotase', 'Sgpt Alamine Aminotransferase', 'Sgot Aspartate Aminotransferase', 'Total Protiens', 'ALB Albumin', 'A/G Ratio Albumin and Globulin Ratio', 'Result']
   ‚Ä¢ Target column 'Result' found ‚úÖ
   ‚Ä¢ Target distribution: {1: 21917, 2: 8774}

‚

The code defines utilities for column name standardization and dataset analysis using the pandas library in Python. It includes functions to clean and normalize column names, identify the target variable, categorize column types, and validate dataset configurations. These functions enhance data preprocessing by ensuring consistency and integrity, making it easier to manage various data types and handle potential issues like missing values. Overall, they provide a structured approach for effective dataset analysis.

#### 2.1.2 Column Name Standardization and Dataset Analysis Utilities

In [3]:
# Code Chunk ID: CHUNK_006
# Column Name Standardization and Dataset Analysis Utilities
import re
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple, Any

def standardize_column_names(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    
    # Create mapping of old to new column names
    name_mapping = {}
    
    for col in df.columns:
        # Remove special characters and normalize
        new_name = re.sub(r'[^\w\s]', '', str(col))  # Remove special chars
        new_name = re.sub(r'\s+', '_', new_name.strip())  # Replace spaces with underscores
        new_name = new_name.lower()  # Convert to lowercase
        
        # Handle duplicate names
        if new_name in name_mapping.values():
            counter = 1
            while f"{new_name}_{counter}" in name_mapping.values():
                counter += 1
            new_name = f"{new_name}_{counter}"
            
        name_mapping[col] = new_name
    
    # Rename columns
    df = df.rename(columns=name_mapping)
    
    print(f"üîÑ Column Name Standardization:")
    for old, new in name_mapping.items():
        if old != new:
            print(f"   '{old}' ‚Üí '{new}'")
    
    return df, name_mapping

def detect_target_column(df: pd.DataFrame, target_hint: str = None) -> str:
    """
    Detect the target column in the dataset.
    
    Args:
        df: Input dataframe
        target_hint: User-provided hint for target column name
        
    Returns:
        Name of the detected target column
    """
    # Common target column patterns
    target_patterns = [
        'target', 'label', 'class', 'outcome', 'result', 'diagnosis', 
        'response', 'y', 'dependent', 'output', 'prediction'
    ]
    
    # If user provided hint, try to find it first
    if target_hint:
        # Try exact match (case insensitive)
        for col in df.columns:
            if col.lower() == target_hint.lower():
                print(f"‚úÖ Target column found: '{col}' (user specified)")
                return col
        
        # Try partial match
        for col in df.columns:
            if target_hint.lower() in col.lower():
                print(f"‚úÖ Target column found: '{col}' (partial match to '{target_hint}')")
                return col
    
    # Auto-detect based on patterns
    for pattern in target_patterns:
        for col in df.columns:
            if pattern in col.lower():
                print(f"‚úÖ Target column auto-detected: '{col}' (pattern: '{pattern}')")
                return col
    
    # If no pattern match, check for binary columns (likely targets)
    binary_cols = []
    for col in df.columns:
        unique_vals = df[col].dropna().nunique()
        if unique_vals == 2:
            binary_cols.append(col)
    
    if binary_cols:
        target_col = binary_cols[0]  # Take first binary column
        print(f"‚úÖ Target column inferred: '{target_col}' (binary column)")
        return target_col
    
    # Last resort: use last column
    target_col = df.columns[-1]
    print(f"‚ö†Ô∏è Target column defaulted to: '{target_col}' (last column)")
    return target_col

def analyze_column_types(df: pd.DataFrame, categorical_hint: List[str] = None) -> Dict[str, str]:
    """
    Analyze and categorize column types.
    
    Args:
        df: Input dataframe
        categorical_hint: User-provided list of categorical columns
        
    Returns:
        Dictionary mapping column names to types ('categorical', 'continuous', 'binary')
    """
    column_types = {}
    
    for col in df.columns:
        # Skip if user explicitly specified as categorical
        if categorical_hint and col in categorical_hint:
            column_types[col] = 'categorical'
            continue
            
        # Analyze column characteristics
        non_null_data = df[col].dropna()
        unique_count = non_null_data.nunique()
        total_count = len(non_null_data)
        
        # Determine type based on data characteristics
        if unique_count == 2:
            column_types[col] = 'binary'
        elif df[col].dtype == 'object' or unique_count < 10:
            column_types[col] = 'categorical'
        elif df[col].dtype in ['int64', 'float64'] and unique_count > 10:
            column_types[col] = 'continuous'
        else:
            # Default based on uniqueness ratio
            uniqueness_ratio = unique_count / total_count
            if uniqueness_ratio < 0.1:
                column_types[col] = 'categorical'
            else:
                column_types[col] = 'continuous'
    
    return column_types

def validate_dataset_config(df: pd.DataFrame, target_col: str, config: Dict[str, Any]) -> bool:
    """
    Validate dataset configuration and provide warnings.
    
    Args:
        df: Input dataframe
        target_col: Target column name
        config: Configuration dictionary
        
    Returns:
        True if validation passes, False otherwise
    """
    print(f"\nüîç Dataset Validation:")
    
    valid = True
    
    # Check if target column exists
    if target_col not in df.columns:
        print(f"‚ùå Target column '{target_col}' not found in dataset!")
        print(f"   Available columns: {list(df.columns)}")
        valid = False
    else:
        print(f"‚úÖ Target column '{target_col}' found")
    
    # Check dataset size
    if len(df) < 100:
        print(f"‚ö†Ô∏è Small dataset: {len(df)} rows (recommend >1000 for synthetic data)")
    else:
        print(f"‚úÖ Dataset size: {len(df)} rows")
    
    # Check for missing data
    missing_pct = (df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100
    if missing_pct > 20:
        print(f"‚ö†Ô∏è High missing data: {missing_pct:.1f}% (recommend MICE imputation)")
    elif missing_pct > 0:
        print(f"üîç Missing data: {missing_pct:.1f}% (manageable)")
    else:
        print(f"‚úÖ No missing data")
    
    return valid

print("‚úÖ Dataset analysis utilities loaded successfully!")

‚úÖ Dataset analysis utilities loaded successfully!


#### 2.1.3 Load and Analyze Dataset with Generalized Configuration

This code loads and analyzes a dataset using a specified configuration. It imports necessary libraries, attempts to read a CSV file, and standardizes the column names while allowing for potential updates to the target column. The script detects the target column, analyzes data types, and validates the dataset configuration, providing a summary of the dataset‚Äôs shape and missing values. Finally, it stores metadata about the dataset for future reference.

In [4]:
# Code Chunk ID: CHUNK_007
# Load and Analyze Dataset with Generalized Configuration
import pandas as pd
import numpy as np

# Apply user configuration
data_file = DATA_FILE
target_column = TARGET_COLUMN

print(f"üìÇ Loading dataset: {data_file}")

# Load the dataset
try:
    data = pd.read_csv(data_file)
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"üìä Original shape: {data.shape}")
    
    # Set up dataset identifier and current data file for new folder structure
    import setup
    setup.DATASET_IDENTIFIER = setup.extract_dataset_identifier(data_file)
    setup.CURRENT_DATA_FILE = data_file
    print(f"üìÅ Dataset identifier: {setup.DATASET_IDENTIFIER}")
    print(f"üìÖ Session timestamp: {setup.SESSION_TIMESTAMP}")
    
except FileNotFoundError:
    print(f"‚ùå Error: Could not find file {data_file}")
    print(f"üìã Please verify the file path in the USER CONFIGURATION section above")
    raise
except Exception as e:
    print(f"‚ùå Error loading dataset: {e}")
    raise

# Basic info
print(f"\nüìã Dataset Info:")
print(f"   ‚Ä¢ Target column: {target_column}")
print(f"   ‚Ä¢ Features: {data.shape[1] - 1}")
print(f"   ‚Ä¢ Samples: {data.shape[0]}")
print(f"   ‚Ä¢ Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

üìÇ Loading dataset: data/liver_train.csv
‚úÖ Dataset loaded successfully!
üìä Original shape: (30691, 11)
üìÅ Dataset identifier: liver-train
üìÖ Session timestamp: 2025-09-18

üìã Dataset Info:
   ‚Ä¢ Target column: Result
   ‚Ä¢ Features: 10
   ‚Ä¢ Samples: 30691
   ‚Ä¢ Memory usage: 4.12 MB


This code provides advanced utilities for handling missing data using various strategies in Python. It includes functions to assess missing data patterns, apply Multiple Imputation by Chained Equations (MICE), visualize missing patterns, and implement different strategies for managing missing values. The `assess_missing_patterns` function analyzes and summarizes missing data, while `apply_mice_imputation` leverages an iterative imputer for numeric columns. The `visualize_missing_patterns` function creates visual representations of missing data, and the `handle_missing_data_strategy` function executes the chosen strategy, offering options like MICE, dropping rows, or filling with median or mode values. Collectively, these utilities facilitate effective management of missing data to improve dataset quality.

#### 2.1.4 Advanced Missing Data Handling with MICE

In [5]:
# Code Chunk ID: CHUNK_008
# Advanced Missing Data Handling with MICE
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def assess_missing_patterns(df: pd.DataFrame) -> dict:
    """
    Comprehensive assessment of missing data patterns.
    
    Args:
        df: Input dataframe
        
    Returns:
        Dictionary with missing data analysis
    """
    missing_analysis = {}
    
    # Basic missing statistics
    missing_counts = df.isnull().sum()
    missing_percentages = (missing_counts / len(df)) * 100
    
    missing_analysis['missing_counts'] = missing_counts[missing_counts > 0]
    missing_analysis['missing_percentages'] = missing_percentages[missing_percentages > 0]
    missing_analysis['total_missing_cells'] = df.isnull().sum().sum()
    missing_analysis['total_cells'] = df.size
    missing_analysis['overall_missing_rate'] = (missing_analysis['total_missing_cells'] / missing_analysis['total_cells']) * 100
    
    # Missing patterns
    missing_patterns = df.isnull().value_counts()
    missing_analysis['missing_patterns'] = missing_patterns
    
    return missing_analysis

def apply_mice_imputation(df: pd.DataFrame, target_col: str = None, max_iter: int = 10, random_state: int = 42) -> pd.DataFrame:
    """
    Apply Multiple Imputation by Chained Equations (MICE) to handle missing data.
    
    Args:
        df: Input dataframe with missing values
        target_col: Target column name (excluded from imputation predictors)
        max_iter: Maximum number of imputation iterations
        random_state: Random state for reproducibility
        
    Returns:
        DataFrame with imputed values
    """
    print(f"üîß Applying MICE imputation...")
    
    # Separate features and target
    if target_col and target_col in df.columns:
        features = df.drop(columns=[target_col])
        target = df[target_col]
    else:
        features = df.copy()
        target = None
    
    # Identify numeric and categorical columns
    numeric_cols = features.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = features.select_dtypes(include=['object', 'category']).columns.tolist()
    
    df_imputed = features.copy()
    
    # Handle numeric columns with MICE
    if numeric_cols:
        print(f"   Imputing {len(numeric_cols)} numeric columns...")
        numeric_imputer = IterativeImputer(
            estimator=RandomForestRegressor(n_estimators=10, random_state=random_state),
            max_iter=max_iter,
            random_state=random_state
        )
        
        numeric_imputed = numeric_imputer.fit_transform(features[numeric_cols])
        df_imputed[numeric_cols] = numeric_imputed
    
    # Handle categorical columns with mode imputation (simpler approach)
    if categorical_cols:
        print(f"   Imputing {len(categorical_cols)} categorical columns with mode...")
        for col in categorical_cols:
            mode_value = features[col].mode()
            if len(mode_value) > 0:
                df_imputed[col] = features[col].fillna(mode_value[0])
            else:
                # If no mode, fill with 'Unknown'
                df_imputed[col] = features[col].fillna('Unknown')
    
    # Add target back if it exists
    if target is not None:
        df_imputed[target_col] = target
    
    print(f"‚úÖ MICE imputation completed!")
    print(f"   Missing values before: {features.isnull().sum().sum()}")
    print(f"   Missing values after: {df_imputed.isnull().sum().sum()}")
    
    return df_imputed

def visualize_missing_patterns(df: pd.DataFrame, title: str = "Missing Data Patterns") -> None:
    """
    Create visualizations for missing data patterns.
    
    Args:
        df: Input dataframe
        title: Title for the plot
    """
    missing_data = df.isnull()
    
    if missing_data.sum().sum() == 0:
        print("‚úÖ No missing data to visualize!")
        return
    
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Missing data heatmap
    sns.heatmap(missing_data, 
                yticklabels=False, 
                cbar=True, 
                cmap='viridis',
                ax=axes[0])
    axes[0].set_title('Missing Data Heatmap')
    axes[0].set_xlabel('Columns')
    
    # Missing data bar chart
    missing_counts = missing_data.sum()
    missing_counts = missing_counts[missing_counts > 0]
    
    if len(missing_counts) > 0:
        missing_counts.plot(kind='bar', ax=axes[1], color='coral')
        axes[1].set_title('Missing Values by Column')
        axes[1].set_ylabel('Count of Missing Values')
        axes[1].tick_params(axis='x', rotation=45)
    else:
        axes[1].text(0.5, 0.5, 'No Missing Data', 
                    horizontalalignment='center', 
                    verticalalignment='center',
                    transform=axes[1].transAxes,
                    fontsize=16)
        axes[1].set_title('Missing Values by Column')
    
    plt.suptitle(title, fontsize=16)
    plt.tight_layout()
    plt.show()

def handle_missing_data_strategy(df: pd.DataFrame, strategy: str, target_col: str = None) -> pd.DataFrame:
    """
    Apply the specified missing data handling strategy.
    
    Args:
        df: Input dataframe
        strategy: Strategy to use ('mice', 'drop', 'median', 'mode')
        target_col: Target column name
        
    Returns:
        DataFrame with missing data handled
    """
    print(f"\nüîß Applying missing data strategy: {strategy.upper()}")
    
    if df.isnull().sum().sum() == 0:
        print("‚úÖ No missing data detected - no imputation needed")
        return df.copy()
    
    if strategy.lower() == 'mice':
        return apply_mice_imputation(df, target_col)
    
    elif strategy.lower() == 'drop':
        print(f"   Dropping rows with missing values...")
        df_clean = df.dropna()
        print(f"   Rows before: {len(df)}, Rows after: {len(df_clean)}")
        return df_clean
    
    elif strategy.lower() == 'median':
        print(f"   Filling missing values with median/mode...")
        df_filled = df.copy()
        
        # Numeric columns: fill with median
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if df[col].isnull().sum() > 0:
                median_val = df[col].median()
                df_filled[col] = df[col].fillna(median_val)
                print(f"     {col}: filled {df[col].isnull().sum()} values with median {median_val:.2f}")
        
        # Categorical columns: fill with mode
        categorical_cols = df.select_dtypes(include=['object', 'category']).columns
        for col in categorical_cols:
            if df[col].isnull().sum() > 0:
                mode_val = df[col].mode()
                if len(mode_val) > 0:
                    df_filled[col] = df[col].fillna(mode_val[0])
                    print(f"     {col}: filled {df[col].isnull().sum()} values with mode '{mode_val[0]}'")
        
        return df_filled
    
    elif strategy.lower() == 'mode':
        print(f"   Filling missing values with mode...")
        df_filled = df.copy()
        
        for col in df.columns:
            if df[col].isnull().sum() > 0:
                mode_val = df[col].mode()
                if len(mode_val) > 0:
                    df_filled[col] = df[col].fillna(mode_val[0])
                    print(f"     {col}: filled {df[col].isnull().sum()} values with mode '{mode_val[0]}'")
        
        return df_filled
    
    else:
        print(f"‚ö†Ô∏è Unknown strategy '{strategy}'. Using 'median' as fallback.")
        return handle_missing_data_strategy(df, 'median', target_col)

print("‚úÖ Missing data handling utilities loaded successfully!")

‚úÖ Missing data handling utilities loaded successfully!


In [6]:
# Code Chunk ID: CHUNK_008A
# ============================================================================
# CONDITIONAL MISSING DATA IMPUTATION
# ============================================================================
# Apply missing data strategy only if missing values exist

missing_count = data.isnull().sum().sum()

if missing_count > 0:
    print(f"üîß MISSING DATA IMPUTATION")
    print(f"üìä Found {missing_count} missing values - applying {MISSING_STRATEGY} strategy")
    
    # Store original data
    data_original = data.copy()
    
    # Apply imputation using CHUNK_008 functions
    data = handle_missing_data_strategy(data, MISSING_STRATEGY, TARGET_COLUMN)
    
    # Report results
    remaining = data.isnull().sum().sum()
    print(f"‚úÖ Imputation complete: {missing_count} ‚Üí {remaining} missing values")
else:
    print("‚úÖ No missing values detected - skipping imputation")

üîß MISSING DATA IMPUTATION
üìä Found 5425 missing values - applying mice strategy

üîß Applying missing data strategy: MICE
üîß Applying MICE imputation...
   Imputing 9 numeric columns...
   Imputing 1 categorical columns with mode...
‚úÖ MICE imputation completed!
   Missing values before: 5425
   Missing values after: 0
‚úÖ Imputation complete: 5425 ‚Üí 0 missing values


In [None]:
# Code Chunk ID: CHUNK_008A
# ============================================================================
# ROBUST MISSING DATA IMPUTATION
# ============================================================================
# Apply missing data strategy with robust error handling

missing_count = data.isnull().sum().sum()

if missing_count > 0:
    print(f"üîß MISSING DATA IMPUTATION")
    print(f"üìä Found {missing_count} missing values - applying {MISSING_STRATEGY} strategy")
    
    # Store original data
    data_original = data.copy()
    
    # Apply robust missing data handling
    try:
        if MISSING_STRATEGY.lower() == 'mice':
            # Apply MICE imputation with robust error handling
            print("   üî¨ Applying MICE imputation...")
            
            # Separate numeric and categorical columns
            numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()
            categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
            
            # Handle numeric columns with iterative imputer
            if numeric_cols:
                from sklearn.experimental import enable_iterative_imputer
                from sklearn.impute import IterativeImputer
                from sklearn.ensemble import RandomForestRegressor
                
                print(f"   Imputing {len(numeric_cols)} numeric columns...")
                numeric_imputer = IterativeImputer(
                    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
                    max_iter=10,
                    random_state=42
                )
                data[numeric_cols] = numeric_imputer.fit_transform(data[numeric_cols])
            
            # Handle categorical columns with mode
            if categorical_cols:
                print(f"   Imputing {len(categorical_cols)} categorical columns with mode...")
                for col in categorical_cols:
                    if data[col].isnull().any():
                        mode_val = data[col].mode()
                        fill_val = mode_val[0] if len(mode_val) > 0 else 'Unknown'
                        data[col] = data[col].fillna(fill_val)
            
            print("‚úÖ MICE imputation completed!")
            
        else:
            # Fallback to simple imputation
            print(f"   üìä Applying {MISSING_STRATEGY} imputation...")
            
            for col in data.columns:
                if data[col].isnull().any():
                    if data[col].dtype in ['object']:
                        # Categorical: use mode
                        mode_val = data[col].mode()
                        fill_val = mode_val[0] if len(mode_val) > 0 else 'Unknown'
                        data[col] = data[col].fillna(fill_val)
                    else:
                        # Numeric: use median
                        median_val = data[col].median()
                        data[col] = data[col].fillna(median_val)
            
            print(f"‚úÖ {MISSING_STRATEGY} imputation completed!")
    
    except Exception as e:
        print(f"‚ö†Ô∏è Imputation error: {e}")
        print("   Applying fallback median/mode imputation...")
        
        # Robust fallback imputation
        for col in data.columns:
            if data[col].isnull().any():
                if data[col].dtype in ['object']:
                    data[col] = data[col].fillna('Unknown')
                else:
                    data[col] = data[col].fillna(data[col].median())
        
        print("‚úÖ Fallback imputation completed!")
    
    # Report results
    remaining = data.isnull().sum().sum()
    print(f"   Missing values before: {missing_count}")
    print(f"   Missing values after: {remaining}")
    print(f"‚úÖ Imputation complete: {missing_count} ‚Üí {remaining} missing values")
else:
    print("‚úÖ No missing values detected - skipping imputation")

### 2.1.4a - Adjustments

In [8]:
data.head

<bound method NDFrame.head of       Age of the patient Gender of the patient  Total Bilirubin  \
0                   45.0                Female              0.7   
1                   29.0                  Male              2.9   
2                   60.0                  Male              6.8   
3                   55.0                  Male              4.4   
4                   54.0                Female              0.8   
...                  ...                   ...              ...   
4995                48.0                  Male              0.7   
4996                45.0                Female              0.8   
4997                35.0                  Male              2.7   
4998                42.0                Female              0.6   
4999                46.0                  Male              0.8   

      Direct Bilirubin  Alkphos Alkaline Phosphotase  \
0                  0.2                         186.0   
1                  1.2                         189.0 

In [9]:
# log transform these: Total Bilirubin,Direct Bilirubin,Alkphos Alkaline Phosphotase,Sgpt Alamine Aminotransferase
data = sampled_data.copy()
data['Total Bilirubin'] = np.log1p(data['Total Bilirubin'])
data['Direct Bilirubin'] = np.log1p(data['Direct Bilirubin'])
data['Alkphos Alkaline Phosphotase'] = np.log1p(data['Alkphos Alkaline Phosphotase'])
data['Sgpt Alamine Aminotransferase'] = np.log1p(data['Sgpt Alamine Aminotransferase'])

print("‚úÖ Log transformation applied to selected columns")

‚úÖ Log transformation applied to selected columns


#### 2.1.5 EDA
This code snippet provides an enhanced overview and analysis of a dataset. It generates basic statistics, including the dataset name, shape, memory usage, total missing values, missing percentage, number of duplicate rows, and counts of numeric and categorical columns. The results are organized into a dictionary called `overview_stats`, which is then iterated over to print each statistic in a formatted manner. Additionally, it sets up for displaying a sample of the data afterward.

In [10]:
# Code Chunk ID: CHUNK_009
# Enhanced Dataset Overview and Analysis
print("üìã COMPREHENSIVE DATASET OVERVIEW")
print("=" * 60)

# Basic statistics
overview_stats = {
    'Dataset Name': 'Breast Cancer Wisconsin (Diagnostic)',
    'Shape': f"{data.shape[0]} rows √ó {data.shape[1]} columns",
    'Memory Usage': f"{data.memory_usage(deep=True).sum() / 1024**2:.2f} MB",
    'Total Missing Values': data.isnull().sum().sum(),
    'Missing Percentage': f"{(data.isnull().sum().sum() / data.size) * 100:.2f}%",
    'Duplicate Rows': data.duplicated().sum(),
    'Numeric Columns': len(data.select_dtypes(include=[np.number]).columns),
    'Categorical Columns': len(data.select_dtypes(include=['object']).columns)
}

for key, value in overview_stats.items():
    print(f"{key:.<25} {value}")

üìã COMPREHENSIVE DATASET OVERVIEW
Dataset Name............. Breast Cancer Wisconsin (Diagnostic)
Shape.................... 5000 rows √ó 11 columns
Memory Usage............. 0.67 MB
Total Missing Values..... 0
Missing Percentage....... 0.00%
Duplicate Rows........... 477
Numeric Columns.......... 10
Categorical Columns...... 1


In [11]:
# Code Chunk ID: CHUNK_010
# Enhanced Column Analysis - OUTPUT TO FILE
print("üìä DETAILED COLUMN ANALYSIS (SAVING TO FILE)")
print("=" * 50)

column_analysis = pd.DataFrame({
    'Column': data.columns,
    'Data_Type': data.dtypes.astype(str),
    'Unique_Values': [data[col].nunique() for col in data.columns],
    'Missing_Count': [data[col].isnull().sum() for col in data.columns],
    'Missing_Percent': [f"{(data[col].isnull().sum()/len(data)*100):.2f}%" for col in data.columns],
    'Min_Value': [data[col].min() if data[col].dtype in ['int64', 'float64'] else 'N/A' for col in data.columns],
    'Max_Value': [data[col].max() if data[col].dtype in ['int64', 'float64'] else 'N/A' for col in data.columns]
})

# Use new folder structure: results/dataset_identifier/YYYY-MM-DD/Section-2
results_path = get_results_path(DATASET_IDENTIFIER, 2)
os.makedirs(results_path, exist_ok=True)
csv_file = f'{results_path}/column_analysis.csv'
column_analysis.to_csv(csv_file, index=False)

print(f"üìä Column analysis table saved to {csv_file}")
print(f"üìä Analysis completed for {len(data.columns)} features")

üìä DETAILED COLUMN ANALYSIS (SAVING TO FILE)
üìä Column analysis table saved to results/liver-train/2025-09-18/Section-2/column_analysis.csv
üìä Analysis completed for 11 features


This code conducts an enhanced analysis of the target variable within a dataset. It computes the counts and percentages of target classes, organizing the results into a DataFrame called `target_summary`, which distinguishes between benign and malignant classes if applicable. The class balance is assessed by calculating a balance ratio, with outputs indicating whether the dataset is balanced, moderately imbalanced, or highly imbalanced. If the specified target column is not found, it displays a warning and lists available columns in the dataset.

In [12]:
# Code Chunk ID: CHUNK_011
# Enhanced Target Variable Analysis - OUTPUT TO FILE
print("üéØ TARGET VARIABLE ANALYSIS (SAVING TO FILE)")
print("=" * 40)

if target_column in data.columns:
    target_counts = data[target_column].value_counts().sort_index()
    target_props = data[target_column].value_counts(normalize=True).sort_index() * 100
    
    target_summary = pd.DataFrame({
        'Class': target_counts.index,
        'Count': target_counts.values,
        'Percentage': [f"{prop:.1f}%" for prop in target_props.values],
        'Description': ['Benign (Non-cancerous)', 'Malignant (Cancerous)'] if len(target_counts) == 2 else [f'Class {i}' for i in target_counts.index]
    })
    
    # Use new folder structure: results/dataset_identifier/YYYY-MM-DD/Section-2
    results_path = get_results_path(DATASET_IDENTIFIER, 2)
    os.makedirs(results_path, exist_ok=True)
    csv_file = f'{results_path}/target_analysis.csv'
    target_summary.to_csv(csv_file, index=False)
    
    # Calculate class balance metrics
    balance_ratio = target_counts.min() / target_counts.max()
    
    # Save balance metrics to separate file
    balance_metrics = pd.DataFrame({
        'Metric': ['Class_Balance_Ratio', 'Dataset_Balance_Category'],
        'Value': [f"{balance_ratio:.3f}", 
                 'Balanced' if balance_ratio > 0.8 else 'Moderately Imbalanced' if balance_ratio > 0.5 else 'Highly Imbalanced']
    })
    balance_file = f'{results_path}/target_balance_metrics.csv'
    balance_metrics.to_csv(balance_file, index=False)
    
    print(f"üìä Target variable analysis saved to {csv_file}")
    print(f"üìä Class balance metrics saved to {balance_file}")
    print(f"üìä Class Balance Ratio: {balance_ratio:.3f}")
    print(f"üìä Dataset Balance: {'Balanced' if balance_ratio > 0.8 else 'Moderately Imbalanced' if balance_ratio > 0.5 else 'Highly Imbalanced'}")
    
else:
    print(f"‚ö†Ô∏è Warning: Target column '{target_column}' not found!")
    print(f"Available columns: {list(data.columns)}")

üéØ TARGET VARIABLE ANALYSIS (SAVING TO FILE)
üìä Target variable analysis saved to results/liver-train/2025-09-18/Section-2/target_analysis.csv
üìä Class balance metrics saved to results/liver-train/2025-09-18/Section-2/target_balance_metrics.csv
üìä Class Balance Ratio: 0.400
üìä Dataset Balance: Highly Imbalanced


This code provides enhanced visualizations of feature distributions in a dataset. It retrieves numeric columns, excluding the target variable, and generates histograms for each numeric feature, displaying them in a grid layout. The histograms are enhanced with options for density, color, and grid lines to improve readability. If no numeric features are found, a warning message is displayed; otherwise, the generated plots give insights into the distributions of the numeric features in the dataset.

In [13]:
# Code Chunk ID: CHUNK_012
# Enhanced Feature Distribution Visualizations - OUTPUT TO FILE
print("üìä FEATURE DISTRIBUTION ANALYSIS (SAVING TO FILE)")
print("=" * 40)

# Turn off interactive mode to prevent figures from displaying in notebook
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend
plt.ioff()  # Turn off interactive mode

# Get numeric columns excluding target
numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()
if target_column in numeric_cols:
    numeric_cols.remove(target_column)

if numeric_cols:
    n_cols = min(3, len(numeric_cols))
    n_rows = (len(numeric_cols) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5*n_cols, 4*n_rows))
    # Use dataset name fallback for title
    dataset_name = DATASET_IDENTIFIER.title() if DATASET_IDENTIFIER else "Dataset"
    fig.suptitle(f'{dataset_name} - Feature Distributions', fontsize=16, fontweight='bold')
    
    # Handle different subplot configurations
    if n_rows == 1 and n_cols == 1:
        axes = [axes]
    elif n_rows == 1:
        axes = axes
    else:
        axes = axes.flatten()
    
    for i, col in enumerate(numeric_cols):
        if i < len(axes):
            # Enhanced histogram
            axes[i].hist(data[col], bins=30, alpha=0.7, color='skyblue', 
                        edgecolor='black', density=True)
            
            axes[i].set_title(f'{col}', fontsize=12, fontweight='bold')
            axes[i].set_xlabel(col)
            axes[i].set_ylabel('Density')
            axes[i].grid(True, alpha=0.3)
    
    # Remove empty subplots
    for j in range(len(numeric_cols), len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    
    # Use new folder structure: results/dataset_identifier/YYYY-MM-DD/Section-2
    results_path = get_results_path(DATASET_IDENTIFIER, 2)
    os.makedirs(results_path, exist_ok=True)
    plot_file = f'{results_path}/feature_distributions.png'
    plt.savefig(plot_file, dpi=300, bbox_inches='tight')
    plt.close()  # Close the figure to free memory
    
    print(f"üìä Feature distribution plots saved to {plot_file}")
    print(f"üìä Distribution analysis completed for {len(numeric_cols)} numeric features")
else:
    print("‚ö†Ô∏è No numeric features found for visualization")

üìä FEATURE DISTRIBUTION ANALYSIS (SAVING TO FILE)
üìä Feature distribution plots saved to results/liver-train/2025-09-18/Section-2/feature_distributions.png
üìä Distribution analysis completed for 9 numeric features


This code conducts an enhanced correlation analysis of features within a dataset. It calculates the correlation matrix for numeric columns and includes the target variable if it is numeric, displaying the results in a heatmap for better visualization. The analysis identifies correlations with the target variable, categorizing each feature based on its correlation strength (strong, moderate, or weak) and presenting the findings in a DataFrame. If there are insufficient numeric features, a warning message is displayed, indicating that correlation analysis cannot be performed.

In [14]:
# Code Chunk ID: CHUNK_013
# Enhanced Correlation Analysis - OUTPUT TO FILE
print("üîç CORRELATION ANALYSIS (SAVING TO FILE)")
print("=" * 30)

# Turn off interactive mode to prevent figures from displaying in notebook
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend
plt.ioff()  # Turn off interactive mode

if len(numeric_cols) > 1:
    # Include target in correlation if numeric
    cols_for_corr = numeric_cols.copy()
    if data[target_column].dtype in ['int64', 'float64']:
        cols_for_corr.append(target_column)
    
    correlation_matrix = data[cols_for_corr].corr()
    
    # Enhanced correlation heatmap
    fig, ax = plt.subplots(figsize=(10, 8))
    
    sns.heatmap(correlation_matrix, 
                annot=True, 
                cmap='RdBu_r',
                center=0, 
                square=True, 
                linewidths=0.5,
                fmt='.3f',
                ax=ax)
    
    # Use dataset name fallback for title
    dataset_name = DATASET_IDENTIFIER.title() if DATASET_IDENTIFIER else "Dataset"
    ax.set_title(f'{dataset_name} - Feature Correlation Matrix', 
              fontsize=14, fontweight='bold', pad=20)
    plt.tight_layout()
    
    # Use new folder structure: results/dataset_identifier/YYYY-MM-DD/Section-2
    results_path = get_results_path(DATASET_IDENTIFIER, 2)
    os.makedirs(results_path, exist_ok=True)
    heatmap_file = f'{results_path}/correlation_heatmap.png'
    plt.savefig(heatmap_file, dpi=300, bbox_inches='tight')
    plt.close()  # Close the figure to free memory
    
    # Save correlation matrix to CSV
    corr_matrix_file = f'{results_path}/correlation_matrix.csv'
    correlation_matrix.to_csv(corr_matrix_file)
    
    print(f"üîç Correlation heatmap saved to {heatmap_file}")
    print(f"üîç Correlation matrix saved to {corr_matrix_file}")
    
    # Correlation with target analysis
    if target_column in correlation_matrix.columns:
        print("\nüîç CORRELATIONS WITH TARGET VARIABLE (SAVING TO FILE)")
        print("=" * 45)
        
        target_corrs = correlation_matrix[target_column].abs().sort_values(ascending=False)
        target_corrs = target_corrs[target_corrs.index != target_column]
        
        corr_analysis = pd.DataFrame({
            'Feature': target_corrs.index,
            'Absolute_Correlation': target_corrs.values,
            'Raw_Correlation': [correlation_matrix.loc[feat, target_column] for feat in target_corrs.index],
            'Strength': ['Strong' if abs(corr) > 0.7 else 'Moderate' if abs(corr) > 0.3 else 'Weak' 
                        for corr in target_corrs.values]
        })
        
        # Save correlation analysis to CSV instead of displaying
        corr_analysis_file = f'{results_path}/target_correlations.csv'
        corr_analysis.to_csv(corr_analysis_file, index=False)
        
        print(f"üîç Target correlation analysis saved to {corr_analysis_file}")
        print(f"üìä Correlation analysis completed for {len(target_corrs)} features")
    
else:
    print("‚ö†Ô∏è Insufficient numeric features for correlation analysis")

üîç CORRELATION ANALYSIS (SAVING TO FILE)
üîç Correlation heatmap saved to results/liver-train/2025-09-18/Section-2/correlation_heatmap.png
üîç Correlation matrix saved to results/liver-train/2025-09-18/Section-2/correlation_matrix.csv

üîç CORRELATIONS WITH TARGET VARIABLE (SAVING TO FILE)
üîç Target correlation analysis saved to results/liver-train/2025-09-18/Section-2/target_correlations.csv
üìä Correlation analysis completed for 9 features


This code sets up global configuration variables for consistent evaluation across model evaluations. It checks for the existence of required variables, such as `data` and `target_column`, and raises an error if they are not defined. The code establishes global constants for the target column, results directory, and a copy of the original data while defining categorical columns, excluding the target. It then creates the results directory if it does not already exist and verifies that all necessary global variables are present, providing feedback on the setup's success.

In [15]:
# Code Chunk ID: CHUNK_014
# ============================================================================
# GLOBAL CONFIGURATION VARIABLES
# ============================================================================
# These variables are used across all sections for consistent evaluation

# Verify required variables exist before setting globals
if 'data' not in globals() or 'target_column' not in globals():
    raise ValueError("‚ùå ERROR: 'data' and 'target_column' must be defined before setting global variables. Please run the data loading cell first.")

# Set up global variables for use in all model evaluations
TARGET_COLUMN = target_column  # Use the target column from data loading

# üîß UPDATED: Preserve dataset-specific RESULTS_DIR that was set in CHUNK_005
# Don't override it with a generic path - maintain the structured approach
if 'RESULTS_DIR' not in globals() or RESULTS_DIR is None:
    # Fallback: reconstruct proper results directory structure
    RESULTS_DIR = f"results/{setup.DATASET_IDENTIFIER}/"
    print(f"‚ö†Ô∏è  RESULTS_DIR was missing - using fallback: {RESULTS_DIR}")
else:
    print(f"‚úÖ Using existing RESULTS_DIR: {RESULTS_DIR}")

data = data.copy()    # Create a copy of original data for evaluation functions

# Define categorical columns for all models
categorical_columns = data.select_dtypes(include=['object']).columns.tolist()
if TARGET_COLUMN in categorical_columns:
    categorical_columns.remove(TARGET_COLUMN)  # Remove target from categorical list

# Apply user-specified categorical columns if provided
if 'CATEGORICAL_COLUMNS' in globals() and CATEGORICAL_COLUMNS:
    categorical_columns = CATEGORICAL_COLUMNS
    print(f"   ‚Ä¢ Using user-specified categorical columns: {categorical_columns}")
else:
    print(f"   ‚Ä¢ Auto-detected categorical columns: {categorical_columns}")

print("üîß Global Configuration Summary:")
print(f"   ‚Ä¢ TARGET_COLUMN: {TARGET_COLUMN}")
print(f"   ‚Ä¢ RESULTS_DIR: {RESULTS_DIR}")
print(f"   ‚Ä¢ data shape: {data.shape}")
print(f"   ‚Ä¢ categorical_columns: {categorical_columns}")

# Create base results directory if it doesn't exist
import os
if not os.path.exists(RESULTS_DIR):
    os.makedirs(RESULTS_DIR, exist_ok=True)
    print(f"   ‚Ä¢ Created base results directory: {RESULTS_DIR}")
else:
    print(f"   ‚Ä¢ Base results directory already exists: {RESULTS_DIR}")

# Validate that all required variables are now available
required_vars = ['TARGET_COLUMN', 'RESULTS_DIR', 'data', 'categorical_columns']
missing_vars = [var for var in required_vars if var not in globals()]

if missing_vars:
    raise ValueError(f"‚ùå ERROR: Missing required global variables: {missing_vars}")
else:
    print("‚úÖ All required global variables are now available for Section 3 evaluations")

‚úÖ Using existing RESULTS_DIR: results/liver-train/
   ‚Ä¢ Using user-specified categorical columns: ['Gender of the patient']
üîß Global Configuration Summary:
   ‚Ä¢ TARGET_COLUMN: Result
   ‚Ä¢ RESULTS_DIR: results/liver-train/
   ‚Ä¢ data shape: (5000, 11)
   ‚Ä¢ categorical_columns: ['Gender of the patient']
   ‚Ä¢ Base results directory already exists: results/liver-train/
‚úÖ All required global variables are now available for Section 3 evaluations


In [None]:
# ============================================================================
# SECTION 2 FINALIZATION: COMPLETE DATA PREPROCESSING PIPELINE
# ============================================================================
# Ensure all data preprocessing is complete and save final processed dataset

print("üéØ SECTION 2 FINALIZATION: COMPLETE DATA PREPROCESSING")
print("=" * 80)

# ============================================================================
# STEP 1: Apply smart categorical preprocessing
# ============================================================================
print("\nüìä STEP 1: Smart Categorical Data Preprocessing")
print("-" * 50)

# Apply our enhanced categorical preprocessing if not already applied
try:
    data_processed, categorical_cols_processed, encoders_processed = clean_and_preprocess_data(
        data, categorical_columns
    )
    
    print(f"‚úÖ Smart categorical preprocessing completed:")
    print(f"   ‚Ä¢ Processed shape: {data_processed.shape}")
    print(f"   ‚Ä¢ Categorical columns: {categorical_cols_processed}")
    print(f"   ‚Ä¢ Missing values: {data_processed.isnull().sum().sum()}")
    
    # Update our data with processed version
    data = data_processed
    categorical_columns = categorical_cols_processed
    
except Exception as e:
    print(f"‚ö†Ô∏è Categorical preprocessing warning: {e}")
    print("   Using existing data as-is")

# ============================================================================
# STEP 2: Final data validation and summary
# ============================================================================
print("\nüìä STEP 2: Final Data Validation")
print("-" * 50)

# Display comprehensive categorical summary
display_categorical_summary(data, categorical_columns, TARGET_COLUMN)

# Final data quality checks
print(f"\nüîç Final Data Quality Report:")
print(f"   ‚Ä¢ Shape: {data.shape}")
print(f"   ‚Ä¢ Missing values: {data.isnull().sum().sum()}")
print(f"   ‚Ä¢ Target column: {TARGET_COLUMN}")
print(f"   ‚Ä¢ Target distribution: {data[TARGET_COLUMN].value_counts().to_dict()}")
print(f"   ‚Ä¢ Categorical columns: {categorical_columns}")
print(f"   ‚Ä¢ Numeric columns: {len(data.select_dtypes(include=[np.number]).columns)}")

# ============================================================================
# STEP 3: Save processed dataset for Sections 3 & 4
# ============================================================================
print(f"\nüíæ STEP 3: Saving Processed Dataset")
print("-" * 50)

# Ensure directory exists
import os
os.makedirs("data", exist_ok=True)

# Save the final processed dataset
processed_dataset_path = "data/liver_train_final_processed.csv"
data.to_csv(processed_dataset_path, index=False)

print(f"‚úÖ Final processed dataset saved to: {processed_dataset_path}")
print(f"   ‚Ä¢ This dataset will be used in Sections 3 and 4")
print(f"   ‚Ä¢ All preprocessing, imputation, and categorical encoding applied")
print(f"   ‚Ä¢ Ready for synthetic data generation and evaluation")

# ============================================================================
# STEP 4: Set globals for Sections 3 & 4
# ============================================================================
print(f"\nüîß STEP 4: Global Variables for Sections 3 & 4")
print("-" * 50)

# Set the path for Sections 3 & 4 to use
PROCESSED_DATASET_PATH = processed_dataset_path

print(f"‚úÖ Global variables ready:")
print(f"   ‚Ä¢ TARGET_COLUMN: {TARGET_COLUMN}")
print(f"   ‚Ä¢ RESULTS_DIR: {RESULTS_DIR}")
print(f"   ‚Ä¢ PROCESSED_DATASET_PATH: {PROCESSED_DATASET_PATH}")
print(f"   ‚Ä¢ categorical_columns: {categorical_columns}")

print("\n" + "=" * 80)
print("üöÄ SECTION 2 COMPLETE - Data Ready for Synthetic Generation!")
print("üéØ Sections 3 & 4 will use the processed dataset for consistent results")
print("=" * 80)

## 3 Demo All Models with Default Parameters

### 3.1 Demos

#### 3.1.1 CTGAN Demo

In [None]:
# Code Chunk ID: CHUNK_016
import time
try:
    print("üîÑ CTGAN Demo - Default Parameters")
    print("=" * 500)
    
    # Import and initialize CTGAN model using ModelFactory
    from src.models.model_factory import ModelFactory
    
    ctgan_model = ModelFactory.create("ctgan", random_state=42)
    
    # Define demo parameters for quick execution
    demo_params = {
        'epochs': 500,
        'batch_size': 100,
        'generator_dim': (128, 128),
        'discriminator_dim': (128, 128)
    }
    
    # Train with demo parameters
    print("Training CTGAN with demo parameters...")
    start_time = time.time()
    
    # Auto-detect discrete columns
    discrete_columns = data.select_dtypes(include=['object']).columns.tolist()
    
    ctgan_model.train(data, discrete_columns=discrete_columns, **demo_params)
    train_time = time.time() - start_time
    
    # Generate synthetic data
    demo_samples = len(data)  # Same size as original dataset
    print(f"Generating {demo_samples} synthetic samples...")
    synthetic_data_ctgan = ctgan_model.generate(demo_samples)
    
    print(f"‚úÖ CTGAN Demo completed successfully!")
    print(f"   - Training time: {train_time:.2f} seconds")
    print(f"   - Generated samples: {len(synthetic_data_ctgan)}")
    print(f"   - Original data shape: {data.shape}")
    print(f"   - Synthetic data shape: {synthetic_data_ctgan.shape}")
    
    # Store for later use in comprehensive evaluation
    demo_results_ctgan = {
        'model': ctgan_model,
        'synthetic_data': synthetic_data_ctgan,
        'training_time': train_time,
        'parameters_used': demo_params
    }
    
except ImportError as e:
    print(f"‚ùå CTGAN not available: {e}")
    print(f"   Please ensure CTGAN dependencies are installed")
except Exception as e:
    print(f"‚ùå Error during CTGAN demo: {str(e)}")
    print("   Check model implementation and data compatibility")
    import traceback
    traceback.print_exc()

#### 3.1.2 CTAB-GAN Demo

In [None]:
# Code Chunk ID: CHUNK_020
try:
    print("üîÑ CTAB-GAN Demo - Default Parameters")
    print("=" * 50)
    
    # Check CTABGAN availability (imported from setup.py)
    if not CTABGAN_AVAILABLE:
        raise ImportError("CTAB-GAN not available - clone and install CTAB-GAN repository")
    
    # Initialize CTAB-GAN model (already defined in notebook)
    ctabgan_model = CTABGANModel()
    print("‚úÖ CTAB-GAN model initialized successfully")
    
    # Record start time
    start_time = time.time()
    
    # Train the model with demo parameters
    print("üöÄ Training CTAB-GAN model (epochs=500)...")
    ctabgan_model.fit(data, categorical_columns=None, target_column=target_column)
    
    # Record training time
    train_time = time.time() - start_time
    
    # Generate synthetic data
    print("üéØ Generating synthetic data...")
    synthetic_data_ctabgan = ctabgan_model.generate(len(data))
    
    # Display results
    print("‚úÖ CTAB-GAN Demo completed successfully!")
    print(f"   - Training time: {train_time:.2f} seconds")
    print(f"   - Generated samples: {len(synthetic_data_ctabgan)}")
    print(f"   - Original shape: {data.shape}")
    print(f"   - Synthetic shape: {synthetic_data_ctabgan.shape}")
    
    # Show sample of synthetic data with proper handling for both DataFrame and array
    print(f"\nüìä Sample of generated data:")
    if hasattr(synthetic_data_ctabgan, 'head'):
        # It's a DataFrame
        print(synthetic_data_ctabgan.head())
    else:
        # It's likely a numpy array
        print("First 5 rows of synthetic data:")
        print(synthetic_data_ctabgan[:5])
    print("=" * 50)
    
except ImportError as e:
    print(f"‚ùå CTAB-GAN not available: {e}")
    print(f"   Please ensure CTAB-GAN dependencies are installed")
    print(f"   Note: CTABGAN_AVAILABLE = {globals().get('CTABGAN_AVAILABLE', 'undefined')}")
except Exception as e:
    print(f"‚ùå Error during CTAB-GAN demo: {str(e)}")
    print("   Check model implementation and data compatibility")
    import traceback
    traceback.print_exc()

#### 3.1.3 CTAB-GAN+ Demo

In [None]:
# Code Chunk ID: CHUNK_024
try:
    print("üîÑ CTAB-GAN+ Demo - Default Parameters")
    print("=" * 50)
    
    # Check CTABGAN+ availability with fallback
    try:
        ctabganplus_available = CTABGANPLUS_AVAILABLE
    except NameError:
        print("‚ö†Ô∏è  CTABGANPLUS_AVAILABLE variable not defined - checking direct import...")
        try:
            # Try to check if CTABGANPLUS (the imported class) exists
            from model.ctabgan import CTABGAN as CTABGANPLUS
            ctabganplus_available = True
            print("‚úÖ CTAB-GAN+ import check successful")
        except ImportError:
            ctabganplus_available = False
            print("‚ùå CTAB-GAN+ import check failed")
    
    if not ctabganplus_available:
        raise ImportError("CTAB-GAN+ not available - clone and install CTAB-GAN+ repository")
    
    # Initialize CTAB-GAN+ model with epochs parameter in constructor
    ctabganplus_model = CTABGANPlusModel(epochs=500)
    print("‚úÖ CTAB-GAN+ model initialized successfully")
    
    # Record start time
    start_time = time.time()
    
    # Train the model (epochs already set in constructor)
    print("üöÄ Training CTAB-GAN+ model (epochs=500)...")
    ctabganplus_model.fit(data)
    
    # Record training time
    train_time = time.time() - start_time
    
    # Generate synthetic data
    print("üéØ Generating synthetic data...")
    synthetic_data_ctabganplus = ctabganplus_model.generate(len(data))
    
    # Display results
    print("‚úÖ CTAB-GAN+ Demo completed successfully!")
    print(f"   - Training time: {train_time:.2f} seconds")
    print(f"   - Generated samples: {len(synthetic_data_ctabganplus)}")
    print(f"   - Original shape: {data.shape}")
    print(f"   - Synthetic shape: {synthetic_data_ctabganplus.shape}")
    
    # Show sample of synthetic data with proper handling for both DataFrame and array
    print(f"\nüìä Sample of generated data:")
    if hasattr(synthetic_data_ctabganplus, 'head'):
        # It's a DataFrame
        print(synthetic_data_ctabganplus.head())
    else:
        # It's likely a numpy array
        print("First 5 rows of synthetic data:")
        print(synthetic_data_ctabganplus[:5])
    print("=" * 50)
    
except ImportError as e:
    print(f"‚ùå CTAB-GAN+ not available: {e}")
    print(f"   Please ensure CTAB-GAN+ dependencies are installed")
except Exception as e:
    print(f"‚ùå Error during CTAB-GAN+ demo: {str(e)}")
    print("   Check model implementation and data compatibility")
    import traceback
    traceback.print_exc()

#### 3.1.4 GANerAid Demo

In [None]:
# Code Chunk ID: CHUNK_028
try:
    print("üîÑ GANerAid Demo - Default Parameters")
    print("=" * 50)
    
    # Check GANerAid availability with fallback
    try:
        ganeraid_available = GANERAID_AVAILABLE
        GANerAidModel  # Test if the class is available
    except NameError:
        print("‚ö†Ô∏è GANerAidModel not available - checking import...")
        try:
            # Try to import GANerAidModel
            from src.models.implementations.ganeraid_model import GANerAidModel
            ganeraid_available = True
            print("‚úÖ GANerAidModel import successful")
        except ImportError:
            ganeraid_available = False
            print("‚ùå GANerAidModel import failed")
    
    if not ganeraid_available:
        raise ImportError("GANerAid not available - please install GANerAid dependencies")
    
    # Initialize GANerAid model
    ganeraid_model = GANerAidModel()
    print("‚úÖ GANerAid model initialized successfully")
    
    # Define demo_samples variable for synthetic data generation
    demo_samples = len(data)  # Same size as original dataset
    
    # Train with minimal parameters for demo
    demo_params = {'epochs': 500, 'batch_size': 100}
    start_time = time.time()
    ganeraid_model.train(data, **demo_params)  # GANerAid uses train method
    train_time = time.time() - start_time
    
    # Generate synthetic data
    synthetic_data_ganeraid = ganeraid_model.generate(demo_samples)
    
    print(f"‚úÖ GANerAid Demo completed successfully!")
    print(f"   - Training time: {train_time:.2f} seconds")
    print(f"   - Generated samples: {len(synthetic_data_ganeraid)}")
    print(f"   - Original shape: {data.shape}")
    print(f"   - Synthetic shape: {synthetic_data_ganeraid.shape}")
    
    # Show sample of synthetic data with proper handling for both DataFrame and array
    print(f"\nüìä Sample of generated data:")
    if hasattr(synthetic_data_ganeraid, 'head'):
        # It's a DataFrame
        print(synthetic_data_ganeraid.head())
    else:
        # It's likely a numpy array
        print("First 5 rows of synthetic data:")
        print(synthetic_data_ganeraid[:5])
    print("=" * 50)
    
except ImportError as e:
    print(f"‚ùå GANerAid not available: {e}")
    print(f"   Please ensure GANerAid dependencies are installed")
except Exception as e:
    print(f"‚ùå Error during GANerAid demo: {str(e)}")
    print("   Check model implementation and data compatibility")
    import traceback
    traceback.print_exc()

#### 3.1.5 CopulaGAN Demo

In [None]:
# Code Chunk ID: CHUNK_031
try:
    print("üîÑ CopulaGAN Demo - Default Parameters")
    print("=" * 50)
    
    # Import and initialize CopulaGAN model using ModelFactory
    from src.models.model_factory import ModelFactory
    
    copulagan_model = ModelFactory.create("copulagan", random_state=42)
    
    # Define demo parameters optimized for CopulaGAN
    demo_params = {
        'epochs': 500,
        'batch_size': 100,
        'generator_dim': (128, 128),
        'discriminator_dim': (128, 128),
        'default_distribution': 'beta',  # Good for bounded data
        'enforce_min_max_values': True
    }
    
    # Train with demo parameters
    print("Training CopulaGAN with demo parameters...")
    start_time = time.time()
    
    # Auto-detect discrete columns for CopulaGAN
    discrete_columns = data.select_dtypes(include=['object']).columns.tolist()
    
    copulagan_model.train(data, discrete_columns=discrete_columns, **demo_params)
    train_time = time.time() - start_time
    
    # Generate synthetic data
    demo_samples = len(data)  # Same size as original dataset
    print(f"Generating {demo_samples} synthetic samples...")
    synthetic_data_copulagan = copulagan_model.generate(demo_samples)
    
    print(f"‚úÖ CopulaGAN Demo completed successfully!")
    print(f"   - Training time: {train_time:.2f} seconds")
    print(f"   - Generated samples: {len(synthetic_data_copulagan)}")
    print(f"   - Original data shape: {data.shape}")
    print(f"   - Synthetic data shape: {synthetic_data_copulagan.shape}")
    print(f"   - Distribution used: {demo_params['default_distribution']}")
    
    # Store for later use in comprehensive evaluation
    demo_results_copulagan = {
        'model': copulagan_model,
        'synthetic_data': synthetic_data_copulagan,
        'training_time': train_time,
        'parameters_used': demo_params
    }
    
except ImportError as e:
    print(f"‚ùå CopulaGAN not available: {e}")
    print(f"   Please ensure CopulaGAN dependencies are installed")
except Exception as e:
    print(f"‚ùå Error during CopulaGAN demo: {str(e)}")
    print("   Check model implementation and data compatibility")
    import traceback
    traceback.print_exc()

#### 3.1.6 TVAE Demo

In [None]:
# Code Chunk ID: CHUNK_034
try:
    print("üîÑ TVAE Demo - Default Parameters")
    print("=" * 50)
    
    # Import and initialize TVAE model using ModelFactory
    from src.models.model_factory import ModelFactory
    
    tvae_model = ModelFactory.create("tvae", random_state=42)
    
    # Define demo parameters optimized for TVAE
    demo_params = {
        'epochs': 50,
        'batch_size': 100,
        'compress_dims': (128, 128),
        'decompress_dims': (128, 128),
        'l2scale': 1e-5,
        'loss_factor': 2,
        'learning_rate': 1e-3  # VAE-specific learning rate
    }
    
    # Train with demo parameters
    print("Training TVAE with demo parameters...")
    start_time = time.time()
    
    # Auto-detect discrete columns for TVAE
    discrete_columns = data.select_dtypes(include=['object']).columns.tolist()
    
    tvae_model.train(data, discrete_columns=discrete_columns, **demo_params)
    train_time = time.time() - start_time
    
    # Generate synthetic data
    demo_samples = len(data)  # Same size as original dataset
    print(f"Generating {demo_samples} synthetic samples...")
    synthetic_data_tvae = tvae_model.generate(demo_samples)
    
    print(f"‚úÖ TVAE Demo completed successfully!")
    print(f"   - Training time: {train_time:.2f} seconds")
    print(f"   - Generated samples: {len(synthetic_data_tvae)}")
    print(f"   - Original data shape: {data.shape}")
    print(f"   - Synthetic data shape: {synthetic_data_tvae.shape}")
    print(f"   - VAE architecture: compress{demo_params['compress_dims']} ‚Üí decompress{demo_params['decompress_dims']}")
    
    # Store for later use in comprehensive evaluation
    demo_results_tvae = {
        'model': tvae_model,
        'synthetic_data': synthetic_data_tvae,
        'training_time': train_time,
        'parameters_used': demo_params
    }
    
except ImportError as e:
    print(f"‚ùå TVAE not available: {e}")
    print(f"   Please ensure TVAE dependencies are installed")
except Exception as e:
    print(f"‚ùå Error during TVAE demo: {str(e)}")
    print("   Check model implementation and data compatibility")
    import traceback
    traceback.print_exc()

### 3.2 Batch Process

In [None]:
# Code Chunk ID: CHUNK_018
# ============================================================================
# SECTION 3 - BATCH EVALUATION FOR ALL TRAINED MODELS
# Standardized evaluation using enhanced batch evaluation system
# ============================================================================

print("üîç SECTION 3 - COMPREHENSIVE BATCH EVALUATION")
print("=" * 60)

section3_results = evaluate_all_available_models(
    section_number=3,
    scope=globals(),  # Pass notebook scope to access synthetic data variables
    models_to_evaluate=None,  # Evaluate all available models
    real_data=None,  # Will use 'data' from scope
    target_col=None   # Will use 'target_column' from scope
)

if section3_results:
    print(f"\nüéâ SECTION 3 BATCH EVALUATION COMPLETED!")
    print(f"üìä Evaluated {len(section3_results)} models successfully")
    print(f"üìÅ All results saved to organized folder structure")
    
    # Show quick summary of best performing models
    best_models = []
    for model_name, results in section3_results.items():
        if 'error' not in results:
            quality_score = results.get('overall_quality_score', 0)
            best_models.append((model_name, quality_score))
    
    if best_models:
        best_models.sort(key=lambda x: x[1], reverse=True)
        print(f"\nüèÜ RANKING BY QUALITY SCORE:")
        for i, (model, score) in enumerate(best_models, 1):
            print(f"   {i}. {model}: {score:.3f}")
else:
    print("\n‚ö†Ô∏è No models available for evaluation")
    print("   Train some models first in previous sections")

## 4: Hyperparameter Tuning for Each Model

### 4.1 Hyperparameter Optimization

#### 4.1.1 CTGAN Hyperparameter Optimization

In [None]:
# ============================================================================
# SECTION 4 DATA LOADING: USE PROCESSED DATASET FROM SECTION 2
# ============================================================================
# Load the final processed dataset saved from Section 2

print("üîß SECTION 4: LOADING PROCESSED DATASET")
print("=" * 60)

# Load the processed dataset that was saved at the end of Section 2
if 'PROCESSED_DATASET_PATH' in globals():
    processed_path = PROCESSED_DATASET_PATH
else:
    processed_path = "data/liver_train_final_processed.csv"

print(f"üìÇ Loading processed dataset from: {processed_path}")

try:
    # Load the fully processed dataset
    data = pd.read_csv(processed_path)
    
    print(f"‚úÖ Processed dataset loaded successfully!")
    print(f"üìä Shape: {data.shape}")
    print(f"üìä Missing values: {data.isnull().sum().sum()}")
    print(f"üìä Target column '{TARGET_COLUMN}' distribution:")
    print(data[TARGET_COLUMN].value_counts())
    
    # Verify categorical columns are properly processed
    if categorical_columns:
        print(f"\nüìä Categorical columns verification:")
        for col in categorical_columns:
            if col in data.columns:
                unique_vals = data[col].unique()
                print(f"   ‚Ä¢ {col}: {len(unique_vals)} unique values")
                if len(unique_vals) <= 5:
                    print(f"     Values: {list(unique_vals)}")
    
    print(f"\n‚úÖ Section 4 ready with processed dataset!")
    print(f"   ‚Ä¢ All categorical data properly encoded")
    print(f"   ‚Ä¢ No missing values")
    print(f"   ‚Ä¢ Ready for hyperparameter optimization")
    
except FileNotFoundError:
    print(f"‚ùå ERROR: Processed dataset not found at {processed_path}")
    print(f"   Please ensure Section 2 has been run completely")
    print(f"   Section 2 should save the processed dataset automatically")
    raise

except Exception as e:
    print(f"‚ùå ERROR loading processed dataset: {e}")
    raise

print("=" * 60)

In [16]:
# Code Chunk ID: CHUNK_040
# CTGAN Hyperparameter Optimization Execution
# Complete optimization study with search space definition and execution

import optuna
import time
from datetime import datetime
import json
import pandas as pd

# ============================================================================
# CRITICAL FIX: Load clean, imputed subset data for Section 4
# ============================================================================
print("üîÑ Loading clean subset data for Section 4...")
data = pd.read_csv("data/liver_train_subset.csv")
print(f"‚úÖ Clean data loaded: {data.shape[0]} rows, {data.shape[1]} columns")
print(f"‚úÖ Missing values: {data.isnull().sum().sum()}")
print(f"‚úÖ Target column '{TARGET_COLUMN}' distribution:")
print(data[TARGET_COLUMN].value_counts())

# Validate data quality
if data.isnull().sum().sum() > 0:
    raise ValueError("ERROR: Section 4 data still contains missing values!")
else:
    print("‚úÖ Data validation passed: 0 missing values confirmed")

# ============================================================================
# SECTION 4.1: CTGAN HYPERPARAMETER OPTIMIZATION 
# ============================================================================

print("üîß SECTION 4.1: CTGAN HYPERPARAMETER OPTIMIZATION")
print("=" * 80)

def ctgan_search_space(trial):
    """CTGAN search space definition based on working CTGAN and Optuna best practices."""
    return {
        'epochs': trial.suggest_int('epochs', 50, 300),
        'batch_size': trial.suggest_categorical('batch_size', [100, 200, 500, 1000]),
        'pac': trial.suggest_int('pac', 1, 10),
        'generator_lr': trial.suggest_loguniform('generator_lr', 1e-5, 1e-2),
        'discriminator_lr': trial.suggest_loguniform('discriminator_lr', 1e-5, 1e-2),
        'generator_dim': trial.suggest_categorical('generator_dim', [(128, 128), (256, 256)]),
        'discriminator_dim': trial.suggest_categorical('discriminator_dim', [(128, 128), (256, 256)]),
        'generator_decay': trial.suggest_loguniform('generator_decay', 1e-8, 1e-4),
        'discriminator_decay': trial.suggest_loguniform('discriminator_decay', 1e-8, 1e-4),
        'log_frequency': trial.suggest_categorical('log_frequency', [True, False]),
        'verbose': trial.suggest_categorical('verbose', [True])
    }

def ctgan_objective(trial):
    """CTGAN objective function with corrected PAC validation and fixed imports."""
    try:
        # Get hyperparameters from trial
        params = ctgan_search_space(trial)
        
        # CORRECTED PAC VALIDATION: Fix incompatible combinations if needed
        batch_size = params['batch_size']
        original_pac = params['pac']
        
        # Find the largest compatible PAC value <= original_pac
        compatible_pac = original_pac
        while batch_size % compatible_pac != 0 and compatible_pac > 1:
            compatible_pac -= 1
        
        if compatible_pac != original_pac:
            print(f"‚ö†Ô∏è  Adjusted PAC from {original_pac} to {compatible_pac} for batch_size {batch_size}")
        
        params['pac'] = compatible_pac
        print(f"‚úÖ PAC validation: {batch_size} % {compatible_pac} = {batch_size % compatible_pac}")

        print(f"\nüîÑ CTGAN Trial {trial.number + 1}: epochs={params['epochs']}, batch_size={params['batch_size']}, pac={params['pac']}, lr={params['generator_lr']:.2e}")
        
        # Import model factory
        from src.models.model_factory import ModelFactory
        
        # Create CTGAN model
        ctgan_model = ModelFactory.create("ctgan", random_state=42)
        
        print(f"üéØ Using target column: '{TARGET_COLUMN}'")
        print("‚úÖ Using CTGAN from ctgan package")
        
        # Auto-detect discrete columns
        discrete_columns = data.select_dtypes(include=['object']).columns.tolist()
        
        # Train model with trial parameters
        start_time = time.time()
        ctgan_model.train(data, discrete_columns=discrete_columns, **params)
        train_time = time.time() - start_time
        
        print(f"‚è±Ô∏è Training completed in {train_time:.1f} seconds")
        
        # Generate synthetic data for evaluation
        synthetic_data = ctgan_model.generate(5000)
        print(f"üìä Generated synthetic data: {synthetic_data.shape}")
        
        # Use enhanced objective function
        from setup import enhanced_objective_function_v2
        combined_score, similarity_score, accuracy_score = enhanced_objective_function_v2(
            data, synthetic_data, TARGET_COLUMN
        )
        
        print(f"üéØ Trial {trial.number + 1} Results:")
        print(f"   ‚Ä¢ Combined Score: {combined_score:.4f}")
        print(f"   ‚Ä¢ Similarity: {similarity_score:.4f}")
        print(f"   ‚Ä¢ Accuracy: {accuracy_score:.4f}")
        
        return combined_score
        
    except Exception as e:
        print(f"‚ùå Trial {trial.number + 1} failed: {str(e)}")
        return 0.0

# Create and run optimization study
print(f"üîÑ Creating CTGAN optimization study...")
print(f"üìä Dataset info: {len(data)} rows, {len(data.columns)} columns")
print(f"üìä Target column '{TARGET_COLUMN}' unique values: {data[TARGET_COLUMN].nunique()}")
print()

study = optuna.create_study(direction='maximize')
study.optimize(ctgan_objective, n_trials=15)

# Extract and display results
best_trial = study.best_trial
print(f"\n‚úÖ CTGAN Optimization completed!")
print(f"üèÜ Best score: {best_trial.value:.4f}")
print(f"üîß Best parameters:")
for param, value in best_trial.params.items():
    print(f"   ‚Ä¢ {param}: {value}")

print("‚úÖ CTGAN optimization completed successfully!")

[I 2025-09-18 10:33:07,935] A new study created in memory with name: no-name-03f906b7-2013-44a2-8982-89f1aa9b97a0


üîÑ Loading clean subset data for Section 4...
‚úÖ Clean data loaded: 5000 rows, 11 columns
‚úÖ Missing values: 0
‚úÖ Target column 'Result' distribution:
Result
1    3571
2    1429
Name: count, dtype: int64
‚úÖ Data validation passed: 0 missing values confirmed
üîß SECTION 4.1: CTGAN HYPERPARAMETER OPTIMIZATION
üîÑ Creating CTGAN optimization study...
üìä Dataset info: 5000 rows, 11 columns
üìä Target column 'Result' unique values: 2

‚úÖ PAC validation: 1000 % 4 = 0

üîÑ CTGAN Trial 1: epochs=82, batch_size=1000, pac=4, lr=1.23e-04
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.32) | Discrim. (-0.02): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 82/82 [00:23<00:00,  3.45it/s]


‚è±Ô∏è Training completed in 32.3 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4483


[I 2025-09-18 10:33:41,010] Trial 0 finished with value: 0.5446516538316053 and parameters: {'epochs': 82, 'batch_size': 1000, 'pac': 4, 'generator_lr': 0.00012252263126709824, 'discriminator_lr': 0.0001125490514094824, 'generator_dim': (128, 128), 'discriminator_dim': (256, 256), 'generator_decay': 6.430628409771229e-05, 'discriminator_decay': 3.100193327653601e-05, 'log_frequency': True, 'verbose': True}. Best is trial 0 with value: 0.5446516538316053.


[OK] TRTS (Synthetic->Real): 0.7142
[OK] TRTS Evaluation: 2 scenarios, Average: 0.6892
[CHART] Combined Score: 0.5447 (Similarity: 0.4483, Accuracy: 0.6892)
üéØ Trial 1 Results:
   ‚Ä¢ Combined Score: 0.5447
   ‚Ä¢ Similarity: 0.4483
   ‚Ä¢ Accuracy: 0.6892
‚ö†Ô∏è  Adjusted PAC from 8 to 5 for batch_size 100
‚úÖ PAC validation: 100 % 5 = 0

üîÑ CTGAN Trial 2: epochs=159, batch_size=100, pac=5, lr=1.00e-03
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.06) | Discrim. (0.03): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 159/159 [00:45<00:00,  3.48it/s] 


‚è±Ô∏è Training completed in 51.0 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4769


[I 2025-09-18 10:34:32,847] Trial 1 finished with value: 0.5651739921758683 and parameters: {'epochs': 159, 'batch_size': 100, 'pac': 8, 'generator_lr': 0.0010043300241983526, 'discriminator_lr': 1.93229036723994e-05, 'generator_dim': (128, 128), 'discriminator_dim': (128, 128), 'generator_decay': 7.562979358615399e-07, 'discriminator_decay': 1.5222672870320026e-07, 'log_frequency': False, 'verbose': True}. Best is trial 1 with value: 0.5651739921758683.


[OK] TRTS (Synthetic->Real): 0.7000
[OK] TRTS Evaluation: 2 scenarios, Average: 0.6976
[CHART] Combined Score: 0.5652 (Similarity: 0.4769, Accuracy: 0.6976)
üéØ Trial 2 Results:
   ‚Ä¢ Combined Score: 0.5652
   ‚Ä¢ Similarity: 0.4769
   ‚Ä¢ Accuracy: 0.6976
‚ö†Ô∏è  Adjusted PAC from 8 to 5 for batch_size 500
‚úÖ PAC validation: 500 % 5 = 0

üîÑ CTGAN Trial 3: epochs=53, batch_size=500, pac=5, lr=7.37e-04
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.12) | Discrim. (0.00): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 53/53 [00:15<00:00,  3.43it/s] 


‚è±Ô∏è Training completed in 18.0 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4274


[I 2025-09-18 10:34:51,670] Trial 2 finished with value: 0.546265072025546 and parameters: {'epochs': 53, 'batch_size': 500, 'pac': 8, 'generator_lr': 0.0007367967192597789, 'discriminator_lr': 2.9098702225049595e-05, 'generator_dim': (128, 128), 'discriminator_dim': (128, 128), 'generator_decay': 2.737686156430235e-07, 'discriminator_decay': 1.0872635223078182e-08, 'log_frequency': False, 'verbose': True}. Best is trial 1 with value: 0.5651739921758683.


[OK] TRTS (Synthetic->Real): 0.7138
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7246
[CHART] Combined Score: 0.5463 (Similarity: 0.4274, Accuracy: 0.7246)
üéØ Trial 3 Results:
   ‚Ä¢ Combined Score: 0.5463
   ‚Ä¢ Similarity: 0.4274
   ‚Ä¢ Accuracy: 0.7246
‚úÖ PAC validation: 200 % 10 = 0

üîÑ CTGAN Trial 4: epochs=186, batch_size=200, pac=10, lr=9.25e-03
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.12) | Discrim. (0.08): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 186/186 [00:53<00:00,  3.46it/s] 


‚è±Ô∏è Training completed in 56.4 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4962


[I 2025-09-18 10:35:48,860] Trial 3 finished with value: 0.5704671133635955 and parameters: {'epochs': 186, 'batch_size': 200, 'pac': 10, 'generator_lr': 0.009246271959789708, 'discriminator_lr': 0.0002927586307297788, 'generator_dim': (128, 128), 'discriminator_dim': (128, 128), 'generator_decay': 2.8904359687963482e-05, 'discriminator_decay': 3.1618587731121753e-08, 'log_frequency': True, 'verbose': True}. Best is trial 3 with value: 0.5704671133635955.


[OK] TRTS (Synthetic->Real): 0.6854
[OK] TRTS Evaluation: 2 scenarios, Average: 0.6819
[CHART] Combined Score: 0.5705 (Similarity: 0.4962, Accuracy: 0.6819)
üéØ Trial 4 Results:
   ‚Ä¢ Combined Score: 0.5705
   ‚Ä¢ Similarity: 0.4962
   ‚Ä¢ Accuracy: 0.6819
‚úÖ PAC validation: 1000 % 8 = 0

üîÑ CTGAN Trial 5: epochs=80, batch_size=1000, pac=8, lr=6.11e-04
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.01) | Discrim. (-0.00): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:23<00:00,  3.47it/s]


‚è±Ô∏è Training completed in 25.7 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4568


[I 2025-09-18 10:36:15,352] Trial 4 finished with value: 0.5426528247790218 and parameters: {'epochs': 80, 'batch_size': 1000, 'pac': 8, 'generator_lr': 0.0006109673783990332, 'discriminator_lr': 8.272710659457367e-05, 'generator_dim': (128, 128), 'discriminator_dim': (128, 128), 'generator_decay': 3.3065827739529055e-08, 'discriminator_decay': 1.69112287614393e-05, 'log_frequency': False, 'verbose': True}. Best is trial 3 with value: 0.5704671133635955.


[OK] TRTS (Synthetic->Real): 0.7126
[OK] TRTS Evaluation: 2 scenarios, Average: 0.6715
[CHART] Combined Score: 0.5427 (Similarity: 0.4568, Accuracy: 0.6715)
üéØ Trial 5 Results:
   ‚Ä¢ Combined Score: 0.5427
   ‚Ä¢ Similarity: 0.4568
   ‚Ä¢ Accuracy: 0.6715
‚ö†Ô∏è  Adjusted PAC from 3 to 2 for batch_size 1000
‚úÖ PAC validation: 1000 % 2 = 0

üîÑ CTGAN Trial 6: epochs=85, batch_size=1000, pac=2, lr=3.66e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.33) | Discrim. (-0.02): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 85/85 [00:24<00:00,  3.48it/s]


‚è±Ô∏è Training completed in 27.1 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4685


[I 2025-09-18 10:36:43,230] Trial 5 finished with value: 0.5845130688228797 and parameters: {'epochs': 85, 'batch_size': 1000, 'pac': 3, 'generator_lr': 3.6595762496340635e-05, 'discriminator_lr': 0.008739437589622541, 'generator_dim': (256, 256), 'discriminator_dim': (256, 256), 'generator_decay': 4.056451320329324e-08, 'discriminator_decay': 1.786356437961059e-08, 'log_frequency': False, 'verbose': True}. Best is trial 5 with value: 0.5845130688228797.


[OK] TRTS (Synthetic->Real): 0.7142
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7586
[CHART] Combined Score: 0.5845 (Similarity: 0.4685, Accuracy: 0.7586)
üéØ Trial 6 Results:
   ‚Ä¢ Combined Score: 0.5845
   ‚Ä¢ Similarity: 0.4685
   ‚Ä¢ Accuracy: 0.7586
‚ö†Ô∏è  Adjusted PAC from 6 to 5 for batch_size 1000
‚úÖ PAC validation: 1000 % 5 = 0

üîÑ CTGAN Trial 7: epochs=273, batch_size=1000, pac=5, lr=2.24e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-1.64) | Discrim. (-0.07): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 273/273 [01:18<00:00,  3.46it/s]


‚è±Ô∏è Training completed in 81.5 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.5584


[I 2025-09-18 10:38:05,407] Trial 6 finished with value: 0.6377901404629124 and parameters: {'epochs': 273, 'batch_size': 1000, 'pac': 6, 'generator_lr': 2.2422302614277393e-05, 'discriminator_lr': 0.0011706923335360964, 'generator_dim': (256, 256), 'discriminator_dim': (128, 128), 'generator_decay': 2.6906039146885346e-06, 'discriminator_decay': 3.459507363914129e-05, 'log_frequency': False, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.7184
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7569
[CHART] Combined Score: 0.6378 (Similarity: 0.5584, Accuracy: 0.7569)
üéØ Trial 7 Results:
   ‚Ä¢ Combined Score: 0.6378
   ‚Ä¢ Similarity: 0.5584
   ‚Ä¢ Accuracy: 0.7569
‚ö†Ô∏è  Adjusted PAC from 6 to 5 for batch_size 100
‚úÖ PAC validation: 100 % 5 = 0

üîÑ CTGAN Trial 8: epochs=300, batch_size=100, pac=5, lr=1.97e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-1.35) | Discrim. (-0.18): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 [01:26<00:00,  3.46it/s]


‚è±Ô∏è Training completed in 89.3 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.5152


[I 2025-09-18 10:39:35,449] Trial 7 finished with value: 0.5702958377957967 and parameters: {'epochs': 300, 'batch_size': 100, 'pac': 6, 'generator_lr': 1.9656607108354457e-05, 'discriminator_lr': 0.0013699705973202005, 'generator_dim': (256, 256), 'discriminator_dim': (256, 256), 'generator_decay': 2.963015853251557e-07, 'discriminator_decay': 1.5704457260967846e-08, 'log_frequency': False, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.7000
[OK] TRTS Evaluation: 2 scenarios, Average: 0.6529
[CHART] Combined Score: 0.5703 (Similarity: 0.5152, Accuracy: 0.6529)
üéØ Trial 8 Results:
   ‚Ä¢ Combined Score: 0.5703
   ‚Ä¢ Similarity: 0.5152
   ‚Ä¢ Accuracy: 0.6529
‚úÖ PAC validation: 1000 % 10 = 0

üîÑ CTGAN Trial 9: epochs=233, batch_size=1000, pac=10, lr=2.98e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-1.95) | Discrim. (-0.02): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 233/233 [01:07<00:00,  3.45it/s]


‚è±Ô∏è Training completed in 70.1 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.5310


[I 2025-09-18 10:40:46,268] Trial 8 finished with value: 0.5906669160494245 and parameters: {'epochs': 233, 'batch_size': 1000, 'pac': 10, 'generator_lr': 2.9805041978169706e-05, 'discriminator_lr': 0.0055832460794528185, 'generator_dim': (256, 256), 'discriminator_dim': (128, 128), 'generator_decay': 4.614591431754217e-07, 'discriminator_decay': 1.1294365397745083e-06, 'log_frequency': False, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.6702
[OK] TRTS Evaluation: 2 scenarios, Average: 0.6802
[CHART] Combined Score: 0.5907 (Similarity: 0.5310, Accuracy: 0.6802)
üéØ Trial 9 Results:
   ‚Ä¢ Combined Score: 0.5907
   ‚Ä¢ Similarity: 0.5310
   ‚Ä¢ Accuracy: 0.6802
‚úÖ PAC validation: 1000 % 5 = 0

üîÑ CTGAN Trial 10: epochs=162, batch_size=1000, pac=5, lr=6.05e-04
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.26) | Discrim. (-0.11): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 162/162 [00:46<00:00,  3.45it/s]


‚è±Ô∏è Training completed in 49.5 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.5112


[I 2025-09-18 10:41:36,543] Trial 9 finished with value: 0.5761466169566716 and parameters: {'epochs': 162, 'batch_size': 1000, 'pac': 5, 'generator_lr': 0.000605120917871477, 'discriminator_lr': 0.006312156784665449, 'generator_dim': (256, 256), 'discriminator_dim': (256, 256), 'generator_decay': 1.9349314583004405e-05, 'discriminator_decay': 7.801861623771125e-07, 'log_frequency': False, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.7136
[OK] TRTS Evaluation: 2 scenarios, Average: 0.6735
[CHART] Combined Score: 0.5761 (Similarity: 0.5112, Accuracy: 0.6735)
üéØ Trial 10 Results:
   ‚Ä¢ Combined Score: 0.5761
   ‚Ä¢ Similarity: 0.5112
   ‚Ä¢ Accuracy: 0.6735
‚úÖ PAC validation: 500 % 1 = 0

üîÑ CTGAN Trial 11: epochs=297, batch_size=500, pac=1, lr=1.05e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-1.77) | Discrim. (0.04): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 297/297 [01:25<00:00,  3.46it/s] 


‚è±Ô∏è Training completed in 88.3 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4993


[I 2025-09-18 10:43:05,643] Trial 10 finished with value: 0.596803727957878 and parameters: {'epochs': 297, 'batch_size': 500, 'pac': 1, 'generator_lr': 1.0504296510990164e-05, 'discriminator_lr': 0.0010007320273515243, 'generator_dim': (256, 256), 'discriminator_dim': (128, 128), 'generator_decay': 4.732521628697638e-06, 'discriminator_decay': 4.602254887658862e-06, 'log_frequency': True, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.7242
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7430
[CHART] Combined Score: 0.5968 (Similarity: 0.4993, Accuracy: 0.7430)
üéØ Trial 11 Results:
   ‚Ä¢ Combined Score: 0.5968
   ‚Ä¢ Similarity: 0.4993
   ‚Ä¢ Accuracy: 0.7430
‚úÖ PAC validation: 500 % 1 = 0

üîÑ CTGAN Trial 12: epochs=293, batch_size=500, pac=1, lr=1.11e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.02) | Discrim. (0.06): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 293/293 [01:24<00:00,  3.48it/s] 


‚è±Ô∏è Training completed in 86.9 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.5131


[I 2025-09-18 10:44:33,345] Trial 11 finished with value: 0.5977683515476562 and parameters: {'epochs': 293, 'batch_size': 500, 'pac': 1, 'generator_lr': 1.1096177303635114e-05, 'discriminator_lr': 0.0012362503740461198, 'generator_dim': (256, 256), 'discriminator_dim': (128, 128), 'generator_decay': 4.658323781462847e-06, 'discriminator_decay': 9.559982938406341e-05, 'log_frequency': True, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.7158
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7248
[CHART] Combined Score: 0.5978 (Similarity: 0.5131, Accuracy: 0.7248)
üéØ Trial 12 Results:
   ‚Ä¢ Combined Score: 0.5978
   ‚Ä¢ Similarity: 0.5131
   ‚Ä¢ Accuracy: 0.7248
‚úÖ PAC validation: 500 % 1 = 0

üîÑ CTGAN Trial 13: epochs=249, batch_size=500, pac=1, lr=1.01e-04
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.07) | Discrim. (-0.09): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 249/249 [01:11<00:00,  3.48it/s]


‚è±Ô∏è Training completed in 74.2 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.4983


[I 2025-09-18 10:45:48,308] Trial 12 finished with value: 0.5855203044513156 and parameters: {'epochs': 249, 'batch_size': 500, 'pac': 1, 'generator_lr': 0.00010116345104463824, 'discriminator_lr': 0.0009566442951215516, 'generator_dim': (256, 256), 'discriminator_dim': (128, 128), 'generator_decay': 4.619464810056044e-06, 'discriminator_decay': 8.987071372070277e-05, 'log_frequency': True, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.6992
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7164
[CHART] Combined Score: 0.5855 (Similarity: 0.4983, Accuracy: 0.7164)
üéØ Trial 13 Results:
   ‚Ä¢ Combined Score: 0.5855
   ‚Ä¢ Similarity: 0.4983
   ‚Ä¢ Accuracy: 0.7164
‚ö†Ô∏è  Adjusted PAC from 3 to 2 for batch_size 200
‚úÖ PAC validation: 200 % 2 = 0

üîÑ CTGAN Trial 14: epochs=238, batch_size=200, pac=2, lr=7.22e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-2.48) | Discrim. (0.14): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 238/238 [01:08<00:00,  3.46it/s] 


‚è±Ô∏è Training completed in 71.4 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.5270


[I 2025-09-18 10:47:00,419] Trial 13 finished with value: 0.6162950946389659 and parameters: {'epochs': 238, 'batch_size': 200, 'pac': 3, 'generator_lr': 7.223401844542763e-05, 'discriminator_lr': 0.0016455830280112666, 'generator_dim': (256, 256), 'discriminator_dim': (128, 128), 'generator_decay': 4.445826408521231e-06, 'discriminator_decay': 6.757469569906482e-05, 'log_frequency': True, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.7144
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7503
[CHART] Combined Score: 0.6163 (Similarity: 0.5270, Accuracy: 0.7503)
üéØ Trial 14 Results:
   ‚Ä¢ Combined Score: 0.6163
   ‚Ä¢ Similarity: 0.5270
   ‚Ä¢ Accuracy: 0.7503
‚ö†Ô∏è  Adjusted PAC from 3 to 2 for batch_size 200
‚úÖ PAC validation: 200 % 2 = 0

üîÑ CTGAN Trial 15: epochs=238, batch_size=200, pac=2, lr=9.50e-05
üéØ Using target column: 'Result'
‚úÖ Using CTGAN from ctgan package


Gen. (-1.96) | Discrim. (-0.05): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 238/238 [01:08<00:00,  3.48it/s]


‚è±Ô∏è Training completed in 71.0 seconds
üìä Generated synthetic data: (5000, 11)
[TARGET] Enhanced objective function using target column: 'Result'
[OK] Similarity Analysis: 10/10 valid metrics, Average: 0.5299


[I 2025-09-18 10:48:12,163] Trial 14 finished with value: 0.6126610194778767 and parameters: {'epochs': 238, 'batch_size': 200, 'pac': 3, 'generator_lr': 9.504870453339129e-05, 'discriminator_lr': 0.0026636529328407693, 'generator_dim': (256, 256), 'discriminator_dim': (128, 128), 'generator_decay': 2.1624958473693224e-06, 'discriminator_decay': 7.578867484798316e-06, 'log_frequency': True, 'verbose': True}. Best is trial 6 with value: 0.6377901404629124.


[OK] TRTS (Synthetic->Real): 0.6864
[OK] TRTS Evaluation: 2 scenarios, Average: 0.7368
[CHART] Combined Score: 0.6127 (Similarity: 0.5299, Accuracy: 0.7368)
üéØ Trial 15 Results:
   ‚Ä¢ Combined Score: 0.6127
   ‚Ä¢ Similarity: 0.5299
   ‚Ä¢ Accuracy: 0.7368

‚úÖ CTGAN Optimization completed!
üèÜ Best score: 0.6378
üîß Best parameters:
   ‚Ä¢ epochs: 273
   ‚Ä¢ batch_size: 1000
   ‚Ä¢ pac: 6
   ‚Ä¢ generator_lr: 2.2422302614277393e-05
   ‚Ä¢ discriminator_lr: 0.0011706923335360964
   ‚Ä¢ generator_dim: (256, 256)
   ‚Ä¢ discriminator_dim: (128, 128)
   ‚Ä¢ generator_decay: 2.6906039146885346e-06
   ‚Ä¢ discriminator_decay: 3.459507363914129e-05
   ‚Ä¢ log_frequency: False
   ‚Ä¢ verbose: True
‚úÖ CTGAN optimization completed successfully!


#### 4.1.2 CTAB-GAN Hyperparameter Optimization

In [None]:
# Code Chunk ID: CHUNK_042
# Import required libraries for CTAB-GAN optimization
import optuna
import numpy as np
import pandas as pd
from src.models.model_factory import ModelFactory
from src.evaluation.trts_framework import TRTSEvaluator

# ============================================================================
# CRITICAL FIX: Ensure clean, imputed subset data is loaded for CTAB-GAN
# ============================================================================
print("üîÑ Reloading clean subset data for CTAB-GAN optimization...")
data = pd.read_csv("data/liver_train_subset.csv")
print(f"‚úÖ Clean data loaded: {data.shape[0]} rows, {data.shape[1]} columns")
print(f"‚úÖ Missing values: {data.isnull().sum().sum()}")

# Validate data quality
if data.isnull().sum().sum() > 0:
    raise ValueError("ERROR: CTAB-GAN data still contains missing values!")
else:
    print("‚úÖ Data validation passed: 0 missing values confirmed")

# CORRECTED CTAB-GAN Search Space (3 supported parameters only)
def ctabgan_search_space(trial):
    """Realistic CTAB-GAN hyperparameter space - ONLY supported parameters"""
    return {
        'epochs': trial.suggest_int('epochs', 100, 1000, step=50),
        'batch_size': trial.suggest_categorical('batch_size', [64, 128, 256]),  # Remove 500 - not stable
        'test_ratio': trial.suggest_float('test_ratio', 0.15, 0.25, step=0.05),
        # REMOVED: class_dim, random_dim, num_channels (not supported by constructor)
    }

def ctabgan_objective(trial):
    """FINAL CORRECTED CTAB-GAN objective function with SCORE EXTRACTION FIX"""
    try:
        # Get realistic hyperparameters from trial
        params = ctabgan_search_space(trial)
        
        print(f"\nüîÑ CTAB-GAN Trial {trial.number + 1}: epochs={params['epochs']}, batch_size={params['batch_size']}, test_ratio={params['test_ratio']:.3f}")
        
        # Initialize CTAB-GAN using ModelFactory
        model = ModelFactory.create("ctabgan", random_state=42)
        
        # Only pass supported parameters to train()
        result = model.train(data, 
                           epochs=params['epochs'],
                           batch_size=params['batch_size'],
                           test_ratio=params['test_ratio'])

        print(f"üèãÔ∏è Training CTAB-GAN with corrected parameters...")

        # Generate synthetic data for evaluation
        synthetic_data = model.generate(5000)
        print(f"üìä Generated synthetic data: {synthetic_data.shape}")
        
        # Use enhanced objective function
        from setup import enhanced_objective_function_v2
        combined_score, similarity_score, accuracy_score = enhanced_objective_function_v2(
            data, synthetic_data, TARGET_COLUMN
        )
        
        print(f"üéØ Trial {trial.number + 1} Results:")
        print(f"   ‚Ä¢ Combined Score: {combined_score:.4f}")
        print(f"   ‚Ä¢ Similarity: {similarity_score:.4f}")
        print(f"   ‚Ä¢ Accuracy: {accuracy_score:.4f}")
        
        return combined_score
        
    except Exception as e:
        print(f"‚ùå Trial {trial.number + 1} failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return 0.0

# Create and run optimization study
print(f"\nüîß SECTION 4.2: CTAB-GAN HYPERPARAMETER OPTIMIZATION")
print("=" * 80)
print(f"üîÑ Creating CTAB-GAN optimization study...")
print(f"üìä Dataset info: {len(data)} rows, {len(data.columns)} columns")
print(f"üìä Target column '{TARGET_COLUMN}' unique values: {data[TARGET_COLUMN].nunique()}")
print()

# Create study and optimize
ctabgan_study = optuna.create_study(direction='maximize')
ctabgan_study.optimize(ctabgan_objective, n_trials=5)

# Extract and display results
best_trial = ctabgan_study.best_trial
print(f"\n‚úÖ CTAB-GAN Optimization completed!")
print(f"üèÜ Best score: {best_trial.value:.4f}")
print(f"üîß Best parameters:")
for param, value in best_trial.params.items():
    print(f"   ‚Ä¢ {param}: {value}")

print("‚úÖ CTAB-GAN optimization completed successfully!")

#### 4.1.3 CTAB-GAN+ Hyperparameter Optimization

In [None]:
# Code Chunk ID: CHUNK_044
# Import required libraries for CTAB-GAN+ optimization
import optuna
import numpy as np
import pandas as pd
from src.models.model_factory import ModelFactory
from src.evaluation.trts_framework import TRTSEvaluator

# ============================================================================
# CRITICAL FIX: Ensure clean, imputed subset data is loaded for CTAB-GAN+
# ============================================================================
print("üîÑ Reloading clean subset data for CTAB-GAN+ optimization...")
data = pd.read_csv("data/liver_train_subset.csv")
print(f"‚úÖ Clean data loaded: {data.shape[0]} rows, {data.shape[1]} columns")
print(f"‚úÖ Missing values: {data.isnull().sum().sum()}")

# Validate data quality
if data.isnull().sum().sum() > 0:
    raise ValueError("ERROR: CTAB-GAN+ data still contains missing values!")
else:
    print("‚úÖ Data validation passed: 0 missing values confirmed")

# CORRECTED CTAB-GAN+ Search Space (3 supported parameters only)
def ctabganplus_search_space(trial):
    """Realistic CTAB-GAN+ hyperparameter space - ONLY supported parameters"""
    return {
        'epochs': trial.suggest_int('epochs', 150, 1000, step=50),
        'batch_size': trial.suggest_categorical('batch_size', [64, 128, 256]),  # Remove 500 - not stable
        'test_ratio': trial.suggest_float('test_ratio', 0.15, 0.25, step=0.05),
        # REMOVED: class_dim, random_dim, num_channels (not supported by constructor)
    }

def ctabganplus_objective(trial):
    """FINAL CORRECTED CTAB-GAN+ objective function with SCORE EXTRACTION FIX"""
    try:
        # Get realistic hyperparameters from trial
        params = ctabganplus_search_space(trial)
        
        print(f"\nüîÑ CTAB-GAN+ Trial {trial.number + 1}: epochs={params['epochs']}, batch_size={params['batch_size']}, test_ratio={params['test_ratio']:.3f}")
        
        # Initialize CTAB-GAN+ using ModelFactory
        model = ModelFactory.create("ctabganplus", random_state=42)
        
        # Only pass supported parameters to train()
        result = model.train(data, 
                           epochs=params['epochs'],
                           batch_size=params['batch_size'],
                           test_ratio=params['test_ratio'])

        print(f"üèãÔ∏è Training CTAB-GAN+ with corrected parameters...")

        # Generate synthetic data for evaluation
        synthetic_data = model.generate(5000)
        print(f"üìä Generated synthetic data: {synthetic_data.shape}")
        
        # Use enhanced objective function
        from setup import enhanced_objective_function_v2
        combined_score, similarity_score, accuracy_score = enhanced_objective_function_v2(
            data, synthetic_data, TARGET_COLUMN
        )
        
        print(f"üéØ Trial {trial.number + 1} Results:")
        print(f"   ‚Ä¢ Combined Score: {combined_score:.4f}")
        print(f"   ‚Ä¢ Similarity: {similarity_score:.4f}")
        print(f"   ‚Ä¢ Accuracy: {accuracy_score:.4f}")
        
        return combined_score
        
    except Exception as e:
        print(f"‚ùå Trial {trial.number + 1} failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return 0.0

# Create and run optimization study
print(f"\nüîß SECTION 4.3: CTAB-GAN+ HYPERPARAMETER OPTIMIZATION")
print("=" * 80)
print(f"üîÑ Creating CTAB-GAN+ optimization study...")
print(f"üìä Dataset info: {len(data)} rows, {len(data.columns)} columns")
print(f"üìä Target column '{TARGET_COLUMN}' unique values: {data[TARGET_COLUMN].nunique()}")
print()

# Create study and optimize
ctabganplus_study = optuna.create_study(direction='maximize')
ctabganplus_study.optimize(ctabganplus_objective, n_trials=5)

# Extract and display results
best_trial = ctabganplus_study.best_trial
print(f"\n‚úÖ CTAB-GAN+ Optimization completed!")
print(f"üèÜ Best score: {best_trial.value:.4f}")
print(f"üîß Best parameters:")
for param, value in best_trial.params.items():
    print(f"   ‚Ä¢ {param}: {value}")

print("‚úÖ CTAB-GAN+ optimization completed successfully!")

#### 4.1.4 GANerAid Hyperparameter Optimization

In [None]:
# Code Chunk ID: CHUNK_046
# GANerAid Search Space and Hyperparameter Optimization

# ============================================================================
# CRITICAL FIX: Ensure clean, imputed subset data is loaded for GANerAid
# ============================================================================
print("üîÑ Reloading clean subset data for GANerAid optimization...")
import pandas as pd
data = pd.read_csv("data/liver_train_subset.csv")
print(f"‚úÖ Clean data loaded: {data.shape[0]} rows, {data.shape[1]} columns")
print(f"‚úÖ Missing values: {data.isnull().sum().sum()}")

# Validate data quality
if data.isnull().sum().sum() > 0:
    raise ValueError("ERROR: GANerAid data still contains missing values!")
else:
    print("‚úÖ Data validation passed: 0 missing values confirmed")

def ganeraid_search_space(trial):
    """
    GENERALIZED GANerAid hyperparameter search space with dynamic constraint adjustment.
    
    CRITICAL INSIGHT: Following CTGAN's compatible_pac pattern for robust constraint handling.
    GANerAid requires: batch_size % nr_of_rows == 0 AND nr_of_rows < dataset_size
    """
    
    # Define available batch sizes (easily extensible like CTGAN)
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128, 256])
    
    # Define dataset size constraint (GANerAid specific)
    dataset_size = len(data)  # Use current dataset size
    
    # Find compatible nr_of_rows values (same pattern as compatible_pac)
    max_nr_of_rows = min(dataset_size - 1, 500)  # Prevent index out of bounds
    possible_nr_of_rows = []
    
    # Find all compatible values (batch_size % nr_of_rows == 0)
    for candidate in range(1, max_nr_of_rows + 1):
        if batch_size % candidate == 0:
            possible_nr_of_rows.append(candidate)
    
    # Select nr_of_rows from compatible values
    if possible_nr_of_rows:
        nr_of_rows = trial.suggest_categorical(f'nr_of_rows_for_batch_{batch_size}', possible_nr_of_rows)
    else:
        # Fallback: use largest divisor of batch_size that's < dataset_size
        for candidate in range(batch_size, 0, -1):
            if batch_size % candidate == 0 and candidate < dataset_size:
                nr_of_rows = candidate
                break
        else:
            nr_of_rows = 1  # Ultimate fallback
    
    return {
        'epochs': trial.suggest_int('epochs', 500, 1500, step=100),
        'batch_size': batch_size,
        'nr_of_rows': nr_of_rows,
    }

def ganeraid_objective(trial):
    """GENERALIZED GANerAid objective function with ALL constraint validation."""
    try:
        # Get hyperparameters from trial
        params = ganeraid_search_space(trial)
        
        # DYNAMIC CONSTRAINT ADJUSTMENT (following CTGAN pattern)
        dataset_size = len(data)
        batch_size = params['batch_size']
        original_nr_of_rows = params['nr_of_rows']
        
        # Comprehensive constraint validation
        compatible_nr_of_rows = original_nr_of_rows
        found_compatible = False
        
        # Try to find compatible nr_of_rows (batch_size % nr_of_rows == 0 AND nr_of_rows < dataset_size)
        for candidate in range(original_nr_of_rows, 0, -1):
            if (batch_size % candidate == 0 and 
                candidate < dataset_size):
                compatible_nr_of_rows = candidate
                found_compatible = True
                break
        
        # If still not compatible, try upward
        if not found_compatible:
            for candidate in range(original_nr_of_rows + 1, min(dataset_size, batch_size + 1)):
                if (batch_size % candidate == 0 and 
                    candidate < dataset_size):
                    compatible_nr_of_rows = candidate
                    found_compatible = True
                    break
        
        # Ultimate fallback
        if not found_compatible:
            compatible_nr_of_rows = 1
        
        params['nr_of_rows'] = compatible_nr_of_rows
        
        print(f"\\nüîÑ GANerAid Trial {trial.number + 1}: epochs={params['epochs']}, batch_size={params['batch_size']}, nr_of_rows={params['nr_of_rows']}")
        print(f"‚úÖ Constraint validation: {batch_size} % {compatible_nr_of_rows} = {batch_size % compatible_nr_of_rows}, {compatible_nr_of_rows} < {dataset_size}")

        # Initialize GANerAid using ModelFactory
        from src.models.model_factory import ModelFactory
        model = ModelFactory.create("ganeraid", random_state=42)
        
        # Train model
        print(f"üèãÔ∏è Training GANerAid with validated parameters...")
        start_time = time.time()
        
        try:
            model.train(data, epochs=params['epochs'])
            training_time = time.time() - start_time
            print(f"‚è±Ô∏è Training completed successfully in {training_time:.1f} seconds")
        except IndexError as e:
            print(f"‚ùå IndexError during training (constraint violation): {str(e)}")
            return 0.0
        except Exception as e:
            print(f"‚ùå Training failed: {str(e)}")
            return 0.0

        # Generate synthetic data for evaluation
        synthetic_data = model.generate(5000)
        print(f"üìä Generated synthetic data: {synthetic_data.shape}")
        
        # Use enhanced objective function
        from setup import enhanced_objective_function_v2
        combined_score, similarity_score, accuracy_score = enhanced_objective_function_v2(
            data, synthetic_data, TARGET_COLUMN
        )
        
        print(f"üéØ Trial {trial.number + 1} Results:")
        print(f"   ‚Ä¢ Combined Score: {combined_score:.4f}")
        print(f"   ‚Ä¢ Similarity: {similarity_score:.4f}")
        print(f"   ‚Ä¢ Accuracy: {accuracy_score:.4f}")
        
        return combined_score
        
    except Exception as e:
        print(f"‚ùå Trial {trial.number + 1} failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return 0.0

# Create and run optimization study
import optuna
import time

print(f"\\nüîß SECTION 4.4: GANerAid HYPERPARAMETER OPTIMIZATION")
print("=" * 80)
print(f"üîÑ Creating GANerAid optimization study...")
print(f"üìä Dataset info: {len(data)} rows, {len(data.columns)} columns")
print(f"üìä Target column '{TARGET_COLUMN}' unique values: {data[TARGET_COLUMN].nunique()}")
print()

# Create study and optimize
ganeraid_study = optuna.create_study(direction='maximize')
ganeraid_study.optimize(ganeraid_objective, n_trials=5)

# Extract and display results
best_trial = ganeraid_study.best_trial
print(f"\\n‚úÖ GANerAid Optimization completed!")
print(f"üèÜ Best score: {best_trial.value:.4f}")
print(f"üîß Best parameters:")
for param, value in best_trial.params.items():
    print(f"   ‚Ä¢ {param}: {value}")

print("‚úÖ GANerAid optimization completed successfully!")

#### 4.1.5 CopulaGAN Hyperparameter Optimization

In [None]:
# Code Chunk ID: CHUNK_048
# CopulaGAN Search Space and Hyperparameter Optimization

# ============================================================================
# CRITICAL FIX: Ensure clean, imputed subset data is loaded for CopulaGAN
# ============================================================================
print("üîÑ Reloading clean subset data for CopulaGAN optimization...")
import pandas as pd
data = pd.read_csv("data/liver_train_subset.csv")
print(f"‚úÖ Clean data loaded: {data.shape[0]} rows, {data.shape[1]} columns")
print(f"‚úÖ Missing values: {data.isnull().sum().sum()}")

# Validate data quality
if data.isnull().sum().sum() > 0:
    raise ValueError("ERROR: CopulaGAN data still contains missing values!")
else:
    print("‚úÖ Data validation passed: 0 missing values confirmed")

def copulagan_search_space(trial):
    """
    GENERALIZED CopulaGAN hyperparameter search space with dynamic constraint adjustment.
    
    CRITICAL INSIGHT: Following CTGAN's compatible_pac pattern for robust constraint handling.
    CopulaGAN requires discrete_columns to be properly defined.
    """
    return {
        'epochs': trial.suggest_int('epochs', 50, 500, step=50),
        'batch_size': trial.suggest_categorical('batch_size', [100, 200, 500]),
        'generator_lr': trial.suggest_loguniform('generator_lr', 1e-5, 1e-2),
        'discriminator_lr': trial.suggest_loguniform('discriminator_lr', 1e-5, 1e-2),
        'generator_decay': trial.suggest_loguniform('generator_decay', 1e-8, 1e-4),
        'discriminator_decay': trial.suggest_loguniform('discriminator_decay', 1e-8, 1e-4),
    }

def copulagan_objective(trial):
    """GENERALIZED CopulaGAN objective function."""
    try:
        # Get hyperparameters from trial
        params = copulagan_search_space(trial)
        
        print(f"\\nüîÑ CopulaGAN Trial {trial.number + 1}: epochs={params['epochs']}, batch_size={params['batch_size']}")
        
        # Initialize CopulaGAN using ModelFactory
        from src.models.model_factory import ModelFactory
        model = ModelFactory.create("copulagan", random_state=42)
        
        # Auto-detect discrete columns for CopulaGAN
        discrete_columns = data.select_dtypes(include=['object']).columns.tolist()
        print(f"üìä Detected discrete columns: {discrete_columns}")
        
        # Train model
        print(f"üèãÔ∏è Training CopulaGAN...")
        start_time = time.time()
        
        try:
            model.train(data, discrete_columns=discrete_columns, **params)
            training_time = time.time() - start_time
            print(f"‚è±Ô∏è Training completed successfully in {training_time:.1f} seconds")
        except Exception as e:
            print(f"‚ùå Training failed: {str(e)}")
            return 0.0

        # Generate synthetic data for evaluation
        synthetic_data = model.generate(5000)
        print(f"üìä Generated synthetic data: {synthetic_data.shape}")
        
        # Use enhanced objective function
        from setup import enhanced_objective_function_v2
        combined_score, similarity_score, accuracy_score = enhanced_objective_function_v2(
            data, synthetic_data, TARGET_COLUMN
        )
        
        print(f"üéØ Trial {trial.number + 1} Results:")
        print(f"   ‚Ä¢ Combined Score: {combined_score:.4f}")
        print(f"   ‚Ä¢ Similarity: {similarity_score:.4f}")
        print(f"   ‚Ä¢ Accuracy: {accuracy_score:.4f}")
        
        return combined_score
        
    except Exception as e:
        print(f"‚ùå Trial {trial.number + 1} failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return 0.0

# Create and run optimization study
import optuna
import time

print(f"\\nüîß SECTION 4.5: CopulaGAN HYPERPARAMETER OPTIMIZATION")
print("=" * 80)
print(f"üîÑ Creating CopulaGAN optimization study...")
print(f"üìä Dataset info: {len(data)} rows, {len(data.columns)} columns")
print(f"üìä Target column '{TARGET_COLUMN}' unique values: {data[TARGET_COLUMN].nunique()}")
print()

# Create study and optimize
copulagan_study = optuna.create_study(direction='maximize')
copulagan_study.optimize(copulagan_objective, n_trials=5)

# Extract and display results
best_trial = copulagan_study.best_trial
print(f"\\n‚úÖ CopulaGAN Optimization completed!")
print(f"üèÜ Best score: {best_trial.value:.4f}")
print(f"üîß Best parameters:")
for param, value in best_trial.params.items():
    print(f"   ‚Ä¢ {param}: {value}")

print("‚úÖ CopulaGAN optimization completed successfully!")

#### 4.1.6 TVAE Hyperparameter Optimization

In [None]:
# Code Chunk ID: CHUNK_050
# TVAE Robust Search Space (from hypertuning_eg.md)

# ============================================================================
# CRITICAL FIX: Ensure clean, imputed subset data is loaded for TVAE
# ============================================================================
print("üîÑ Reloading clean subset data for TVAE optimization...")
import pandas as pd
data = pd.read_csv("data/liver_train_subset.csv")
print(f"‚úÖ Clean data loaded: {data.shape[0]} rows, {data.shape[1]} columns")
print(f"‚úÖ Missing values: {data.isnull().sum().sum()}")

# Validate data quality
if data.isnull().sum().sum() > 0:
    raise ValueError("ERROR: TVAE data still contains missing values!")
else:
    print("‚úÖ Data validation passed: 0 missing values confirmed")

def tvae_search_space(trial):
    return {
        "epochs": trial.suggest_int("epochs", 50, 500, step=50),  # Training cycles
        "batch_size": trial.suggest_categorical("batch_size", [100, 200, 500]),  # Batch size for training
        "embedding_dim": trial.suggest_categorical("embedding_dim", [64, 128]),  # Embedding dimension
        "compress_dims": trial.suggest_categorical("compress_dims", [(128, 128), (256, 256)]),  # Compression layers
        "decompress_dims": trial.suggest_categorical("decompress_dims", [(128, 128), (256, 256)]),  # Decompression layers
        "l2scale": trial.suggest_loguniform("l2scale", 1e-8, 1e-3),  # L2 regularization
        "loss_factor": trial.suggest_int("loss_factor", 1, 5),  # Loss scaling factor
    }

def tvae_objective(trial):
    """TVAE objective function with comprehensive error handling."""
    try:
        # Get hyperparameters from trial
        params = tvae_search_space(trial)
        
        print(f"\\nüîÑ TVAE Trial {trial.number + 1}: epochs={params['epochs']}, batch_size={params['batch_size']}, embedding_dim={params['embedding_dim']}")
        
        # Initialize TVAE using ModelFactory
        from src.models.model_factory import ModelFactory
        model = ModelFactory.create("tvae", random_state=42)
        
        # Train model
        print("üèãÔ∏è Training TVAE...")
        start_time = time.time()
        model.train(data, **params)
        training_time = time.time() - start_time
        print(f"‚è±Ô∏è Training completed in {training_time:.1f} seconds")
        
        # Generate synthetic data for evaluation
        synthetic_data = model.generate(5000)
        print(f"üìä Generated synthetic data: {synthetic_data.shape}")
        
        # Use enhanced objective function
        from setup import enhanced_objective_function_v2
        combined_score, similarity_score, accuracy_score = enhanced_objective_function_v2(
            data, synthetic_data, TARGET_COLUMN
        )
        
        print(f"üéØ Trial {trial.number + 1} Results:")
        print(f"   ‚Ä¢ Combined Score: {combined_score:.4f}")
        print(f"   ‚Ä¢ Similarity: {similarity_score:.4f}")
        print(f"   ‚Ä¢ Accuracy: {accuracy_score:.4f}")
        
        return combined_score
        
    except Exception as e:
        print(f"‚ùå Trial {trial.number + 1} failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return 0.0

# Create and run optimization study
import optuna
import time

print(f"\\nüîß SECTION 4.6: TVAE HYPERPARAMETER OPTIMIZATION")
print("=" * 80)
print(f"üîÑ Creating TVAE optimization study...")
print(f"üìä Dataset info: {len(data)} rows, {len(data.columns)} columns")
print(f"üìä Target column '{TARGET_COLUMN}' unique values: {data[TARGET_COLUMN].nunique()}")
print()

# Create study and optimize
tvae_study = optuna.create_study(direction='maximize')
tvae_study.optimize(tvae_objective, n_trials=10)

# Extract and display results
best_trial = tvae_study.best_trial
print(f"\\n‚úÖ TVAE Optimization completed!")
print(f"üèÜ Best score: {best_trial.value:.4f}")
print(f"üîß Best parameters:")
for param, value in best_trial.params.items():
    print(f"   ‚Ä¢ {param}: {value}")

print("‚úÖ TVAE optimization completed successfully!")

### 4.2 Batch process 

In [None]:
# Code Chunk ID: CHUNK_052
# ============================================================================
# SECTION 4 - BATCH HYPERPARAMETER OPTIMIZATION ANALYSIS
# ============================================================================

print("üîç SECTION 4 - HYPERPARAMETER OPTIMIZATION BATCH ANALYSIS")
print("=" * 80)
print()

# Use enhanced batch evaluation function from setup.py
# Following exact same pattern as CHUNK_018 (Section 3) - no module reload needed!
try:
    # Run batch analysis with file export for all models
    section4_batch_results = evaluate_hyperparameter_optimization_results(
        section_number=4,
        scope=globals(),  # Pass notebook scope to access study variables
        target_column=TARGET_COLUMN
    )
    
    print("\n" + "="*80)
    print("‚úÖ SECTION 4 HYPERPARAMETER OPTIMIZATION BATCH ANALYSIS COMPLETED!")
    print("="*80)
    print(f"üìä Models processed: {len(section4_batch_results['summary_data'])}")
    print(f"üìÅ Results exported to: {section4_batch_results['results_dir']}")
    print(f"üìã Individual model analysis files:")
    print("   ‚Ä¢ Hyperparameter parameter_analysis.png plots")
    print("   ‚Ä¢ Optimization convergence_analysis.png graphs")
    print("   ‚Ä¢ Parameter correlation matrices")
    print("   ‚Ä¢ Best trial summary tables")
    print("   ‚Ä¢ Comprehensive optimization summary CSV")

    
except Exception as e:
    print(f"‚ùå Batch hyperparameter analysis failed: {str(e)}")
    print(f"üîç Error details: {type(e).__name__}")
    import traceback
    traceback.print_exc()
    print("\n‚ö†Ô∏è  Falling back to individual chunk analysis if needed")

# ============================================================================
# SAVE BEST PARAMETERS TO CSV FOR SECTION 5 USE
# ============================================================================
print("\n" + "=" * 80)
print("üíæ SAVING BEST PARAMETERS FROM SECTION 4 OPTIMIZATION")
print("=" * 80)

try:
    # Save all best parameters to CSV using setup.py function
    param_save_results = save_best_parameters_to_csv(
        scope=globals(),
        section_number=4,
        dataset_identifier=DATASET_IDENTIFIER
    )
    
    if param_save_results['success']:
        print(f"\n‚úÖ Parameter saving completed successfully!")
        print(f"   ‚Ä¢ Files saved: {len(param_save_results['files_saved'])}")
        print(f"   ‚Ä¢ Parameter entries: {param_save_results['parameters_count']}")
        print(f"   ‚Ä¢ Models processed: {param_save_results['models_count']}")
        print(f"   ‚Ä¢ Directory: {param_save_results['results_dir']}")
        
        # Display saved files
        for file_path in param_save_results['files_saved']:
            print(f"     üìÅ {file_path.split('/')[-1]}")
    else:
        print(f"\n‚ö†Ô∏è  Parameter saving completed with issues: {param_save_results['message']}")
        
except Exception as e:
    print(f"\n‚ùå Parameter saving failed: {str(e)}")
    print(f"   Section 5 will fall back to memory-based parameter retrieval")

print(f"\nüìà Section 4 hyperparameter optimization analysis complete!")
print("üèÅ Ready for Section 5: Optimized model re-training")

## Section 5: Final Model Comparison and Best-of-Best Selection

#### 5.1.1 Best CTGAN Model Evaluation

In [None]:
# Code Chunk ID: CHUNK_053
# Section 5.1: Best CTGAN Model Evaluation  
print("üèÜ SECTION 5.1: BEST CTGAN MODEL EVALUATION")
print("=" * 60)

# ============================================================================
# LOAD BEST PARAMETERS FROM SECTION 4 (CSV + MEMORY FALLBACK)
# ============================================================================
print("üìñ 5.1.0 Loading best parameters from Section 4...")

try:
    # Load all best parameters using setup.py function
    param_data = load_best_parameters_from_csv(
        section_number=4,
        dataset_identifier=DATASET_IDENTIFIER,
        fallback_to_memory=True,
        scope=globals()
    )
    
    print(f"‚úÖ Parameter loading completed from {param_data['source']}")
    print(f"   ‚Ä¢ Models available: {param_data['models_count']}")
    
    # Extract CTGAN parameters specifically
    loaded_ctgan_params = param_data['parameters'].get('ctgan', None)
    
except Exception as e:
    print(f"‚ö†Ô∏è  Parameter loading failed: {str(e)}")
    print(f"   Falling back to direct memory access")
    loaded_ctgan_params = None

# 5.1.1 Retrieve Best Model Results from Section 4.1
print("\nüìä 5.1.1 Retrieving best CTGAN results from Section 4.1...")

try:
    # Primary: Use loaded parameters if available
    if loaded_ctgan_params is not None:
        print(f"‚úÖ Using loaded CTGAN parameters from {param_data['source']}")
        best_params = loaded_ctgan_params
        
        # Try to get additional metadata from memory if available
        if 'ctgan_study' in globals() and ctgan_study is not None and hasattr(ctgan_study, 'best_trial'):
            best_trial = ctgan_study.best_trial
            best_value = best_trial.value
            trial_number = best_trial.number
        else:
            # Use fallback values when memory unavailable  
            best_value = 0.0  # Will be recalculated during evaluation
            trial_number = "loaded_from_csv"
            print(f"   ‚ö†Ô∏è  Memory study unavailable - using loaded parameters only")
        
    else:
        # Fallback: Direct memory access
        print(f"üîÑ Falling back to direct memory access...")
        best_trial = ctgan_study.best_trial
        best_params = best_trial.params
        best_value = best_trial.value
        trial_number = best_trial.number
        print(f"‚úÖ Using CTGAN parameters from memory")
    
    print(f"\n‚úÖ Section 4.1 CTGAN optimization parameters retrieved!")
    print(f"   ‚Ä¢ Best Trial: #{trial_number}")
    print(f"   ‚Ä¢ Best Objective Score: {best_value:.4f}" if isinstance(best_value, (int, float)) else f"   ‚Ä¢ Best Objective Score: {best_value}")
    print(f"   ‚Ä¢ Parameter count: {len(best_params)}")
    
    # Display parameters
    print(f"\nüìà 5.1.2 Best CTGAN configuration:")
    for param, value in best_params.items():
        if isinstance(value, float):
            print(f"   ‚Ä¢ {param}: {value:.4f}")
        else:
            print(f"   ‚Ä¢ {param}: {value}")
    
    print(f"üîç Parameter source: {param_data.get('source', 'memory') if loaded_ctgan_params else 'memory'}")
    
    # ============================================================================
    # 5.1.3 TRAIN FINAL CTGAN MODEL WITH OPTIMIZED PARAMETERS
    # ============================================================================
    
    print(f"\nüîß 5.1.3 Training final CTGAN model with optimized parameters...")
    
    try:
        # Use ModelFactory pattern
        from src.models.model_factory import ModelFactory
        
        # Create CTGAN model
        final_ctgan_model = ModelFactory.create("ctgan", random_state=42)
        
        # Apply best parameters with defaults for missing values
        final_ctgan_params = {
            'epochs': best_params.get('epochs', 300),
            'batch_size': best_params.get('batch_size', 500),
            'generator_lr': best_params.get('generator_lr', 2e-4),
            'discriminator_lr': best_params.get('discriminator_lr', 2e-4),
            'generator_decay': best_params.get('generator_decay', 1e-6),
            'discriminator_decay': best_params.get('discriminator_decay', 1e-6),
            'pac': best_params.get('pac', 10),
            'verbose': best_params.get('verbose', True)
        }
        
        print("üîß Training CTGAN with optimal hyperparameters...")
        for param, value in final_ctgan_params.items():
            print(f"   ‚Ä¢ Using {param}: {value}")
        
        # Train the model
        final_ctgan_model.train(data, **final_ctgan_params)
        print("‚úÖ CTGAN training completed successfully!")
        
        # Generate synthetic data
        print("üé≤ Generating synthetic data...")
        synthetic_ctgan_final = final_ctgan_model.generate(len(data))
        print(f"‚úÖ Generated {len(synthetic_ctgan_final)} synthetic samples")
        
        # ============================================================================
        # 5.1.4 EVALUATE FINAL CTGAN MODEL PERFORMANCE
        # ============================================================================
        
        print("\nüìä 5.1.4 Final CTGAN Model Evaluation...")
        
        # Use enhanced objective function for evaluation
        if 'enhanced_objective_function_v2' in globals():
            print("üéØ Enhanced objective function evaluation:")
            
            ctgan_final_score, ctgan_similarity, ctgan_accuracy = enhanced_objective_function_v2(
                real_data=data, 
                synthetic_data=synthetic_ctgan_final, 
                target_column=TARGET_COLUMN
            )
            
            print(f"\n‚úÖ Final CTGAN Evaluation Results:")
            print(f"   ‚Ä¢ Overall Score: {ctgan_final_score:.4f}")
            print(f"   ‚Ä¢ Similarity Score: {ctgan_similarity:.4f} (60% weight)")  
            print(f"   ‚Ä¢ Accuracy Score: {ctgan_accuracy:.4f} (40% weight)")
            
            # Store results for Section 5.7 comparison
            ctgan_final_results = {
                'model_name': 'CTGAN',
                'objective_score': ctgan_final_score,
                'similarity_score': ctgan_similarity,
                'accuracy_score': ctgan_accuracy,
                'best_params': best_params,
                'parameter_source': param_data.get('source', 'memory') if loaded_ctgan_params else 'memory',
                'synthetic_data': synthetic_ctgan_final
            }
            
            print("üéØ CTGAN Final Assessment:")
            print(f"   ‚Ä¢ Production Ready: {'‚úÖ Yes' if ctgan_final_score > 0.6 else '‚ö†Ô∏è Review Required'}")
            print(f"   ‚Ä¢ Recommended for: General-purpose tabular synthetic data generation")
            print(f"   ‚Ä¢ Final Score vs Optimization Score: {ctgan_final_score:.4f} vs {best_value:.4f}" if isinstance(best_value, (int, float)) else f"   ‚Ä¢ Final Score: {ctgan_final_score:.4f}")
            
        else:
            print("‚ö†Ô∏è Enhanced objective function not available - using basic evaluation")
            ctgan_final_results = {
                'model_name': 'CTGAN',
                'objective_score': best_value if isinstance(best_value, (int, float)) else 0.0,
                'best_params': best_params,
                'parameter_source': param_data.get('source', 'memory') if loaded_ctgan_params else 'memory',
                'synthetic_data': synthetic_ctgan_final
            }
                
    except Exception as train_error:
        print(f"‚ùå Failed to train final CTGAN model: {train_error}")
        import traceback
        traceback.print_exc()
        synthetic_ctgan_final = None
        ctgan_final_score = 0.0
        ctgan_final_results = {
            'model_name': 'CTGAN',
            'objective_score': 0.0,
            'error': str(train_error)
        }

except Exception as e:
    print(f"‚ùå Error accessing CTGAN parameters: {e}")
    print("   Please ensure Section 4.1 has been executed successfully or parameter CSV exists.")
    # Create empty results to prevent downstream errors
    synthetic_ctgan_final = None
    ctgan_final_results = {
        'model_name': 'CTGAN',
        'objective_score': 0.0,
        'error': str(e)
    }
    
print("\n" + "=" * 60)
print("‚úÖ SECTION 5.1 COMPLETE: Best CTGAN model trained and evaluated")
print("üîÑ Ready for Section 5.2: CTAB-GAN model training")

#### 5.1.2 Best CTAB-GAN Model Evaluation

In [None]:
# Code Chunk ID: CHUNK_053a

# Section 5.2: Best CTAB-GAN Model Evaluation
print("üèÜ SECTION 5.2: BEST CTAB-GAN MODEL EVALUATION")
print("=" * 60)

# 5.2.1 Retrieve Best Model Results from Section 4.2
print("üìä 5.2.1 Retrieving best CTAB-GAN results from Section 4.2...")

try:
    # Use unified parameter loading function
    ctabgan_params = get_model_parameters(
        model_name='ctab-gan',
        section_number=4,
        dataset_identifier=DATASET_IDENTIFIER,
        scope=globals()
    )
    
    if ctabgan_params is not None:
        best_params = ctabgan_params
        
        # Try to get additional metadata from memory if available
        if 'ctabgan_study' in globals() and ctabgan_study is not None:
            best_trial = ctabgan_study.best_trial
            best_objective_score = best_trial.value
            trial_number = best_trial.number
            print(f"‚úÖ Section 4.2 CTAB-GAN optimization completed successfully!")
            print(f"   ‚Ä¢ Best Trial: #{trial_number}")
        else:
            # Use fallback values when memory unavailable
            best_objective_score = 0.0
            trial_number = "loaded_from_csv"
            print(f"‚úÖ Section 4.2 CTAB-GAN parameters loaded from CSV!")
            print(f"   ‚Ä¢ Best Trial: #{trial_number}")
        
        print(f"   ‚Ä¢ Best Objective Score: {best_objective_score:.4f}" if isinstance(best_objective_score, (int, float)) else f"   ‚Ä¢ Best Objective Score: {best_objective_score}")
        print(f"   ‚Ä¢ Best Parameters:")
        for param, value in best_params.items():
            print(f"     - {param}: {value}")
        
        # 5.2.2 Train Final CTAB-GAN Model using Section 5.1 Pattern
        print("üîß Training final CTAB-GAN model using Section 5.1 proven pattern with optimized parameters...")
        
        try:
            # Use the exact same ModelFactory pattern that works in Section 5.1
            from src.models.model_factory import ModelFactory
            
            # Create CTAB-GAN model using the working pattern
            final_ctabgan_model = ModelFactory.create("ctabgan", random_state=42)
            
            # Apply the best parameters found in Section 4.2 optimization
            final_ctabgan_params = {
                'epochs': best_params.get('epochs', 300),
                'batch_size': best_params.get('batch_size', 512),
                'lr': best_params.get('lr', 2e-4),
                'betas': best_params.get('betas', (0.5, 0.9)),
                'l2scale': best_params.get('l2scale', 1e-5),
                'mixed_precision': best_params.get('mixed_precision', False),
                'test_ratio': best_params.get('test_ratio', 0.20),
                'verbose': best_params.get('verbose', True)
            }
            
            print("üîß Training CTAB-GAN with optimal hyperparameters...")
            for param, value in final_ctabgan_params.items():
                print(f"   ‚Ä¢ Using {param}: {value}")
            
            # Train the model with best parameters
            final_ctabgan_model.train(data, **final_ctabgan_params)
            print("‚úÖ CTAB-GAN training completed successfully!")
            
            # Generate synthetic data
            print("üìä Generating synthetic data for evaluation...")
            synthetic_ctabgan_final = final_ctabgan_model.generate(len(data))
            print(f"‚úÖ Generated {len(synthetic_ctabgan_final)} synthetic samples")
            
            # Evaluate using enhanced objective function
            if 'enhanced_objective_function_v2' in globals():
                print("üéØ CTAB-GAN Classification Performance Analysis:")
                
                ctabgan_final_score, ctabgan_similarity, ctabgan_accuracy = enhanced_objective_function_v2(
                    real_data=data, 
                    synthetic_data=synthetic_ctabgan_final, 
                    target_column=TARGET_COLUMN
                )
                
                print(f"‚úÖ CTAB-GAN Final Results:")
                print(f"   ‚Ä¢ Overall Score: {ctabgan_final_score:.4f}")
                print(f"   ‚Ä¢ Similarity Score: {ctabgan_similarity:.4f}")  
                print(f"   ‚Ä¢ Accuracy Score: {ctabgan_accuracy:.4f}")
                
                # Store results for Section 5.7 comparison
                ctabgan_final_results = {
                    'model_name': 'CTAB-GAN',
                    'objective_score': ctabgan_final_score,
                    'similarity_score': ctabgan_similarity,
                    'accuracy_score': ctabgan_accuracy,
                    'best_params': best_params,
                    'synthetic_data': synthetic_ctabgan_final
                }
                
            else:
                print("‚ö†Ô∏è Enhanced objective function not available - using basic evaluation")
                ctabgan_final_results = {
                    'model_name': 'CTAB-GAN',
                    'objective_score': best_objective_score,
                    'best_params': best_params,
                    'synthetic_data': synthetic_ctabgan_final
                }
                
        except Exception as e:
            print(f"‚ùå CTAB-GAN training failed: {str(e)}")
            synthetic_ctabgan_final = None
            ctabgan_final_results = {
                'model_name': 'CTAB-GAN',
                'objective_score': 0.0,
                'error': str(e)
            }
        
    else:
        print("‚ùå CTAB-GAN study results not found - Section 4.2 may not have completed successfully")
        print("    Please ensure Section 4.2 has been executed before running Section 5.2")
        synthetic_ctabgan_final = None
        ctabgan_final_score = 0.0
        ctabgan_final_results = {
            'model_name': 'CTAB-GAN',
            'objective_score': 0.0,
            'error': 'Section 4.2 not completed'
        }
        
except Exception as e:
    print(f"‚ùå Error in Section 5.2 CTAB-GAN evaluation: {e}")
    import traceback
    traceback.print_exc()
    synthetic_ctabgan_final = None
    ctabgan_final_score = 0.0
    ctabgan_final_results = {
        'model_name': 'CTAB-GAN',
        'objective_score': 0.0,
        'error': str(e)
    }

print("‚úÖ Section 5.2 CTAB-GAN evaluation completed!")
print("=" * 60)

#### 5.1.3 Best CTAB-GAN+ Model Evaluation

In [None]:
# Code Chunk ID: CHUNK_061
# ============================================================================
# Section 5.3: Best CTAB-GAN+ Model Evaluation - FIXED IMPLEMENTATION
# ============================================================================
# Using Section 4.3 optimized hyperparameters with proven ModelFactory pattern

print("üèÜ SECTION 5.3: BEST CTAB-GAN+ MODEL EVALUATION")
print("=" * 80)

try:
    # Step 1: Retrieve Section 4.3 CTAB-GAN+ optimization results
    if 'ctabganplus_study' in globals():
        best_trial = ctabganplus_study.best_trial
        best_params = best_trial.params
        best_objective_score = best_trial.value
        
        print(f"‚úÖ Retrieved Section 4.3 CTAB-GAN+ optimization results")
        print(f"   ‚Ä¢ Best Trial: #{best_trial.number}")
        print(f"   ‚Ä¢ Best Objective Score: {best_objective_score:.4f}")
        print(f"   ‚Ä¢ Parameters: {len(best_params)} hyperparameters")
        
        # Display best parameters
        print(f"\nüìä Best CTAB-GAN+ Hyperparameters:")
        print("-" * 40)
        for param, value in best_params.items():
            if isinstance(value, float):
                print(f"   ‚Ä¢ {param}: {value:.4f}")
            else:
                print(f"   ‚Ä¢ {param}: {value}")
                
    else:
        print("‚ö†Ô∏è CTAB-GAN+ optimization results not found - using fallback parameters")
        # Fallback CTAB-GAN+ parameters (basic working configuration)
        best_params = {
            'epochs': 100,
            'batch_size': 128,
            'lr_generator': 1e-4,
            'lr_discriminator': 2e-4,
            'beta_1': 0.5,
            'beta_2': 0.9,
            'lambda_gp': 10,
            'pac': 1
        }
        best_objective_score = None
        print(f"   Using fallback parameters: {best_params}")

    # Step 2: Create CTAB-GAN+ model using proven ModelFactory pattern (SAME AS SECTION 5.2)
    print(f"\nüèóÔ∏è Creating CTAB-GAN+ model using ModelFactory...")
    from src.models.model_factory import ModelFactory
    
    # CRITICAL FIX: Use the exact same ModelFactory pattern that works in Section 5.1 & 5.2
    final_ctabganplus_model = ModelFactory.create("ctabganplus", random_state=42)
    print(f"‚úÖ CTAB-GAN+ model created successfully")
    
    # Step 3: Train using the correct method name: .train() (NOT .fit())
    print(f"\nüöÄ Training CTAB-GAN+ model with optimized hyperparameters...")
    print(f"   ‚Ä¢ Data shape: {data.shape}")
    print(f"   ‚Ä¢ Target column: '{TARGET_COLUMN}'")
    print(f"   ‚Ä¢ Training with Section 4.3 parameters")
    
    # Store final parameters for results tracking
    final_ctabganplus_params = best_params.copy()
    
    # CRITICAL FIX: Train using .train() method (proven pattern from Sections 5.1 & 5.2)
    final_ctabganplus_model.train(data, **final_ctabganplus_params)
    print(f"‚úÖ CTAB-GAN+ model training completed successfully!")
    
    # Step 4: Generate synthetic data using the correct method: .generate()
    print(f"\nüìä Generating synthetic data for evaluation...")
    synthetic_ctabganplus_final = final_ctabganplus_model.generate(len(data))
    print(f"‚úÖ Synthetic data generated successfully!")
    print(f"   ‚Ä¢ Synthetic data shape: {synthetic_ctabganplus_final.shape}")
    print(f"   ‚Ä¢ Columns match: {list(synthetic_ctabganplus_final.columns) == list(data.columns)}")
    
    # Step 5: Quick evaluation using enhanced objective function (NO IMPORT - function in globals)
    if 'enhanced_objective_function_v2' in globals():
        ctabganplus_final_score, ctabganplus_similarity, ctabganplus_accuracy = enhanced_objective_function_v2(
            real_data=data, 
            synthetic_data=synthetic_ctabganplus_final, 
            target_column=TARGET_COLUMN
        )
        
        print(f"\nüìä CTAB-GAN+ Enhanced Objective Function v2 Results:")
        print(f"   ‚Ä¢ Final Combined Score: {ctabganplus_final_score:.4f}")
        print(f"   ‚Ä¢ Statistical Similarity (60%): {ctabganplus_similarity:.4f}")
        print(f"   ‚Ä¢ Classification Accuracy (40%): {ctabganplus_accuracy:.4f}")
    else:
        print("‚ö†Ô∏è Enhanced objective function not available - using basic metrics")
        ctabganplus_final_score = 0.5  # Fallback score
        ctabganplus_similarity = 0.5
        ctabganplus_accuracy = 0.5
    
    # Store results for Section 5.7 comparative analysis
    ctabganplus_final_results = {
        'model_name': 'CTAB-GAN+',
        'objective_score': ctabganplus_final_score,
        'similarity_score': ctabganplus_similarity,
        'accuracy_score': ctabganplus_accuracy,
        'final_combined_score': ctabganplus_final_score,
        'sections_completed': ['5.3.1'],
        'evaluation_method': 'section_5_1_pattern',
        'section_4_optimization': best_objective_score is not None,
        'best_section_4_score': best_objective_score
    }
    
    print(f"\n‚úÖ SECTION 5.3 COMPLETED SUCCESSFULLY!")
    print(f"üéØ CTAB-GAN+ evaluation completed using Section 4.3 optimized parameters")
    print(f"üìä Results ready for Section 5.7 comparative analysis")
    print("-" * 80)

except Exception as e:
    print(f"‚ùå CTAB-GAN+ evaluation failed: {str(e)}")
    import traceback
    traceback.print_exc()
    # Set fallback for subsequent sections
    synthetic_ctabganplus_final = None
    ctabganplus_final_results = {'error': str(e), 'evaluation_failed': True}

#### Section 5.1.4 BEST GANerAid MODEL

In [None]:
# Code Chunk ID: CHUNK_065
# ============================================================================
# Section 5.4.1: Best GANerAid Model Training
# ============================================================================
# Using Section 4.4 optimized hyperparameters with proven ModelFactory pattern

print("üèÜ SECTION 5.4.1: BEST GANerAid MODEL TRAINING")
print("=" * 80)

try:
    # Step 1: Retrieve Section 4.4 GANerAid optimization results
    if 'ganeraid_study' in globals():
        best_trial = ganeraid_study.best_trial
        final_ganeraid_params = best_trial.params
        best_objective_score = best_trial.value
        
        print(f"‚úÖ Retrieved Section 4.4 GANerAid optimization results")
        print(f"   ‚Ä¢ Best Trial: #{best_trial.number}")
        print(f"   ‚Ä¢ Best Objective Score: {best_objective_score:.4f}")
        print(f"   ‚Ä¢ Parameters: {len(final_ganeraid_params)} hyperparameters")
        
    else:
        print("‚ö†Ô∏è GANerAid optimization results not found - using fallback parameters")
        # Fallback GANerAid parameters
        final_ganeraid_params = {
            'epochs': 100,
            'batch_size': 128,
            'learning_rate': 1e-4
        }
        best_objective_score = None

    # Step 2: Create GANerAid model using proven ModelFactory pattern
    print(f"\nüèóÔ∏è Creating GANerAid model using ModelFactory...")
    from src.models.model_factory import ModelFactory
    
    final_ganeraid_model = ModelFactory.create("ganeraid", random_state=42)
    print(f"‚úÖ GANerAid model created successfully")
    
    # Step 3: Train using .train() method (NOT .fit())
    print(f"\nüöÄ Training GANerAid model with optimized hyperparameters...")
    final_ganeraid_model.train(data, **final_ganeraid_params)
    print(f"‚úÖ GANerAid model training completed successfully!")
    
    # Step 4: Generate synthetic data
    synthetic_ganeraid_final = final_ganeraid_model.generate(len(data))
    print(f"‚úÖ GANerAid synthetic data generated: {synthetic_ganeraid_final.shape}")
    
    # Step 5: Quick evaluation using enhanced objective function (NO IMPORT - function in globals)
    if 'enhanced_objective_function_v2' in globals():
        ganeraid_final_score, ganeraid_similarity, ganeraid_accuracy = enhanced_objective_function_v2(
            real_data=data, synthetic_data=synthetic_ganeraid_final, target_column=TARGET_COLUMN
        )
        
        print(f"\nüìä GANerAid Enhanced Objective Function v2 Results:")
        print(f"   ‚Ä¢ Final Combined Score: {ganeraid_final_score:.4f}")
        print(f"   ‚Ä¢ Statistical Similarity (60%): {ganeraid_similarity:.4f}")
        print(f"   ‚Ä¢ Classification Accuracy (40%): {ganeraid_accuracy:.4f}")
    else:
        print("‚ö†Ô∏è Enhanced objective function not available - using basic metrics")
        ganeraid_final_score = 0.5  # Fallback score
        ganeraid_similarity = 0.5
        ganeraid_accuracy = 0.5
    
    # Store results
    ganeraid_final_results = {
        'model_name': 'GANerAid',
        'objective_score': ganeraid_final_score,
        'similarity_score': ganeraid_similarity,
        'accuracy_score': ganeraid_accuracy,
        'final_combined_score': ganeraid_final_score,
        'sections_completed': ['5.4.1'],
        'evaluation_method': 'section_5_1_pattern',
        'section_4_optimization': best_objective_score is not None,
        'best_section_4_score': best_objective_score,
        'optimized_params': final_ganeraid_params
    }
    
    print(f"\n‚úÖ SECTION 5.4.1 - GANerAid MODEL TRAINING COMPLETED!")
    print("-" * 80)

except Exception as e:
    print(f"‚ùå GANerAid training failed: {str(e)}")
    import traceback
    traceback.print_exc()
    synthetic_ganeraid_final = None
    ganeraid_final_results = {'error': str(e), 'training_failed': True}

#### 5.1.5: Best CopulaGAN Model

In [None]:
# Code Chunk ID: CHUNK_070
# ============================================================================
# Section 5.5.1: Best CopulaGAN Model Training - ENHANCED ERROR HANDLING
# ============================================================================
# Using Section 4.5 optimized hyperparameters with proven ModelFactory pattern

print("üèÜ SECTION 5.5.1: BEST CopulaGAN MODEL TRAINING")
print("=" * 80)

try:

    # Load CopulaGAN best parameters from CSV file (more reliable than memory variables)
    def load_best_copulagan_params():
        try:
            import pandas as pd
            import ast
            csv_path = 'results/pakistani-diabetes-dataset/2025-09-11/Section-4/best_parameters.csv'
            df = pd.read_csv(csv_path)
            copulagan_params = df[df['model_name'] == 'CopulaGAN']
            
            if copulagan_params.empty:
                return None, None, None
                
            # Get the best score and trial number
            best_score = copulagan_params.iloc[0]['best_score']
            trial_number = copulagan_params.iloc[0]['trial_number']
            
            # Convert parameters to proper types
            params = {}
            for _, row in copulagan_params.iterrows():
                if row['is_component']:  # Skip component entries (discriminator_dim_0, etc.)
                    continue
                    
                param_name = row['parameter_name']
                param_value = row['parameter_value']
                param_type = row['parameter_type']
                
                if param_type == 'int':
                    params[param_name] = int(param_value)
                elif param_type == 'float':
                    params[param_name] = float(param_value)
                elif param_type == 'bool':
                    params[param_name] = param_value == 'True'
                elif param_type == 'tuple':
                    params[param_name] = ast.literal_eval(param_value)
                elif param_type == 'list':
                    params[param_name] = ast.literal_eval(param_value)
                else:
                    params[param_name] = param_value
                    
            return params, best_score, trial_number
            
        except Exception as e:
            print(f"Error loading parameters from CSV: {e}")
            return None, None, None
    
    # Load the best parameters
    final_copulagan_params, best_objective_score, trial_number = load_best_copulagan_params()

    if final_copulagan_params is not None:
        print(f"‚úÖ Retrieved Section 4.5 CopulaGAN optimization results from CSV")
        print(f"   ‚Ä¢ Best Trial: #{trial_number}")
        print(f"   ‚Ä¢ Best Objective Score: {best_objective_score:.4f}")
        print(f"   ‚Ä¢ Parameters: {len(final_copulagan_params)} hyperparameters")
        print(f"   ‚Ä¢ Parameter details: {final_copulagan_params}")
        
    else:
        print("‚ö†Ô∏è CopulaGAN optimization results not found - using fallback parameters")
        # Simplified fallback CopulaGAN parameters (SDV compatible)
        final_copulagan_params = {
            'epochs': 50,  # Reduced for stability
            'batch_size': 64,  # Smaller batch size
            'lr': 2e-4  # Slightly higher learning rate
        }
        best_objective_score = None

    # Step 2: Enhanced data preprocessing for CopulaGAN
    print(f"\nüîß Preprocessing data for CopulaGAN...")
    
    # CopulaGAN requires proper data types and no missing values
    copula_data = data.copy()
    
    # Handle missing values
    if copula_data.isnull().sum().sum() > 0:
        print(f"   ‚ö†Ô∏è Found {copula_data.isnull().sum().sum()} missing values - filling with median/mode")
        for col in data.columns:
            if copula_data[col].dtype in ['float64', 'int64']:
                copula_data[col].fillna(copula_data[col].median(), inplace=True)
            else:
                copula_data[col].fillna(copula_data[col].mode()[0] if not copula_data[col].mode().empty else 0, inplace=True)
    
    # Ensure proper data types
    for col in data.columns:
        if copula_data[col].dtype == 'object':
            try:
                copula_data[col] = pd.to_numeric(copula_data[col], errors='coerce')
                if copula_data[col].isnull().sum() > 0:
                    copula_data[col].fillna(0, inplace=True)
            except:
                pass
    
    print(f"   ‚úÖ Data preprocessing completed: {copula_data.shape}")
    print(f"   ‚Ä¢ Missing values: {copula_data.isnull().sum().sum()}")
    print(f"   ‚Ä¢ Data types: {copula_data.dtypes.value_counts().to_dict()}")

    # Step 3: Create CopulaGAN model using proven ModelFactory pattern
    print(f"\nüèóÔ∏è Creating CopulaGAN model using ModelFactory...")
    from src.models.model_factory import ModelFactory
    
    try:
        final_copulagan_model = ModelFactory.create("copulagan", random_state=42)
        print(f"‚úÖ CopulaGAN model created successfully")
        
        # Step 4: Enhanced training with error handling
        print(f"\nüöÄ Training CopulaGAN model with optimized hyperparameters...")
        print(f"   ‚Ä¢ Using parameters: {final_copulagan_params}")
        
        # Train using ALL optimized hyperparameters (same pattern as other Section 5 chunks)
        print(f"   ‚Ä¢ Using ALL parameters from Section 4.5: {final_copulagan_params}")
        
        # Auto-detect discrete columns for CopulaGAN (same as working Section 3)
        discrete_columns = data.select_dtypes(include=['object']).columns.tolist()
        
        # Train with ALL optimized parameters AND discrete_columns (same pattern as Section 3)
        final_copulagan_model.train(data, discrete_columns=discrete_columns, **final_copulagan_params)
        print(f"‚úÖ CopulaGAN model training completed successfully!")
        
        # Step 5: Generate synthetic data
        print(f"\nüîß Generating CopulaGAN synthetic data...")
        synthetic_copulagan_final = final_copulagan_model.generate(len(data))
        
        # Ensure synthetic data has same structure as original
        if isinstance(synthetic_copulagan_final, pd.DataFrame):
            # Ensure column order matches
            synthetic_copulagan_final = synthetic_copulagan_final[data.columns]
        
        print(f"‚úÖ CopulaGAN synthetic data generated: {synthetic_copulagan_final.shape}")
        print(f"   ‚Ä¢ Columns match: {list(synthetic_copulagan_final.columns) == list(data.columns)}")
        
        # Step 6: Quick evaluation using enhanced objective function
        if 'enhanced_objective_function_v2' in globals():
            print(f"\nüìä CopulaGAN Enhanced Objective Function v2 Results:")
            
            try:
                copulagan_final_score, copulagan_similarity, copulagan_accuracy = enhanced_objective_function_v2(
                    real_data=data, synthetic_data=synthetic_copulagan_final, target_column=TARGET_COLUMN
                )
                
                print(f"   ‚Ä¢ Final Combined Score: {copulagan_final_score:.4f}")
                print(f"   ‚Ä¢ Statistical Similarity (60%): {copulagan_similarity:.4f}")
                print(f"   ‚Ä¢ Classification Accuracy (40%): {copulagan_accuracy:.4f}")
                
            except Exception as eval_error:
                print(f"   ‚ö†Ô∏è Evaluation failed: {eval_error}")
                copulagan_final_score = 0.3  # Lower fallback due to training issues
                copulagan_similarity = 0.3
                copulagan_accuracy = 0.3
                
        else:
            print("‚ö†Ô∏è Enhanced objective function not available - using fallback metrics")
            copulagan_final_score = 0.3
            copulagan_similarity = 0.3
            copulagan_accuracy = 0.3
        
        # Store results
        copulagan_final_results = {
            'model_name': 'CopulaGAN',
            'objective_score': copulagan_final_score,
            'similarity_score': copulagan_similarity,
            'accuracy_score': copulagan_accuracy,
            'final_combined_score': copulagan_final_score,
            'sections_completed': ['5.5.1'],
            'evaluation_method': 'section_5_1_pattern',
            'section_4_optimization': best_objective_score is not None,
            'best_section_4_score': best_objective_score,
            'optimized_params': final_copulagan_params,
            'training_successful': True
        }
        
        print(f"\n‚úÖ SECTION 5.5.1 - CopulaGAN MODEL TRAINING COMPLETED!")
        
    except Exception as model_error:
        print(f"‚ùå CopulaGAN model creation/training failed: {model_error}")
        print("   This may be due to CopulaGAN compatibility issues")
        
        # Create minimal fallback results
        synthetic_copulagan_final = None
        copulagan_final_results = {
            'model_name': 'CopulaGAN',
            'training_error': str(model_error),
            'training_successful': False,
            'sections_completed': [],
            'fallback_reason': 'CopulaGAN training compatibility issue'
        }
    
    print("-" * 80)

except Exception as e:
    print(f"‚ùå CopulaGAN Section 5.5.1 failed: {str(e)}")
    import traceback
    traceback.print_exc()
    synthetic_copulagan_final = None
    copulagan_final_results = {'error': str(e), 'training_failed': True}

#### 5.1.6: Best TVAE Model Evaluation 

In [None]:
# Code Chunk ID: CHUNK_075
# ============================================================================
# Section 5.6.1: Best TVAE Model Training
# ============================================================================
# Using Section 4.6 optimized hyperparameters with proven ModelFactory pattern

print("üèÜ SECTION 5.6.1: BEST TVAE MODEL TRAINING")
print("=" * 80)

try:
    # Step 1: Retrieve Section 4.6 TVAE optimization results
    if 'tvae_study' in globals():
        best_trial = tvae_study.best_trial
        final_tvae_params = best_trial.params
        best_objective_score = best_trial.value
        
        print(f"‚úÖ Retrieved Section 4.6 TVAE optimization results")
        print(f"   ‚Ä¢ Best Trial: #{best_trial.number}")
        print(f"   ‚Ä¢ Best Objective Score: {best_objective_score:.4f}")
        print(f"   ‚Ä¢ Parameters: {len(final_tvae_params)} hyperparameters")
        
    else:
        print("‚ö†Ô∏è TVAE optimization results not found - using fallback parameters")
        # Fallback TVAE parameters
        final_tvae_params = {
            'epochs': 100,
            'batch_size': 128,
            'lr': 1e-4,
            'compress_dims': [128, 64],
            'decompress_dims': [64, 128]
        }
        best_objective_score = None

    # Step 2: Create TVAE model using proven ModelFactory pattern
    print(f"\nüèóÔ∏è Creating TVAE model using ModelFactory...")
    from src.models.model_factory import ModelFactory
    
    final_tvae_model = ModelFactory.create("tvae", random_state=42)
    print(f"‚úÖ TVAE model created successfully")
    
    # Step 3: Train using .train() method (NOT .fit())
    print(f"\nüöÄ Training TVAE model with optimized hyperparameters...")
    final_tvae_model.train(data, **final_tvae_params)
    print(f"‚úÖ TVAE model training completed successfully!")
    
    # Step 4: Generate synthetic data
    synthetic_tvae_final = final_tvae_model.generate(len(data))
    print(f"‚úÖ TVAE synthetic data generated: {synthetic_tvae_final.shape}")
    
    # Step 5: Quick evaluation using enhanced objective function (NO IMPORT - function in globals)
    if 'enhanced_objective_function_v2' in globals():
        tvae_final_score, tvae_similarity, tvae_accuracy = enhanced_objective_function_v2(
            real_data=data, synthetic_data=synthetic_tvae_final, target_column=TARGET_COLUMN
        )
        
        print(f"\nüìä TVAE Enhanced Objective Function v2 Results:")
        print(f"   ‚Ä¢ Final Combined Score: {tvae_final_score:.4f}")
        print(f"   ‚Ä¢ Statistical Similarity (60%): {tvae_similarity:.4f}")
        print(f"   ‚Ä¢ Classification Accuracy (40%): {tvae_accuracy:.4f}")
    else:
        print("‚ö†Ô∏è Enhanced objective function not available - using basic metrics")
        tvae_final_score = 0.5  # Fallback score
        tvae_similarity = 0.5
        tvae_accuracy = 0.5
    
    # Store results
    tvae_final_results = {
        'model_name': 'TVAE',
        'objective_score': tvae_final_score,
        'similarity_score': tvae_similarity,
        'accuracy_score': tvae_accuracy,
        'final_combined_score': tvae_final_score,
        'sections_completed': ['5.6.1'],
        'evaluation_method': 'section_5_1_pattern',
        'section_4_optimization': best_objective_score is not None,
        'best_section_4_score': best_objective_score,
        'optimized_params': final_tvae_params
    }
    
    print(f"\n‚úÖ SECTION 5.6.1 - TVAE MODEL TRAINING COMPLETED!")
    print("-" * 80)

except Exception as e:
    print(f"‚ùå TVAE training failed: {str(e)}")
    import traceback
    traceback.print_exc()
    synthetic_tvae_final = None
    tvae_final_results = {'error': str(e), 'training_failed': True}

### 5.2 Batch Process

In [None]:
# Code Chunk ID: CHUNK_076
# ============================================================================
# SECTION 5.2 - OPTIMIZED MODELS BATCH EVALUATION
# Following CHUNK_018 pattern with comprehensive file export to Section-5 directory
# ============================================================================

print("üîç SECTION 5.2 - OPTIMIZED MODELS BATCH EVALUATION")
print("=" * 80)
print("üìã Evaluating all available optimized models from Section 5.1.x")
print("üìÅ Exporting all tables and analysis to Section-5 directory")
print("üîÑ Following Section 3 comprehensive evaluation pattern")
print()

# Ensure setup module function is available
from setup import evaluate_section5_optimized_models

# Use Section 5 batch evaluation function from setup.py
# Following exact same pattern as CHUNK_018 (Section 3) - comprehensive file export!
try:
    # Run batch evaluation with file export for all optimized models
    section5_batch_results = evaluate_section5_optimized_models(
        section_number=5,
        scope=globals(),  # Pass notebook scope to access synthetic data variables
        target_column=TARGET_COLUMN
    )
    
    print("\n" + "="*80)
    print("‚úÖ SECTION 5.2 OPTIMIZED MODELS BATCH EVALUATION COMPLETED!")
    print("="*80)
    print(f"üìä Models processed: {section5_batch_results['models_processed']}")
    print(f"üìÅ Results exported to: {section5_batch_results['results_dir']}")
    
    # Show summary of all evaluations
    if 'evaluation_summaries' in section5_batch_results:
        print("\nüìã EVALUATION SUMMARIES:")
        print("-" * 40)
        for model_name, summary in section5_batch_results['evaluation_summaries'].items():
            print(f"ü§ñ {model_name}:")
            print(f"   üìä Synthetic samples: {summary.get('synthetic_samples', 'N/A')}")
            print(f"   üìà Overall score: {summary.get('overall_score', 'N/A')}")
            
    print("\n" + "="*80)
            
except Exception as e:
    print(f"‚ùå Section 5.2 batch evaluation failed: {e}")
    print(f"üîç Error details: {type(e).__name__}")
    print()
    print("‚ö†Ô∏è  Check that Section 5.1.x models completed successfully")

print("\nüìà Section 5.2 optimized model batch evaluation complete!")
print("üèÅ Ready for final model comparison and production deployment!")