# Synthetic Brand Generation V2 - Enhanced with Ensemble Methods

### University of Colorado Boulder - Introduction to Deep Learning
---
#### Dyego Fernandes de Sousa
---

### Improvements over V1

This notebook implements the following enhancements:

1. **TVAE (Tabular Variational Autoencoder)**: Alternative to CTGAN for better continuous distributions
2. **Gaussian Copula**: For better correlation structure preservation
3. **Larger Language Models**: GPT-2 Medium, Flan-T5, Phi-2, TinyLlama for improved brand name generation
4. **Ensemble Methods**: Voting/averaging across multiple generators
5. **Hyperparameter Tuning**: Optional Optuna-based optimization for tabular synthesizers
6. **Comprehensive Quality Evaluation**: Multi-dimensional assessment with visualizations

### Notebook Structure
1. **Phase 0**: Hyperparameter Tuning (Optional)
2. **Phase 1**: Setup & Data Preparation
3. **Phase 2**: Tabular Ensemble Training (CTGAN + TVAE + Gaussian Copula)
4. **Phase 3**: LLM Ensemble Training (GPT-2 Medium + Flan-T5)
5. **Phase 4**: Synthetic Data Generation with Ensembles
6. **Phase 5**: Synthetic Data Quality Evaluation
   - 5.1 Statistical Distribution Comparison (KS Tests)
   - 5.2 KS Statistics Visualization
   - 5.3 Distribution Comparison (Histograms with KDE)
   - 5.4 Correlation Structure Preservation
   - 5.5 QQ Plots (Quantile Comparison)
   - 5.6 PCA & t-SNE Dimensionality Reduction
   - 5.7 Feature-wise Statistics Comparison
   - 5.8 Quality Scorecard & Radar Chart
   - 5.9 Augmented Data Clustering Analysis
   - 5.10 Final Summary
7. **Conclusion**: Results Analysis & Future Work
8. **References**: Bibliography

## Phase 1: Setup & Installation

### Optimized for Google Colab Pro (~15GB RAM, ~16GB VRAM)

In [1]:
# Clone repository and install dependencies
!git clone https://github.com/dyegofern/csca5642-deep-learning.git
!pip install -q sdv transformers torch pandas numpy scikit-learn matplotlib seaborn plotly scipy
!pip install -q peft bitsandbytes accelerate sentencepiece  # Additional V2 dependencies

import sys
import os
from google.colab import drive

MAPPED_DIR = '/content/csca5642-deep-learning'

# Mount Google Drive
print("Mounting Google Drive...")
if not os.path.exists('/content/drive'):
    drive.mount('/content/drive')
else:
    print("Google Drive already mounted")

DATA_PATH = MAPPED_DIR + '/data/raw/brand_information.csv'

# Set output and model directories to Google Drive
DRIVE_OUTPUT_BASE = '/content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2'
OUTPUT_DIR = os.path.join(DRIVE_OUTPUT_BASE, 'outputs')
MODEL_DIR = os.path.join(DRIVE_OUTPUT_BASE, 'models')

# Create directories
print(f"\nCreating directories in Google Drive...")
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)
print(f"Output directory: {OUTPUT_DIR}")
print(f"Model directory: {MODEL_DIR}")

# Add src to path
src_path = MAPPED_DIR + '/src'
if src_path not in sys.path:
    sys.path.append(src_path)

print(f"\nSetup complete!")

Cloning into 'csca5642-deep-learning'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 88 (delta 46), reused 60 (delta 22), pack-reused 0 (from 0)[K
Receiving objects: 100% (88/88), 1.51 MiB | 21.54 MiB/s, done.
Resolving deltas: 100% (46/46), done.
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.0/197.0 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.5/14.5 MB[0m [31m143.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.7/52.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.3/74.3 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Import V2 modules
from data_processor import BrandDataProcessor
from tabular_gan_v2 import (
    EnsembleSynthesizer,
    CTGANSynthesizerWrapper,
    TVAESynthesizerWrapper,
    GaussianCopulaSynthesizerWrapper,
    calculate_generation_targets
)
from brand_name_generator_v2 import BrandNameGeneratorV2
from evaluator import BrandDataEvaluator

# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import gc

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Check GPU
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

print("\nAll V2 modules loaded successfully!")

CUDA available: True
GPU: NVIDIA L4
Memory: 23.8 GB

All V2 modules loaded successfully!


## Configuration

In [3]:
# Configuration for V2
FROM_PRETRAINED = False  # Set to True to load pre-trained models

# Tabular Ensemble Config
CTGAN_EPOCHS = 300
TVAE_EPOCHS = 300
BATCH_SIZE = 500
ENSEMBLE_WEIGHTS = {
    'ctgan': 0.40,
    'tvae': 0.35,
    'gaussian_copula': 0.25
}

# LLM Ensemble Config
LLM_MODELS = ['gpt2-medium', 'flan-t5-base']  # Can add 'phi-2', 'tinyllama' if memory allows
LLM_EPOCHS = 3

# Generation Config
MIN_BRANDS_PER_COMPANY = 10
DIVERSITY_TEMPERATURE = 0.7
ADD_DIVERSITY_NOISE = True

print("Configuration:")
print(f"  FROM_PRETRAINED: {FROM_PRETRAINED}")
print(f"  CTGAN_EPOCHS: {CTGAN_EPOCHS}")
print(f"  TVAE_EPOCHS: {TVAE_EPOCHS}")
print(f"  LLM_MODELS: {LLM_MODELS}")
print(f"  ENSEMBLE_WEIGHTS: {ENSEMBLE_WEIGHTS}")

Configuration:
  FROM_PRETRAINED: False
  CTGAN_EPOCHS: 300
  TVAE_EPOCHS: 300
  LLM_MODELS: ['gpt2-medium', 'flan-t5-base']
  ENSEMBLE_WEIGHTS: {'ctgan': 0.4, 'tvae': 0.35, 'gaussian_copula': 0.25}


## Phase 0: Hyperparameter Tuning (Optional)

This phase uses Optuna to find optimal hyperparameters for the tabular synthesizers. 
Set `RUN_HYPERPARAMETER_TUNING = True` to run optimization, or use previously saved best parameters.

**Tuned Parameters:**
- **CTGAN/TVAE**: epochs, batch_size, embedding_dim, generator/discriminator dimensions
- **Ensemble**: weights for each model
- **Generation**: noise_level, diversity_temperature

In [None]:
# Hyperparameter Tuning Configuration
RUN_HYPERPARAMETER_TUNING = False  # Set to True to run Optuna optimization
N_TUNING_TRIALS = 20  # Number of Optuna trials (more = better but slower)
TUNING_TIMEOUT = 900  # Maximum time for tuning in seconds (15 minutes)

# Path to save/load best hyperparameters
HYPERPARAMS_PATH = os.path.join(MODEL_DIR, 'best_hyperparameters.json')

print(f"Hyperparameter tuning: {'ENABLED' if RUN_HYPERPARAMETER_TUNING else 'DISABLED'}")
print(f"Trials: {N_TUNING_TRIALS}, Timeout: {TUNING_TIMEOUT}s")
print(f"Hyperparameters path: {HYPERPARAMS_PATH}")

In [None]:
# Import hyperparameter tuner from external module
from hyperparameter_tuner_v2 import HyperparameterTunerV2

# Initialize tuner (will be used after data preparation)
tuner = None
best_hyperparams = None

print("HyperparameterTunerV2 imported successfully.")
print("Tuner will be initialized after data preparation (Phase 1).")

In [None]:
# Check for saved hyperparameters
if os.path.exists(HYPERPARAMS_PATH):
    print("Found saved hyperparameters!")
    best_hyperparams = HyperparameterTunerV2.load(HYPERPARAMS_PATH)
    
    print(f"Previous best score: {best_hyperparams.get('best_score', 'N/A')}")
    print(f"Trials completed: {best_hyperparams.get('n_trials', 'N/A')}")
    print(f"Timestamp: {best_hyperparams.get('tuning_timestamp', 'N/A')}")
    
    # Update configuration with loaded hyperparameters
    CTGAN_EPOCHS = best_hyperparams.get('ctgan_epochs', CTGAN_EPOCHS)
    TVAE_EPOCHS = best_hyperparams.get('tvae_epochs', TVAE_EPOCHS)
    BATCH_SIZE = best_hyperparams.get('batch_size', BATCH_SIZE)
    if 'ensemble_weights' in best_hyperparams:
        ENSEMBLE_WEIGHTS = best_hyperparams['ensemble_weights']
    
    print(f"\nUsing optimized configuration:")
    print(f"  CTGAN_EPOCHS: {CTGAN_EPOCHS}")
    print(f"  TVAE_EPOCHS: {TVAE_EPOCHS}")
    print(f"  BATCH_SIZE: {BATCH_SIZE}")
    print(f"  ENSEMBLE_WEIGHTS: {ENSEMBLE_WEIGHTS}")
else:
    print("No saved hyperparameters found.")
    if RUN_HYPERPARAMETER_TUNING:
        print("Hyperparameter tuning will run after data preparation.")
    else:
        print("Using default configuration. Set RUN_HYPERPARAMETER_TUNING=True to optimize.")

## Phase 1: Data Preparation

In [4]:
# Load and process data
processor = BrandDataProcessor(DATA_PATH)
raw_data = processor.load_data()
print(f"Loaded {len(raw_data)} brands with {len(raw_data.columns)} features")

Loading data from /content/csca5642-deep-learning/data/raw/brand_information.csv...
Loaded 3605 brands with 77 features
Loaded 3605 brands with 77 features


In [5]:
# Clean data
cleaned_data = processor.clean_data()
print(f"\nCleaned data: {len(cleaned_data)} rows, {len(cleaned_data.columns)} columns")


=== Data Cleaning ===

Identified 54 numerical features
Identified 7 categorical features
Identified 6 text features (will be handled separately)
Dropped text-heavy columns: ['esg_summary', 'accusation', 'references_and_links']

Handling missing values...
  Filled demographics_gender with mode/Unknown
  Filled demographics_lifestyle with mode/Unknown

Cleaned dataset: 3605 rows, 74 columns

Cleaned data: 3605 rows, 74 columns


In [6]:
# Prepare for GAN training
train_df, val_df = processor.prepare_for_gan(test_size=0.2)

print(f"\nTraining set: {len(train_df)} brands")
print(f"Validation set: {len(val_df)} brands")

# Get column types
discrete_cols = processor.categorical_features
binary_cols = [col for col in train_df.columns if train_df[col].nunique() == 2 and set(train_df[col].unique()).issubset({0, 1})]
numerical_cols = [col for col in train_df.columns if col not in discrete_cols and col not in binary_cols]

print(f"\nColumn types:")
print(f"  Numerical: {len(numerical_cols)}")
print(f"  Categorical: {len(discrete_cols)}")
print(f"  Binary: {len(binary_cols)}")


=== Preparing Data for GAN ===

Encoding categorical features...
  Encoded industry_name: 14 unique values
  Encoded country_of_origin: 42 unique values
  Encoded headquarters_country: 43 unique values
  Encoded demographics_income_level: 84 unique values
  Encoded demographics_geographic_reach: 167 unique values
  Encoded demographics_gender: 49 unique values
  Encoded demographics_lifestyle: 3540 unique values
  Encoded company_name: 243 companies
Added 40 single-brand companies to training set.

Train set: 2892 brands
Validation set: 713 brands

Training set: 2892 brands
Validation set: 713 brands

Column types:
  Numerical: 25
  Categorical: 7
  Binary: 30


In [None]:
# Execute hyperparameter tuning if enabled and no saved params exist
if RUN_HYPERPARAMETER_TUNING and best_hyperparams is None:
    print("Initializing hyperparameter tuner...")
    
    # Create tuner instance
    tuner = HyperparameterTunerV2(
        train_data=train_df,
        discrete_cols=discrete_cols,
        binary_cols=binary_cols,
        eval_sample_size=min(1000, len(train_df)),
        gen_sample_size=500,
        verbose=True
    )
    
    # Run optimization
    best_hyperparams = tuner.tune(
        n_trials=N_TUNING_TRIALS,
        timeout=TUNING_TIMEOUT,
        seed=42,
        show_progress_bar=True
    )
    
    # Save the best hyperparameters
    tuner.save(HYPERPARAMS_PATH)
    
    # Plot optimization history
    tuner.plot_optimization_history(
        save_path=os.path.join(OUTPUT_DIR, 'hyperparameter_tuning_history.png')
    )
    
    # Update global configuration
    CTGAN_EPOCHS = best_hyperparams['ctgan_epochs']
    TVAE_EPOCHS = best_hyperparams['tvae_epochs']
    BATCH_SIZE = best_hyperparams['batch_size']
    ENSEMBLE_WEIGHTS = best_hyperparams['ensemble_weights']
    
    print(f"\nUpdated configuration with optimized hyperparameters:")
    print(f"  CTGAN_EPOCHS: {CTGAN_EPOCHS}")
    print(f"  TVAE_EPOCHS: {TVAE_EPOCHS}")
    print(f"  BATCH_SIZE: {BATCH_SIZE}")
    print(f"  ENSEMBLE_WEIGHTS: {ENSEMBLE_WEIGHTS}")
else:
    if best_hyperparams is not None:
        print("Using previously loaded hyperparameters.")
    else:
        print("Using default hyperparameters. Set RUN_HYPERPARAMETER_TUNING=True to optimize.")

## Phase 2: Tabular Ensemble Training

Training CTGAN, TVAE, and Gaussian Copula models

In [7]:
# Initialize Ensemble Synthesizer
tabular_ensemble = EnsembleSynthesizer(
    ctgan_epochs=CTGAN_EPOCHS,
    ctgan_batch_size=BATCH_SIZE,
    tvae_epochs=TVAE_EPOCHS,
    tvae_batch_size=BATCH_SIZE,
    gc_default_distribution='beta',
    weights=ENSEMBLE_WEIGHTS,
    verbose=True,
    cuda=True
)

print("Tabular Ensemble initialized with:")
print(f"  - CTGAN (epochs={CTGAN_EPOCHS})")
print(f"  - TVAE (epochs={TVAE_EPOCHS})")
print(f"  - Gaussian Copula (distribution=beta)")

Tabular Ensemble initialized with:
  - CTGAN (epochs=300)
  - TVAE (epochs=300)
  - Gaussian Copula (distribution=beta)


In [8]:
if FROM_PRETRAINED:
    # Load pre-trained models
    print("Loading pre-trained tabular ensemble...")
    tabular_ensemble.load_models(os.path.join(MODEL_DIR, 'tabular_ensemble'))
else:
    # Train all models
    print("Training tabular ensemble (this will take ~30-60 minutes)...")
    training_times = tabular_ensemble.train(
        data=train_df,
        discrete_columns=discrete_cols,
        binary_columns=binary_cols
    )

    # Save models
    tabular_ensemble.save_models(os.path.join(MODEL_DIR, 'tabular_ensemble'))

    print(f"\nTraining times:")
    for model, time in training_times.items():
        print(f"  {model}: {time:.1f} seconds")

Training tabular ensemble (this will take ~30-60 minutes)...

ENSEMBLE SYNTHESIZER: TRAINING ALL MODELS
Enabled models: ['ctgan', 'tvae', 'gaussian_copula']

--- Training CTGAN ---

=== Training CTGAN ===
Training on 2892 samples with 62 features
Converting 30 binary columns to boolean...
Setting 30 binary columns as boolean type in metadata...
Training for 300 epochs with batch size 500...


Gen. (-0.12) | Discrim. (-0.70): 100%|██████████| 300/300 [01:44<00:00,  2.88it/s]


CTGAN Training completed!
ctgan trained in 126.62 seconds

--- Training TVAE ---

=== Training TVAE ===
Training on 2892 samples with 62 features
Converting 30 binary columns to boolean...
Setting 30 binary columns as boolean type in metadata...
Training for 300 epochs with batch size 500...


Loss: -98.334: 100%|██████████| 300/300 [01:01<00:00,  4.91it/s]


TVAE Training completed!
tvae trained in 70.08 seconds

--- Training GAUSSIAN_COPULA ---

=== Training Gaussian Copula ===
Training on 2892 samples with 62 features
Converting 30 binary columns to boolean...
Setting 30 binary columns as boolean type in metadata...
Fitting Gaussian Copula with 'beta' distribution...
Gaussian Copula Training completed!
gaussian_copula trained in 7.31 seconds

TRAINING COMPLETE
Total time: 204.01 seconds
CTGAN model saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/models/tabular_ensemble/ctgan_model.pkl
TVAE model saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/models/tabular_ensemble/tvae_model.pkl
Gaussian Copula model saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/models/tabular_ensemble/gaussian_copula_model.pkl
Ensemble models saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/models/tabular_ensemble

Training times:
  ctgan: 126.6 seconds
  tvae: 70.1 se

In [9]:
# Compare individual model quality
print("Evaluating individual model quality...")
comparison_df = tabular_ensemble.compare_all_models(train_df, n_samples=1000)
print("\nModel Comparison:")
display(comparison_df)

Evaluating individual model quality...

=== Generating 1000 Samples (Ensemble) ===
  Generating from ctgan...
  Generating from tvae...
  Generating from gaussian_copula...
  Combined 1000 samples from 3 models

Model Comparison:


Unnamed: 0,Mean KS Statistic,KS Pass Rate,Correlation MSE,Mean Relative Error
ctgan,0.26234,0.222222,0.02527,0.905551
tvae,0.114521,0.444444,0.020114,0.337487
gaussian_copula,0.350456,0.222222,0.019653,145.629258
ensemble,0.466863,0.111111,0.015953,35.093697


In [10]:
# Optionally optimize weights based on quality
optimized_weights = tabular_ensemble.optimize_weights(train_df, n_eval_samples=1000)
print(f"Optimized weights: {optimized_weights}")


=== Evaluating Individual Model Quality ===

Evaluating ctgan...
  Mean KS statistic: 0.2666
  Correlation MSE: 0.0245

Evaluating tvae...
  Mean KS statistic: 0.1118
  Correlation MSE: 0.0216

Evaluating gaussian_copula...
  Mean KS statistic: 0.3560
  Correlation MSE: 0.0176

Optimized weights: {'ctgan': np.float64(0.2575040716298723), 'tvae': np.float64(0.5403735446466744), 'gaussian_copula': np.float64(0.20212238372345337)}
Optimized weights: {'ctgan': np.float64(0.2575040716298723), 'tvae': np.float64(0.5403735446466744), 'gaussian_copula': np.float64(0.20212238372345337)}


## Phase 3: LLM Ensemble Training

Training GPT-2 Medium and Flan-T5 for brand name generation

In [11]:
# Prepare brand name training data
brands_df = processor.df[['brand_name', 'company_name', 'industry_name']].dropna()
print(f"Brand name training data: {len(brands_df)} examples")
brands_df.head()

Brand name training data: 3605 examples


Unnamed: 0,brand_name,company_name,industry_name
0,00 Null Null,"S. C. Johnson & Son, Inc.",Household & Personal Products
1,100 Grand,Ferrero Group,Processed Foods
2,1950 127 Cheese,Foremost Farms USA Cooperative,"Meat, Poultry & Dairy"
3,2nd Street Creamery,"Wells Enterprises, Inc.",Processed Foods
4,3 Musketeers,"Mars, Incorporated",Processed Foods


In [12]:
# Initialize LLM Ensemble Generator
llm_generator = BrandNameGeneratorV2(
    models=LLM_MODELS,
    memory_efficient=True,
    verbose=True
)

print(f"LLM Ensemble initialized with models: {LLM_MODELS}")

LLM Ensemble initialized with models: ['gpt2-medium', 'flan-t5-base']


In [13]:
if FROM_PRETRAINED:
    # Load pre-trained models
    print("Loading pre-trained LLM ensemble...")
    llm_generator.load_model(os.path.join(MODEL_DIR, 'llm_ensemble'))
else:
    # Fine-tune all models
    print(f"Fine-tuning LLM ensemble (epochs={LLM_EPOCHS})...")
    print("This will train each model sequentially to save memory.")

    llm_generator.fine_tune(
        brands_df=brands_df,
        epochs=LLM_EPOCHS,
        output_dir=os.path.join(MODEL_DIR, 'llm_ensemble')
    )

    # Save ensemble config
    llm_generator.save_model(os.path.join(MODEL_DIR, 'llm_ensemble'))

Fine-tuning LLM ensemble (epochs=3)...
This will train each model sequentially to save memory.

FINE-TUNING: gpt2-medium
Loading gpt2-medium...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded on cuda

=== Fine-tuning gpt2-medium ===
Training on 3605 examples


[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdyegofern[0m ([33mdyegofern-university-of-colorado-boulder[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,2.4027
100,1.3879
150,1.2237
200,1.1536
250,1.0662
300,0.9845
350,0.9709
400,0.9697
450,0.9622
500,0.8443


Model saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/models/llm_ensemble/gpt2-medium

FINE-TUNING: flan-t5-base
Loading google/flan-t5-base...


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Model loaded on cuda

=== Fine-tuning google/flan-t5-base ===
Training on 3605 examples


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
50,0.0
100,0.0
150,0.0
200,0.0
250,0.0
300,0.0
350,0.0
400,0.0
450,0.0
500,0.0


Model saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/models/llm_ensemble/flan-t5-base

All models fine-tuned and saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/models/llm_ensemble


In [14]:
# Test LLM generation
print("Testing LLM ensemble generation...")
llm_generator.prepare_model()

test_companies = [
    ("PepsiCo", "Non-Alcoholic Beverages"),
    ("Nestle", "Processed Foods"),
    ("Mars, Incorporated", "Processed Foods")
]

for company, industry in test_companies:
    names = llm_generator.generate_brand_names(company, industry, n_names=3)
    print(f"\n{company} ({industry}): {names}")

Testing LLM ensemble generation...
Loading gpt2-medium...
Model loaded on cuda
Loading google/flan-t5-base...


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Model loaded on cuda

Apple (Technology): ['Apple Inc', 'Apple', 'Amazon.com']

Nike (Apparel): ['Nike', 'Nike in industry: Apparel', 'Adidas']

Nestle (Food & Beverage): ['Nestle Foods', 'Nutri-Grain', 'Nestle']


## Phase 4: Synthetic Data Generation

In [17]:
# Calculate generation targets
generation_targets = calculate_generation_targets(
    data=train_df,
    company_column='company_name',
    min_brands_per_company=MIN_BRANDS_PER_COMPANY
)
generation_targets = dict(list(generation_targets.items())[:50])


=== Generation Targets ===
  Total companies: 243
  Total brands to generate: 1885
  Average per company: 7.8


In [18]:
# Generate synthetic tabular features using ensemble
print("Generating synthetic features with ensemble...")
synthetic_features, failed_companies = tabular_ensemble.generate_stratified(
    company_distribution=generation_targets,
    verbose=True
)

print(f"\nGenerated {len(synthetic_features)} synthetic brand features")
if failed_companies:
    print(f"Failed companies: {len(failed_companies)}")

Generating synthetic features with ensemble...

=== Ensemble Stratified Generation ===
  Companies: 50
  Total brands requested: 500

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.16s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.67it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:04<00:00,  2.26it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:04<00:00,  2.01it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.51it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:12<00:00,  1.29s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:04<00:00,  2.48it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.18it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:17<00:00,  1.75s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 25.86it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.24it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.26it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.86it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:10<00:00,  1.07s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.33it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.09it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.01it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.41it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.30it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:12<00:00,  1.26s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:03<00:00,  2.89it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 23.55it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.69it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.30it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.83it/s]


  Combined 10 samples from 3 models
  Progress: 10/50 companies...

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.10it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:02<00:00,  4.55it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.14it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:12<00:00,  1.24s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:31<00:00,  3.12s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.67it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.58it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:14<00:00,  1.45s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.19it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.97it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:13<00:00,  1.33s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.85it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.17s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:12<00:00,  1.25s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.04it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:10<00:00,  1.07s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:03<00:00,  2.66it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 25.57it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.09it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.11it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.04it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.97it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.69it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.06it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.76it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.10it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.95it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.60it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:02<00:00,  4.01it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.15it/s]


  Combined 10 samples from 3 models
  Progress: 20/50 companies...

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.91it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.08it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:10<00:00,  1.04s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:02<00:00,  3.57it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.91it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.16it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 24.45it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.45it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.57it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 25.86it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.13s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.17s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 24.58it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.22it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.14s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.84it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.12it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.30it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.06it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:10<00:00,  1.03s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:28<00:00,  2.82s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 24.47it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.58it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:16<00:00,  1.64s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.12it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:03<00:00,  2.61it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.84it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.74it/s]


  Combined 10 samples from 3 models
  Progress: 30/50 companies...

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.27it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:14<00:00,  1.48s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.05it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.38it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:02<00:00,  4.54it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.75it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.17it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:20<00:00,  2.03s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.76it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:14<00:00,  1.44s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.47it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.95it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.22it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.13s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.77it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.13s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:14<00:00,  1.43s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 24.44it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.06it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:15<00:00,  1.58s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 24.21it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:22<00:00,  2.24s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.24it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.09it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:03<00:00,  2.90it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.85it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.57it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:17<00:00,  1.76s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.34it/s]


  Combined 10 samples from 3 models
  Progress: 40/50 companies...

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.66it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:21<00:00,  2.15s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.22it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:12<00:00,  1.26s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.93it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.03it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.81it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.70it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.13it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.12it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 24.65it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.67it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:12<00:00,  1.23s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.82it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.06it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:09<00:00,  1.02it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.53it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:10<00:00,  1.04s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:03<00:00,  2.69it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.85it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:11<00:00,  1.10s/it]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:13<00:00,  1.32s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.87it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:05<00:00,  1.67it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:23<00:00,  2.35s/it]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.05it/s]


  Combined 10 samples from 3 models

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:06<00:00,  1.59it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 27.27it/s]


  Combined 10 samples from 3 models
  Progress: 50/50 companies...

=== Generating 10 Samples (Ensemble) ===
  Generating from ctgan...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.13it/s]


  Generating from tvae...


Sampling remaining columns: 100%|██████████| 10/10 [00:08<00:00,  1.16it/s]


  Generating from gaussian_copula...


Sampling remaining columns: 100%|██████████| 10/10 [00:00<00:00, 26.84it/s]

  Combined 10 samples from 3 models

Generated 500 total synthetic brands

Generated 500 synthetic brand features





In [19]:
# Add diversity noise if enabled
if ADD_DIVERSITY_NOISE:
    print("Adding diversity noise to numerical features...")
    synthetic_features = tabular_ensemble.add_diversity_noise(
        synthetic_features,
        noise_level=0.07
    )

Adding diversity noise to numerical features...


In [20]:
# Decode categorical features back to original values
print("Decoding categorical features...")
synthetic_decoded = processor.decode_categorical(synthetic_features)
print(f"Decoded {len(synthetic_decoded)} synthetic brands")

Decoding categorical features...
Decoded 500 synthetic brands


In [21]:
# Generate brand names using LLM ensemble
print("\nGenerating brand names with LLM ensemble...")
llm_generator.reset_uniqueness_tracker()

synthetic_with_names = llm_generator.generate_for_dataframe(
    synthetic_df=synthetic_decoded,
    temperature=DIVERSITY_TEMPERATURE,
    verbose=True
)

print(f"\nFinal synthetic dataset: {len(synthetic_with_names)} brands")


Generating brand names with LLM ensemble...
Uniqueness tracker reset

=== Generating Brand Names for 500 Brands ===
  Generated 100/500 names... (Failed: 0)
  Generated 200/500 names... (Failed: 2)
  Generated 300/500 names... (Failed: 16)
  Generated 400/500 names... (Failed: 17)
  Generated 500/500 names... (Failed: 23)

Generated 500 brand names
  Unique names: 500
  Failed/fallback: 23
  Success rate: 95.4%

Final synthetic dataset: 500 brands


In [22]:
# Preview synthetic data
print("\nSample of generated synthetic brands:")
display(synthetic_with_names[['company_name', 'industry_name', 'brand_name']].head(20))


Sample of generated synthetic brands:


Unnamed: 0,company_name,industry_name,brand_name
0,Nestle,Processed Foods,Nestle
1,Nestle,Biotechnology & Pharmaceuticals,Snickers®
2,Nestle,"Meat, Poultry & Dairy",Snack Bars
3,Nestle,Household & Personal Products,Snapple
4,Nestle,"Meat, Poultry, & Dairy",Veggie Brands
5,Nestle,"Meat, Poultry, & Dairy","Nestle, Inc"
6,Nestle,Processed Foods,Nestle is a manufacturer of processed foods
7,Nestle,Processed Foods,Reddi-Taste
8,Nestle,Processed Foods,Nestle Foods
9,Nestle,"Meat, Poultry & Dairy",Chicken nuggets


In [23]:
# Save synthetic data
synthetic_path = os.path.join(OUTPUT_DIR, 'synthetic_brands_v2.csv')
synthetic_with_names.to_csv(synthetic_path, index=False)
print(f"Synthetic data saved to {synthetic_path}")

# Create augmented dataset
original_decoded = processor.decode_categorical(train_df)
augmented_df = pd.concat([original_decoded, synthetic_with_names], ignore_index=True)

augmented_path = os.path.join(OUTPUT_DIR, 'augmented_brands_v2.csv')
augmented_df.to_csv(augmented_path, index=False)
print(f"Augmented data saved to {augmented_path}")
print(f"Total augmented size: {len(augmented_df)} brands")

Synthetic data saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/outputs/synthetic_brands_v2.csv
Augmented data saved to /content/drive/MyDrive/Colab_Output/SyntheticBrandGeneration_V2/outputs/augmented_brands_v2.csv
Total augmented size: 3392 brands


## Phase 5: Synthetic Data Quality Evaluation

This phase comprehensively evaluates the quality of generated synthetic data through:
1. **Statistical Fidelity**: Distribution matching (KS tests), correlation preservation
2. **Visual Analysis**: Distribution overlays, QQ plots, feature comparisons
3. **Dimensionality Reduction**: PCA and t-SNE projections
4. **Summary Metrics**: Quality scorecards and radar charts

In [None]:
# Initialize evaluator and prepare data
evaluator = BrandDataEvaluator()

# Get numerical columns for evaluation
eval_numerical_cols = [col for col in numerical_cols if col in synthetic_features.columns and col in train_df.columns]

# Prepare augmented numerical data for later use
augmented_numerical = pd.concat([train_df[eval_numerical_cols], synthetic_features[eval_numerical_cols]], ignore_index=True)

print(f"Evaluation columns: {len(eval_numerical_cols)} numerical features")
print(f"Real data samples: {len(train_df)}")
print(f"Synthetic data samples: {len(synthetic_features)}")
print(f"Augmented data samples: {len(augmented_numerical)}")

In [None]:
### 5.1 Statistical Distribution Comparison (KS Test)

from scipy import stats

# Compute KS statistics for all features
print("="*70)
print("KOLMOGOROV-SMIRNOV TEST RESULTS")
print("="*70)
print(f"{'Feature':<35} {'KS Stat':>10} {'P-Value':>12} {'Result':>10}")
print("-"*70)

ks_results = {}
for col in eval_numerical_cols:
    if col in train_df.columns and col in synthetic_features.columns:
        stat, pvalue = stats.ks_2samp(
            train_df[col].dropna(),
            synthetic_features[col].dropna()
        )
        ks_results[col] = {'statistic': stat, 'pvalue': pvalue}
        result = "PASS" if pvalue > 0.05 else "FAIL"
        print(f"{col:<35} {stat:>10.4f} {pvalue:>12.4f} {result:>10}")

passes = sum(1 for v in ks_results.values() if v['pvalue'] > 0.05)
print("-"*70)
print(f"SUMMARY: {passes}/{len(ks_results)} features pass (p > 0.05) = {100*passes/len(ks_results):.1f}% pass rate")
print("="*70)

In [None]:
### 5.2 KS Statistics Visualization

# Create a bar chart of KS statistics
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Sort features by KS statistic
sorted_features = sorted(ks_results.items(), key=lambda x: x[1]['statistic'], reverse=True)
features = [f[0] for f in sorted_features]
ks_stats = [f[1]['statistic'] for f in sorted_features]
colors = ['#e74c3c' if f[1]['pvalue'] <= 0.05 else '#27ae60' for f in sorted_features]

# Bar chart of KS statistics
ax1 = axes[0]
bars = ax1.barh(range(len(features)), ks_stats, color=colors, edgecolor='black', alpha=0.8)
ax1.set_yticks(range(len(features)))
ax1.set_yticklabels(features, fontsize=9)
ax1.set_xlabel('KS Statistic', fontsize=12)
ax1.set_title('Distribution Similarity (KS Test)\nLower is Better', fontsize=14, fontweight='bold')
ax1.axvline(x=0.1, color='orange', linestyle='--', linewidth=2, label='Good threshold (0.1)')
ax1.axvline(x=0.2, color='red', linestyle='--', linewidth=2, label='Poor threshold (0.2)')
ax1.legend(loc='lower right')
ax1.invert_yaxis()

# Add pass/fail legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#27ae60', label='Pass (p > 0.05)'),
                   Patch(facecolor='#e74c3c', label='Fail (p ≤ 0.05)')]
ax1.legend(handles=legend_elements, loc='lower right')

# Pie chart of pass/fail
ax2 = axes[1]
fail_count = len(ks_results) - passes
sizes = [passes, fail_count]
labels = [f'Pass\n({passes})', f'Fail\n({fail_count})']
colors_pie = ['#27ae60', '#e74c3c']
explode = (0.05, 0)
wedges, texts, autotexts = ax2.pie(sizes, explode=explode, labels=labels, colors=colors_pie,
                                    autopct='%1.1f%%', shadow=True, startangle=90,
                                    textprops={'fontsize': 12, 'fontweight': 'bold'})
ax2.set_title('KS Test Results Summary', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'ks_test_results.png'), dpi=150, bbox_inches='tight')
plt.show()

In [None]:
### 5.3 Distribution Comparison - Histograms with KDE

# Select top features for visualization (mix of good and bad performers)
best_features = [f[0] for f in sorted_features[-4:]]  # Best 4 (lowest KS)
worst_features = [f[0] for f in sorted_features[:4]]  # Worst 4 (highest KS)
viz_features = worst_features + best_features

n_features = len(viz_features)
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for idx, feature in enumerate(viz_features):
    ax = axes[idx]
    
    # Get data
    real_data = train_df[feature].dropna()
    synth_data = synthetic_features[feature].dropna()
    
    # Plot histograms with KDE
    ax.hist(real_data, bins=30, alpha=0.5, label='Real', color='steelblue', density=True, edgecolor='black')
    ax.hist(synth_data, bins=30, alpha=0.5, label='Synthetic', color='coral', density=True, edgecolor='black')
    
    # Add KDE curves
    try:
        real_data.plot.kde(ax=ax, color='steelblue', linewidth=2, linestyle='-')
        synth_data.plot.kde(ax=ax, color='coral', linewidth=2, linestyle='-')
    except:
        pass  # Skip KDE if it fails
    
    # Get KS stat for this feature
    ks_stat = ks_results[feature]['statistic']
    pvalue = ks_results[feature]['pvalue']
    result = "PASS" if pvalue > 0.05 else "FAIL"
    
    ax.set_title(f'{feature}\nKS={ks_stat:.3f} ({result})', fontsize=11, fontweight='bold')
    ax.set_xlabel('')
    ax.set_ylabel('Density')
    ax.legend(fontsize=9)
    
    # Color code title based on result
    if result == "FAIL":
        ax.title.set_color('#c0392b')
    else:
        ax.title.set_color('#27ae60')

plt.suptitle('Distribution Comparison: Real vs Synthetic Data\nTop Row: Worst Performers | Bottom Row: Best Performers', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'distribution_comparison_v2.png'), dpi=150, bbox_inches='tight')
plt.show()

In [None]:
### 5.4 Correlation Structure Preservation

# Compute correlation matrices
real_corr = train_df[eval_numerical_cols].corr()
synth_corr = synthetic_features[eval_numerical_cols].corr()
corr_diff = np.abs(real_corr - synth_corr)

# Create figure with 3 subplots
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# Real data correlations
sns.heatmap(real_corr, ax=axes[0], cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            square=True, cbar_kws={'label': 'Correlation', 'shrink': 0.8},
            xticklabels=True, yticklabels=True)
axes[0].set_title('Real Data Correlations', fontsize=14, fontweight='bold')
axes[0].tick_params(axis='both', labelsize=8, rotation=45)

# Synthetic data correlations
sns.heatmap(synth_corr, ax=axes[1], cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            square=True, cbar_kws={'label': 'Correlation', 'shrink': 0.8},
            xticklabels=True, yticklabels=True)
axes[1].set_title('Synthetic Data Correlations', fontsize=14, fontweight='bold')
axes[1].tick_params(axis='both', labelsize=8, rotation=45)

# Correlation difference (absolute)
sns.heatmap(corr_diff, ax=axes[2], cmap='Reds', vmin=0, vmax=0.5,
            square=True, cbar_kws={'label': '|Difference|', 'shrink': 0.8},
            xticklabels=True, yticklabels=True)
axes[2].set_title('Absolute Correlation Difference\n(Lower is Better)', fontsize=14, fontweight='bold')
axes[2].tick_params(axis='both', labelsize=8, rotation=45)

# Add metrics annotation
mean_corr_diff = corr_diff.mean().mean()
max_corr_diff = corr_diff.max().max()

plt.suptitle(f'Correlation Structure Analysis\nMean Abs. Difference: {mean_corr_diff:.4f} | Max Difference: {max_corr_diff:.4f}', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'correlation_comparison_v2.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\nCorrelation Preservation Metrics:")
print(f"  Mean absolute difference: {mean_corr_diff:.4f}")
print(f"  Max absolute difference: {max_corr_diff:.4f}")
print(f"  Correlation RMSE: {np.sqrt(np.mean((real_corr - synth_corr).values**2)):.4f}")

In [None]:
### 5.5 QQ Plots - Quantile Comparison

# QQ plots for selected features
from scipy import stats as scipy_stats

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for idx, feature in enumerate(viz_features):
    ax = axes[idx]
    
    real_data = np.sort(train_df[feature].dropna().values)
    synth_data = np.sort(synthetic_features[feature].dropna().values)
    
    # Interpolate to same length for QQ plot
    n_points = min(len(real_data), len(synth_data), 100)
    real_quantiles = np.percentile(real_data, np.linspace(0, 100, n_points))
    synth_quantiles = np.percentile(synth_data, np.linspace(0, 100, n_points))
    
    # Plot QQ
    ax.scatter(real_quantiles, synth_quantiles, alpha=0.6, s=30, c='steelblue', edgecolor='black')
    
    # Add diagonal reference line
    min_val = min(real_quantiles.min(), synth_quantiles.min())
    max_val = max(real_quantiles.max(), synth_quantiles.max())
    ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Match')
    
    # Color based on KS result
    ks_stat = ks_results[feature]['statistic']
    result = "PASS" if ks_results[feature]['pvalue'] > 0.05 else "FAIL"
    title_color = '#27ae60' if result == "PASS" else '#c0392b'
    
    ax.set_title(f'{feature}\nKS={ks_stat:.3f}', fontsize=11, fontweight='bold', color=title_color)
    ax.set_xlabel('Real Quantiles', fontsize=10)
    ax.set_ylabel('Synthetic Quantiles', fontsize=10)
    ax.legend(loc='lower right', fontsize=9)
    ax.set_aspect('equal', adjustable='box')

plt.suptitle('Q-Q Plots: Real vs Synthetic Quantiles\nPoints on diagonal = Perfect distribution match', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'qq_plots.png'), dpi=150, bbox_inches='tight')
plt.show()

In [None]:
### 5.6 PCA & t-SNE Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

# Prepare data
X_real = train_df[eval_numerical_cols].fillna(0).values
X_synth = synthetic_features[eval_numerical_cols].fillna(0).values

# Standardize
scaler = StandardScaler()
X_real_scaled = scaler.fit_transform(X_real)
X_synth_scaled = scaler.transform(X_synth)

# PCA
pca = PCA(n_components=2)
X_real_pca = pca.fit_transform(X_real_scaled)
X_synth_pca = pca.transform(X_synth_scaled)

# t-SNE (on combined data for fair comparison)
print("Computing t-SNE projection (this may take a moment)...")
X_combined = np.vstack([X_real_scaled, X_synth_scaled])
labels = np.array(['Real'] * len(X_real) + ['Synthetic'] * len(X_synth))

tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_combined_tsne = tsne.fit_transform(X_combined)
X_real_tsne = X_combined_tsne[:len(X_real)]
X_synth_tsne = X_combined_tsne[len(X_real):]

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# PCA plot
ax1 = axes[0]
ax1.scatter(X_real_pca[:, 0], X_real_pca[:, 1], alpha=0.5, s=30, c='steelblue', label='Real', edgecolor='white', linewidth=0.5)
ax1.scatter(X_synth_pca[:, 0], X_synth_pca[:, 1], alpha=0.5, s=30, c='coral', label='Synthetic', edgecolor='white', linewidth=0.5)
ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=12)
ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=12)
ax1.set_title('PCA Projection\nReal vs Synthetic Data', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11, loc='upper right')
ax1.grid(True, alpha=0.3)

# t-SNE plot
ax2 = axes[1]
ax2.scatter(X_real_tsne[:, 0], X_real_tsne[:, 1], alpha=0.5, s=30, c='steelblue', label='Real', edgecolor='white', linewidth=0.5)
ax2.scatter(X_synth_tsne[:, 0], X_synth_tsne[:, 1], alpha=0.5, s=30, c='coral', label='Synthetic', edgecolor='white', linewidth=0.5)
ax2.set_xlabel('t-SNE Dimension 1', fontsize=12)
ax2.set_ylabel('t-SNE Dimension 2', fontsize=12)
ax2.set_title('t-SNE Projection\nReal vs Synthetic Data', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11, loc='upper right')
ax2.grid(True, alpha=0.3)

plt.suptitle('Dimensionality Reduction: Synthetic Data Should Overlap with Real Data', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'pca_tsne_comparison.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\nPCA Explained Variance: PC1={pca.explained_variance_ratio_[0]:.1%}, PC2={pca.explained_variance_ratio_[1]:.1%}")

In [None]:
### 5.7 Feature-wise Statistics Comparison

# Compute statistics for real and synthetic data
stats_comparison = []

for col in eval_numerical_cols:
    real_col = train_df[col].dropna()
    synth_col = synthetic_features[col].dropna()
    
    stats_comparison.append({
        'Feature': col,
        'Real Mean': real_col.mean(),
        'Synth Mean': synth_col.mean(),
        'Mean Diff %': abs(real_col.mean() - synth_col.mean()) / (abs(real_col.mean()) + 1e-10) * 100,
        'Real Std': real_col.std(),
        'Synth Std': synth_col.std(),
        'Std Diff %': abs(real_col.std() - synth_col.std()) / (abs(real_col.std()) + 1e-10) * 100,
        'Real Min': real_col.min(),
        'Synth Min': synth_col.min(),
        'Real Max': real_col.max(),
        'Synth Max': synth_col.max(),
        'KS Stat': ks_results[col]['statistic']
    })

stats_df = pd.DataFrame(stats_comparison)

# Display summary table
print("="*80)
print("FEATURE-WISE STATISTICS COMPARISON")
print("="*80)
display(stats_df[['Feature', 'Real Mean', 'Synth Mean', 'Mean Diff %', 'Real Std', 'Synth Std', 'Std Diff %', 'KS Stat']].round(3))

# Visualize mean and std differences
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Mean difference
ax1 = axes[0]
sorted_by_mean = stats_df.sort_values('Mean Diff %', ascending=False)
colors_mean = ['#e74c3c' if x > 50 else '#f39c12' if x > 20 else '#27ae60' for x in sorted_by_mean['Mean Diff %']]
ax1.barh(range(len(sorted_by_mean)), sorted_by_mean['Mean Diff %'], color=colors_mean, edgecolor='black', alpha=0.8)
ax1.set_yticks(range(len(sorted_by_mean)))
ax1.set_yticklabels(sorted_by_mean['Feature'], fontsize=9)
ax1.set_xlabel('Mean Difference (%)', fontsize=12)
ax1.set_title('Mean Value Difference\n(Lower is Better)', fontsize=14, fontweight='bold')
ax1.axvline(x=20, color='orange', linestyle='--', linewidth=2, alpha=0.7)
ax1.axvline(x=50, color='red', linestyle='--', linewidth=2, alpha=0.7)
ax1.invert_yaxis()

# Std difference  
ax2 = axes[1]
sorted_by_std = stats_df.sort_values('Std Diff %', ascending=False)
colors_std = ['#e74c3c' if x > 50 else '#f39c12' if x > 20 else '#27ae60' for x in sorted_by_std['Std Diff %']]
ax2.barh(range(len(sorted_by_std)), sorted_by_std['Std Diff %'], color=colors_std, edgecolor='black', alpha=0.8)
ax2.set_yticks(range(len(sorted_by_std)))
ax2.set_yticklabels(sorted_by_std['Feature'], fontsize=9)
ax2.set_xlabel('Std Deviation Difference (%)', fontsize=12)
ax2.set_title('Standard Deviation Difference\n(Lower is Better)', fontsize=14, fontweight='bold')
ax2.axvline(x=20, color='orange', linestyle='--', linewidth=2, alpha=0.7)
ax2.axvline(x=50, color='red', linestyle='--', linewidth=2, alpha=0.7)
ax2.invert_yaxis()

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'statistics_comparison.png'), dpi=150, bbox_inches='tight')
plt.show()

In [None]:
### 5.8 Synthetic Data Quality Scorecard & Radar Chart

# Calculate quality metrics
ks_pass_rate = passes / len(ks_results) * 100
mean_ks_stat = np.mean([v['statistic'] for v in ks_results.values()])
mean_mean_diff = stats_df['Mean Diff %'].mean()
mean_std_diff = stats_df['Std Diff %'].mean()

# Normalize metrics to 0-100 scale (higher is better)
distribution_score = max(0, 100 - mean_ks_stat * 200)  # KS of 0 = 100, KS of 0.5 = 0
correlation_score = max(0, 100 - mean_corr_diff * 200)  # Diff of 0 = 100, Diff of 0.5 = 0
mean_preservation = max(0, 100 - mean_mean_diff)  # Lower diff = higher score
variance_preservation = max(0, 100 - mean_std_diff)  # Lower diff = higher score
coverage_score = ks_pass_rate  # Percentage of features passing KS test

# Overall score (weighted average)
overall_score = (distribution_score * 0.3 + correlation_score * 0.2 + 
                 mean_preservation * 0.2 + variance_preservation * 0.2 + coverage_score * 0.1)

# Create figure with scorecard and radar chart
fig = plt.figure(figsize=(18, 8))

# Left side: Scorecard
ax1 = fig.add_subplot(121)
ax1.axis('off')

# Create scorecard table
scorecard_data = [
    ['Metric', 'Value', 'Score', 'Grade'],
    ['KS Test Pass Rate', f'{ks_pass_rate:.1f}%', f'{coverage_score:.1f}', 'A' if coverage_score >= 70 else 'B' if coverage_score >= 50 else 'C' if coverage_score >= 30 else 'D'],
    ['Mean KS Statistic', f'{mean_ks_stat:.3f}', f'{distribution_score:.1f}', 'A' if distribution_score >= 70 else 'B' if distribution_score >= 50 else 'C' if distribution_score >= 30 else 'D'],
    ['Correlation Preservation', f'{mean_corr_diff:.3f}', f'{correlation_score:.1f}', 'A' if correlation_score >= 70 else 'B' if correlation_score >= 50 else 'C' if correlation_score >= 30 else 'D'],
    ['Mean Value Preservation', f'{mean_mean_diff:.1f}%', f'{mean_preservation:.1f}', 'A' if mean_preservation >= 70 else 'B' if mean_preservation >= 50 else 'C' if mean_preservation >= 30 else 'D'],
    ['Variance Preservation', f'{mean_std_diff:.1f}%', f'{variance_preservation:.1f}', 'A' if variance_preservation >= 70 else 'B' if variance_preservation >= 50 else 'C' if variance_preservation >= 30 else 'D'],
    ['─' * 20, '─' * 10, '─' * 10, '─' * 5],
    ['OVERALL QUALITY', '', f'{overall_score:.1f}', 'A' if overall_score >= 70 else 'B' if overall_score >= 50 else 'C' if overall_score >= 30 else 'D'],
]

table = ax1.table(cellText=scorecard_data, loc='center', cellLoc='center',
                  colWidths=[0.35, 0.2, 0.15, 0.1])
table.auto_set_font_size(False)
table.set_fontsize(11)
table.scale(1.2, 2)

# Style the table
for i in range(len(scorecard_data)):
    for j in range(4):
        cell = table[(i, j)]
        if i == 0:  # Header
            cell.set_facecolor('#34495e')
            cell.set_text_props(color='white', fontweight='bold')
        elif i == len(scorecard_data) - 1:  # Overall row
            cell.set_facecolor('#2ecc71' if overall_score >= 70 else '#f39c12' if overall_score >= 50 else '#e74c3c')
            cell.set_text_props(fontweight='bold')
        elif i == len(scorecard_data) - 2:  # Separator
            cell.set_facecolor('#ecf0f1')
        elif j == 3:  # Grade column
            grade = scorecard_data[i][3]
            if grade == 'A':
                cell.set_facecolor('#27ae60')
            elif grade == 'B':
                cell.set_facecolor('#f39c12')
            elif grade == 'C':
                cell.set_facecolor('#e67e22')
            else:
                cell.set_facecolor('#e74c3c')
            cell.set_text_props(color='white', fontweight='bold')

ax1.set_title('Synthetic Data Quality Scorecard', fontsize=16, fontweight='bold', pad=20)

# Right side: Radar chart
ax2 = fig.add_subplot(122, projection='polar')

# Radar chart data
categories = ['Distribution\nMatching', 'Correlation\nPreservation', 'Mean\nPreservation', 
              'Variance\nPreservation', 'KS Test\nPass Rate']
values = [distribution_score, correlation_score, mean_preservation, variance_preservation, coverage_score]
values += values[:1]  # Close the polygon

angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]

ax2.plot(angles, values, 'o-', linewidth=2, color='#3498db', markersize=8)
ax2.fill(angles, values, alpha=0.25, color='#3498db')
ax2.set_xticks(angles[:-1])
ax2.set_xticklabels(categories, fontsize=10)
ax2.set_ylim(0, 100)
ax2.set_yticks([20, 40, 60, 80, 100])
ax2.set_yticklabels(['20', '40', '60', '80', '100'], fontsize=9)
ax2.set_title('Quality Metrics Radar Chart\n(Higher is Better)', fontsize=14, fontweight='bold', pad=20)

# Add reference circles
for val in [30, 50, 70]:
    circle = plt.Circle((0, 0), val, transform=ax2.transData._b, fill=False, 
                         linestyle='--', alpha=0.3, color='gray')

plt.suptitle('SYNTHETIC DATA QUALITY ASSESSMENT', fontsize=18, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'quality_scorecard.png'), dpi=150, bbox_inches='tight')
plt.show()

# Print summary
print("\n" + "="*70)
print("QUALITY ASSESSMENT SUMMARY")
print("="*70)
print(f"Overall Quality Score: {overall_score:.1f}/100 ({'Excellent' if overall_score >= 70 else 'Good' if overall_score >= 50 else 'Fair' if overall_score >= 30 else 'Poor'})")
print(f"Distribution Matching: {distribution_score:.1f}/100")
print(f"Correlation Preservation: {correlation_score:.1f}/100")
print(f"Statistical Fidelity: {(mean_preservation + variance_preservation)/2:.1f}/100")
print("="*70)

In [None]:
### 5.9 Augmented Data Clustering Analysis

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Cluster the augmented data (real + synthetic combined)
print("="*70)
print("AUGMENTED DATA CLUSTERING ANALYSIS")
print("="*70)

# Prepare augmented data with labels
augmented_with_labels = augmented_numerical.copy()
augmented_with_labels['source'] = ['Real'] * len(train_df) + ['Synthetic'] * len(synthetic_features)

# Standardize for clustering
X_augmented = scaler.fit_transform(augmented_numerical.fillna(0))

# Perform clustering
n_clusters = 5
clustering = AgglomerativeClustering(n_clusters=n_clusters)
cluster_labels = clustering.fit_predict(X_augmented)

# Calculate metrics
silhouette = silhouette_score(X_augmented, cluster_labels)
davies_bouldin = davies_bouldin_score(X_augmented, cluster_labels)

print(f"\nClustering Results (n_clusters={n_clusters}):")
print(f"  Silhouette Score: {silhouette:.4f} (higher is better, range: -1 to 1)")
print(f"  Davies-Bouldin Index: {davies_bouldin:.4f} (lower is better)")

# Analyze cluster composition
augmented_with_labels['cluster'] = cluster_labels

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Cluster composition (Real vs Synthetic distribution per cluster)
ax1 = axes[0]
cluster_composition = pd.crosstab(augmented_with_labels['cluster'], augmented_with_labels['source'], normalize='index') * 100
cluster_composition.plot(kind='bar', ax=ax1, color=['steelblue', 'coral'], edgecolor='black', alpha=0.8)
ax1.set_xlabel('Cluster', fontsize=12)
ax1.set_ylabel('Percentage (%)', fontsize=12)
ax1.set_title('Cluster Composition: Real vs Synthetic\n(Balanced = Good Integration)', fontsize=14, fontweight='bold')
ax1.legend(title='Source', fontsize=10)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)

# Add cluster sizes as text
cluster_sizes = augmented_with_labels['cluster'].value_counts().sort_index()
for i, (idx, size) in enumerate(cluster_sizes.items()):
    ax1.annotate(f'n={size}', xy=(i, 105), ha='center', fontsize=9, fontweight='bold')

# PCA visualization with clusters
ax2 = axes[1]
X_augmented_pca = pca.fit_transform(X_augmented)
scatter = ax2.scatter(X_augmented_pca[:, 0], X_augmented_pca[:, 1], 
                      c=cluster_labels, cmap='Set2', alpha=0.6, s=30, edgecolor='white', linewidth=0.5)

# Mark synthetic points with different marker
synthetic_mask = np.array([False] * len(train_df) + [True] * len(synthetic_features))
ax2.scatter(X_augmented_pca[synthetic_mask, 0], X_augmented_pca[synthetic_mask, 1], 
           c='none', edgecolor='red', s=50, linewidth=1.5, marker='o', label='Synthetic', alpha=0.5)

ax2.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=12)
ax2.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=12)
ax2.set_title(f'Augmented Data Clusters (PCA View)\nSilhouette: {silhouette:.3f}', fontsize=14, fontweight='bold')
ax2.legend(loc='upper right', fontsize=10)

plt.colorbar(scatter, ax=ax2, label='Cluster')
plt.suptitle('Augmented Data (Real + Synthetic) Clustering', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'augmented_clustering.png'), dpi=150, bbox_inches='tight')
plt.show()

# Print cluster composition summary
print(f"\nCluster Composition Summary:")
print(cluster_composition.round(1).to_string())
print(f"\nInterpretation: If synthetic data integrates well, each cluster should have")
print(f"roughly proportional representation ({100*len(synthetic_features)/len(augmented_numerical):.1f}% synthetic expected)")

## Clean Up

## Conclusion

### What Worked Well

1. **Ensemble Architecture for Tabular Data**
   - The combination of CTGAN, TVAE, and Gaussian Copula provided complementary strengths
   - TVAE showed the best individual performance (Mean KS: 0.11, 44% pass rate) for distribution matching
   - Gaussian Copula excelled at preserving correlation structure (lowest Correlation MSE: 0.0197)
   - **Feature-specific model selection**: The ensemble allows each model to contribute where it excels

2. **Correlation Structure Preservation**
   - Mean absolute correlation difference of ~0.16 indicates reasonable preservation
   - The correlation heatmaps show synthetic data maintains similar inter-feature relationships
   - PCA/t-SNE projections confirm synthetic data occupies the same feature space as real data

3. **Scalable Data Generation Pipeline**
   - Successfully generated synthetic brand records with stratified company distribution
   - The pipeline supports conditional generation based on company characteristics
   - Model persistence to Google Drive enables iterative experimentation without retraining

4. **LLM Brand Name Generation**
   - 95.4% success rate for unique brand name generation
   - GPT-2 Medium and Flan-T5 ensemble provided diverse naming suggestions
   - Memory-efficient sequential loading enabled running on standard GPU

5. **Comprehensive Quality Evaluation**
   - Multi-dimensional assessment covering distribution, correlation, and statistical fidelity
   - Visual diagnostics (histograms, QQ plots, heatmaps) enable intuitive quality assessment
   - Quantitative scorecards provide actionable metrics for improvement

### What Didn't Work Well

1. **Distribution Matching Challenges**
   - Only ~29% of features passed the KS test for distribution similarity
   - Particularly poor performance on: `market_cap_billion_usd` (KS=0.91), `employees` (KS=0.65), `greenwashing_factors` (KS=0.50-0.62)
   - Heavy-tailed distributions and sparse features remain difficult for all GAN variants

2. **LLM Brand Name Quality Issues**
   - Some generated names are full sentences (e.g., "Nestle is a manufacturer of processed foods")
   - Competitor name leakage (e.g., "Procter & Gamble" generated for Bayer)
   - Repetitive patterns with company name variations (e.g., "Nestle Foods", "Nestle, Inc")

3. **Feature-specific Weaknesses**
   - Financial metrics (revenue, market cap) show high variance in synthetic data
   - Categorical encoding may lose semantic relationships
   - Binary features have limited diversity in synthetic samples

### Future Enhancements

1. **Improved Tabular Synthesis**
   - Implement feature-specific preprocessing (log transforms for heavy-tailed distributions)
   - Use conditional generation with explicit constraints for bounded features
   - Explore TabDDPM (diffusion models) as an alternative to GAN-based methods
   - Add post-processing validation to clip unrealistic values

2. **Enhanced LLM Brand Generation**
   - Fine-tune with negative examples to prevent competitor name generation
   - Implement stricter output validation and filtering
   - Use retrieval-augmented generation (RAG) with brand name databases
   - Add style conditioning for different brand naming conventions

3. **Better Evaluation Metrics**
   - Implement Machine Learning Efficacy tests (train on synthetic, test on real)
   - Add privacy metrics (nearest neighbor distance, membership inference)
   - Use domain-specific validity checks for brand attributes

4. **Architecture Improvements**
   - Implement attention-based tabular models (TabTransformer, FT-Transformer)
   - Use hierarchical generation: company → industry → brand attributes → brand name
   - Add discriminator-based filtering to reject low-quality synthetic samples

## Conclusion

### What Worked Well

1. **Ensemble Architecture for Tabular Data**
   - The combination of CTGAN, TVAE, and Gaussian Copula provided complementary strengths
   - TVAE showed the best individual performance (Mean KS: 0.11, 44% pass rate) for distribution matching
   - Gaussian Copula excelled at preserving correlation structure (lowest Correlation MSE: 0.0197)
   - Dynamic weight optimization improved results by shifting weight toward TVAE (~54%)

2. **Scalable Data Generation Pipeline**
   - Successfully generated 500 synthetic brand records with stratified company distribution
   - The pipeline supports conditional generation based on company characteristics
   - Model persistence to Google Drive enables iterative experimentation

3. **LLM Brand Name Generation**
   - 95.4% success rate for unique brand name generation
   - GPT-2 Medium and Flan-T5 ensemble provided diverse naming suggestions
   - Memory-efficient sequential loading enabled running on standard GPU

4. **Correlation Structure Preservation**
   - Mean absolute correlation difference of 0.165 indicates reasonable preservation
   - PCA visualizations show synthetic data occupies similar feature space as original

### What Didn't Work Well

1. **Distribution Matching Challenges**
   - Only 29.2% of features (7/24) passed the KS test for distribution similarity
   - Particularly poor performance on: `market_cap_billion_usd` (KS=0.91), `employees` (KS=0.65), `greenwashing_factors` (KS=0.50-0.62)
   - Heavy-tailed distributions and sparse features remain difficult for GANs

2. **Clustering Quality Degradation**
   - Silhouette score dropped from 0.713 (original) to 0.682 (augmented)
   - Davies-Bouldin score increased from 0.387 to 0.421 (worse)
   - Synthetic data may be introducing noise rather than enhancing cluster structure

3. **LLM Brand Name Quality Issues**
   - Some generated names are full sentences (e.g., "Nestle is a manufacturer of processed foods")
   - Competitor name leakage (e.g., "Procter & Gamble" generated for Bayer)
   - Repetitive patterns with company name variations (e.g., "Nestle Foods", "Nestle, Inc")
   - 4.6% fallback rate indicates generation failures

4. **Ensemble Aggregation Complexity**
   - Simple weighted averaging may not be optimal for combining diverse model outputs
   - The ensemble sometimes performed worse than individual models on certain metrics

### Future Enhancements

1. **Improved Tabular Synthesis**
   - Implement feature-specific preprocessing (log transforms for heavy-tailed distributions)
   - Use conditional generation with explicit constraints for bounded features
   - Explore TabDDPM (diffusion models) as an alternative to GAN-based methods
   - Add post-processing validation to clip unrealistic values

2. **Enhanced LLM Brand Generation**
   - Fine-tune with negative examples to prevent competitor name generation
   - Implement stricter output validation and filtering
   - Use retrieval-augmented generation (RAG) with brand name databases
   - Add style conditioning for different brand naming conventions (descriptive, invented, founder-based)

3. **Better Evaluation Metrics**
   - Implement Machine Learning Efficacy tests (train on synthetic, test on real)
   - Add privacy metrics (nearest neighbor distance, membership inference)
   - Use domain-specific validity checks for brand attributes

4. **Hyperparameter Optimization**
   - Extend Optuna tuning to LLM generation parameters (temperature, top_p, top_k)
   - Implement multi-objective optimization balancing distribution matching and correlation preservation
   - Add early stopping based on validation metrics during training

5. **Architecture Improvements**
   - Implement attention-based tabular models (TabTransformer, FT-Transformer)
   - Use hierarchical generation: company → industry → brand attributes → brand name
   - Add discriminator-based filtering to reject low-quality synthetic samples

## References & Bibliography

### Synthetic Data Generation

1. **CTGAN (Conditional Tabular GAN)**
   - Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). *Modeling Tabular Data using Conditional GAN*. NeurIPS 2019.
   - Paper: https://arxiv.org/abs/1907.00503
   - Implementation: [SDV Library](https://github.com/sdv-dev/SDV)

2. **TVAE (Tabular Variational Autoencoder)**
   - Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). *Modeling Tabular Data using Conditional GAN*. NeurIPS 2019.
   - Part of the same paper as CTGAN, presenting VAE-based alternative

3. **Gaussian Copula**
   - Patki, N., Wedge, R., & Veeramachaneni, K. (2016). *The Synthetic Data Vault*. IEEE DSAA 2016.
   - Paper: https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf

4. **SDV (Synthetic Data Vault) Library**
   - Documentation: https://docs.sdv.dev/sdv/
   - GitHub: https://github.com/sdv-dev/SDV

### Language Models for Text Generation

5. **GPT-2**
   - Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). *Language Models are Unsupervised Multitask Learners*. OpenAI.
   - Paper: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

6. **Flan-T5**
   - Chung, H. W., et al. (2022). *Scaling Instruction-Finetuned Language Models*. arXiv preprint.
   - Paper: https://arxiv.org/abs/2210.11416

7. **Hugging Face Transformers**
   - Wolf, T., et al. (2020). *Transformers: State-of-the-Art Natural Language Processing*. EMNLP 2020.
   - Documentation: https://huggingface.co/docs/transformers/

### Evaluation Metrics

8. **Kolmogorov-Smirnov Test**
   - Massey Jr, F. J. (1951). *The Kolmogorov-Smirnov test for goodness of fit*. Journal of the American Statistical Association, 46(253), 68-78.

9. **Silhouette Score**
   - Rousseeuw, P. J. (1987). *Silhouettes: a graphical aid to the interpretation and validation of cluster analysis*. Journal of Computational and Applied Mathematics, 20, 53-65.

10. **Davies-Bouldin Index**
    - Davies, D. L., & Bouldin, D. W. (1979). *A cluster separation measure*. IEEE Transactions on Pattern Analysis and Machine Intelligence, (2), 224-227.

### Hyperparameter Optimization

11. **Optuna**
    - Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). *Optuna: A Next-generation Hyperparameter Optimization Framework*. KDD 2019.
    - Paper: https://arxiv.org/abs/1907.10902
    - Documentation: https://optuna.org/

### Related Work on Tabular Data Synthesis

12. **TabDDPM (Diffusion Models for Tabular Data)**
    - Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). *TabDDPM: Modelling Tabular Data with Diffusion Models*. ICML 2023.
    - Paper: https://arxiv.org/abs/2209.15421

13. **CTAB-GAN+**
    - Zhao, Z., Kunar, A., Birke, R., & Chen, L. Y. (2022). *CTAB-GAN+: Enhancing Tabular Data Synthesis*. arXiv preprint.
    - Paper: https://arxiv.org/abs/2204.00401

14. **Synthetic Data Generation Survey**
    - Jordon, J., Yoon, J., & van der Schaar, M. (2022). *Synthetic Data - what, why and how?* arXiv preprint.
    - Paper: https://arxiv.org/abs/2205.03257

### Software & Tools

- **Python**: https://www.python.org/
- **PyTorch**: https://pytorch.org/
- **Pandas**: https://pandas.pydata.org/
- **Scikit-learn**: https://scikit-learn.org/
- **Matplotlib**: https://matplotlib.org/
- **Seaborn**: https://seaborn.pydata.org/
- **Google Colab**: https://colab.research.google.com/

In [None]:
# Clear GPU memory
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("GPU memory cleared")