# LLM-based TTS: Main Training and Inference Notebook

This notebook orchestrates the training of a GPT2-based Text-to-Speech model and provides utilities for running inference using trained checkpoints. It leverages modularized Python scripts located in the `src/` directory for configuration, dataset handling, model definition, training procedures, and utility functions.

**Assumed Project Structure:**
```
ProjectRoot/
├── notebooks/  <-- This Jupyter notebook (llm_tts_13.ipynb)
├── src/        <-- Python modules (config.py, model.py, etc.)
├── data/         <-- LJSpeech dataset (or other configured dataset)
└── training_runs/  <-- Output directory for training artifacts (checkpoints, logs, samples)
```
**Setup Steps:**
1. Ensure the project structure outlined above is in place.
2. Verify that the dataset path (e.g., LJSpeech) is correctly specified in `src/config.py` (variable `DATA_DIR`).
3. Install all necessary Python packages (e.g., `torch`, `torchaudio`, `transformers`, `pandas`, `scikit-learn`, `matplotlib`, `soundfile`, `tqdm`).

## 1. Environment Setup

This optional cell sets an environment variable useful for CUDA debugging. It can help pinpoint the source of CUDA errors by making GPU operations synchronous.

In [8]:
# %env CUDA_LAUNCH_BLOCKING=1
# import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1" # Alternative method

## 2. Import Necessary Modules from `src`

This cell adjusts the Python system path to allow importing modules from the `src` directory, which is assumed to be a sibling to the `notebooks` directory. It then attempts to import key modules required for training.

In [9]:
import sys
import os

# Adjust Python path to locate the 'src' directory.
# Assumes this notebook is in 'ProjectRoot/notebooks/'.
module_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'src'))
if module_path not in sys.path:
    sys.path.insert(0, module_path) # Insert at beginning to prioritize 'src' modules.
    print(f"Added to sys.path: {module_path}")
else:
    print(f"{module_path} already in sys.path")

# Attempt to import core modules for training.
try:
    import config  # Configuration settings from src.config
    import train   # Main training script from src.train
    print("Successfully imported modules from src.")
    _train_modules_loaded = True
except ImportError as e:
    print(f"Error importing modules from src: {e}")
    print("Ensure 'src' directory is correctly structured and accessible.")
    print(f"Current working directory: {os.getcwd()}")
    print(f"sys.path: {sys.path}")
    _train_modules_loaded = False

/home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/src already in sys.path
Successfully imported modules from src.


## 3. Run Model Training

Executes the main training function (`main_train`) from the `src.train` module. 
Before running, ensure all configurations in `src/config.py` (dataset paths, hyperparameters, etc.) are set as desired. 
Training outputs, including checkpoints and logs, will be saved to the directory specified in `config.OUTPUT_DIR`.

In [None]:
# Execute the main training script if modules were loaded.
if __name__ == '__main__' and _train_modules_loaded:
    print(f"Starting training run. Output will be in: {os.path.abspath(config.OUTPUT_DIR)}")
    
    # Change Current Working Directory (CWD) to 'src' to ensure 
    # relative paths within the training script work correctly.
    original_cwd = os.getcwd()
    try:
        os.chdir(module_path) # module_path points to 'src/'
        print(f"Changed CWD to: {os.getcwd()} for running train.main_train()")
        train.main_train()
    except Exception as e_train:
        print(f"Error during train.main_train(): {e_train}")
        import traceback
        traceback.print_exc()
    finally:
        os.chdir(original_cwd) # Restore original CWD.
        print(f"Restored CWD to: {os.getcwd()}")

elif not _train_modules_loaded:
    print("Module 'train' or 'config' not loaded. Cannot start training.")

Starting training run. Output will be in: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs/training_run_20250519_182855
Changed CWD to: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/src for running train.main_train()
Created output directories for RUN_ID: 20250519_182855 in /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs/training_run_20250519_182855
Using device: cuda


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Preparing DataLoaders...
Using 13100 entries from the dataset.
Split dataset: 12445 training samples, 655 validation samples.
Initializing GPT2TTS model...
GPT2TTS initialized. Audio vocab size: 1026 (Codebook: 1024, Special: 2)
Expected num_audio_codebooks (Nq): 32
BOS Audio ID: 1024, EOS Audio ID: 1025
Loss function using label smoothing: 0.1
Total number of training steps: 389
Using linear LR scheduler with 200 warmup steps over 389 total steps.
Starting with curriculum learning: initial audio token length = 1024

--- Starting Training for 1 Epochs ---


Epoch 1/1:   0%|          | 0/1 [00:00<?, ?epoch/s]

Curriculum: Max audio token length for epoch 1 is 1024


Epoch 1/1:   0%|          | 0/1 [3:41:32<?, ?epoch/s, Train Loss=2.7289, LR=0.00e+00]

Epoch 1 Training Summary: Avg Loss: 2.7289 (389/389 batches processed)


Epoch 1/1:   0%|          | 0/1 [3:47:51<?, ?epoch/s, Train Loss=2.7289, Val Loss=1.0705, LR=0.00e+00]

Epoch 1 Validation Summary: Avg Loss: 1.0705


Epoch 1/1:   0%|          | 0/1 [3:47:53<?, ?epoch/s, Train Loss=2.7289, LR=0.00e+00, Epoch Time=3:47:53, Val Loss=1.0705]/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [6,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [7,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.


Saved new best model (val_loss: 1.0705) to ../training_runs/training_run_20250519_182855/checkpoints/best_val_loss_model.pth

--- Logging audio samples for epoch 1 ---
Generating audio for: 'Hello world, this is a speech synthesis test.'
Generating with: max_new_tokens=512, temp=0.75, top_p=0.90, repetition_penalty=1.20, suppress_tokens=[1024]


Epoch 1/1:   0%|          | 0/1 [3:47:53<?, ?epoch/s, Train Loss=2.7289, LR=0.00e+00, Epoch Time=3:47:53, Val Loss=1.0705]

Error during train.main_train(): CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Restored CWD to: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/notebooks





## 4. Setup for Inference and Evaluation

Imports modules required for running inference and evaluation tasks, including the newly created `evaluation` module.

In [12]:
import torch
from IPython.display import Audio as IPyAudio, display
import numpy as np # For placeholder metrics

# Ensure src is in path (might be redundant if cell 2 ran).
import sys
import os
notebook_dir_inf = os.getcwd()
src_path_inf = os.path.abspath(os.path.join(notebook_dir_inf, '..', 'src'))
if src_path_inf not in sys.path:
    sys.path.insert(0, src_path_inf)
    # print(f"Added to sys.path for inference/eval: {src_path_inf}") # Less verbose

try:
    import config as inference_config 
    from model import GPT2TTS 
    from utils import generate_speech # Still used by run_inference_from_checkpoint if kept
    from transformers import GPT2Tokenizer
    import evaluation # Import the new evaluation module
    print("Successfully imported modules for inference and evaluation.")
    _inference_eval_modules_loaded = True
except ImportError as e:
    print(f"Error importing inference/evaluation modules: {e}")
    _inference_eval_modules_loaded = False

[Placeholder CER] Hyp: 'test...' Ref: 'test...'
[Placeholder UTMOS] Waveform shape: (1,), SR: 16000
[Placeholder SECS] Gen wave: (1,), Ref wave: (1,), SR: 16000
Successfully imported modules for inference and evaluation.


## 5. Single Inference Example

Loads the latest/best checkpoint and generates speech for a single test sentence. This uses a helper function defined within this notebook for quick tests.

In [13]:
# This function remains in the notebook for quick, single inference tests.
def run_inference_from_checkpoint(checkpoint_path, text_to_synthesize, custom_gen_params=None):
    """Loads model from checkpoint, generates speech for one text."""
    if not _inference_eval_modules_loaded:
        print("Inference modules not loaded. Cannot run.")
        return None # Return None to indicate failure
    
    print(f"Loading model for single inference from: {checkpoint_path}")
    device = inference_config.DEVICE
    gpt2_tokenizer_inf = GPT2Tokenizer.from_pretrained(inference_config.GPT2_MODEL_NAME)
    if gpt2_tokenizer_inf.pad_token is None:
        gpt2_tokenizer_inf.pad_token = gpt2_tokenizer_inf.eos_token
        gpt2_tokenizer_inf.pad_token_id = gpt2_tokenizer_inf.eos_token_id
    
    inference_model = GPT2TTS(tokenizer=gpt2_tokenizer_inf).to(device) 
    try:
        abs_checkpoint_path = os.path.abspath(checkpoint_path) 
        state_dict = torch.load(abs_checkpoint_path, map_location=device)
        if all(k.startswith('module.') for k in state_dict.keys()): 
            state_dict = {k[len('module.'):]: v for k, v in state_dict.items()}
        inference_model.load_state_dict(state_dict)
        print("Model checkpoint loaded successfully for single inference.")
    except Exception as e:
        print(f"Error loading checkpoint: {e}")
        return None

    inference_model.eval()
    waveform = generate_speech(inference_model, text_to_synthesize, device, generation_params=custom_gen_params)
    if waveform is not None and waveform.size > 0:
        print("Generated waveform shape:", waveform.shape)
        display(IPyAudio(waveform, rate=inference_config.TARGET_SAMPLE_RATE))
        return waveform # Return waveform for potential use in metrics
    else:
        print("Single inference speech generation failed.")
        return None

if __name__ == '__main__' and _inference_eval_modules_loaded: 
    output_base_abs = os.path.abspath(os.path.join(os.getcwd(), '..', 'training_runs'))
    latest_run_dir = None
    if os.path.exists(output_base_abs):
        all_runs = sorted([d for d in os.listdir(output_base_abs) if os.path.isdir(os.path.join(output_base_abs, d)) and d.startswith('training_run_')])
        if all_runs: latest_run_dir = all_runs[-1]
    
    checkpoint_to_load_single = None
    if latest_run_dir:
        best_chk = os.path.join(output_base_abs, latest_run_dir, "checkpoints", "best_val_loss_model.pth")
        latest_chk = os.path.join(output_base_abs, latest_run_dir, "checkpoints", "latest_model_checkpoint.pth")
        if os.path.exists(best_chk): checkpoint_to_load_single = best_chk
        elif os.path.exists(latest_chk): checkpoint_to_load_single = latest_chk
    
    if checkpoint_to_load_single:
        print(f"Single inference using: {checkpoint_to_load_single}")
        _ = run_inference_from_checkpoint(checkpoint_to_load_single, "A quick test of the system.")
    else:
        print("No checkpoint found for single inference example.")

Single inference using: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs/training_run_20250519_182855/checkpoints/best_val_loss_model.pth
Loading model for single inference from: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs/training_run_20250519_182855/checkpoints/best_val_loss_model.pth
GPT2TTS initialized. Audio vocab size: 1026 (Codebook: 1024, Special: 2)
Expected num_audio_codebooks (Nq): 32
BOS Audio ID: 1024, EOS Audio ID: 1025
Loss function using label smoothing: 0.1
GPT2Model initialized with derived config: vocab_size=50257, pad_token_id=50256, name_or_path='gpt2'
Model checkpoint loaded successfully for single inference.
Generating with: max_new_tokens=512, temp=0.75, top_p=0.90, suppress_tokens=[1024]

--- Generated Token Inspection for: 'A quick test of the system....' ---
Sliced generated_audio_tokens_with_eos (first 50): [611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 611, 61

## 6. Advanced Inference and Evaluation (Using `evaluation` module)

This section utilizes the `src.evaluation` module for:
1. Running inference with different sampling strategies.
2. Demonstrating placeholder calls for CER, UTMOS, and SECS metrics.

### 6.1 Inference with Different Sampling Strategies

Uses `evaluation.perform_varied_inference` to generate speech with various sampling configurations.

In [14]:
if __name__ == '__main__' and _inference_eval_modules_loaded:
    # --- Setup for varied inference (similar to cell 5, but for evaluation module) ---
    output_base_abs_hw = os.path.abspath(os.path.join(os.getcwd(), '..', 'training_runs'))
    latest_run_dir_hw = None
    if os.path.exists(output_base_abs_hw):
        all_runs_hw = sorted([d for d in os.listdir(output_base_abs_hw) if os.path.isdir(os.path.join(output_base_abs_hw, d)) and d.startswith('training_run_')])
        if all_runs_hw: latest_run_dir_hw = all_runs_hw[-1]

    checkpoint_to_load_hw = None
    loaded_model_for_varied_inf = None # Store the loaded model

    if latest_run_dir_hw:
        best_chk_hw = os.path.join(output_base_abs_hw, latest_run_dir_hw, "checkpoints", "best_val_loss_model.pth")
        latest_chk_hw = os.path.join(output_base_abs_hw, latest_run_dir_hw, "checkpoints", "latest_model_checkpoint.pth")
        if os.path.exists(best_chk_hw): checkpoint_to_load_hw = best_chk_hw
        elif os.path.exists(latest_chk_hw): checkpoint_to_load_hw = latest_chk_hw
    
    if checkpoint_to_load_hw:
        print(f"Homework: Loading checkpoint for varied inference: {checkpoint_to_load_hw}")
        # Load the model once for all varied inferences
        hw_tokenizer = GPT2Tokenizer.from_pretrained(inference_config.GPT2_MODEL_NAME)
        if hw_tokenizer.pad_token is None: 
            hw_tokenizer.pad_token = hw_tokenizer.eos_token
            hw_tokenizer.pad_token_id = hw_tokenizer.eos_token_id
        loaded_model_for_varied_inf = GPT2TTS(tokenizer=hw_tokenizer).to(inference_config.DEVICE)
        try:
            state_dict_hw = torch.load(checkpoint_to_load_hw, map_location=inference_config.DEVICE)
            if all(k.startswith('module.') for k in state_dict_hw.keys()):
                state_dict_hw = {k[len('module.'):]: v for k, v in state_dict_hw.items()}
            loaded_model_for_varied_inf.load_state_dict(state_dict_hw)
            loaded_model_for_varied_inf.eval()
            print("Model loaded successfully for varied inference.")
        except Exception as e_load_hw:
            print(f"Error loading model for varied inference: {e_load_hw}")
            loaded_model_for_varied_inf = None # Ensure model is None if loading failed
    
    if loaded_model_for_varied_inf:
        sample_texts_hw = [
            "This is the first sentence for our sampling strategy test.",
            "Can this model generate long and coherent speech without degrading?",
            "A short phrase."
        ]
        sampling_strategies_hw = {
            "Default (from utils)": None,
            "Low Temp": {"temperature": 0.5, "top_p": 0.9, "repetition_penalty": 1.0, "do_sample": True},
            "High Temp": {"temperature": 0.9, "top_p": 0.95, "repetition_penalty": 1.0, "do_sample": True},
            "Top-K Only": {"temperature": 0.8, "top_k": 50, "repetition_penalty": 1.0, "do_sample": True}
        }
        # Call the function from the evaluation module
        evaluation.perform_varied_inference(
            model=loaded_model_for_varied_inf,
            device=inference_config.DEVICE,
            checkpoint_path=checkpoint_to_load_hw, # Pass for context
            texts_to_synthesize=sample_texts_hw,
            sampling_strategies=sampling_strategies_hw,
            target_sample_rate=inference_config.TARGET_SAMPLE_RATE
        )
    else:
        print("Homework: Model not loaded. Skipping varied inference.")

elif not _inference_eval_modules_loaded:
    print("Cannot run varied inference due to module import failures.")

Homework: Loading checkpoint for varied inference: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs/training_run_20250519_182855/checkpoints/best_val_loss_model.pth
GPT2TTS initialized. Audio vocab size: 1026 (Codebook: 1024, Special: 2)
Expected num_audio_codebooks (Nq): 32
BOS Audio ID: 1024, EOS Audio ID: 1025
Loss function using label smoothing: 0.1
GPT2Model initialized with derived config: vocab_size=50257, pad_token_id=50256, name_or_path='gpt2'
Model loaded successfully for varied inference.

--- Performing Varied Inference using checkpoint: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs/training_run_20250519_182855/checkpoints/best_val_loss_model.pth ---

--- Synthesizing for text: 'This is the first sentence for our sampling strategy test.' ---

Strategy: Default (from utils)
Generating with: max_new_tokens=512, temp=0.75, top_p=0.90, suppress_tokens=[1024]

--- Generated Token Inspection for: 'This is the first sentence for...' ---
Sliced gene


Strategy: Low Temp
Params: {'temperature': 0.5, 'top_p': 0.9, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.50, top_p=0.90, suppress_tokens=[1024]

--- Generated Token Inspection for: 'This is the first sentence for...' ---
Sliced generated_audio_tokens_with_eos (first 50): [376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376]
Number of cleaned audio tokens: 512
Most common generated tokens:


Strategy: High Temp
Params: {'temperature': 0.9, 'top_p': 0.95, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.90, top_p=0.95, suppress_tokens=[1024]

--- Generated Token Inspection for: 'This is the first sentence for...' ---
Sliced generated_audio_tokens_with_eos (first 50): [84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84]
Number of cleaned audio tokens: 512
Most common generated tokens: {84: 512}
Number of unique tokens generated: 1
--- End Token Inspection ---

Generated waveform s


Strategy: Top-K Only
Params: {'temperature': 0.8, 'top_k': 50, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.80, top_p=1.00, suppress_tokens=[1024]

--- Generated Token Inspection for: 'This is the first sentence for...' ---
Sliced generated_audio_tokens_with_eos (first 50): [946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946, 946]
Number of cleaned audio tokens: 512
Most common generated tokens


--- Synthesizing for text: 'Can this model generate long and coherent speech without degrading?' ---

Strategy: Default (from utils)
Generating with: max_new_tokens=512, temp=0.75, top_p=0.90, suppress_tokens=[1024]

--- Generated Token Inspection for: 'Can this model generate long a...' ---
Sliced generated_audio_tokens_with_eos (first 50): [717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717, 717]
Number of cleaned audio tokens: 512
Most common generated token


Strategy: Low Temp
Params: {'temperature': 0.5, 'top_p': 0.9, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.50, top_p=0.90, suppress_tokens=[1024]

--- Generated Token Inspection for: 'Can this model generate long a...' ---
Sliced generated_audio_tokens_with_eos (first 50): [806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806]
Number of cleaned audio tokens: 512
Most common generated tokens:


Strategy: High Temp
Params: {'temperature': 0.9, 'top_p': 0.95, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.90, top_p=0.95, suppress_tokens=[1024]

--- Generated Token Inspection for: 'Can this model generate long a...' ---
Sliced generated_audio_tokens_with_eos (first 50): [794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794, 794]
Number of cleaned audio tokens: 512
Most common generated token


Strategy: Top-K Only
Params: {'temperature': 0.8, 'top_k': 50, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.80, top_p=1.00, suppress_tokens=[1024]

--- Generated Token Inspection for: 'Can this model generate long a...' ---
Sliced generated_audio_tokens_with_eos (first 50): [376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376, 376]
Number of cleaned audio tokens: 512
Most common generated tokens


--- Synthesizing for text: 'A short phrase.' ---

Strategy: Default (from utils)
Generating with: max_new_tokens=512, temp=0.75, top_p=0.90, suppress_tokens=[1024]

--- Generated Token Inspection for: 'A short phrase....' ---
Sliced generated_audio_tokens_with_eos (first 50): [806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806, 806]
Number of cleaned audio tokens: 512
Most common generated tokens: {806: 512}
Number of unique tokens generated: 1
--- End Token In


Strategy: Low Temp
Params: {'temperature': 0.5, 'top_p': 0.9, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.50, top_p=0.90, suppress_tokens=[1024]

--- Generated Token Inspection for: 'A short phrase....' ---
Sliced generated_audio_tokens_with_eos (first 50): [940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940, 940]
Number of cleaned audio tokens: 512
Most common generated tokens: {940: 512}
Num


Strategy: High Temp
Params: {'temperature': 0.9, 'top_p': 0.95, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.90, top_p=0.95, suppress_tokens=[1024]

--- Generated Token Inspection for: 'A short phrase....' ---
Sliced generated_audio_tokens_with_eos (first 50): [573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573, 573]
Number of cleaned audio tokens: 512
Most common generated tokens: {573: 512}
N


Strategy: Top-K Only
Params: {'temperature': 0.8, 'top_k': 50, 'repetition_penalty': 1.0, 'do_sample': True, 'max_new_tokens': 512}
Generating with: max_new_tokens=512, temp=0.80, top_p=1.00, suppress_tokens=[1024]

--- Generated Token Inspection for: 'A short phrase....' ---
Sliced generated_audio_tokens_with_eos (first 50): [228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228]
Number of raw generated audio tokens (after prompt): 512
Cleaned audio tokens (first 50): [228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228, 228]
Number of cleaned audio tokens: 512
Most common generated tokens: {228: 512}
Nu

### 6.2 Evaluation Metrics (CER, UTMOS, SECS)

Demonstrates conceptual calls to placeholder metric functions from `evaluation.py`.
Actual calculation requires external libraries and ground truth data.

In [15]:
if __name__ == '__main__' and _inference_eval_modules_loaded:
    print("\n--- Metrics Calculation Placeholders (using evaluation module) ---")
    
    # Example: Assume 'generated_waveform_np' is a NumPy array from inference
    # and 'ground_truth_text', 'reference_speaker_audio_np' are available.
    
    # For CER (Character Error Rate)
    # 1. Generate audio for a known text.
    # 2. Transcribe the generated audio using an ASR model.
    # 3. Compare transcribed text with the original/ground truth text.
    example_gt_text = "This is a ground truth sentence."
    example_asr_hypothesis = "this is a ground truth sentense"
    cer_score = evaluation.calculate_cer(example_asr_hypothesis, example_gt_text)
    print(f"Example CER (placeholder): {cer_score:.4f}")

    # For UTMOS (Predicted MOS Score)
    # Generate or load a waveform.
    dummy_waveform = np.random.randn(inference_config.TARGET_SAMPLE_RATE * 2) # 2 seconds dummy audio
    utmos_score = evaluation.calculate_utmos(dummy_waveform, inference_config.TARGET_SAMPLE_RATE)
    print(f"Example UTMOS (placeholder): {utmos_score:.4f}")

    # For SECS (Speaker Encoder Cosine Similarity)
    # Requires a reference audio from the target speaker and a generated audio.
    dummy_reference_waveform = np.random.randn(inference_config.TARGET_SAMPLE_RATE * 3)
    secs_score = evaluation.calculate_secs(dummy_waveform, dummy_reference_waveform, inference_config.TARGET_SAMPLE_RATE)
    print(f"Example SECS (placeholder): {secs_score:.4f}")

    print("\nNote: Actual metric calculations require integrating specific models/libraries.")

elif not _inference_eval_modules_loaded:
    print("Cannot demonstrate metrics calculation due to module import failures.")


--- Metrics Calculation Placeholders (using evaluation module) ---
[Placeholder CER] Hyp: 'this is a ground truth sentense...' Ref: 'This is a ground truth sentence....'
Example CER (placeholder): 0.1000
[Placeholder UTMOS] Waveform shape: (48000,), SR: 24000
Example UTMOS (placeholder): 3.5000
[Placeholder SECS] Gen wave: (48000,), Ref wave: (72000,), SR: 24000
Example SECS (placeholder): 0.8500

Note: Actual metric calculations require integrating specific models/libraries.


## 7. Encodec Path Verification (Diagnostic Utility)

This cell provides a utility to verify the Encodec encode/decode pathway using `diag_run_encodec_verification` from `src.utils`.

In [17]:
import torch
from IPython.display import Audio as IPyAudio, display
import pandas as pd

import sys
import os
notebook_dir_diag = os.getcwd()
src_path_diag = os.path.abspath(os.path.join(notebook_dir_diag, '..', 'src'))
if src_path_diag not in sys.path:
    sys.path.insert(0, src_path_diag)
    # print(f"Added to sys.path for diagnostic: {src_path_diag}") # Verbose

try:
    import config as diag_config
    from model import GPT2TTS as DiagGPT2TTS
    from dataset import LJSpeechDataset as DiagLJSpeechDataset 
    from utils import run_encodec_path_verification as diag_run_encodec_verification 
    from transformers import GPT2Tokenizer as DiagGPT2Tokenizer, AutoProcessor as DiagAutoProcessor
    print("Successfully imported modules for diagnostic utility.")
    _diag_util_modules_loaded = True
except ImportError as e:
    print(f"Error importing modules for diagnostic utility: {e}")
    _diag_util_modules_loaded = False

def diagnostic_encodec_check_notebook(): # Renamed to avoid conflict if evaluation.py also has one
    """Runs Encodec encode/decode path verification."""
    if not _diag_util_modules_loaded:
        print("Diagnostic utility modules not loaded. Cannot run.")
        return

    device = diag_config.DEVICE
    print(f"Running Encodec diagnostic on device: {device}")
    diag_gpt2_tokenizer = DiagGPT2Tokenizer.from_pretrained(diag_config.GPT2_MODEL_NAME)
    if diag_gpt2_tokenizer.pad_token is None:
        diag_gpt2_tokenizer.pad_token = diag_gpt2_tokenizer.eos_token
        diag_gpt2_tokenizer.pad_token_id = diag_gpt2_tokenizer.eos_token_id
    
    encodec_processor_diag = DiagAutoProcessor.from_pretrained(diag_config.ENCODEC_MODEL_NAME)
    diag_model_instance = DiagGPT2TTS(tokenizer=diag_gpt2_tokenizer).to(device)
    diag_model_instance.eval()

    try:
        abs_data_dir = os.path.abspath(os.path.join(src_path_diag, os.path.dirname(diag_config.METADATA_FILE)))
        abs_metadata_file = os.path.join(abs_data_dir, os.path.basename(diag_config.METADATA_FILE))
        abs_wavs_dir = os.path.join(abs_data_dir, os.path.basename(diag_config.WAVS_DIR))

        if not os.path.exists(abs_metadata_file) or not os.path.exists(abs_wavs_dir):
            print("Diagnostic data files not found."); return

        full_metadata_df_diag = pd.read_csv(abs_metadata_file, sep='|', header=None, names=['filename_base', 'original_text', 'normalized_text'], quoting=3)
        if full_metadata_df_diag.empty: print("Metadata empty."); return
            
        diag_dataset_instance = DiagLJSpeechDataset(
            metadata_df=full_metadata_df_diag.head(2), wavs_dir=abs_wavs_dir,
            processor=encodec_processor_diag, target_sample_rate=diag_config.TARGET_SAMPLE_RATE,
            tokenizer=diag_gpt2_tokenizer
        )
        if len(diag_dataset_instance) == 0: print("Diagnostic dataset empty."); return
    except Exception as e:
        print(f"Error preparing diagnostic dataset: {e}"); import traceback; traceback.print_exc(); return

    diag_run_encodec_verification(diag_model_instance, diag_dataset_instance, device)

if __name__ == '__main__':
    if _diag_util_modules_loaded:
        diagnostic_encodec_check_notebook()
        print("Encodec diagnostic cell ready. Uncomment 'diagnostic_encodec_check_notebook()' to run.")
    else:
        print("Cannot prepare diagnostic due to module import failures.")

Successfully imported modules for diagnostic utility.
Running Encodec diagnostic on device: cuda
GPT2TTS initialized. Audio vocab size: 1026 (Codebook: 1024, Special: 2)
Expected num_audio_codebooks (Nq): 32
BOS Audio ID: 1024, EOS Audio ID: 1025
Loss function using label smoothing: 0.1
GPT2Model initialized with derived config: vocab_size=50257, pad_token_id=50256, name_or_path='gpt2'

--- Running Encodec Path Verification ---
Using ground truth audio for text: 'Printing, in the only sense with which we are at present concerned, differs from most if not from al...' 
Shape of ground-truth audio codes from encodec: torch.Size([1, 1, 32, 725])
Shape of final numpy waveform from diagnostic: (232000,)
Playing RECONSTRUCTED ground-truth audio (after encode & decode):


Playing ORIGINAL ground-truth audio for comparison:


--- Encodec Path Verification Done ---

Encodec diagnostic cell ready. Uncomment 'diagnostic_encodec_check_notebook()' to run.


## TTS Project: Quick Summary & Next Steps

### Where We Are

We've been building a GPT-2 based Text-to-Speech (TTS) model. After trying smaller data subsets, we ran it on the full LJSpeech dataset for one epoch.

**The Good:**
* The model trained and validated, hitting a validation loss around 1.07.
* The checkpoint from this run was saved.

**The Not-So-Good:**
1.  **Mode Collapse:** When trying to generate speech, the model just repeats the same audio token over and over. This means we get a monotonous sound, not actual speech.
2.  **CUDA Error:** After that one full-dataset epoch, a "device-side assert triggered" CUDA error popped up when the code tried to log audio samples (which uses the generation function). This usually points to an indexing problem with embeddings on the GPU.

### What We've Tried So Far

To get here and tackle earlier issues, we've:

* **Scaled Up Data:** Moved from small test sets to the full dataset.
* **Used Curriculum Learning:** Started with shorter audio clips in training and planned to increase the length each epoch.
* **Tuned Training:**
    * Lowered the learning rate (to `2e-5`).
    * Added weight decay (`0.01`) and label smoothing (`0.1`).
* **Improved Inference:**
    * Used sampling methods (temperature, top_p) to get more varied outputs.
    * Made sure the model doesn't try to generate the special "start of audio" (BOS) token after the prompt.
    * Temporarily set `repetition_penalty` to `1.0` to avoid a specific CUDA error with mixed text/audio token IDs.
* **Fixed CUDA/Config Bugs:** Spent a good deal of effort making sure the GPT-2 tokenizer and its configuration (especially `vocab_size` and `pad_token_id`) were perfectly aligned to prevent earlier CUDA errors during model setup. The latest fix involved passing the tokenizer directly into our `GPT2TTS` model.

### What's Next?

The main goals now are to fix the mode collapse and the new CUDA error.

1.  **Nail Down the CUDA Error (Top Priority):**
    * This error stops us from properly checking the model's output. It's likely an issue with an invalid token ID being used for an embedding, especially now that we're using the full dataset.
    * Need to double-check how `prepare_inputs_for_generation` handles token IDs (special tokens vs. audio codebook tokens) and any clamping logic.

2.  **Train Longer on Full Data:**
    * Once the CUDA error is gone, one epoch isn't enough. The model needs many more epochs (or a lot more training steps) on the full dataset to learn properly and avoid mode collapse.

3.  **Tune Curriculum Learning:**
    * Make sure the `current_max_audio_tokens_for_curriculum` actually increases and lets the model train on sequences as long as we want to generate.
    * If mode collapse hits early even with more training, try starting the curriculum with even shorter audio sequences.

4.  **Generation Parameters (After CUDA Fix):**
    * The `RepetitionPenaltyLogitsProcessor` was turned off (by setting `repetition_penalty=1.0`) because it caused CUDA errors. If we can fix the root cause of that error (likely the mixed text/audio token ID issue), using a `repetition_penalty > 1.0` could help reduce repetitive audio.

5.  **Check Token Usage:**
    * If the model still gets stuck after more training, look at which audio tokens it's generating versus what's in the training data. A big difference could point to other problems.

6.  **Keep Tuning Hyperparameters:**
    * Learning rate, batch size, and scheduler settings might need more tweaks as training goes on for longer.

7.  **Quick Sanity Checks (If Still Stuck):**
    * Are dropout layers working during training?
    * Can the model's context window (`n_positions`) handle our prompt + generated audio length?
    * Are BOS/EOS tokens used consistently in training and inference?

8.  **More Advanced Ideas (Later On):**
    * **Token Drop/Masking:** Randomly hide some input audio tokens during training so the model learns better from context.
    * **Scheduled Sampling:** Slowly switch from giving the model real audio tokens to its own generated tokens during training.

**Immediate Plan:** First, fix the CUDA error so we can actually test generation properly. Then, get ready for a much longer training run on the full dataset, keeping an eye on the generated audio samples to see if we're beating mode collapse.
