# LLM-based TTS: Main Training and Inference Notebook

This notebook orchestrates the training of a GPT2-based Text-to-Speech model and provides utilities for running inference using trained checkpoints. It leverages modularized Python scripts located in the `src/` directory for configuration, dataset handling, model definition, training procedures, and utility functions.

**Assumed Project Structure:**
```
ProjectRoot/
├── notebooks/  <-- This Jupyter notebook (llm_tts_13.ipynb)
├── src/        <-- Python modules (config.py, model.py, etc.)
├── data/         <-- LJSpeech dataset (or other configured dataset)
└── training_runs/  <-- Output directory for training artifacts (checkpoints, logs, samples)
```
**Setup Steps:**
1. Ensure the project structure outlined above is in place.
2. Verify that the dataset path (e.g., LJSpeech) is correctly specified in `src/config.py` (variable `DATA_DIR`).
3. Install all necessary Python packages (e.g., `torch`, `torchaudio`, `transformers`, `pandas`, `scikit-learn`, `matplotlib`, `soundfile`, `tqdm`, `jiwer`).

## 1. Environment Setup

This optional cell sets an environment variable useful for CUDA debugging. It can help pinpoint the source of CUDA errors by making GPU operations synchronous.

In [1]:
%env CUDA_LAUNCH_BLOCKING=1
# import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1" # Alternative method

env: CUDA_LAUNCH_BLOCKING=1


## 2. Import Necessary Modules from `src`

This cell adjusts the Python system path to allow importing modules from the `src` directory, which is assumed to be a sibling to the `notebooks` directory. It then attempts to import key modules required for training.

In [2]:
import sys
import os

# Adjust Python path to locate the 'src' directory.
# Assumes this notebook is in 'ProjectRoot/notebooks/'.
module_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'src'))
if module_path not in sys.path:
    sys.path.insert(0, module_path) # Insert at beginning to prioritize 'src' modules.
    print(f"Added to sys.path: {module_path}")
else:
    print(f"{module_path} already in sys.path")

# Attempt to import core modules for training.
try:
    import config  # Configuration settings from src.config
    import train   # Main training script from src.train
    print("Successfully imported modules from src.")
    _train_modules_loaded = True
except ImportError as e:
    print(f"Error importing modules from src: {e}")
    print("Ensure 'src' directory is correctly structured and accessible.")
    print(f"Current working directory: {os.getcwd()}")
    print(f"sys.path: {sys.path}")
    _train_modules_loaded = False

Added to sys.path: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/src


  from .autonotebook import tqdm as notebook_tqdm


Successfully imported and verified callable: speechbrain.lobes.models.FastSpeech2.mel_spectogram (via module attribute access)
Successfully imported modules from src.


## 3. Run Model Training

Executes the main training function (`main_train`) from the `src.train` module. 
Before running, ensure all configurations in `src/config.py` (dataset paths, hyperparameters, etc.) are set as desired. 
Training outputs, including checkpoints and logs, will be saved to the directory specified in `config.OUTPUT_DIR`.

In [3]:
# Execute the main training script if modules were loaded.
if __name__ == '__main__' and _train_modules_loaded:
    print(f"Starting training run. Output will be in: {os.path.abspath(config.OUTPUT_DIR)}")
    
    # Change Current Working Directory (CWD) to 'src' to ensure 
    # relative paths within the training script work correctly.
    original_cwd = os.getcwd()
    try:
        os.chdir(module_path) # module_path points to 'src/'
        print(f"Changed CWD to: {os.getcwd()} for running train.main_train()")
        train.main_train()
    except Exception as e_train:
        print(f"Error during train.main_train(): {e_train}")
        import traceback
        traceback.print_exc()
    finally:
        os.chdir(original_cwd) # Restore original CWD.
        print(f"Restored CWD to: {os.getcwd()}")

elif not _train_modules_loaded:
    print("Module 'train' or 'config' not loaded. Cannot start training.")

Starting training run. Output will be in: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel
Changed CWD to: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/src for running train.main_train()
Training run output: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331
Using device: cuda
get_data_loaders_mel called with: max_text_len=192, max_mel_frames=800
Train dataset size: 12445, Val dataset size: 655
LJSpeechMelDataset initialized with max_text_len: 192, max_mel_frames: 800
Replaced GPT-2 lm_head to output 80 mel bins.
GPT-2 configured with n_positions: 1024


  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


Speaker encoder speechbrain/spkrec-ecapa-voxceleb loaded.
Added SCL projection layer: 768 -> 192


  WeightNorm.apply(module, name, dim)


HiFi-GAN vocoder speechbrain/tts-hifigan-ljspeech loaded.

--- Epoch 1/20 ---


Epoch 1 Training: 100%|██████████| 1556/1556 [26:51<00:00,  1.04s/it, mel_loss=0.817, scl_loss=0.754, lr=3.8e-5]

--- Epoch 1 Complete. Avg Mel Loss: 10.2344, Avg SCL Proxy: 3.5761 ---

--- Epoch 2/20 ---



Epoch 2 Training: 100%|██████████| 1556/1556 [26:16<00:00,  1.01s/it, mel_loss=0.617, scl_loss=0.733, lr=4.98e-5]

--- Epoch 2 Complete. Avg Mel Loss: 2.7547, Avg SCL Proxy: 3.0253 ---

--- Epoch 3/20 ---



Step 1000 Validation: 100%|██████████| 82/82 [00:52<00:00,  1.56it/s, val_mel_loss=0.571]loss=0.702, lr=4.94e-5]



Validation at step 1000: Total Loss: 0.9252, Mel Loss: 0.5633, SCL Proxy: 0.7239
Best validation loss improved to 0.9252. Checkpoint saved to /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth


Epoch 3 Training:  57%|█████▋    | 888/1556 [15:55<3:27:19, 18.62s/it, mel_loss=0.581, scl_loss=0.702, lr=4.94e-5]

Error during sample generation for logging: Given groups=1, weight of size [512, 80, 7], expected input[1, 300, 96] to have 80 channels, but got 300 channels instead


Epoch 3 Training: 100%|██████████| 1556/1556 [27:06<00:00,  1.05s/it, mel_loss=0.55, scl_loss=0.708, lr=4.9e-5]   

--- Epoch 3 Complete. Avg Mel Loss: 2.3271, Avg SCL Proxy: 2.8855 ---

--- Epoch 4/20 ---



Epoch 4 Training: 100%|██████████| 1556/1556 [25:55<00:00,  1.00it/s, mel_loss=0.522, scl_loss=0.845, lr=4.75e-5]

--- Epoch 4 Complete. Avg Mel Loss: 2.1403, Avg SCL Proxy: 2.8328 ---

--- Epoch 5/20 ---



Epoch 5 Training: 100%|██████████| 1556/1556 [25:58<00:00,  1.00s/it, mel_loss=0.497, scl_loss=0.655, lr=4.53e-5]

--- Epoch 5 Complete. Avg Mel Loss: 2.0333, Avg SCL Proxy: 2.7698 ---

--- Epoch 6/20 ---



Step 2000 Validation: 100%|██████████| 82/82 [00:52<00:00,  1.57it/s, val_mel_loss=0.489]loss=0.646, lr=4.49e-5]



Validation at step 2000: Total Loss: 0.8264, Mel Loss: 0.4823, SCL Proxy: 0.6880
Best validation loss improved to 0.8264. Checkpoint saved to /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth


Epoch 6 Training:  14%|█▍        | 220/1556 [04:39<6:49:07, 18.37s/it, mel_loss=0.499, scl_loss=0.646, lr=4.49e-5]

Error during sample generation for logging: Given groups=1, weight of size [512, 80, 7], expected input[1, 300, 96] to have 80 channels, but got 300 channels instead


Epoch 6 Training: 100%|██████████| 1556/1556 [26:58<00:00,  1.04s/it, mel_loss=0.478, scl_loss=0.616, lr=4.26e-5] 

--- Epoch 6 Complete. Avg Mel Loss: 1.9500, Avg SCL Proxy: 2.7264 ---

--- Epoch 7/20 ---



Epoch 7 Training: 100%|██████████| 1556/1556 [26:00<00:00,  1.00s/it, mel_loss=0.464, scl_loss=0.689, lr=3.94e-5]

--- Epoch 7 Complete. Avg Mel Loss: 1.8840, Avg SCL Proxy: 2.6770 ---

--- Epoch 8/20 ---



Step 3000 Validation: 100%|██████████| 82/82 [00:52<00:00,  1.55it/s, val_mel_loss=0.437]_loss=0.62, lr=3.68e-5] 



Validation at step 3000: Total Loss: 0.7615, Mel Loss: 0.4322, SCL Proxy: 0.6586
Best validation loss improved to 0.7615. Checkpoint saved to /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth


Epoch 8 Training:  71%|███████   | 1108/1556 [19:33<2:14:28, 18.01s/it, mel_loss=0.458, scl_loss=0.62, lr=3.68e-5]

Error during sample generation for logging: Given groups=1, weight of size [512, 80, 7], expected input[1, 300, 96] to have 80 channels, but got 300 channels instead


Epoch 8 Training: 100%|██████████| 1556/1556 [27:01<00:00,  1.04s/it, mel_loss=0.457, scl_loss=0.645, lr=3.57e-5] 

--- Epoch 8 Complete. Avg Mel Loss: 1.8381, Avg SCL Proxy: 2.6273 ---

--- Epoch 9/20 ---



Epoch 9 Training: 100%|██████████| 1556/1556 [26:02<00:00,  1.00s/it, mel_loss=0.461, scl_loss=0.664, lr=3.18e-5]

--- Epoch 9 Complete. Avg Mel Loss: 1.7984, Avg SCL Proxy: 2.6057 ---

--- Epoch 10/20 ---



Epoch 10 Training: 100%|██████████| 1556/1556 [25:58<00:00,  1.00s/it, mel_loss=0.441, scl_loss=0.685, lr=2.77e-5]

--- Epoch 10 Complete. Avg Mel Loss: 1.7674, Avg SCL Proxy: 2.5793 ---

--- Epoch 11/20 ---



Step 4000 Validation: 100%|██████████| 82/82 [00:52<00:00,  1.57it/s, val_mel_loss=0.426]_loss=0.648, lr=2.65e-5]



Validation at step 4000: Total Loss: 0.7321, Mel Loss: 0.4224, SCL Proxy: 0.6194
Best validation loss improved to 0.7321. Checkpoint saved to /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth


Epoch 11 Training:  28%|██▊       | 440/1556 [08:19<5:30:08, 17.75s/it, mel_loss=0.439, scl_loss=0.648, lr=2.65e-5]

Error during sample generation for logging: Given groups=1, weight of size [512, 80, 7], expected input[1, 300, 96] to have 80 channels, but got 300 channels instead


Epoch 11 Training: 100%|██████████| 1556/1556 [26:58<00:00,  1.04s/it, mel_loss=0.429, scl_loss=0.667, lr=2.36e-5] 

--- Epoch 11 Complete. Avg Mel Loss: 1.7435, Avg SCL Proxy: 2.5331 ---

--- Epoch 12/20 ---



Epoch 12 Training: 100%|██████████| 1556/1556 [26:02<00:00,  1.00s/it, mel_loss=0.425, scl_loss=0.632, lr=1.94e-5]

--- Epoch 12 Complete. Avg Mel Loss: 1.7256, Avg SCL Proxy: 2.5142 ---

--- Epoch 13/20 ---



Step 5000 Validation: 100%|██████████| 82/82 [00:52<00:00,  1.57it/s, val_mel_loss=0.412]_loss=0.768, lr=1.59e-5] 



Validation at step 5000: Total Loss: 0.7195, Mel Loss: 0.4088, SCL Proxy: 0.6213
Best validation loss improved to 0.7195. Checkpoint saved to /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth


Epoch 13 Training:  85%|████████▌ | 1328/1556 [23:09<1:11:24, 18.79s/it, mel_loss=0.43, scl_loss=0.768, lr=1.59e-5]

Error during sample generation for logging: Given groups=1, weight of size [512, 80, 7], expected input[1, 300, 96] to have 80 channels, but got 300 channels instead


Epoch 13 Training: 100%|██████████| 1556/1556 [26:57<00:00,  1.04s/it, mel_loss=0.429, scl_loss=0.622, lr=1.54e-5] 

--- Epoch 13 Complete. Avg Mel Loss: 1.7094, Avg SCL Proxy: 2.4710 ---

--- Epoch 14/20 ---



Epoch 14 Training: 100%|██████████| 1556/1556 [26:00<00:00,  1.00s/it, mel_loss=0.422, scl_loss=0.625, lr=1.17e-5]

--- Epoch 14 Complete. Avg Mel Loss: 1.6974, Avg SCL Proxy: 2.4536 ---

--- Epoch 15/20 ---



Epoch 15 Training: 100%|██████████| 1556/1556 [25:59<00:00,  1.00s/it, mel_loss=0.423, scl_loss=0.61, lr=8.34e-6] 

--- Epoch 15 Complete. Avg Mel Loss: 1.6884, Avg SCL Proxy: 2.4262 ---

--- Epoch 16/20 ---



Step 6000 Validation: 100%|██████████| 82/82 [00:52<00:00,  1.57it/s, val_mel_loss=0.414]_loss=0.683, lr=7.02e-6]



Validation at step 6000: Total Loss: 0.7152, Mel Loss: 0.4141, SCL Proxy: 0.6023
Best validation loss improved to 0.7152. Checkpoint saved to /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth


Epoch 16 Training:  42%|████▏     | 660/1556 [12:01<4:26:53, 17.87s/it, mel_loss=0.423, scl_loss=0.683, lr=7.02e-6]

Error during sample generation for logging: Given groups=1, weight of size [512, 80, 7], expected input[1, 300, 96] to have 80 channels, but got 300 channels instead


Epoch 16 Training: 100%|██████████| 1556/1556 [26:56<00:00,  1.04s/it, mel_loss=0.418, scl_loss=0.592, lr=5.45e-6] 

--- Epoch 16 Complete. Avg Mel Loss: 1.6812, Avg SCL Proxy: 2.4144 ---

--- Epoch 17/20 ---



Epoch 17 Training: 100%|██████████| 1556/1556 [25:56<00:00,  1.00s/it, mel_loss=0.416, scl_loss=0.519, lr=3.12e-6]

--- Epoch 17 Complete. Avg Mel Loss: 1.6765, Avg SCL Proxy: 2.3987 ---

--- Epoch 18/20 ---



Step 7000 Validation: 100%|██████████| 82/82 [00:52<00:00,  1.57it/s, val_mel_loss=0.413]l_loss=0.564, lr=1.4e-6] 



Validation at step 7000: Total Loss: 0.7161, Mel Loss: 0.4121, SCL Proxy: 0.6079


Epoch 18 Training:  99%|█████████▉| 1548/1556 [26:50<02:25, 18.25s/it, mel_loss=0.414, scl_loss=0.564, lr=1.4e-6]

Error during sample generation for logging: Given groups=1, weight of size [512, 80, 7], expected input[1, 300, 96] to have 80 channels, but got 300 channels instead


Epoch 18 Training: 100%|██████████| 1556/1556 [26:55<00:00,  1.04s/it, mel_loss=0.414, scl_loss=0.564, lr=1.4e-6]

--- Epoch 18 Complete. Avg Mel Loss: 1.6736, Avg SCL Proxy: 2.3941 ---

--- Epoch 19/20 ---



Epoch 19 Training: 100%|██████████| 1556/1556 [26:02<00:00,  1.00s/it, mel_loss=0.421, scl_loss=0.58, lr=3.53e-7] 

--- Epoch 19 Complete. Avg Mel Loss: 1.6720, Avg SCL Proxy: 2.3857 ---

--- Epoch 20/20 ---



Epoch 20 Training: 100%|██████████| 1556/1556 [25:57<00:00,  1.00s/it, mel_loss=0.419, scl_loss=0.731, lr=0]       

--- Epoch 20 Complete. Avg Mel Loss: 1.6715, Avg SCL Proxy: 2.3877 ---
--- Training Finished ---
Restored CWD to: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/notebooks





![TensorBoard Training Dynamics](training_dynamics.png)

## 4. Setup for Inference and Evaluation

Imports modules required for running inference and evaluation tasks.

In [None]:
import torch
from IPython.display import Audio as IPyAudio, display # For notebook display
import numpy as np 
import soundfile as sf 
import os
import sys
import torchaudio # For loading reference audio

# Ensure src is in path (might be redundant if cell 2 ran).
notebook_dir_inf = os.getcwd() # Should be ProjectRoot/notebooks/
src_path_inf = os.path.abspath(os.path.join(notebook_dir_inf, '..', 'src'))
if src_path_inf not in sys.path:
    sys.path.insert(0, src_path_inf)
    print(f"Added to sys.path for inference/eval: {src_path_inf}")

try:
    import config as inference_config 
    from model import GPT2TTSMelPredictor 
    from utils import generate_speech_mel 
    from transformers import GPT2Tokenizer
    # Import the specific functions from evaluation
    from evaluation import run_single_inference_mel, perform_varied_inference, calculate_cer, calculate_utmos, calculate_secs
    print("Successfully imported modules for inference and evaluation.")
    _inference_eval_modules_loaded = True
except ImportError as e:
    print(f"Error importing inference/evaluation modules: {e}")
    _inference_eval_modules_loaded = False

Added to sys.path for inference/eval: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/src


  from .autonotebook import tqdm as notebook_tqdm


Successfully imported modules for inference and evaluation.


## 5. Single Inference Example

Loads the latest/best checkpoint and generates speech for a single test sentence. This uses a helper function defined within this notebook for quick tests.

In [None]:
if __name__ == '__main__' and _inference_eval_modules_loaded: 
    project_root_nb = os.path.abspath(os.path.join(os.getcwd(), '..')) 
    output_base_abs_nb = os.path.join(project_root_nb, inference_config.OUTPUT_DIR) 
    
    latest_run_dir_name_nb = None
    latest_run_full_path_nb = None
    if os.path.exists(output_base_abs_nb):
        all_runs_nb = sorted([
            d for d in os.listdir(output_base_abs_nb) 
            if os.path.isdir(os.path.join(output_base_abs_nb, d)) and d.startswith('training_run_mel_')
        ])
        if all_runs_nb: 
            latest_run_dir_name_nb = all_runs_nb[-1]
            latest_run_full_path_nb = os.path.join(output_base_abs_nb, latest_run_dir_name_nb)
    
    checkpoint_to_load_single_nb = None
    if latest_run_full_path_nb:
        best_chk_nb = os.path.join(latest_run_full_path_nb, "checkpoints", "best_val_loss_model.pth")
        latest_chk_nb = os.path.join(latest_run_full_path_nb, "checkpoints", "latest_model_checkpoint.pth")
        if os.path.exists(best_chk_nb): checkpoint_to_load_single_nb = best_chk_nb
        elif os.path.exists(latest_chk_nb): checkpoint_to_load_single_nb = latest_chk_nb
    
    sample_ref_audio_path_nb = None
    ljspeech_wav_dir_nb = os.path.join(project_root_nb, "data", "LJSpeech-1.1", "wavs")
    if os.path.exists(ljspeech_wav_dir_nb):
        wav_files_nb = [f for f in os.listdir(ljspeech_wav_dir_nb) if f.endswith('.wav')]
        if wav_files_nb: sample_ref_audio_path_nb = os.path.join(ljspeech_wav_dir_nb, wav_files_nb[0]) # Use first wav as reference
    
    if checkpoint_to_load_single_nb and sample_ref_audio_path_nb and latest_run_full_path_nb:
        print(f"Single inference using checkpoint: {checkpoint_to_load_single_nb}")
        print(f"Using reference audio: {sample_ref_audio_path_nb}")
        
        # Define output directory for this specific inference
        single_inference_output_dir = os.path.join(latest_run_full_path_nb, "inference_samples_notebook", "single_sample")
        os.makedirs(single_inference_output_dir, exist_ok=True)
        
        waveform_np = run_single_inference_mel(
            model_checkpoint_path=checkpoint_to_load_single_nb, 
            text_to_synthesize="Hello, this is a test of the speech synthesis system.", 
            reference_audio_path=sample_ref_audio_path_nb,
            output_dir=single_inference_output_dir,
            output_filename_base="single_test_output" 
            # device can be omitted to use config.DEVICE
        )
        if waveform_np is not None:
            display(IPyAudio(waveform_np, rate=inference_config.TARGET_SAMPLE_RATE))
    else:
        if not checkpoint_to_load_single_nb: print("No checkpoint found for single inference example.")
        if not sample_ref_audio_path_nb: print("No reference audio found for single inference example.")
        if not latest_run_full_path_nb: print("Could not determine latest run directory for saving output.")

Single inference using checkpoint: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth
Using reference audio: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/data/LJSpeech-1.1/wavs/LJ006-0265.wav

--- Running Single Inference ---
Model Checkpoint: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth
Text: 'Hello, this is a test of the speech synthesis system.'
Reference Audio: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/data/LJSpeech-1.1/wavs/LJ006-0265.wav
Output Directory: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/inference_samples_notebook/single_sample
Replaced GPT-2 lm_head to output 80 mel bins.
GPT-2 configured with n_positions: 1024


  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


Speaker encoder speechbrain/spkrec-ecapa-voxceleb loaded.
Added SCL projection layer: 768 -> 192


  WeightNorm.apply(module, name, dim)


HiFi-GAN vocoder speechbrain/tts-hifigan-ljspeech loaded.
Loaded model weights from /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth (excluding speaker_encoder and hifi_gan).
Model loaded successfully.
Reference audio loaded and processed. Shape: torch.Size([116125])
Generated audio saved to: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/inference_samples_notebook/single_sample/single_test_output.wav


## 6. Advanced Inference and Evaluation

This section utilizes the `src.evaluation` module for:
1. Running inference with different sampling strategies.
2. Calculating CER and demonstrating placeholder calls for UTMOS and SECS metrics.

### 6.1 Inference with Different Sampling Strategies

Uses `evaluation.perform_varied_inference` to generate speech.
Audio files will be saved into the latest training run's 'inference_samples_notebook/varied_samples' directory.

In [None]:
if __name__ == '__main__' and _inference_eval_modules_loaded:
    # --- Setup for evaluation (model loading is done within perform_varied_inference if not passed) ---
    # We will load the model once and pass it to perform_varied_inference for efficiency.
    project_root_nb_eval = os.path.abspath(os.path.join(os.getcwd(), '..'))
    output_base_abs_nb_eval = os.path.join(project_root_nb_eval, inference_config.OUTPUT_DIR)
    latest_run_dir_name_nb_eval = None
    latest_run_full_path_nb_eval = None

    if os.path.exists(output_base_abs_nb_eval):
        all_runs_nb_eval = sorted([
            d for d in os.listdir(output_base_abs_nb_eval) 
            if os.path.isdir(os.path.join(output_base_abs_nb_eval, d)) and d.startswith('training_run_mel_')
        ])
        if all_runs_nb_eval: 
            latest_run_dir_name_nb_eval = all_runs_nb_eval[-1]
            latest_run_full_path_nb_eval = os.path.join(output_base_abs_nb_eval, latest_run_dir_name_nb_eval)


    checkpoint_to_load_eval_nb = None
    loaded_model_for_eval_nb = None 
    varied_inference_output_dir = None 

    if latest_run_full_path_nb_eval:
        best_chk_eval_nb = os.path.join(latest_run_full_path_nb_eval, "checkpoints", "best_val_loss_model.pth")
        latest_chk_eval_nb = os.path.join(latest_run_full_path_nb_eval, "checkpoints", "latest_model_checkpoint.pth")
        if os.path.exists(best_chk_eval_nb): checkpoint_to_load_eval_nb = best_chk_eval_nb
        elif os.path.exists(latest_chk_eval_nb): checkpoint_to_load_eval_nb = latest_chk_eval_nb
        
        varied_inference_output_dir = os.path.join(latest_run_full_path_nb_eval, "inference_samples_notebook", "varied_samples")
        os.makedirs(varied_inference_output_dir, exist_ok=True)

    if checkpoint_to_load_eval_nb:
        print(f"Evaluation: Loading checkpoint for varied inference: {checkpoint_to_load_eval_nb}")
        eval_tokenizer_nb = GPT2Tokenizer.from_pretrained(inference_config.GPT2_MODEL_NAME)
        if eval_tokenizer_nb.pad_token is None: eval_tokenizer_nb.pad_token = eval_tokenizer_nb.eos_token
        
        loaded_model_for_eval_nb = GPT2TTSMelPredictor.from_pretrained_custom(
            tokenizer=eval_tokenizer_nb, 
            checkpoint_path=checkpoint_to_load_eval_nb, 
            device=inference_config.DEVICE
        )
        if loaded_model_for_eval_nb:
            loaded_model_for_eval_nb.eval()
            print("Model loaded successfully for varied inference.")
        else:
            print(f"Failed to load model from {checkpoint_to_load_eval_nb} for varied inference.")
    else:
        print("Evaluation: No checkpoint found to load model for varied inference.")
    
    eval_ref_audio_path_nb = None
    eval_ref_waveform_nb = None # Tensor waveform
    ljspeech_wav_dir_eval_nb = os.path.join(project_root_nb_eval, "data", "LJSpeech-1.1", "wavs")
    if os.path.exists(ljspeech_wav_dir_eval_nb):
        wav_files_eval_nb = [f for f in os.listdir(ljspeech_wav_dir_eval_nb) if f.endswith('.wav')]
        if len(wav_files_eval_nb) > 1: eval_ref_audio_path_nb = os.path.join(ljspeech_wav_dir_eval_nb, wav_files_eval_nb[1]) 
        elif len(wav_files_eval_nb) == 1: eval_ref_audio_path_nb = os.path.join(ljspeech_wav_dir_eval_nb, wav_files_eval_nb[0])

        if eval_ref_audio_path_nb:
            print(f"Using reference audio for varied inference: {eval_ref_audio_path_nb}")
            try:
                temp_waveform, temp_sr = torchaudio.load(eval_ref_audio_path_nb)
                if temp_sr != inference_config.TARGET_SAMPLE_RATE:
                    resampler_eval = torchaudio.transforms.Resample(orig_freq=temp_sr, new_freq=inference_config.TARGET_SAMPLE_RATE)
                    temp_waveform = resampler_eval(temp_waveform)
                if temp_waveform.shape[0] > 1: temp_waveform = torch.mean(temp_waveform, dim=0)
                else: temp_waveform = temp_waveform.squeeze(0)
                eval_ref_waveform_nb = temp_waveform 
            except Exception as e_ref_load:
                print(f"Error loading reference audio for varied inference: {e_ref_load}")
                eval_ref_waveform_nb = None

    if loaded_model_for_eval_nb and eval_ref_waveform_nb is not None and varied_inference_output_dir:
        sample_texts_varied = [
            "This is a test with default generation length.",
            "Let's try a slightly longer sentence to see how it performs.",
            "Short one."
        ]
        
        varied_sampling_strategies = {
            "Default_Length": None, 
            "Short_Output": {"max_mel_frames": 150}, # Passed to generate_speech_mel
            "Medium_Output": {"max_mel_frames": 400}
        }
        
        perform_varied_inference(
            model=loaded_model_for_eval_nb, 
            device=inference_config.DEVICE,
            checkpoint_path=checkpoint_to_load_eval_nb, 
            texts_to_synthesize=sample_texts_varied,
            reference_audio_waveform=eval_ref_waveform_nb, 
            reference_audio_sample_rate=inference_config.TARGET_SAMPLE_RATE,
            sampling_strategies=varied_sampling_strategies,
            target_sample_rate=inference_config.TARGET_SAMPLE_RATE,
            output_dir=varied_inference_output_dir
        )
    else:
        if not loaded_model_for_eval_nb: print("Evaluation: Model not loaded. Skipping varied inference.")
        if eval_ref_waveform_nb is None: print("Evaluation: Reference audio not loaded. Skipping varied inference.")
        if not varied_inference_output_dir: print("Evaluation: Output directory for varied inference not set. Skipping.")

elif not _inference_eval_modules_loaded:
    print("Cannot run varied inference setup due to module import failures.")

Evaluation: Loading checkpoint for varied inference: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth
Replaced GPT-2 lm_head to output 80 mel bins.
GPT-2 configured with n_positions: 1024
Speaker encoder speechbrain/spkrec-ecapa-voxceleb loaded.
Added SCL projection layer: 768 -> 192
HiFi-GAN vocoder speechbrain/tts-hifigan-ljspeech loaded.
Loaded model weights from /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth (excluding speaker_encoder and hifi_gan).
Model loaded successfully for varied inference.
Using reference audio for varied inference: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/data/LJSpeech-1.1/wavs/LJ038-0017.wav

--- Performing Varied Inference using checkpoint: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.

### 6.2 Evaluation Metrics (CER, UTMOS, SECS)

Demonstrates calls to metric functions from `evaluation.py`.
Generated audio will be saved into the latest training run's 'inference_samples_notebook' directory.

In [None]:
if __name__ == '__main__' and _inference_eval_modules_loaded:
    print("\n--- Metrics Calculation (using evaluation module) ---")
    
    example_ground_truth_text = "This is a ground truth sentence."
    example_asr_hypothesis = "this is a ground truth sentense" # Example ASR output
    # For a real CER, you would generate audio, then transcribe it with an ASR
    # generated_audio_for_cer = run_single_inference_mel(...) 
    # example_asr_hypothesis = your_asr_function(generated_audio_for_cer)

    print(f"\nCalculating CER for:\n  GT:  {example_ground_truth_text}\n  ASR: {example_asr_hypothesis}")
    cer_score = calculate_cer(example_asr_hypothesis, example_ground_truth_text)
    print(f"Calculated CER Score: {cer_score:.4f}")


    print("\nDemonstrating UTMOS & SECS placeholder calls:")
    
    generated_waveform_for_metrics_nb = None
    metrics_output_dir = None
    # Try to use the model and paths from the varied inference setup cell (6.1)
    # These variables are: loaded_model_for_eval_nb, eval_ref_audio_path_nb, latest_run_full_path_nb_eval
    
    if 'latest_run_full_path_nb_eval' in locals() and latest_run_full_path_nb_eval is not None:
         metrics_output_dir = os.path.join(latest_run_full_path_nb_eval, "inference_samples_notebook", "metrics_sample")
         os.makedirs(metrics_output_dir, exist_ok=True)

    if 'loaded_model_for_eval_nb' in locals() and loaded_model_for_eval_nb is not None and \
       'eval_ref_audio_path_nb' in locals() and eval_ref_audio_path_nb is not None and \
       metrics_output_dir is not None:
        
        print(f"Generating audio for metrics using reference: {eval_ref_audio_path_nb}")
        
        # Use the reference waveform tensor if it was loaded in the previous cell
        ref_for_metrics_tensor = eval_ref_waveform_nb if 'eval_ref_waveform_nb' in locals() and eval_ref_waveform_nb is not None else None
        
        if ref_for_metrics_tensor is None: # Attempt to load if not already available
            try:
                temp_waveform_metrics, temp_sr_metrics = torchaudio.load(eval_ref_audio_path_nb)
                if temp_sr_metrics != inference_config.TARGET_SAMPLE_RATE:
                    resampler_metrics = torchaudio.transforms.Resample(orig_freq=temp_sr_metrics, new_freq=inference_config.TARGET_SAMPLE_RATE)
                    temp_waveform_metrics = resampler_metrics(temp_waveform_metrics)
                if temp_waveform_metrics.shape[0] > 1: temp_waveform_metrics = torch.mean(temp_waveform_metrics, dim=0)
                else: temp_waveform_metrics = temp_waveform_metrics.squeeze(0)
                ref_for_metrics_tensor = temp_waveform_metrics
            except Exception as e_load_metrics_ref:
                 print(f"Error loading reference for metrics: {e_load_metrics_ref}")
        
        if ref_for_metrics_tensor is not None:
            # Use run_single_inference_mel to generate and save the file
            # Need the checkpoint path that loaded_model_for_eval_nb corresponds to
            # This was checkpoint_to_load_eval_nb
            if 'checkpoint_to_load_eval_nb' in locals() and checkpoint_to_load_eval_nb is not None:
                generated_waveform_for_metrics_nb = run_single_inference_mel(
                    model_checkpoint_path=checkpoint_to_load_eval_nb, 
                    text_to_synthesize="This is a test for UTMOS and SECS evaluation.",
                    reference_audio_path=eval_ref_audio_path_nb, # Path is still needed by the function
                    output_dir=metrics_output_dir,
                    output_filename_base="metrics_generation_output",
                    max_mel_frames_override=inference_config.LOGGING_MAX_MEL_FRAMES
                )
                if generated_waveform_for_metrics_nb is not None:
                    display(IPyAudio(generated_waveform_for_metrics_nb, rate=inference_config.TARGET_SAMPLE_RATE))
            else:
                print("Checkpoint path for loaded evaluation model not found. Cannot generate audio for metrics.")
        else:
            print("Reference audio for metrics generation not available.")
            
    if generated_waveform_for_metrics_nb is None: 
        print("TTS model or reference audio not available for metrics generation, using dummy waveform for UTMOS/SECS.")
        generated_waveform_for_metrics_nb = np.random.randn(inference_config.TARGET_SAMPLE_RATE * 2) # 2s dummy
        # Save this dummy if you want to ensure the file exists for placeholder calls
        if metrics_output_dir:
            sf.write(os.path.join(metrics_output_dir, "dummy_metrics_generation_output.wav"), generated_waveform_for_metrics_nb, inference_config.TARGET_SAMPLE_RATE)

        
    if generated_waveform_for_metrics_nb is not None and generated_waveform_for_metrics_nb.size > 0:
        utmos_score = calculate_utmos(generated_waveform_for_metrics_nb, inference_config.TARGET_SAMPLE_RATE)
        
        target_speaker_waveform_for_secs_nb = None
        # Use the actual reference audio that was used for conditioning, if available
        if 'ref_for_metrics_tensor' in locals() and ref_for_metrics_tensor is not None:
             target_speaker_waveform_for_secs_nb = ref_for_metrics_tensor.cpu().numpy()
        else: # Fallback to a dummy reference if the actual one wasn't loaded
            target_speaker_waveform_for_secs_nb = np.random.randn(inference_config.TARGET_SAMPLE_RATE * 3) # 3s dummy
        
        secs_score = calculate_secs(generated_waveform_for_metrics_nb, target_speaker_waveform_for_secs_nb, inference_config.TARGET_SAMPLE_RATE)
    else:
        print("Failed to generate/load waveform for UTMOS/SECS example.")

    print("\nNote: Actual UTMOS and SECS calculations require integrating specific models/libraries as detailed in evaluation.py.")

elif not _inference_eval_modules_loaded:
    print("Cannot demonstrate metrics calculation due to module import failures.")


--- Metrics Calculation (using evaluation module) ---

Calculating CER for:
  GT:  This is a ground truth sentence.
  ASR: this is a ground truth sentense
CER: Hyp: 'this is a ground truth sentense...' Ref: 'This is a ground truth sentence....' -> Score: 0.0323

Demonstrating UTMOS & SECS placeholder calls:
Generating audio for metrics using reference: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/data/LJSpeech-1.1/wavs/LJ038-0017.wav

--- Running Single Inference ---
Model Checkpoint: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/checkpoints/best_val_loss_model.pth
Text: 'This is a test for UTMOS and SECS evaluation.'
Reference Audio: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/data/LJSpeech-1.1/wavs/LJ038-0017.wav
Output Directory: /home/denis/kpi/ucu_audio_processing_labs/GPT2TTS/training_runs_mel/training_run_mel_20250521-005331/inference_samples_notebook/metrics_sample
Replaced GPT-2 lm_head to output 80 mel bins.
G

[Placeholder UTMOS] Waveform shape: (79360,), SR: 22050
  Actual UTMOS calculation requires a pre-trained UTMOS model and library.
[Placeholder SECS] Gen wave: (79360,), Ref wave: (138653,), SR: 22050
  Actual SECS calculation requires a pre-trained Speaker Encoder model (e.g., from xTTS or SpeechBrain).

Note: Actual UTMOS and SECS calculations require integrating specific models/libraries as detailed in evaluation.py.


## GPT2TTS Project: Summary and Future Directions

### Current Architecture and Performance

The current system employs a GPT-2 model adapted to predict mel spectrograms directly from text. This GPT-2 is conditioned on speaker/style characteristics extracted from a reference audio clip using a dedicated Conditioning Encoder followed by a Perceiver Resampler. The generated mel spectrograms are then converted to waveforms using a pre-trained HiFi-GAN vocoder from SpeechBrain. A Speaker Consistency Loss (SCL) proxy, which compares internal style embeddings with reference speaker embeddings, is also part of the training objective.

While this architecture represents an improvement over previous Encodec-based approaches we have tried by avoiding immediate mode collapse, generating highly realistic and consistently intelligible speech is still a significant challenge. The placeholder Character Error Rate (CER) score of 0.0323 (from a dummy example) is not representative of actual ASR performance on generated audio. Current speech samples consistently exhibit repetition of the same few sounds or syllables, indicating that the model has not sucsessfully learned the complex mapping from text to diverse and coherent mel spectrogram sequences.

### Previous Attempts

Prior to this mel-spectrogram-based architecture, the model was designed to predict discrete audio tokens from an Encodec model. This earlier version invariably suffered from mode collapse, where the model would output highly repetitive and non-speech-like audio sequences, regardless of the input text or training duration. Furthermore, due to the nature of Encodec tokenization and the GPT-2's 1024 token context window, this previous iteration was effectively limited to generating audio sequences less than a second long. The shift to predicting continuous mel spectrograms was a strategic move to mitigate both the severe mode collapse issue and this restrictive length limitation.

### Potential Future Steps

To improve the current system, several avenues can be explored. Further refinement of the mel prediction process is key; this could involve experimenting with different loss functions for mel spectrograms, such as incorporating adversarial losses or perceptual losses, beyond the current L1/MSE. The architecture of the GPT-2 model itself, or the conditioning mechanism, might benefit from adjustments tailored to the regression task of mel prediction. Enhancing the SCL proxy or implementing a more direct form of speaker consistency training could improve voice cloning.

Investigating the autoregressive generation of mel frames is also important. Techniques like scheduled sampling during training, or different sampling strategies during inference (if applicable to continuous outputs), might help reduce repetitiveness and improve naturalness. Furthermore, exploring different vocoders or fine-tuning the existing HiFi-GAN on the model's predicted mels could further enhance audio quality. More sophisticated methods for injecting speaker identity, perhaps through adaptive layer normalization or speaker-conditional layers within the GPT-2, could also be considered.