# Setup
-  Follow the setup instructions based on your preferred environment!

## Local

One of our key goals in designing this assignment is to allow you to complete most of the preliminary implementation work locally.  
We highly recommend that you **pass all tests locally** using the provided `hw4_data_subset` before moving to a GPU runtime.  
To do this, simply:

### Create a new conda environment
```bash
# Be sure to deactivate any active environments first
conda create -n hw4 python=3.12.4
```

### Activate the conda environment
```bash
conda activate hw4
```

### Install the dependencies using the provided `requirements.txt`
```bash
pip install --no-cache-dir --ignore-installed -r requirements.txt
```

### Ensure that your notebook is in the same working directory as the `Handout`
This can be achieved by:
1. Physically moving the notebook into the handout directory.
2. Changing the notebook’s current working directory to the handout directory using the os.chdir() function.

### Open the notebook and select the newly created environment from the kernel selector.

If everything was done correctly, You should see atleast the following files in your current working directory after running `!ls`:
```
.
├── README.md
├── requirements.txt
├── hw4lib/
├── mytorch/
├── tests/
└── hw4_data_subset/
```

## Colab

### Step 1: Get your handout
- See writeup for recommended approaches.

In [None]:
# FailSafe
!rm -rf /content/IDL_HW4

import os
os.kill(os.getpid(), 9)

In [None]:
# Example: My preferred approach
import os
# Settings -> Developer Settings -> Personal Access Tokens -> Token (classic)
os.environ['GITHUB_TOKEN'] = "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"

GITHUB_USERNAME = "adi274903"
REPO_NAME       = "IDL_HW4"
TOKEN = os.environ.get("GITHUB_TOKEN")
repo_url        = f"https://{TOKEN}@github.com/{GITHUB_USERNAME}/{REPO_NAME}"
!git clone {repo_url}

Cloning into 'IDL_HW4'...
remote: Enumerating objects: 1111, done.[K
remote: Counting objects: 100% (85/85), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 1111 (delta 61), reused 0 (delta 0), pack-reused 1026 (from 2)[K
Receiving objects: 100% (1111/1111), 27.34 MiB | 16.71 MiB/s, done.
Resolving deltas: 100% (536/536), done.


In [None]:
# To pull latest changes (Must be in the repo dir, use pwd/ls to verify)
!cd {REPO_NAME} && git pull

Already up to date.


### Step 2: Install Dependencies
- `NOTE`: Your runtime will be restarted to ensure all dependencies are updated.
- `NOTE`: You will see a runtime crashed message, this was intentionally done. Simply move on to the next cell.

In [None]:
!pip uninstall transformers -y
%pip install --no-deps -r /content/IDL_HW4/IDL-HW4/requirements.txt
import os
os.kill(os.getpid(), 9) # NOTE: This will restart the your colab Python runtime (required)!

Found existing installation: transformers 4.51.3
Uninstalling transformers-4.51.3:
  Successfully uninstalled transformers-4.51.3
Collecting appnope (from -r /content/IDL_HW4/IDL-HW4/requirements.txt (line 2))
  Downloading appnope-0.1.4-py2.py3-none-any.whl.metadata (908 bytes)
Collecting asttokens (from -r /content/IDL_HW4/IDL-HW4/requirements.txt (line 3))
  Downloading asttokens-3.0.0-py3-none-any.whl.metadata (4.7 kB)
Collecting beautifulsoup4==4.13.3 (from -r /content/IDL_HW4/IDL-HW4/requirements.txt (line 4))
  Downloading beautifulsoup4-4.13.3-py3-none-any.whl.metadata (3.8 kB)
Collecting certifi==2024.12.14 (from -r /content/IDL_HW4/IDL-HW4/requirements.txt (line 5))
  Downloading certifi-2024.12.14-py3-none-any.whl.metadata (2.3 kB)
Collecting charset-normalizer==3.4.0 (from -r /content/IDL_HW4/IDL-HW4/requirements.txt (line 6))
  Downloading charset_normalizer-3.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (34 kB)
Collecting comm (from -r /content/

### Step 3: Obtain Data

- `NOTE`: This process will automatically download and unzip data for both `HW4P1` and `HW4P2`.  


In [None]:
!curl -L -o /content/s25-hw4-data.zip https://www.kaggle.com/api/v1/datasets/download/cmu11785/s25-hw4-data
!unzip -q -o /content/s25-hw4-data.zip -d /content/hw4_data
!rm -rf /content/s25-hw4-data.zip
!du -h --max-depth=2 /content/hw4_data

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 11.9G  100 11.9G    0     0  22.2M      0  0:09:09  0:09:09 --:--:-- 22.5M
757M	/content/hw4_data/hw4p2_data/dev-clean
748M	/content/hw4_data/hw4p2_data/test-clean
14G	/content/hw4_data/hw4p2_data/train-clean-100
16G	/content/hw4_data/hw4p2_data
1.1G	/content/hw4_data/hw4p1_data/train
28M	/content/hw4_data/hw4p1_data/val
28M	/content/hw4_data/hw4p1_data/test
1.1G	/content/hw4_data/hw4p1_data
17G	/content/hw4_data


### Step 4: Move to Handout Directory
You must be within the handout directory for the library imports to work!

- `NOTE`: You may have to repeat running this command anytime you restart your runtime.
- `NOTE`: You can do a `pwd` to check if you are in the right directory.
- `NOTE`: The way it is setup currently, Your data directory should be one level up from your project directory. Keep this in mind when you are setting your `root` in the config file.

If everything was done correctly, You should see atleast the following files in your current working directory after running `!ls`:
```
.
├── README.md
├── requirements.txt
├── hw4lib/
├── mytorch/
├── tests/
└── hw4_data_subset/

```

In [None]:
import os
os.chdir('/content/IDL_HW4/IDL-HW4')
!ls

hw4_data_subset  mytorch    requirements_tpu.txt  tests
hw4lib		 README.md  requirements.txt


# Imports
- If your setup was done correctly, you should be able to run the following cell without any issues.

In [None]:
#!git pull {repo_url} main
from hw4lib.data import (
    H4Tokenizer,
    ASRDataset,
    verify_dataloader
)
from hw4lib.model import (
    DecoderOnlyTransformer,
    EncoderDecoderTransformer
)
from hw4lib.utils import (
    create_scheduler,
    create_optimizer,
    plot_lr_schedule
)
from hw4lib.trainers import (
    ASRTrainer,
    ProgressiveTrainer
)
from torch.utils.data import DataLoader
import yaml
import gc
import torch
from torchinfo import summary
import os
import json
import wandb
import pandas as pd
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


# Implementations
- `NOTE`: All of these implementations have detailed specification, implementation details, and hints in their respective source files. Make sure to read all of them in their entirety to understand the implementation details!

## Dataset Implementation
- Implement the `ASRDataset` class in `hw4lib/data/asr_dataset.py`.
- You will have to implement parts of `__init__` and completely implement the `__len__`, `__getitem__` and `collate_fn` methods.
- Run the cell below to check your implementation.


In [None]:
!python -m tests.test_dataset_asr

Loading data for train-clean-100 partition...
  0% 0/28 [00:00<?, ?it/s]100% 28/28 [00:00<00:00, 707.28it/s]
Loading data for test-clean partition...
  0% 0/2 [00:00<?, ?it/s]100% 2/2 [00:00<00:00, 4258.18it/s]

Running tests for category: ASRDataset Train
--------------------------------------------------------------------------------[0m

[94m[01/01]    Running:  Test a Train instance of ASRDataset class[0m
Testing __init__ method ...
Test Passed: Dataset length matches FBANK files.
Test Passed: Dataset length matches TRANSCRIPT files.
[91m[01/01]    FAILED:   Test a Train instance of ASRDataset class
           Error:    FBANK file ./hw4_data_subset/hw4p2_data/train-clean-100/fbank/103-1240-0000.npy and TRANSCRIPT file ./hw4_data_subset/hw4p2_data/train-clean-100/text/103-1240-0000.npy are misaligned.[0m


Running tests for category: ASRDataset Test
--------------------------------------------------------------------------------[0m

[94m[01/01]    Running:  Test a Test ins

## Model Implementations

Overview:

- Implement the `CrossAttentionLayer` class in `hw4lib/model/sublayers.py`.
- Implement the `CrossAttentionDecoderLayer` class in `hw4lib/model/decoder_layers.py`.
- Implement the `SelfAttentionEncoderLayer` class in `hw4lib/model/encoder_layers.py`. This will be mostly a copy-paste of the `SelfAttentionDecoderLayer` class in `hw4lib/model/decoder_layers.py` with one minor diffrence: it can attend to all positions in the input sequence.
- Implement the `EncoderDecoderTransformer` class in `hw4lib/model/transformers.py`.

### Transformer Sublayers
- Now, Implement the `CrossAttentionLayer` class in `hw4lib/model/sublayers.py`.
- `NOTE`: You should have already implemented the `SelfAttentionLayer`, and `FeedForwardLayer` classes in `hw4lib/model/sublayers.py`.
- Run the cell below to check your implementation.

In [None]:
!python -m tests.test_sublayer_crossattention


Running tests for category: CrossAttentionLayer
--------------------------------------------------------------------------------[0m

[94m[01/01]    Running:  Test the cross-attention sublayer[0m
Testing initialization ...
Test Passed: All layers exist and are instantiated correctly
Testing forward shapes ...
Test Passed: Forward pass returns the correct shapes
Testing padding mask behaviour ...
Test Passed: Padding mask is applied correctly
Testing cross-attention behaviour ...
Test Passed: Cross-attention behavior is correct
Testing residual connection ...
Test Passed: Residual connection is applied correctly
[92m[01/01]    PASSED:   Test the cross-attention sublayer[0m


                                  Test Summary                                  
[93mCategory:    CrossAttentionLayer           
Results:     1/1 tests passed (100.0%)[0m


### Transformer Cross-Attention Decoder Layer
- Implement the `CrossAttentionDecoderLayer` class in `hw4lib/model/decoder_layers.py`.
- Then run the cell below to check your implementation.


In [None]:
!python -m tests.test_decoderlayer_crossattention


Running tests for category: CrossAttentionDecoderLayer
--------------------------------------------------------------------------------[0m

[94m[01/01]    Running:  Test the cross-attention decoder layer[0m
Testing initialization ...
Test Passed: All sublayers exist and are initialized correctly
Testing forward shapes ...
Test Passed: Forward shapes are as expected
Testing sublayer integration ...
Test Passed: Sublayers interact correctly
Testing cross-attention behavior ...
Test Passed: Cross-attention behaves correctly
[92m[01/01]    PASSED:   Test the cross-attention decoder layer[0m


                                  Test Summary                                  
[93mCategory:    CrossAttentionDecoderLayer    
Results:     1/1 tests passed (100.0%)[0m


### Transformer Self-Attention Encoder Layer
- Implement the `SelfAttentionEncoderLayer` class in `hw4lib/model/encoder_layers.py`.
- Then run the cell below to check your implementation.




In [None]:
!python -m tests.test_encoderlayer_selfattention


Running tests for category: SelfAttentionEncoderLayer
--------------------------------------------------------------------------------[0m

[94m[01/01]    Running:  Test the self-attention encoder layer[0m
Testing initialization ...
Test Passed: All sublayers exist and are initialized correctly
Testing forward shapes ...
Test Passed: Forward shapes are as expected
Testing sublayer interaction ...
Test Passed: Sublayers interact correctly
Testing bidirectional attention ...
Test Passed: Bidirectional attention is working correctly
[92m[01/01]    PASSED:   Test the self-attention encoder layer[0m


                                  Test Summary                                  
[93mCategory:    SelfAttentionEncoderLayer     
Results:     1/1 tests passed (100.0%)[0m


### Encoder-Decoder Transformer

- Implement the  `EncoderDecoderTransformer` class in `hw4lib/model/transformers.py`.
- Then run the cell below to check your implementation.

In [None]:
!python -m tests.test_transformer_encoder_decoder


Running tests for category: EncoderDecoderTransformer
--------------------------------------------------------------------------------[0m

[94m[01/01]    Running:  Test the encoder-decoder transformer[0m
Testing initialization...
Test Passed: All components initialized correctly
Testing encode method...
Test Passed: Encode method works correctly
Testing decode method...
Test Passed: Decode method works correctly
Testing forward pass...
Test Passed: Forward pass works correctly
Testing encoder-decoder integration...
Test Passed: Encoder-decoder integration works correctly
Testing CTC integration...
Test Passed: CTC integration works correctly
Testing forward propagation order...
Test Passed: Forward propagation order is correct
[92m[01/01]    PASSED:   Test the encoder-decoder transformer[0m


                                  Test Summary                                  
[93mCategory:    EncoderDecoderTransformer     
Results:     1/1 tests passed (100.0%)[0m


## Decoding Implementation
- We highly recommend you to implement the `generate_beam` method of the `SequenceGenerator` class in `hw4lib/decoding/sequence_generator.py`.
- Then run the cell below to check your implementation.
- `NOTE`: This is an optional but highly recommended task for `HW4P2` to ease the journey to high cutoffs!

In [None]:
!python -m tests.test_decoding --mode beam


Running tests for category: Decoding
--------------------------------------------------------------------------------[0m

[94m[01/01]    Running:  Test beam decoding[0m
Testing Single Batch Beam Search ...
Beam 0  : generated: HELLO WORLD  | expected: HELLO WORLD 
Beam 1  : generated: YELLOW WORLD | expected: YELLOW WORLD
Beam 2  : generated: MELLOW WORLD | expected: MELLOW WORLD
Testing Multi Batch Beam Search ...
Batch 0  : Beam 0  : generated: HELLO WORLD  | expected: HELLO WORLD 
Batch 0  : Beam 1  : generated: YELLOW WORLD | expected: YELLOW WORLD
Batch 0  : Beam 2  : generated: MELLOW WORLD | expected: MELLOW WORLD
Batch 1  : Beam 0  : generated: GOOD BYE     | expected: GOOD BYE    
Batch 1  : Beam 1  : generated: GREAT DAY    | expected: GREAT DAY   
Batch 1  : Beam 2  : generated: GUD NIGHT    | expected: GUD NIGHT   
[92m[01/01]    PASSED:   Test beam decoding[0m


                                  Test Summary                                  
[93mCategory:    Decodin

## Trainer Implementation
You will have to do some minor in-filling for the `ASRTrainer` class in `hw4lib/trainers/asr_trainer.py` before you can use it.
- Fill in the `TODO`s in the `__init__`.
- Fill in the `TODO`s in the `_train_epoch`.
- Fill in the `TODO`s in the `recognize` method.
- Fill in the `TODO`s in the `_validate_epoch`.
- Fill in the `TODO`s in the `train` method.
- Fill in the `TODO`s in the `evaluate` method.

`WARNING`: There are no test's for this. Implement carefully!

# Experiments
From this point onwards you may want to switch to a `GPU` runtime.
- `OBJECTIVE`: Optimize your model for `CER` on the test set.

## Config
- You can use the `config.yaml` file to set your config for your ablation study.

---
### Notes:

- Set `tokenization: token_type:` to specify your desired tokenization strategy
- You will need to set the root path to your `hw4p1_data` folder in `data: root:`. This will depend on your setup. For eg. if you are following out setup instruction:
  - `PSC`: `"/local/hw4_data/hw4p1_data"`
  - `Colab:`: `"/content/hw4_data/hw4p1_data"`
  - `Kaggle:`: `"/kaggle/input/s25-hw4-data/hw4p1_data"`
- There's extra configurations in the `optimizer` section which will only be relevant if you decide to use the `create_optimizer` function we've provided in `hw4lib/utils/create_optimizer.py`.
- `BE CAREFUL` while setting numeric values. Eg. `1e-4` will get serialized to a `str` while `1.0e-4` gets serialized to float.

In [None]:
%%writefile config.yaml
Name                      : "Aditya Sannabhadti"

###### Tokenization ------------------------------------------------------------
tokenization:
  token_type                : "10k"       # [char, 1k, 5k, 10k]
  token_map :
      'char': 'hw4lib/data/tokenizer_jsons/tokenizer_char.json'
      '1k'  : 'hw4lib/data/tokenizer_jsons/tokenizer_1000.json'
      '5k'  : 'hw4lib/data/tokenizer_jsons/tokenizer_5000.json'
      '10k' : 'hw4lib/data/tokenizer_jsons/tokenizer_10000.json'

###### Dataset -----------------------------------------------------------------
data:
  root                 : "/content/hw4_data/hw4p2_data"  # TODO: Set the root path of your data
  train_partition      : "train-clean-100"  # paired text-speech for ASR pre-training
  val_partition        : "dev-clean"        # paired text-speech for ASR pre-training
  test_partition       : "test-clean"       # paired text-speech for ASR pre-training
  subset               : 1.0                # Load a subset of the data (for debugging, testing, etc)
  batch_size           : 8                # Reduced batch size for more frequent updates
  NUM_WORKERS          : 4                  # Set to 0 for CPU
  norm                 : 'global_mvn'       # ['global_mvn', 'cepstral', 'none']
  num_feats            : 80

  ###### SpecAugment ---------------------------------------------------------------
  specaug                   : True           # Set to True if you want to use SpecAugment
  specaug_conf:
    apply_freq_mask         : True
    freq_mask_width_range   : 15
    num_freq_mask           : 2
    apply_time_mask         : True
    time_mask_width_range   : 80
    num_time_mask           : 2

###### Network Specs -------------------------------------------------------------
model: # Encoder-Decoder Transformer (HW4P2)
  # Speech embedding parameters
  input_dim: 80              # Speech feature dimension
  time_reduction: 4          # Time dimension downsampling factor
  reduction_method: 'both'   # The source_embedding reduction method ['lstm', 'conv', 'both']

  # Architecture parameters
  d_model: 256              # Model dimension
  num_encoder_layers: 12     # Number of encoder layers
  num_decoder_layers: 6     # Number of decoder layers
  num_encoder_heads: 4      # Number of encoder attention heads
  num_decoder_heads: 4      # Number of decoder attention heads
  d_ff_encoder: 1024        # Feed-forward dimension for encoder
  d_ff_decoder: 1024        # Feed-forward dimension for decoder
  skip_encoder_pe: False    # Whether to skip positional encoding for encoder
  skip_decoder_pe: False    # Whether to skip positional encoding for decoder

  # Common parameters
  dropout: 0.2              # Increased dropout rate for better regularization
  layer_drop_rate: 0.01     # Slightly higher layer drop rate for better regularization
  weight_tying: True        # Enable weight tying for better generalization

###### Common Training Parameters ------------------------------------------------
training:
  use_wandb                   : True   # Toggle wandb logging
  wandb_run_id                : "" # "none" or "run_id"
  resume                      : True   # Resume an existing run (run_id != 'none')
  gradient_accumulation_steps : 2
  wandb_project               : "hw4p2" # wandb project to log to

###### Loss ----------------------------------------------------------------------
loss: # Just good ol' CrossEntropy
  label_smoothing: 0.1        # Small label smoothing to help generalization
  ctc_weight: 0.5

###### Optimizer -----------------------------------------------------------------
optimizer:
  name: "adamw" # Options: sgd, adam, adamw
  lr: 0.0003    # Base learning rate

  # Common parameters
  weight_decay: 0.000001

  # Parameter groups
  param_groups:
    - name: self_attn
      patterns: []  # Will match all parameters containing "ffn" and set their learning rate to 0.0002
      lr: 0.0002    # LR for self_attn
      layer_decay:
        enabled: False
        decay_rate: 0.8

    - name: ffn
      patterns: [] # Will match all parameters containing "ffn" and set their learning rate to 0.0002
      lr: 0.0002   # LR for ffn
      layer_decay:
        enabled: False
        decay_rate: 0.8

  # Layer-wise learning rates
  layer_decay:
    enabled: False
    decay_rate: 0.75

  # AdamW specific parameters
  adamw:
    betas: [0.9, 0.999]
    eps: 1.0e-8
    amsgrad: False

###### Scheduler -----------------------------------------------------------------
scheduler:
  name: "cosine_warm"  # Options: reduce_lr, cosine, cosine_warm

  # CosineAnnealingWarmRestarts specific parameters
  cosine_warm:
    T_0: 10    # Number of iterations for the first restart
    T_mult: 10  # Factor increasing T_i after each restart
    eta_min: 0.0000001  # Minimum learning rate
    last_epoch: -1

  # Warmup parameters (can be used with any scheduler)
  warmup:
    enabled: True
    type: "exponential"  # Options: linear, exponential
    epochs: 5
    start_factor: 0.1
    end_factor: 1.0

###### Validation ---------------------------------------------------------------------------
validation:
  # Settings for the _validate_epoch run
  num_batches: None  # Use null to process all batches, or an integer for a limit
  beam_width: 4      # Set desired beam width (1 for greedy)
  temperature: 1.0
  repeat_penalty: 1.2

Overwriting config.yaml


In [None]:
with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)

## Tokenizer

In [None]:
Tokenizer = H4Tokenizer(
    token_map  = config['tokenization']['token_map'],
    token_type = config['tokenization']['token_type']
)

                         Tokenizer Configuration (10k)                          
--------------------------------------------------------------------------------
Vocabulary size:     10000

Special Tokens:
PAD:              0
UNK:              1
MASK:             2
SOS:              3
EOS:              4
BLANK:            5

Validation Example:
--------------------------------------------------------------------------------
Input text:  [SOS]HI DEEP LEARNERS[EOS]
Tokens:      ['[SOS]', 'H', 'I', 'ĠDEEP', 'ĠLEARN', 'ERS', '[EOS]']
Token IDs:   [3, 14, 15, 1169, 2545, 214, 4]
Decoded:     [SOS]HI DEEP LEARNERS[EOS]


## Datasets

In [None]:
train_dataset = ASRDataset(
    partition=config['data']['train_partition'],
    config=config['data'],
    tokenizer=Tokenizer,
    isTrainPartition=True,
    global_stats=None  # Will compute stats from training data
)

# TODO: Get the computed global stats from training set
global_stats = None
if config['data']['norm'] == 'global_mvn':
    global_stats = (train_dataset.global_mean, train_dataset.global_std)
    print(f"Global stats computed from training set.")

val_dataset = ASRDataset(
    partition=config['data']['val_partition'],
    config=config['data'],
    tokenizer=Tokenizer,
    isTrainPartition=False,
    global_stats=global_stats
)

test_dataset = ASRDataset(
    partition=config['data']['test_partition'],
    config=config['data'],
    tokenizer=Tokenizer,
    isTrainPartition=False,
    global_stats=global_stats
)

gc.collect()

Loading data for train-clean-100 partition...


100%|██████████| 28539/28539 [00:30<00:00, 925.93it/s]


Global stats computed from training set.
Loading data for dev-clean partition...


100%|██████████| 2703/2703 [00:01<00:00, 1785.80it/s]


Loading data for test-clean partition...


100%|██████████| 2620/2620 [00:00<00:00, 3407.23it/s]


1309

## Dataloaders

In [None]:
train_loader    = DataLoader(
    dataset     = train_dataset,
    batch_size  = config['data']['batch_size'],
    shuffle     = True,
    num_workers = config['data']['NUM_WORKERS'] if device == 'cuda' else 0,
    pin_memory  = True,
    collate_fn  = train_dataset.collate_fn
)

val_loader      = DataLoader(
    dataset     = val_dataset,
    batch_size  = config['data']['batch_size'],
    shuffle     = False,
    num_workers = config['data']['NUM_WORKERS'] if device == 'cuda' else 0,
    pin_memory  = True,
    collate_fn  = val_dataset.collate_fn
)

test_loader     = DataLoader(
    dataset     = test_dataset,
    batch_size  = config['data']['batch_size'],
    shuffle     = False,
    num_workers = config['data']['NUM_WORKERS'] if device == 'cuda' else 0,
    pin_memory  = True,
    collate_fn  = test_dataset.collate_fn
)

gc.collect()

0

### Dataloader Verification

In [None]:
verify_dataloader(train_loader)

             Dataloader Verification              
Dataloader Partition     : train-clean-100
--------------------------------------------------
Number of Batches        : 3568
Batch Size               : 8
--------------------------------------------------
Checking shapes of the data...                    

Feature Shape            : [8, 1964, 80]
Shifted Transcript Shape : [8, 52]
Golden Transcript Shape  : [8, 52]
Feature Lengths Shape    : [8]
Transcript Lengths Shape : [8]
--------------------------------------------------
Max Feature Length       : 3066
Max Transcript Length    : 89
Avg. Chars per Token     : 4.64


In [None]:
verify_dataloader(val_loader)

             Dataloader Verification              
Dataloader Partition     : dev-clean
--------------------------------------------------
Number of Batches        : 338
Batch Size               : 8
--------------------------------------------------
Checking shapes of the data...                    

Feature Shape            : [8, 3676, 80]
Shifted Transcript Shape : [8, 95]
Golden Transcript Shape  : [8, 95]
Feature Lengths Shape    : [8]
Transcript Lengths Shape : [8]
--------------------------------------------------
Max Feature Length       : 4081
Max Transcript Length    : 122
Avg. Chars per Token     : 4.57


In [None]:
verify_dataloader(test_loader)

             Dataloader Verification              
Dataloader Partition     : test-clean
--------------------------------------------------
Number of Batches        : 328
Batch Size               : 8
--------------------------------------------------
Checking shapes of the data...                    

Feature Shape            : [8, 1320, 80]
Feature Lengths Shape    : [8]
--------------------------------------------------
Max Feature Length       : 4370
Max Transcript Length    : 0
Avg. Chars per Token     : 0.00


## Calculate Max Lengths
Calculating the maximum transcript length across your dataset is a crucial step when working with certain transformer models.
-  We'll use sinusoidal positional encodings that must be precomputed up to a fixed maximum length.
- This maximum length is a hyperparameter that determines:
  - How long of a sequence your model can process
  - The size of your positional encoding matrix
  - Memory requirements during training and inference
- `Requirements`: For this assignment, ensure your positional encodings can accommodate at least the longest sequence in your dataset to prevent truncation. However, you can set this value higher if you anticipate using your languagemodel to work with longer sequences in future tasks (hint: this might be useful for P2! 😉).
- `NOTE`: We'll be using the same positional encoding matrix for all sequences in your dataset. Take this into account when setting your maximum length.

In [None]:
max_feat_len       = max(train_dataset.feat_max_len, val_dataset.feat_max_len, test_dataset.feat_max_len)
max_transcript_len = max(train_dataset.text_max_len, val_dataset.text_max_len, test_dataset.text_max_len)
max_len            = max(max_feat_len, max_transcript_len)

print("="*50)
print(f"{'Max Feature Length':<30} : {max_feat_len}")
print(f"{'Max Transcript Length':<30} : {max_transcript_len}")
print(f"{'Overall Max Length':<30} : {max_len}")
print("="*50)

Max Feature Length             : 4370
Max Transcript Length          : 122
Overall Max Length             : 4370


## Wandb

In [None]:
wandb.login(key="250fcabc213e2ec6b9d37a79683ce4a9b7e92826")

[34m[1mwandb[0m: Currently logged in as: [33madisann20025[0m ([33madisann20025-carnegie-mellon-university[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## Training
Every time you run the trainer, it will create a new directory in the `expts` folder with the following structure:
```
expts/
    └── {run_name}/
        ├── config.yaml
        ├── model_arch.txt
        ├── checkpoints/
        │   ├── checkpoint-best-metric-model.pth
        │   └── checkpoint-last-epoch-model.pth
        ├── attn/
        │   └── {attention visualizations}
        └── text/
            └── {generated text outputs}
```


### Training Strategy 1: Cold-Start Trainer

#### Model Load (Default)

In [None]:
model_config = config['model'].copy()
model_config.update({
    'max_len': max_len,
    'num_classes': Tokenizer.vocab_size
})

model = EncoderDecoderTransformer(**model_config)

# Get some inputs from the train dataloader
for batch in train_loader:
    padded_feats, padded_shifted, padded_golden, feat_lengths, transcript_lengths = batch
    break


model_stats = summary(model, input_data=[padded_feats, padded_shifted, feat_lengths, transcript_lengths])
print(model_stats)

Layer (type:depth-idx)                             Output Shape              Param #
EncoderDecoderTransformer                          [8, 54, 10000]            --
├─SpeechEmbedding: 1-1                             [8, 464, 256]             --
│    └─Conv2DSubsampling: 2-1                      [8, 928, 256]             --
│    │    └─Sequential: 3-1                        [8, 256, 928, 76]         592,640
│    │    └─Linear: 3-2                            [8, 928, 256]             4,980,992
│    │    └─Dropout: 3-3                           [8, 928, 256]             --
│    └─StackedBLSTMEmbedding: 2-2                  [8, 464, 256]             --
│    │    └─LSTM: 3-4                              [8, 928, 256]             395,264
│    │    └─MaxPool1d: 3-5                         [8, 256, 464]             --
│    │    └─LSTM: 3-6                              [8, 464, 256]             395,264
│    │    └─MaxPool1d: 3-7                         [8, 256, 464]             --
│    │    └─L

#### Initialize Trainer

If you need to reload the model from a checkpoint, you can do so by calling the `load_checkpoint` method.

```python
checkpoint_path = "path/to/checkpoint.pth"
trainer.load_checkpoint(checkpoint_path)
```


In [None]:
checkpoint_path = "/content/IDL_HW4/IDL-HW4/expts/model_basic_1.32/checkpoints/checkpoint-last-epoch-model.pth"
trainer.load_checkpoint(checkpoint_path)

Checkpoint loaded from epoch 27
Successfully loaded: model, optimizer, scheduler, scaler, training_state


In [None]:
trainer = ASRTrainer(
    model=model,
    tokenizer=Tokenizer,
    config=config,
    run_name="model_basic_1.32",
    config_file="config.yaml",
    device=device
)

Using device: cuda


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


### Setup Optimizer and Scheduler

You can set your own optimizer and scheduler by setting the class members in the `LMTrainer` class.
Eg:
```python
trainer.optimizer = optim.AdamW(model.parameters(), lr=config['optimizer']['lr'], weight_decay=config['optimizer']['weight_decay'])
trainer.scheduler = optim.lr_scheduler.CosineAnnealingLR(trainer.optimizer, T_max=config['training']['epochs'])
```

We also provide a utility function to create your own optimizer and scheduler with the congig and some extra bells and whistles. You are free to use it or not. Do read their code and documentation to understand how it works (`hw4lib/utils/*`).


#### Setting up the optimizer

In [None]:
trainer.optimizer = create_optimizer(
    model=model,
    opt_config=config['optimizer']
)


🔧 Configuring Optimizer:
├── Type: ADAMW
├── Base LR: 0.0003
├── Weight Decay: 1e-06
├── Parameter Groups:
│   ├── Group: self_attn
│   │   ├── LR: 0.0002
│   │   └── Patterns: []
│   ├── Group: ffn
│   │   ├── LR: 0.0002
│   │   └── Patterns: []
│   └── Default Group (unmatched parameters)
└── AdamW Specific:
    ├── Betas: [0.9, 0.999]
    ├── Epsilon: 1e-08
    └── AMSGrad: False


#### Creating a test scheduler and plotting the learning rate schedule

In [None]:
test_scheduler = create_scheduler(
    optimizer=trainer.optimizer,
    scheduler_config=config['scheduler'],
    train_loader=train_loader,
    gradient_accumulation_steps=config['training']['gradient_accumulation_steps']
)
'''
plot_lr_schedule(
    scheduler=test_scheduler,
    num_epochs=20,
    train_loader=train_loader,
    gradient_accumulation_steps=config['training']['gradient_accumulation_steps']
)
'''


📈 Configuring Learning Rate Scheduler:
├── Type: COSINE_WARM
├── Cosine Annealing Warm Restarts Settings:
│   ├── T_0: 10 epochs (35680 steps)
│   ├── T_mult: 10
│   └── Min LR: 1e-07
├── Warmup Settings:
│   ├── Duration: 5 epochs (8920 steps)
│   ├── Start Factor: 0.1
│   └── End Factor: 1.0


"\nplot_lr_schedule(\n    scheduler=test_scheduler,\n    num_epochs=20,\n    train_loader=train_loader,\n    gradient_accumulation_steps=config['training']['gradient_accumulation_steps']\n)\n"

#### Setting up the scheduler

In [None]:
trainer.scheduler = create_scheduler(
    optimizer=trainer.optimizer,
    scheduler_config=config['scheduler'],
    train_loader=train_loader,
    gradient_accumulation_steps=config['training']['gradient_accumulation_steps']
)


📈 Configuring Learning Rate Scheduler:
├── Type: COSINE_WARM
├── Cosine Annealing Warm Restarts Settings:
│   ├── T_0: 10 epochs (35680 steps)
│   ├── T_mult: 10
│   └── Min LR: 1e-07
├── Warmup Settings:
│   ├── Duration: 5 epochs (8920 steps)
│   ├── Start Factor: 0.1
│   └── End Factor: 1.0


#### Train
- Set your epochs

In [None]:
trainer.train(train_loader, val_loader, epochs=50)




📊 Metrics (Epoch 27):
├── TRAIN:
│   ├── ce_loss: 1.8286
│   ├── ctc_loss: 0.6919
│   ├── joint_loss: 2.1746
│   ├── perplexity_char: 1.4832
│   └── perplexity_token: 6.2252
└── VAL:
    ├── cer: 9.9863
    ├── wer: 19.2658
    └── word_dist: 12.3559
└── TRAINING:
    └── learning_rate: 0.000002





📊 Metrics (Epoch 28):
├── TRAIN:
│   ├── ce_loss: 1.8247
│   ├── ctc_loss: 0.6850
│   ├── joint_loss: 2.1672
│   ├── perplexity_char: 1.4819
│   └── perplexity_token: 6.2009
└── VAL:
    ├── cer: 9.9516
    ├── wer: 19.2309
    └── word_dist: 12.3333
└── TRAINING:
    └── learning_rate: 0.000300


[Training ASR]:  12%|█▏        | 411/3568 [01:09<08:12,  6.41it/s, acc_step=1/2, ce_loss=2.0945, ctc_loss=0.8724, joint_loss=2.5307, perplexity=8.1214]

KeyboardInterrupt: 

#### Inference


In [None]:
# Define the recognition config: Greedy search
recognition_config = {
    'num_batches': None,
    'temperature': 1.0,
    'repeat_penalty': 1.0,
    'lm_weight': None,
    'lm_model': None,
    'beam_width': 10, # Beam width of 1 reverts to greedy
}

# Recognize with the shallow fusion config
config_name = "test"
print(f"Evaluating with {config_name} config")
results = trainer.recognize(test_loader, recognition_config, config_name=config_name, max_length=max_transcript_len)


# Calculate metrics on full batch
generated = [r['generated'] for r in results]
results_df = pd.DataFrame(
    {
        'id': range(len(generated)),
        'transcription': generated
    }
)

# Cleanup (Will end wandb run)
trainer.cleanup()

Evaluating with test config




0,1
learning_rate,▁█
train/ce_loss,█▁
train/ctc_loss,█▁
train/joint_loss,█▁
train/perplexity_char,█▁
train/perplexity_token,█▁
val/cer,█▁
val/wer,█▁
val/word_dist,█▁

0,1
learning_rate,0.0003
train/ce_loss,1.82469
train/ctc_loss,0.68499
train/joint_loss,2.16719
train/perplexity_char,1.48193
train/perplexity_token,6.20089
val/cer,9.9516
val/wer,19.23091
val/word_dist,12.33333


## Submit to Kaggle

### Authenticate Kaggle
In order to use the Kaggle’s public API, you must first authenticate using an API token. Go to the 'Account' tab of your user profile and select 'Create New Token'. This will trigger the download of kaggle.json, a file containing your API credentials.
- `TODO`: Set your kaggle username and api key here based on the API credentials listed in the kaggle.json




In [None]:
import os
os.environ["KAGGLE_USERNAME"] = "adisann"
os.environ["KAGGLE_KEY"] = "8c14e834465338f0f4dc667158c6377a"

In [None]:
results_df.head()

Unnamed: 0,id,transcription
0,0,HE HOPED THERE WOULD BE STILL FOR DINNER TURNI...
1,1,STEP IT INTO YOU HIS BELLY COUNLSLED HIM
2,2,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...
3,3,HOWEVER AND HE GOOD IN YOUR MIND
4,4,NEM BUT DAN FRESH NLLY IS WAITING ON YOU GOOD ...


### Submit

In [None]:
results_df.to_csv("results.csv", index=False)
!kaggle competitions submit -c 11785-s25-hw4p2-asr -f results.csv -m "My Submission"

100% 282k/282k [00:01<00:00, 206kB/s]
Successfully submitted to 11-785 HW4P2: Automatic Speech Recognition