### chaptGPT specs   

A decoder-only transformer in pytorch to predict 'next output' at each time step. 

Each time step t is represented by a vector of n=4 tokens from the Descript DAC encoder. 
The length of the sequence (context window) is Ti=86 for inference, and Tt=8*Ti for training. That is, the context window for training is 8 times the length of the context window for inference. 
The attention is "causal", looking only back in time, and the maximum look-back time for the attention blocks is Ti (even when the sequence is longer during training). That is, the masking matrix is *banded* - triangular to be causal, and limited in lookback which results in a diagonal band). This prevents much of the training on shortened context that happens when tokens are near the beginning of traning examples. 

The size of the vocabulary (the number of descrete values in each codebook) for each of the n tokens is V=1024. 

The dataloader will as is usual, supply batches in triplets  (input, target, conditioning info) where the size of each input and output is Tt*n (the sequence length times the number of tokens at each time step). The tokens are indexes for the vocabulary in the range of (0, V-1). The targets are shifted relative to the input sequence by 1 as is typical for networks the learn to predict the output for the next time step. 

The first layer in the architecture will be a learnable "multiembedding" layer that embeds each of the 4 tokens at each time step as an m-dimensional vector. The n m-dimensional vectors are concatenated to provide the n*m dimensional input embeddings for the transformer blocks at each time step. 

A positional code is is added to the K and Q matricies in each Transformer block using Rotary Position Embedding (RoPE).

We use a stack of b transformer blocks that are standard (using layer norms, a relu for activation, and a forward expansion factor of 4 form the linear layer). Each transformer block consumes and produces a context window length sequence of m*n dimensional vectors. 

After the last transformer block, there is a linear layer that maps the m*n dimensional vectors to the output size which is V*n (the vocabulary size time the number of tokens stacked at each time step). These are the logits that will be fed to the softmax functions (one for each of the n to kens) that provide the probability distribtion across the vocabulary set. We use the criterion nn.CrossEntropyLoss() for computing the loss using the targets provided by the dataloader, and Adam for the optimizer.

Again, at inference time, the fixed-length context window is shorter than the training sequence window length, and equal to the maximum look-back time of the attention blocks. The inference process takes the output produced at each time step (a stack of n tokens), and shift them in to a sliding window that is used for input for the next time step. The length of the sequences generated during inference is arbitrary and should be settable with a parameter. 


<div style="width: 100%; height: 20px; background-color: black;"></div>

## Parameters

In [19]:
paramfile = 'params_mini.yaml' # 'params.yaml' #
DEVICE='cuda'
start_epoch=0 # to start from a previous training checkpoint, otherwise must be 0
verboselevel=0

<div style="width: 100%; height: 20px; background-color: black;"></div>

In [20]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import time

import numpy as np

# and for creating a custom dataset and loader:
from torch.utils.data import DataLoader
import os
import yaml
import shutil

from utils.utils import generate_mask, save_model, load_model, writeDACFile, interpolate_vectors
from DACTransformer.RopeCondDACTransformer import RopeCondDACTransformer

from dataloader.dataset import CustomDACDataset

In [21]:
from torch.utils.tensorboard import SummaryWriter

### <font color='blue'> Derived parameters </font>

In [22]:
# Training data dir

# Load YAML file
with open(paramfile, 'r') as file:
    params = yaml.safe_load(file)

data_dir = params['data_dir']
data_frames =  params['data_frames']
validator_data_dir = params['validator_data_dir']
validator_data_frames = params['validator_data_frames']

# Create an instance of the dataset
dataset = CustomDACDataset(data_dir=data_dir, metadata_excel=data_frames, transforms=None)

# ---------     for the transformer  --------------#
vocab_size = params['vocab_size']
num_tokens = params['num_tokens']

cond_classes = dataset.get_num_classes() # 0
cond_params = params['cond_params']
cond_size = cond_classes + cond_params # num_classes + num params - not a FREE parameter!

#embed_size = params['tblock_input_size'] -cond_size # 240 #32  # embed_size +cond_size must be divisible by num_heads and by num tokens
embed_size = params['model_size']  # embed_size  must be divisible by num_heads and by num tokens
print(f'embed_size is {embed_size}')

Ti = params['Ti']
Tt = params['Tt']
batch_size = params['batch_size']

sequence_length = Tt  # For training

num_layers = params['num_layers']
num_heads = params['num_heads']
forward_expansion = params['forward_expansion']
dropout_rate = params['dropout_rate']
learning_rate = params['learning_rate']
num_epochs=params['num_epochs']

experiment_name=params['experiment'] 
outdir = 'runs' + '/' + experiment_name
basefname= 'out' + '.e' + str(embed_size) + '.l' + str(num_layers) + '.h' + str(num_heads) 

ErrorLogRate = params['ErrorLogRate'] #10
checkpoint_interval = params['checkpoint_interval']



TransformerClass =  globals().get(params['TransformerClass'])  

print(f"using TransformerClass = {params['TransformerClass']}") 
print(f'basefname = {basefname}')
print(f'outdir = {outdir}')

###########################################################################
# Ensure the destination directory exists
#destination_dir = os.path.dirname(outdir + '/' + paramfile)
#if not os.path.exists(destination_dir):
#    os.makedirs(destination_dir)
    
if not os.path.exists(outdir):
    os.makedirs(outdir)
shutil.copy(paramfile, outdir + '/params.yaml')  # copy whatever paramfile was used to outdir and name it params.yaml

embed_size is 512
using TransformerClass = RopeCondDACTransformer
basefname = out.e512.l4.h8
outdir = runs/mini_test_01


'runs/mini_test_01/params.yaml'

### <font color='blue'> Set up cuda. 
Without it, training runs about 10 times slower  
</font>

In [23]:
if DEVICE == 'cuda' :
    torch.cuda.device_count()
    torch.cuda.get_device_properties(0).total_memory/1e9

    device = torch.device(DEVICE) # if the docker was started with --gpus all, then can choose here with cuda:0 (or cpu)
    torch.cuda.device_count()
    print(f'memeory on cuda 0 is  {torch.cuda.get_device_properties(0).total_memory/1e9}')
else :
    device=DEVICE
device

memeory on cuda 0 is  4.294639616


device(type='cuda')

### <font color='blue'> Load data 
</font>

In [24]:

# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

#Validator data set
if validator_data_dir != None :
    validator_dataset=CustomDACDataset(data_dir=validator_data_dir, metadata_excel=validator_data_frames)
    validator_dataloader= DataLoader(validator_dataset, batch_size=batch_size, shuffle=True)

#---------------------------------------------------------------
# Test data dir
for batch_idx, (inputs, targets, cvect) in enumerate(dataloader):
    #pass
    # Your training code here
    # inputs: batch of input data of shape [batch_size, N, T-1]
    # targets: corresponding batch of target data of shape [batch_size, N, T-1]
    
    if (batch_idx == 0) : 
        print(f"Batch {batch_idx + 1}")
        print(f"Inputs shape: {inputs.shape}")
        print(f"Targets shape: {targets.shape}")
        print(f"cvect shape: {cvect.shape}")
        print(f'cevect is {cvect}')

Batch 1
Inputs shape: torch.Size([4, 295, 4])
Targets shape: torch.Size([4, 295, 4])
cvect shape: torch.Size([4, 5])
cevect is tensor([[0., 0., 0., 1., 1.],
        [1., 0., 0., 0., 1.],
        [0., 0., 1., 0., 1.],
        [0., 1., 0., 0., 1.]])


### <font color='blue'> Make the mask 
</font>

In [25]:
mask = generate_mask(Tt, Ti).to(device)
print(f'Mask.shape is {mask.shape}')
mask

Mask.shape is torch.Size([295, 295])


tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [-inf, -inf, -inf,  ..., 0., -inf, -inf],
        [-inf, -inf, -inf,  ..., 0., 0., -inf],
        [-inf, -inf, -inf,  ..., 0., 0., 0.]], device='cuda:0')

In [26]:
# Instantiate model, put it on the device
#model = TransformerDecoder(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, num_tokens, vocab_size).to(device)
print(f'Creating model with embed_size={embed_size}, cond_size={cond_size}')

if start_epoch == 0 : 
    model = TransformerClass(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, cond_classes, num_tokens, vocab_size, cond_size, verboselevel).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
else:
    checkpoint_path = outdir+"/"+basefname+"_chkpt_"+str(start_epoch).zfill(4) +".pth"
    print(f'in train, start_epoch = {start_epoch} and checkpoint_path = {checkpoint_path}')
    assert os.path.exists(checkpoint_path), f"{checkpoint_path} does not exist."
    if start_epoch != 0 and checkpoint_path and os.path.exists(checkpoint_path):
        print(f"Loading and creating model from {checkpoint_path}")       
        # Restore model weights
        model, optimizer, _, vocab_size, num_tokens, cond_size = load_model(checkpoint_path,  TransformerClass, device)
        #best_metric = checkpoint['best_metric']  # If you're tracking performance      
        print(f"Resuming from epoch {start_epoch}")
   
criterion = nn.CrossEntropyLoss()
# Count the number of parameters
num_params = sum(p.numel() for p in model.parameters())
print(f'Total number of parameters: {num_params}')

# Initialize SummaryWriter for tensorboard 
writer = SummaryWriter(outdir)

Creating model with embed_size=512, cond_size=5
 ------------- embed_dim (512) must be divisible by num_heads (8)
Setting up MultiEmbedding with vocab_size= 1024, embed_size= 512, num_codebooks= 4
Setting up RotaryPositionalEmbedding with embed_size= 512, max_len= 295
Total number of parameters: 16295936


<div style="width: 100%; height: 20px; background-color: black;"></div>

# <font color='blue'> Train !! 
</font>

### loss is average CE across all output tokens

$$
L = \frac{1}{N} \sum_{n=1}^{N} \text{CE}(x_n, y_n)
$$


In [27]:

def train(model, optimizer, dataloader, num_epochs, device, outdir, basefname, start_epoch=0, checkpoint_path=None):
    t0 = time.time()
    max_epoch = start_epoch + num_epochs
    for epoch in range(start_epoch, max_epoch):
        torch.cuda.empty_cache()
        model.train()
        for batch_idx, (input_data, target_data, cond_data) in enumerate(dataloader):
            if verboselevel > 5 :
                print(f' ---- submitting batch with input_data={input_data.shape}, target_data={target_data.shape}, cond_data={cond_data.shape}')
            #print(f"b{batch_idx} ", end='')
            optimizer.zero_grad()
    
            # Move inputs and targets to the device
            input_data, target_data, cond_data = input_data.to(device), target_data.to(device), cond_data.to(device)
            
            if cond_size==0 :  #Ignore conditioning data
                cond_expanded=None
            else : 
                # for dataset exammples, expand the conditioning info across all time steps before passing to models
                cond_expanded = cond_data.unsqueeze(1).expand(-1, input_data.size(1), -1)
            
            #print(f'    after loading a batch,  input_data.shape is {input_data.shape}, and cond_data.shape is {cond_data.shape}')
            #print(f'    after loading a batch,  cond_expanded.shape is {cond_expanded.shape}')
            #print(f'    after loading a batch,  mask.shape is {mask.shape}')
            #print(f' model={model}')
            
            # torch.Size([batch_size, seq_len, num_tokens, vocab_size])
            output = model(input_data, cond_expanded, mask)
        
            if verboselevel > 5 :
                print(f' TTTTTTTT after training, output shape ={output.shape}, torch.Size([batch_size, seq_len, num_tokens, vocab_size])')
                print(f' TTTTTTTT Passing to CRITERION with , output.reshape(-1, vocab_size) = {output.reshape(-1, vocab_size).shape} and target_data.reshape(-1) = {target_data.reshape(-1).shape}' )
    
            ##  this works, but is too verbose >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
            ##      # Original shape: (batch_size, seq_len, num_tokens, vocab_size)
            ##      output = output.reshape(batch_size, sequence_length * num_tokens, vocab_size)
            ##      # Original shape: (batch_size, seq_len, num_tokens)
            ##      targets = targets.reshape(batch_size, sequence_length * num_tokens)
            ##      loss = criterion(output.permute(0, 2, 1), targets) 
            
            ##  more succinct <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
            #   Computes the CE for each token separately, and then averages them to get the loss.
            #loss = criterion(output.reshape(-1, vocab_size), target_data.reshape(-1)) # collapses all target_data dimensions into a single dimension
            loss = criterion(output.reshape(-1, vocab_size), target_data.reshape(-1).long())
            ## <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
            
            loss.backward()
            optimizer.step()
        if (epoch+1) % ErrorLogRate == 0:
            print(f'EPOCH {epoch+1}  (with max {max_epoch}), ', end='')
            print(f'loss: {loss}')
            # Log the loss to TensorBoard
            writer.add_scalar('Loss/train', loss, epoch)
            
            if validator_data_dir != None :
                model.eval()
                with torch.no_grad():
                    val_loss = 0
                    for val_inputs, val_targets, cond_data in validator_dataloader:
                        val_inputs, val_targets, cond_data = val_inputs.to(device), val_targets.to(device), cond_data.to(device)
                        
                        if cond_size==0 :  #Ignore conditioning data
                            cond_expanded=None
                        else: 
                            # for dataset exammples, expand the conditioning info across all time steps before passing to models
                            cond_expanded = cond_data.unsqueeze(1).expand(-1, input_data.size(1), -1)
    
                        
                        val_outputs = model(val_inputs,cond_expanded, mask)
                        
                        val_loss += criterion(val_outputs.reshape(-1, vocab_size), val_targets.reshape(-1).long()) # collapses all target_data dimensions into a single dimension
                        #val_loss += criterion(val_outputs, val_targets).item()
    
                print(f'Validation Loss: {val_loss / len(validator_dataloader)}')
                writer.add_scalar('Loss/validation', val_loss / len(validator_dataloader), epoch)
    
                t1 = time.time()
                train_time = t1-t0
                print(f'train time for {epoch-start_epoch+1} epochs, was {train_time}' )
                print(f'')
                
        if (epoch+1) % checkpoint_interval == 0:
            lastbasename = outdir+"/"+basefname+"_chkpt_"+str(epoch+1).zfill(4)
            print(f'EPOCH {epoch+1} save model to : {lastbasename}.pth')
            print(f'')
            save_model(model, optimizer, Ti,  lastbasename +".pth")
        
    
    t1 = time.time()
    train_time = t1-t0
    print(f'train time for {num_epochs} epochs, was {train_time}' )
    print(f'loss  =  {loss}' )
    
## -----------------------------------------------------------------------------------
## OK, let's do it!
train(model, optimizer, dataloader, num_epochs, device, outdir, basefname, start_epoch)

EPOCH 2  (with max 500), loss: 6.71724271774292
Validation Loss: 6.173355579376221
train time for 2 epochs, was 0.5889384746551514

EPOCH 4  (with max 500), loss: 7.088653564453125
Validation Loss: 6.127959728240967
train time for 4 epochs, was 0.9723138809204102

EPOCH 5 save model to : runs/mini_test_01/out.e512.l4.h8_chkpt_0005.pth

EPOCH 6  (with max 500), loss: 6.040456295013428
Validation Loss: 5.6185736656188965
train time for 6 epochs, was 3.3065085411071777

EPOCH 8  (with max 500), loss: 5.439266681671143
Validation Loss: 4.916060447692871
train time for 8 epochs, was 3.6780471801757812

EPOCH 10  (with max 500), loss: 4.944605350494385
Validation Loss: 4.470635890960693
train time for 10 epochs, was 4.050341844558716

EPOCH 10 save model to : runs/mini_test_01/out.e512.l4.h8_chkpt_0010.pth

EPOCH 12  (with max 500), loss: 4.436800003051758
Validation Loss: 3.9613847732543945
train time for 12 epochs, was 5.576999187469482

EPOCH 14  (with max 500), loss: 4.015646934509277
Va

In [28]:
#just check that inference attention mask will look right
#Actually, the inference mask can be None since we are using a context window only as long as the maximum look-back in the training mask
# thats why taking the mask with :TI is upper-triangular. Longer dims would show a banded mask again.
foo=mask[:Ti, :Ti]
foo

tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [0., 0., 0.,  ..., 0., -inf, -inf],
        [0., 0., 0.,  ..., 0., 0., -inf],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

### <font color='blue'> Use Inference.Decode.ipynb to see and hear your generated audio   
</font>