<br><br><br>
# Bluewalm Module
<br>

## LLM Notebook
<br>

This notebook is meant to illustrate the basic usage of the **bluewalm** pytorch extension module. 

First, let's import the modules we need. 



In [1]:
import json
import torch
import matplotlib.pyplot as plt
# now we import is our module
import bluewalm

<br>

### Attention Layer
<br>
In the cell below we create a softmax attention layer. <br>
It's using the usual scaled dot-product attention with causal masking. <br>
<br>

In [2]:
class AttentionLayerSM(torch.nn.Module):
    def __init__(self, dim, n_heads):
        super().__init__()
        self.n_heads = n_heads
        assert dim % n_heads == 0
        self.head_dim = dim // n_heads
        self.wq = torch.nn.Linear(dim, dim, bias=False)
        self.wk = torch.nn.Linear(dim, dim, bias=False)
        self.wv = torch.nn.Linear(dim, dim, bias=False)
        self.wo = torch.nn.Linear(dim, dim, bias=False)
    
    def reset_parameters(self):
        self.wq.reset_parameters()
        self.wk.reset_parameters()
        self.wv.reset_parameters()
        self.wo.reset_parameters()
    
    def forward(self,
                x: torch.Tensor,
                k_cache: torch.Tensor,
                v_cache: torch.Tensor):

        # x : (bsz, seqlen, dim)
        # k_cache : (bsz, cache_len, dim)
        # v_cache : (bsz, cache_len, dim)
        
        bsz, seqlen, _ = x.shape
        q, k, v = self.wq(x), self.wk(x), self.wv(x)
        
        k = torch.cat((k_cache, k), dim=1)
        v = torch.cat((v_cache, v), dim=1)
        
        q = q.view(bsz, -1, self.n_heads, self.head_dim)
        k = k.view(bsz, -1, self.n_heads, self.head_dim)
        v = v.view(bsz, -1, self.n_heads, self.head_dim)
        
        q = q.transpose(1, 2)
        # q : (bsz, n_heads, seqlen, head_dim)
        
        k = k.transpose(1, 2)
        # k : (bsz, n_heads, cache_len + seqlen, head_dim)
        
        v = v.transpose(1, 2)
        # v : (bsz, n_heads, cache_len + seqlen, head_dim)
        
        scores = torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=True)
        # scores : (bsz, n_heads, seqlen, head_dim)
        
        scores = scores.transpose(1, 2).contiguous().view(bsz, seqlen, dim)
        # scores : (bsz, seqlen, dim)

        output = self.wo(scores)
        # output : (bsz, seqlen, dim)

        k = k.transpose(1, 2).view(bsz, -1, dim)
        # k : (bsz, cache_len + seqlen, dim)
        
        v = v.transpose(1, 2).view(bsz, -1, dim)
        # v : (bsz, cache_len + seqlen, dim)
        
        return output, k, v

<br>
In the cell below we create a softplus attention layer. <br>
It's using the softplus attention. <br>
This is the new stuff. <br>
<br>

In [3]:
# we import what we need for the attention layer
from bluewalm.softplus_attention import attention_operator
from bluewalm.softplus_attention import QueryProjection, KeyProjection, ValueProjection, OutProjection


class AttentionLayerSP(torch.nn.Module):
    def __init__(self, dim, core_dim):
        super().__init__()
        self.wq = QueryProjection(dim, core_dim)
        self.wk = KeyProjection(dim, core_dim)
        self.wv = ValueProjection(dim, core_dim)
        self.wo = OutProjection(dim, core_dim)
    
    def reset_parameters(self):
        self.wq.reset_parameters()
        self.wk.reset_parameters()
        self.wv.reset_parameters()
        self.wo.reset_parameters()
    
    def forward(self,
                x: torch.Tensor,
                k_cache: torch.Tensor,
                v_cache: torch.Tensor):
        
        # x : (bsz, seqlen, dim)
        # k_cache : (bsz, core_dim, cache_len)
        # v_cache : (bsz, core_dim, cache_len)
        
        q, k, v = self.wq(x), self.wk(x), self.wv(x)
        # q : (bsz, seqlen, core_dim)
        # k : (bsz, core_dim, seqlen)
        # v : (bsz, core_dim, seqlen)
        
        # reuse attention keys and values by concatenating to the current ones 
        k = torch.cat((k_cache, k), dim=2)
        v = torch.cat((v_cache, v), dim=2)
        # k : (bsz, core_dim, cache_len + seqlen)
        # v : (bsz, core_dim, cache_len + seqlen)
        
        # q, k and v must be contiguous here 
        scores = attention_operator(q, k, v)
        # scores : (bsz, seqlen, core_dim)
        
        output = self.wo(scores)
        # output : (bsz, seqlen, dim)
        return output, k, v

<br>
The secret sauce in the softplus attention layer is the softplus attention operator. <br>
Let's see what the documentation says about it. <br>
<br>

In [4]:
print(attention_operator.__doc__)

 Attention operator. Realizes softplus attention. 
    Args: 
         q (torch.Tensor) : (b x s x r) dimensional tensor, the query tensor, 
         k (torch.Tensor) : (b x r x t) dimensional tensor, the key tensor, 
         v (torch.Tensor) : (b x r x t) dimensional tensor, the value tensor, 
        
        where
             b : batch size
             s : query length
             t : key-value length
             r : core dimension
    
        The three tensors must be on the same device and they must be stored in the same floating point format. 
        Supported formats : torch.float32, torch.float16, torch.bfloat16. 
        
        Returns: 
        qkv (torch.Tensor) : (b x s x r) dimensional tensor, 
                             stored on the same device as the input tensors, 
                             and in the same floating point format. 
    


<br><br>
The astute reader will notice, that the number of attention heads is gone. <br>
Instead of that, the softplus attention depends on an 'r' hyperparameter, which we call the "core dimension". <br>
<br>
Technically, the training procedure will select the number of attention heads, as it does not have to be an integer anymore. <br>
The core dimension is a hyperparameter, but simple experiments suggest, that it might be good to set it to 4x the token dimension. <br>
We can also use a **heuristic** designed to estimate good core dimension values. <br>
<br>

In [5]:
from bluewalm.softplus_attention import heuristic_core_dim

dim = 128
core_dim = heuristic_core_dim(dim)
print(core_dim)

424


Let's see what the documentation says about it: 

In [6]:
print(heuristic_core_dim.__doc__)

 takes dim and returns a friendly suggestion for core dim 


<br>
According to simple experiments we performed, accuracy grows fast in the token dimension when the core dimension is set with the above heuristic. <br> This should be taken with a pinch of salt, since it's a heuristic. <br>
<br>

<br><br>
Softplus attention layers were meant to be used on the **GPU**. <br>
However, they run on the **CPU** as well, and can be adjusted to and implemented for **any AI accelerator**. <br>
By the way, they are **FSDP-ready**. <br>
<br>

<br>
Below, we create one instance of the softmax attention layer and one instance of the softplus attention layer. <br>
<br>

In [7]:
dim = 128
n_heads = 4
softmax_attention = AttentionLayerSM(dim, n_heads)
print(softmax_attention)

AttentionLayerSM(
  (wq): Linear(in_features=128, out_features=128, bias=False)
  (wk): Linear(in_features=128, out_features=128, bias=False)
  (wv): Linear(in_features=128, out_features=128, bias=False)
  (wo): Linear(in_features=128, out_features=128, bias=False)
)


In [8]:
dim = 128
core_dim = 128
softplus_attention = AttentionLayerSP(dim, core_dim)
print(softplus_attention)

AttentionLayerSP(
  (wq): QueryProjection(dim=128, core_dim=128, precision=fp32, size=0.0625 MB)
  (wk): KeyProjection(dim=128, core_dim=128, precision=fp32, size=0.0625 MB)
  (wv): ValueProjection(dim=128, core_dim=128, precision=fp32, size=0.0625 MB)
  (wo): OutProjection(dim=128, core_dim=128, precision=fp32, size=0.0625 MB)
)


<br>

### Positionwise Feedforward Layer (FFN)
<br>
Below is a simple implementation of the positionwise feedforward layer to be found in LLM architectures. <br>
It's a gated linear unit, where "hidden_dim" is the usual bottleneck dimension. <br>
<br>

In [9]:
class FeedForward(torch.nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = torch.nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = torch.nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = torch.nn.Linear(dim, hidden_dim, bias=False)
    
    def reset_parameters(self):
        self.w1.reset_parameters()
        self.w2.reset_parameters()
        self.w3.reset_parameters()
    
    def forward(self, x):
        return self.w2(torch.nn.functional.silu(self.w1(x)) * self.w3(x))

<br>
Let's create an instance of this class. <br>
<br>

In [10]:
combinator_sm = FeedForward(128, 256)
print(combinator_sm)

FeedForward(
  (w1): Linear(in_features=128, out_features=256, bias=False)
  (w2): Linear(in_features=256, out_features=128, bias=False)
  (w3): Linear(in_features=128, out_features=256, bias=False)
)


<br>
Below we import the replacement layer from the bluewalm module. <br>
This layer was meant to replace positionwise feedforward layers. <br>
Using this layer is expected to result in a better (accuracy / compute) ratio. <br>
<br>

In [11]:
from bluewalm.softplus_attention import Combinator


print(Combinator.__doc__)

Combinator layer for softplus attention. Changes the internal representation of the tokens within the forward 
     pass of the Transformer layer. It was meant to be used as a replacement for positionwise feedforward layers. 
     The values are initialized from :math:`\mathcal{U}(\frac{-1}{\sqrt{k}}, \frac{1}{\sqrt{k}})`, 
     where :math:`k = \frac{shape[0] + shape[1]}{2}`. 
    
    Args:
        dim (int): The dimension of the layer. Should be the token dimension. 
    
    KwArgs:
        device (str): The device the layer is constructed on. 
                      By default, this is `cpu`. 
        dtype (torch.dtype): Sets the datatype of the layer after construction. 
                             By default, this is `torch.float32`. 
                             The following are supported: `torch.float32`, `torch.float16`, `torch.bfloat16`. 
    
    Attributes:
        weight (Tensor): the learnable weights of the module. 
        dim (int) : The dimension of the layer. Shou

<br>
Let's create an instance of this class. <br>
<br>

In [12]:
combinator_sp = Combinator(128)
combinator_sp.uniform_(-1.0, 1.0)

print(combinator_sp)

Combinator(dim=128, precision=fp32, size=0.125 MB)


<br>
Notice, that the bottleneck dimension is absent. <br>
That's because there is no bottleneck in this layer. <br> 
<br>
We also expect a loss of accuracy, when compared to the positionwise feedforward layer above. <br>
<br>
We have found that the accuracy advantage we can gain by using the GLU above might not be worth the cost. <br>
We recommend improving accuracy by increasing the token dimension or the core dimension instead. <br>
<br>

<br>

### benchmarking GPU memory
<br>

<br>
Now we are going to measure the GPU memory they use. <br>
We are going to use a context manager for this. <br>
<br>

In [13]:
import math
import contextlib


@contextlib.contextmanager
def benchmark_gpu_memory():
    start = torch.cuda.max_memory_allocated()
    outside_context = set(globals().keys())
    error_msg = None
    try:
        yield
    except torch.OutOfMemoryError:
        start = None
    finally:
        end = torch.cuda.max_memory_allocated()
        # delete the variables allocated within the context 
        inside_context = set(globals().keys()) - outside_context
        for name in inside_context:
            del globals()[name]
        # reset peak memory stats
        torch.cuda.reset_peak_memory_stats()
        if start is not None:
            memory_used = (end - start) / 1024**2
            memory_used = math.ceil(memory_used)
            memory_used = int(memory_used)
            print("used", "{:06d}".format(memory_used), "MB of GPU memory")
        else:
            print("out of memory")
        print()

<br>
We are going to use some convenience functions: 
<br><br>

In [14]:
def get_input(bsz, seqlen, dim, device, dtype=torch.float32):
    shape = (bsz, seqlen, dim)
    x = torch.zeros(shape, device=device, dtype=dtype).uniform_(-1.0, 1.0)
    return x


def get_sm_cache(bsz, cache_len, dim, device, dtype=torch.float32):
    shape = (bsz, cache_len, dim)
    k_cache = torch.empty(shape, device=device, dtype=dtype).uniform_(-1.0, 1.0)
    v_cache = torch.empty(shape, device=device, dtype=dtype).uniform_(-1.0, 1.0)
    return k_cache, v_cache


def get_sp_cache(bsz, cache_len, core_dim, device, dtype=torch.float32):
    shape = (bsz, core_dim, cache_len)
    k_cache = torch.empty(shape, device=device, dtype=dtype).uniform_(-1.0, 1.0)
    v_cache = torch.empty(shape, device=device, dtype=dtype).uniform_(-1.0, 1.0)
    return k_cache, v_cache

<br>
....and some more convenience functions: <br>
<br>

In [15]:
def forward_backward_sm(bsz, seqlen, dim, n_heads, cache_len):
    print("running a single forward pass and then a backward pass of softmax attention....")
    # create the attention layer
    softmax_attention = AttentionLayerSM(dim, n_heads).cuda()
    # create the input
    x = get_input(bsz, seqlen, dim, 'cuda')
    k_cache, v_cache = get_sm_cache(bsz, cache_len, dim, 'cuda')
    # all inputs will require grads
    x.requires_grad = True
    k_cache.requires_grad = True
    v_cache.requires_grad = True
    # run a forward pass
    output, new_k_cache, new_v_cache = softmax_attention(x, k_cache, v_cache)
    # now a backward pass
    output.sum().backward()
    assert x.grad is not None


def forward_backward_sp(bsz, seqlen, dim, core_dim, cache_len):
    print("running a single forward pass and then a backward pass of softplus attention....")
    # create the attention layer
    softplus_attention = AttentionLayerSP(dim, core_dim).cuda()
    # create the input
    x = get_input(bsz, seqlen, dim, 'cuda')
    k_cache, v_cache = get_sp_cache(bsz, cache_len, core_dim, 'cuda')
    # all inputs will require grads
    x.requires_grad = True
    k_cache.requires_grad = True
    v_cache.requires_grad = True
    # run a forward pass
    output, new_k_cache, new_v_cache = softplus_attention(x, k_cache, v_cache)
    # now a backward pass
    output.sum().backward()
    assert x.grad is not None

<br>
We will use a python context manager for measuring the amount of GPU memory used. <br>
<br>

In [16]:
bsz, seqlen, dim, core_dim, n_heads, cache_len = 32, 8192, 512, 512, 128, 0


with benchmark_gpu_memory():
    forward_backward_sm(bsz, seqlen, dim, n_heads, cache_len)

with benchmark_gpu_memory():
    forward_backward_sp(bsz, seqlen, dim, core_dim, cache_len)

running a single forward pass and then a backward pass of softmax attention....
used 013598 MB of GPU memory

running a single forward pass and then a backward pass of softplus attention....
used 006791 MB of GPU memory



<br>

### deployment into TensorRT
<br>

<br>
Now we are going to deploy both the softmax and the softplus attention layers into TensorRT format and run a compute benchmark. <br>
In production the neural network most likely is going to be deployed into TensorRT format, so these compute benchmarks really count. <br>
First, let's import the modules we need. <br>
<br>

In [17]:
import os
import sys
import json
import subprocess
import numpy as np
from ml_dtypes import bfloat16

<br>
We are going to use some convenience functions. <br>
<br>

In [18]:
def execute(command):
    ''' 
        execute command; capture and print stdout
        return stdout 
    ''' 
    # free up some memory 
    torch.cuda.empty_cache()
    # execute command 
    command = command.split()
    outputs = []
    stdout = subprocess.PIPE
    with subprocess.Popen(command, stdout=stdout, bufsize=1, 
                          universal_newlines=True) as process:
        for line in process.stdout:
            line = line[:-1]
            outputs.append(line)
            print(line)
    output = ''.join(outputs)
    return output


def export_bf16_tensor_to_file(tensor, filename):
    tensor = tensor.to(device='cpu')
    tensor = tensor.float()
    tensor = tensor.numpy()
    tensor = tensor.astype(bfloat16)
    tensor.tofile(filename)


def export_bf16_inputs(sample):
    export_bf16_tensor_to_file(sample[0], "input.dat")
    export_bf16_tensor_to_file(sample[1], "k_cache.dat")
    export_bf16_tensor_to_file(sample[2], "v_cache.dat")


def shape_to_str(shape):
    return "x".join([str(i) for i in shape])

<br>
First, let's set the parameters. <br>
<br>

In [19]:
bsz, seqlen, dim, core_dim, n_heads, cache_len = 4, 2048, 512, 512, 128, 0

<br>
Below we deploy the softmax attention layer into TensorRT format and benchmark it. <br>
<br>

In [20]:
# create the model 
softmax_attention = AttentionLayerSM(dim, n_heads).cuda().bfloat16()
# create a sample input 
x = get_input(bsz, seqlen, dim, 'cuda', torch.bfloat16)
k_cache, v_cache = get_sm_cache(bsz, cache_len, dim, 'cuda', torch.bfloat16)
sample = (x, k_cache, v_cache)

# warmup and recompilation of torchscript models may take too long to run.... 
# especially for large models
# here we just turn off graph optimizations to deal with that
with torch.no_grad() and torch.jit.optimized_execution(False):
    # ts export
    softmax_attention_traced = torch.jit.trace(softmax_attention, sample, check_trace=False)
    
    # onnx export
    onnx_model_path = "model.onnx"
    torch.onnx.export(softmax_attention_traced, sample, onnx_model_path, verbose=False, 
                      opset_version=18, export_params=True, 
                      keep_initializers_as_inputs=False, 
                      custom_opsets={}, 
                      do_constant_folding=True, 
                      input_names=['input', 'k_cache', 'v_cache'], 
                      output_names=['output', 'updated_k_cache', 'updated_v_cache'], 
                      dynamic_axes={'input' : {0 : 'batch_size', 1 : 'query_len'}, 
                                    'k_cache' : {0 : 'batch_size', 2 : 'cache_len'}, 
                                    'v_cache' : {0 : 'batch_size', 2 : 'cache_len'}, 
                                    'output' : {0 : 'batch_size', 1 : 'query_len'}, 
                                    'updated_k_cache' : {0 : 'batch_size', 2 : 'updated_cache_len'}, 
                                    'updated_v_cache' : {0 : 'batch_size', 2 : 'updated_cache_len'}})

# tensorrt export
export_bf16_inputs(sample)

command = "trtexec --onnx=./model.onnx"
command += " --loadInputs=input:./input.dat"
if k_cache.numel() > 0:
    command += ",k_cache:./k_cache.dat,v_cache:./v_cache.dat"
command += " --inputIOFormats=bf16:chw,bf16:chw,bf16:chw"
command += " --outputIOFormats=bf16:chw,bf16:chw,bf16:chw"
command += " --bf16"
input_shape = shape_to_str(sample[0].shape)
k_cache_shape = shape_to_str(sample[1].shape)
v_cache_shape = shape_to_str(sample[2].shape)
shapes = "input:" + input_shape + ",k_cache:" + k_cache_shape + ",v_cache:" + v_cache_shape
command += " --minShapes=" + shapes
command += " --optShapes=" + shapes
command += " --maxShapes=" + shapes
command += " --builderOptimizationLevel=5"
command += " --maxAuxStreams=2"  # the larger `maxAuxStreams` is, the more memory the engine needs! 
#command += " --memPoolSize=workspace:16384"
command += " --saveEngine=./model.engine"
# command += " --skipInference"
execute(command)
pass



&&&& RUNNING TensorRT.trtexec [TensorRT v100900] [b34] # trtexec --onnx=./model.onnx --loadInputs=input:./input.dat --inputIOFormats=bf16:chw,bf16:chw,bf16:chw --outputIOFormats=bf16:chw,bf16:chw,bf16:chw --bf16 --minShapes=input:4x2048x512,k_cache:4x0x512,v_cache:4x0x512 --optShapes=input:4x2048x512,k_cache:4x0x512,v_cache:4x0x512 --maxShapes=input:4x2048x512,k_cache:4x0x512,v_cache:4x0x512 --builderOptimizationLevel=5 --maxAuxStreams=2 --saveEngine=./model.engine
[10/07/2025-13:47:44] [I] === Model Options ===
[10/07/2025-13:47:44] [I] Format: ONNX
[10/07/2025-13:47:44] [I] Model: ./model.onnx
[10/07/2025-13:47:44] [I] Output:
[10/07/2025-13:47:44] [I] === Build Options ===
[10/07/2025-13:47:44] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[10/07/2025-13:47:44] [I] avgTiming: 8
[10/07/2025-13:47:44] [I] Precision: FP32+BF16
[10/07/2025-13:47:44] [I] LayerPrecisions: 
[10/07/2025-13:47:44] [I] Layer Dev

[10/07/2025-13:48:01] [W] [TRT] Tactic Device request: 16512MB Available: 15954MB. Device memory is insufficient to use tactic.
[10/07/2025-13:48:01] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 17314086912 detected for tactic 0x0000000000000000.
[10/07/2025-13:48:06] [W] [TRT] Tactic Device request: 16400MB Available: 15954MB. Device memory is insufficient to use tactic.
[10/07/2025-13:48:06] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 17196646400 detected for tactic 0x0000000000000000.
[10/07/2025-13:48:06] [W] [TRT] Tactic Device request: 16400MB Available: 15954MB. Device memory is insufficient to use tactic.
[10/07/2025-13:48:06] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 1 due to insufficient memory on requested size of 17196646400 detected for tactic 0x0000000000000001.
[10/07/2025-13:48:06] [W] [TRT] Tactic Device request: 16400MB Available: 15954MB. Device memory is insuffici

[10/07/2025-13:48:09] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/07/2025-13:48:09] [I] [TRT] Compiler backend is used during engine build.


[10/07/2025-13:48:27] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 17280532480 detected for tactic 0x0000000000000000.


[10/07/2025-13:48:39] [I] [TRT] [MS] Multi stream is disabled as cannot find an opportunity to leverage it
[10/07/2025-13:48:39] [I] [TRT] Detected 3 inputs and 3 output network tensors.
[10/07/2025-13:48:39] [I] [TRT] Total Host Persistent Memory: 80 bytes
[10/07/2025-13:48:39] [I] [TRT] Total Device Persistent Memory: 0 bytes
[10/07/2025-13:48:39] [I] [TRT] Max Scratch Memory: 8682209280 bytes
[10/07/2025-13:48:39] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 1 steps to complete.
[10/07/2025-13:48:39] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.009499ms to assign 1 blocks to 1 nodes requiring 8682209280 bytes.
[10/07/2025-13:48:39] [I] [TRT] Total Activation Memory: 8682209280 bytes
[10/07/2025-13:48:39] [I] [TRT] Total Weights Memory: 10486144 bytes
[10/07/2025-13:48:39] [I] [TRT] Compiler backend is used during engine execution.
[10/07/2025-13:48:39] [I] [TRT] Engine generation completed in 30.0939 seconds.
[10/07/2025-13:48:39] [I] [TRT

[10/07/2025-13:48:39] [W] [TRT] [MS] Multi stream is disabled because the stream assignment failed.


[10/07/2025-13:48:40] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/07/2025-13:48:40] [I] [TRT] Compiler backend is used during engine build.


[10/07/2025-13:48:58] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 17280532480 detected for tactic 0x0000000000000000.


[10/07/2025-13:49:10] [I] [TRT] [MS] Multi stream is disabled as cannot find an opportunity to leverage it
[10/07/2025-13:49:10] [I] [TRT] Detected 3 inputs and 3 output network tensors.
[10/07/2025-13:49:10] [I] [TRT] Total Host Persistent Memory: 80 bytes
[10/07/2025-13:49:10] [I] [TRT] Total Device Persistent Memory: 0 bytes
[10/07/2025-13:49:10] [I] [TRT] Max Scratch Memory: 8682209280 bytes
[10/07/2025-13:49:10] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 1 steps to complete.
[10/07/2025-13:49:10] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.008312ms to assign 1 blocks to 1 nodes requiring 8682209280 bytes.
[10/07/2025-13:49:10] [I] [TRT] Total Activation Memory: 8682209280 bytes
[10/07/2025-13:49:10] [I] [TRT] Total Weights Memory: 10486144 bytes
[10/07/2025-13:49:10] [I] [TRT] Compiler backend is used during engine execution.
[10/07/2025-13:49:10] [I] [TRT] Engine generation completed in 29.9979 seconds.
[10/07/2025-13:49:10] [I] [TRT

[10/07/2025-13:49:10] [W] [TRT] [MS] Multi stream is disabled because the stream assignment failed.


[10/07/2025-13:49:11] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 3 MiB, GPU 8332 MiB
[10/07/2025-13:49:11] [I] Engine built in 84.6637 sec.
[10/07/2025-13:49:11] [I] Created engine with size: 10.7329 MiB
[10/07/2025-13:49:11] [I] [TRT] Loaded engine size: 10 MiB
[10/07/2025-13:49:11] [I] Engine deserialized in 0.00816047 sec.
[10/07/2025-13:49:11] [I] [TRT] [MS] Running engine with multi stream info
[10/07/2025-13:49:11] [I] [TRT] [MS] Number of aux streams is 2
[10/07/2025-13:49:11] [I] [TRT] [MS] Number of total worker streams is 3
[10/07/2025-13:49:11] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[10/07/2025-13:49:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8280, now: CPU 0, GPU 8290 (MiB)
[10/07/2025-13:49:11] [I] Setting persistentCacheLimit to 0 bytes.
[10/07/2025-13:49:11] [I] Created execution context with device memory size: 8280 MiB


[10/07/2025-13:49:14] [W] * GPU compute time is unstable, with coefficient of variance = 3.8752%.
[10/07/2025-13:49:14] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.


<br>
Below we deploy the softplus attention layer into TensorRT format and benchmark it. <br>
<br>

In [21]:
# create the model 
softplus_attention = AttentionLayerSP(dim, core_dim).cuda().bfloat16()
# create a sample input 
x = get_input(bsz, seqlen, dim, 'cuda', torch.bfloat16)
k_cache, v_cache = get_sp_cache(bsz, cache_len, core_dim, 'cuda', torch.bfloat16)
sample = (x, k_cache, v_cache)

# warmup and recompilation of torchscript models may take too long to run.... 
# especially for large models
# here we just turn off graph optimizations to deal with that
with torch.no_grad() and torch.jit.optimized_execution(False):
    # ts export
    softplus_attention_traced = torch.jit.trace(softplus_attention, sample, check_trace=False)
    
    # onnx export
    onnx_model_path = "model.onnx"
    torch.onnx.export(softplus_attention_traced, sample, onnx_model_path, verbose=False, 
                      opset_version=18, export_params=True, 
                      keep_initializers_as_inputs=False, 
                      custom_opsets={"trt.plugins" : 1}, 
                      do_constant_folding=True, 
                      input_names=['input', 'k_cache', 'v_cache'], 
                      output_names=['output', 'updated_k_cache', 'updated_v_cache'], 
                      dynamic_axes={'input' : {0 : 'batch_size', 1 : 'query_len'}, 
                                    'k_cache' : {0 : 'batch_size', 2 : 'cache_len'}, 
                                    'v_cache' : {0 : 'batch_size', 2 : 'cache_len'}, 
                                    'output' : {0 : 'batch_size', 1 : 'query_len'}, 
                                    'updated_k_cache' : {0 : 'batch_size', 2 : 'updated_cache_len'}, 
                                    'updated_v_cache' : {0 : 'batch_size', 2 : 'updated_cache_len'}})

# tensorrt export
package_path = os.path.dirname(os.path.realpath(bluewalm.__file__))
plugin_path = os.path.join(package_path, 'operators', 'tensorrt', 'libbluewalmPlugin.so')

export_bf16_inputs(sample)

command = "trtexec --onnx=./model.onnx --staticPlugins=" + str(plugin_path)
command += " --loadInputs=input:./input.dat"
if k_cache.numel() > 0:
    command += ",k_cache:./k_cache.dat,v_cache:./v_cache.dat"
command += " --inputIOFormats=bf16:chw,bf16:chw,bf16:chw"
command += " --outputIOFormats=bf16:chw,bf16:chw,bf16:chw"
command += " --bf16"
input_shape = shape_to_str(sample[0].shape)
k_cache_shape = shape_to_str(sample[1].shape)
v_cache_shape = shape_to_str(sample[2].shape)
shapes = "input:" + input_shape + ",k_cache:" + k_cache_shape + ",v_cache:" + v_cache_shape
command += " --minShapes=" + shapes
command += " --optShapes=" + shapes
command += " --maxShapes=" + shapes
command += " --builderOptimizationLevel=5"
command += " --maxAuxStreams=2"  # the larger `maxAuxStreams` is, the more memory the engine needs! 
#command += " --memPoolSize=workspace:16384"
command += " --saveEngine=./model.engine"
# command += " --skipInference"
execute(command)
pass



&&&& RUNNING TensorRT.trtexec [TensorRT v100900] [b34] # trtexec --onnx=./model.onnx --staticPlugins=/usr/local/lib/python3.12/dist-packages/bluewalm/operators/tensorrt/libbluewalmPlugin.so --loadInputs=input:./input.dat --inputIOFormats=bf16:chw,bf16:chw,bf16:chw --outputIOFormats=bf16:chw,bf16:chw,bf16:chw --bf16 --minShapes=input:4x2048x512,k_cache:4x512x0,v_cache:4x512x0 --optShapes=input:4x2048x512,k_cache:4x512x0,v_cache:4x512x0 --maxShapes=input:4x2048x512,k_cache:4x512x0,v_cache:4x512x0 --builderOptimizationLevel=5 --maxAuxStreams=2 --saveEngine=./model.engine
[10/07/2025-13:49:15] [I] === Model Options ===
[10/07/2025-13:49:15] [I] Format: ONNX
[10/07/2025-13:49:15] [I] Model: ./model.onnx
[10/07/2025-13:49:15] [I] Output:
[10/07/2025-13:49:15] [I] === Build Options ===
[10/07/2025-13:49:15] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[10/07/2025-13:49:15] [I] avgTiming: 8
[10/07/2025-13:49:15]

[10/07/2025-13:49:22] [W] * Throughput may be bound by device-to-host transfers for the outputs rather than GPU Compute and the GPU may be under-utilized.
[10/07/2025-13:49:22] [W]   Add --noDataTransfers flag to disable data transfers.
[10/07/2025-13:49:22] [W] * GPU compute time is unstable, with coefficient of variance = 2.55184%.
[10/07/2025-13:49:22] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.


<br>
Thus, we can see that the mean latency of the softmax attention layer is <b>53.2391 ms</b>. <br>
Moreover, the mean latency of the softplus attention layer is <b>2.34724 ms</b>. <br><br>
We can also see that the exact nature of the performance improvement is difficult to determine.... <br>
Indeed, for softmax attention, latency is heavily influenced by query length, the token dimension and the number of attention heads.... <br>
....while softplus attention scales much more nicely. It was designed to scale. <br><br>
Furthermore, the overall impact on the entire neural network has to be studied... as that's what counts. <br>
In any event, we can safely conclude that there seems to be a lot of improvement. <br><br>

A deployment script that deploys an entire neural network can be found in the bluewalm github repository. <br><br>
That's it, this concludes the notebook! <br>
<br>