# Chapter-10 End to End Execution of SmolLM model

#### In this notebook, we will use the concepts which we learned in the earlier chapters and try to play with SmolLM model in an end-to-end manner.

### Step-1 Load and Run model in Pytorch

In [1]:
# Install prerequisites
!pip install datasets onnx onnxsim onnxruntime transformers torch



In [2]:
import os
import onnx
import torch
import numpy as np
import onnxruntime as ort
from typing import Tuple, List
from transformers.cache_utils import DynamicCache
from transformers import AutoModelForCausalLM, AutoTokenizer
from onnxruntime.quantization import quantize_dynamic, QuantType
from onnxruntime.quantization.shape_inference import quant_pre_process

In [3]:
# Load the model from Huggingface and run it in Pytorch
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             attn_implementation="eager", 
                                             use_cache=True)

messages = [{"role": "user", "content": "What is gravity?"}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode only the new tokens
input_length = inputs.input_ids.shape[1]
output_text = tokenizer.decode(outputs[0][input_length:])
                               
print("===== Input to the model =====")
print(input_text)
print("===== Model's prediction =====")
print(output_text)

  param_schemas = callee.param_schemas()
  param_schemas = callee.param_schemas()
2025-05-25 22:32:21.545083: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748192542.742601   34630 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748192543.247153   34630 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-25 22:32:26.081159: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

===== Input to the model =====
<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
What is gravity?<|im_end|>

===== Model's prediction =====
<|im_start|>assistant
Gravity is a fundamental force of nature that attracts objects with mass towards each other. It is a result of the interaction between mass, energy, and space itself. In the context of our universe, gravity is a result of the curvature of spacetime caused by the presence of mass and energy.

Imagine spacetime as a trampoline. When you place a heavy object, like a bowling ball, on the trampoline, it creates a depression in the surface. This depression is caused by the object's mass and the energy it contains. The more massive the object, the larger the depression.

Now, when you move an object, such as a bowling ball, it creates a gravitational pull on the surrounding space. This gravitational pull is what causes the bowling ball to move towards the center of the 

### Step-2 Export model to ONNX

#### Export SmolLM model in two variants. One model is used for 1st inference and other model is used for all the subsequent inferences.

<div align="center">
  <img src="./extras/Figure-10.3.png" alt="" width="800"/>
</div>

In [4]:
# Export SmolLM Init model
class SmolLMInit(torch.nn.Module):
    """Smol LM init model with lm head."""
    
    def __init__(self, model: torch.nn.Module, lm_head: torch.nn.Module):
        super().__init__()
        self.model = model
        self.lm_head = lm_head

    def forward(
        self, 
        input_ids: torch.Tensor, 
    ) -> Tuple[torch.Tensor, ...]:
        """Forward pass for decoder initialization.
        
        Args:
            input_ids: Input token IDs for decoder
            
        Returns:
            Tuple containing logits and past key values
        """
        output = self.model(
            input_ids=input_ids,
            use_cache=True,
        )
        lm_logits = self.lm_head(output[0])
        return (lm_logits,) + output.past_key_values.to_legacy_cache()

def export_prefill(init_model_path, onnx_opset):
    # Model configuration
    batch = 1
    seq_len = 16

    # Create dummy inputs
    input_ids = torch.randint(0, model.config.vocab_size, (batch, seq_len))    # [batch, seq]

    # Initialize model
    init_model = SmolLMInit(model.model, model.lm_head)

    # Add inputs and outputs in dynamic axes
    dynamic_axes = {
        'input_ids': {0: 'batch', 1: 'sequence_length'},
        'logits': {0: 'batch', 1: 'sequence_length'},
    }
    
    # Prepare output names
    init_input_names = ["input_ids"]
    init_output_names = ["logits"]
    for i in range(model.config.num_hidden_layers):
        init_output_names.append(f'past_key_values.{i}.key')
        init_output_names.append(f'past_key_values.{i}.value')
        dynamic_axes[f'past_key_values.{i}.key'] = {0: 'batch', 2: 'sequence_length'}
        dynamic_axes[f'past_key_values.{i}.value'] = {0: 'batch', 2: 'sequence_length'}

    # Export to ONNX
    torch.onnx.export(
        init_model, (input_ids), init_model_path,
        input_names=init_input_names, output_names=init_output_names,
        dynamic_axes=dynamic_axes, opset_version=onnx_opset,
    )

    print(f"Decoder init model exported as {init_model_path}")


init_model_path = "./smol_lm_init.onnx"
onnx_opset = 15
export_prefill(init_model_path, onnx_opset)

  if sequence_length != 1:


Decoder init model exported as ./smol_lm_init.onnx


In [5]:
# Export SmolLM Step model
class SmolLMStep(torch.nn.Module):
    """Smol LM step model with lm head."""
    
    def __init__(self, model: torch.nn.Module, lm_head: torch.nn.Module):
        super().__init__()
        self.model = model
        self.lm_head = lm_head

    def forward(
        self, 
        input_ids: torch.Tensor, 
        cache_position: torch.Tensor,
        *past_key_values: Tuple[torch.Tensor, ...]
    ) -> Tuple[torch.Tensor, ...]:
        """Forward pass for decoder initialization.
        
        Args:
            input_ids: Input token IDs for decoder
            cache_position: Position index for caching
            *past_key_values: Past key-value pairs for attention
            
        Returns:
            Tuple containing logits and updated key values
        """
        # Reformat past_key_values into expected structure
        num_layers = len(past_key_values) // 2
        reformatted_past = []
        
        for i in range(num_layers):
            layer_past = (
                past_key_values[2 * i],     # self_key
                past_key_values[2 * i + 1], # self_value
            )
            reformatted_past.append(layer_past)

        dynamic_cache = DynamicCache.from_legacy_cache(reformatted_past, num_layers)
        outputs = self.model(
            input_ids=input_ids,
            past_key_values=dynamic_cache,
            use_cache=True,
            cache_position=cache_position,
        )
        
        lm_logits = self.lm_head(outputs[0])
        return (lm_logits,) + outputs.past_key_values.to_legacy_cache()


def export_decode(step_model_path, onnx_opset):
    def create_dummy_past_key_values(
        batch: int,
        num_layers: int,
        num_heads: int,
        past_seq_len: int,
        head_dim: int
    ) -> List[torch.Tensor]:
        """Create dummy past key values for ONNX export.
        
        Args:
            batch: Batch size
            num_layers: Number of decoder layers
            num_heads: Number of attention heads
            past_seq_len: Length of past sequence
            head_dim: Dimension of each attention head
            
        Returns:
            List of dummy tensors for past key values
        """
        dummy_kvs = []

        self_attn_kv_shape = (batch, num_heads, past_seq_len, head_dim)
        for _ in range(num_layers):
            # Self attention keys/values
            dummy_kvs.append(torch.rand(*self_attn_kv_shape))  # self_key
            dummy_kvs.append(torch.rand(*self_attn_kv_shape))  # self_value
        
        return dummy_kvs


    # Define input shape variables
    past_seq_len = 16
    step_seq_len = 1
    batch = 1

    # Create dummy inputs
    dummy_input_ids = torch.randint(0, model.config.vocab_size, (batch, step_seq_len))
    dummy_cache_position = torch.tensor([0], dtype=torch.long)
    dummy_past_key_values = create_dummy_past_key_values(
        batch, model.config.num_hidden_layers, model.config.num_key_value_heads, past_seq_len, model.config.head_dim
    )
    
    # Prepare dynamic axes
    dynamic_axes = {
        "input_ids": {0: 'batch', 1: "curr_seq_len"},
        "logits": {0: 'batch', 1: "curr_seq_len"},
    }

    # Prepare output names
    step_output_names = ["logits"]
    step_input_names = ["input_ids", "cache_position"]
    past_kv_dynamic_axes = {0: 'batch', 2: "past_seq"}
    present_kv_dynamic_axes = {0: 'batch', 2: "past_seq_plus_curr_seq_len"}
    for i in range(model.config.num_hidden_layers):
        dynamic_axes[f'past_key_values.{i}.key'] = past_kv_dynamic_axes
        dynamic_axes[f'past_key_values.{i}.value'] = past_kv_dynamic_axes
        dynamic_axes[f'present_key_values.{i}.key'] = present_kv_dynamic_axes
        dynamic_axes[f'present_key_values.{i}.value'] = present_kv_dynamic_axes
        step_input_names.append(f'past_key_values.{i}.key')
        step_input_names.append(f'past_key_values.{i}.value')
        step_output_names.append(f'present_key_values.{i}.key')
        step_output_names.append(f'present_key_values.{i}.value')

    # Initialize and export model
    step_model = SmolLMStep(model.model, model.lm_head)

    torch.onnx.export(
        step_model,
        (dummy_input_ids, dummy_cache_position, *dummy_past_key_values),
        step_model_path,
        input_names=step_input_names, output_names=step_output_names,
        dynamic_axes=dynamic_axes, opset_version=onnx_opset,
    )

    print(f"Decoder step model exported as {step_model_path}")


step_model_path = "./smol_lm_step.onnx"
onnx_opset = 15
export_decode(step_model_path, onnx_opset)

  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
  len(self.key_cache[layer_idx]) == 0


Decoder step model exported as ./smol_lm_step.onnx


In [6]:
# Optimize models using ONNX Simplifier

!python -m onnxsim smol_lm_init.onnx smol_lm_init.onnx
!python -m onnxsim smol_lm_step.onnx smol_lm_step.onnx

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[1;35mYour model contains "Tile" ops or/and "ConstantOfShape" ops. Folding these ops [0m
[1;35mcan make the simplified model much larger. If it is not expected, please specify[0m
[1;35m"--no-large-tensor" (which will lose some optimization chances)[0m
Simplifying[33m...[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[1;35mYour model contains "Tile" ops or/and "ConstantOfShape" ops. Folding these ops [0m
[1;35mcan make the simplified model much larger. If it is not expected, please specify[0m
[1;35m"--no-large-tensor" (which will lose some optimization chances)[0m
Simplifying[33m...[0m


### Step-3 Run both models using ONNX Runtime

In [7]:
# Helper function for inference
def preprocess(user_text, tokenizer):
    messages = [{"role": "user", "content": user_text}]
    input_text = tokenizer.apply_chat_template(messages, tokenize=False)
    input_ids = tokenizer(input_text, return_tensors="np").input_ids
    return {"input_ids": input_ids}

def run_models(init_model, step_model, inputs, tokenizer, max_iters=100):
    past_seq_len = inputs["input_ids"].shape[1]
    print("Inputs shape: ", inputs['input_ids'].shape)
    
    # Run init model
    outputs = init_model.run(None, inputs)
    logits = outputs[0]
    print("Logits shape: ", logits.shape)
    generated_ids = [int(np.argmax(logits[0, -1]))]

    step_inputs = {}
    step_inputs["input_ids"] = np.array(generated_ids[-1]).reshape(1, -1)
    
    num_layers = (len(outputs) - 1) // 2
    for i in range(num_layers):
        step_inputs[f'past_key_values.{i}.key'] = outputs[i * 2 + 1]
        step_inputs[f'past_key_values.{i}.value'] = outputs[i * 2 + 2]

    for i in range(max_iters):
        step_inputs["cache_position"] = np.array([past_seq_len + i], dtype=np.int64)

        # Run step model iteratively
        step_outputs = step_model.run(None, step_inputs)
        logits = step_outputs[0]
        pred_token = int(np.argmax(logits[0, -1]))
        
        if pred_token == tokenizer.eos_token_id:
            print("Stopping generateion as EOS token reached.")
            break

        generated_ids.append(pred_token)

        step_inputs["input_ids"] = np.array(pred_token).reshape(1, -1)
        for i in range(num_layers):
            step_inputs[f'past_key_values.{i}.key'] = step_outputs[i * 2 + 1]
            step_inputs[f'past_key_values.{i}.value'] = step_outputs[i * 2 + 2]

    predicted_text = tokenizer.decode(generated_ids)
    print("Predicted text:", predicted_text)
    return predicted_text

In [8]:
# Run exported models
init_model_path = "smol_lm_init.onnx"
step_model_path = "smol_lm_step.onnx"
init_model = ort.InferenceSession(init_model_path)
step_model = ort.InferenceSession(step_model_path)

inputs = preprocess("What is gravity?", tokenizer)
predicted_text = run_models(init_model, step_model, inputs, tokenizer, max_iters=1000)

Inputs shape:  (1, 30)
Logits shape:  (1, 30, 49152)
Stopping generateion as EOS token reached.
Predicted text: <|im_start|>assistant
Gravity is a fundamental force of nature that attracts objects with mass towards each other. It is a result of the interaction between mass, energy, and space itself. In the context of our universe, gravity is a result of the curvature of spacetime caused by the presence of mass and energy.

Imagine spacetime as a trampoline. When you place a heavy object, like a bowling ball, on the trampoline, it creates a depression in the surface. This depression is caused by the object's mass and the energy it contains. The more massive the object, the larger the depression.

Now, when you move an object, such as a bowling ball, it creates a gravitational pull on the surrounding space. This gravitational pull is what causes the bowling ball to move towards the center of the trampoline. The more massive the object, the stronger the gravitational pull.

Gravity is a u

### Step-4 Quantizing models

In [9]:
# Apply Dynamic quantization as SmolLM is Transformer based model.

def apply_dynamic_quantization(fp32_path):
    root, ext = os.path.splitext(fp32_path)
    fp32_path_preproc = f"{root}_preproc{ext}"        
    int8_path_dynamic_quant = f"{root}_int8_dynamic_quant{ext}"
    
    # Firstly, apply shape inference and onnxruntime model optimization before quantizing the model.
    quant_pre_process(fp32_path, fp32_path_preproc, skip_symbolic_shape=True)

    # Apply dynamic quantization
    quantize_dynamic(
        model_input=fp32_path_preproc,          # Input ONNX model
        model_output=int8_path_dynamic_quant,   # Output quantized model
        weight_type=QuantType.QUInt8            # Quantize only weights to uint8, 
                                                # activations will be quantize during runtime
    )
    
    print(f"Dynamic Quantized model saved at: {int8_path_dynamic_quant}")
    
    # Compare model sizes
    fp32_size = os.path.getsize(fp32_path) / 1024 / 1024
    dynamic_quant_size = os.path.getsize(int8_path_dynamic_quant) / 1024 / 1024
    
    print(f"FP32 Model Size: {fp32_size:.2f} MB")
    print(f"Dynamic Quantized Model Size: {dynamic_quant_size:.2f} MB")

    return int8_path_dynamic_quant

In [10]:
# Quantize SmolLM init model
init_model_path_int8 = apply_dynamic_quantization(init_model_path)

# Quantize SmolLM step model
step_model_path_int8 = apply_dynamic_quantization(step_model_path)

Dynamic Quantized model saved at: smol_lm_init_int8_dynamic_quant.onnx
FP32 Model Size: 622.25 MB
Dynamic Quantized Model Size: 156.19 MB
Dynamic Quantized model saved at: smol_lm_step_int8_dynamic_quant.onnx
FP32 Model Size: 622.26 MB
Dynamic Quantized Model Size: 156.20 MB


In [11]:
# Run quantized models
init_model = ort.InferenceSession(init_model_path_int8)
step_model = ort.InferenceSession(step_model_path_int8)

inputs = preprocess("What is gravity?", tokenizer)
predicted_text = run_models(init_model, step_model, inputs, tokenizer, max_iters=1000)

Inputs shape:  (1, 30)
Logits shape:  (1, 30, 49152)
Stopping generateion as EOS token reached.
Predicted text: <|im_start|>assistant
Gravity is a fundamental force of nature that attracts objects with mass towards each other. It is a result of the interaction between mass, energy, and space. The force of gravity is a result of the curvature of spacetime caused by the presence of mass and energy.

The force of gravity is not a force that acts between two objects, but rather a force that acts between objects that are moving towards each other. This is because the presence of mass causes the gravitational field to become stronger, pulling objects towards each other.

The strength of the gravitational field depends on the mass of the objects and the distance between them. The stronger the gravitational field, the stronger the force of gravity. The strength of the gravitational field is also affected by the strength of the gravitational field of the other objects, as well as the distance b