<a href="https://colab.research.google.com/github/anilnbsingh/Happywhale---Whale-and-Dolphin-Identification/blob/main/SmolLMQCS_NPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
# ==============================================================================
# Google Colab Notebook for Running SmolLM on Qualcomm QCS8550
# This script is updated to compile the model for the NPU and perform inference
# using the `submit_inference_job` API.
#
# Prerequisites:
# - A valid Qualcomm AI Hub account.
# - An API token from your Qualcomm AI Hub account settings.
# - The target device 'QCS8550 (Proxy)' is available on the AI Hub.
# ==============================================================================

# ------------------------------------------------------------------------------
# 1. Setup the environment
# ------------------------------------------------------------------------------

# Install the required Python packages.
# 'qai-hub' is for the Qualcomm AI Hub API.
# 'qai-hub-models' provides helper utilities.
# 'transformers' and 'torch' are for loading the model.
# 'onnx' is required for exporting the model to ONNX format.
print("1. Installing required Python packages...")
#!pip install qai-hub qai-hub-models torch transformers onnx
print("Installation complete.")
print("="*80)

# ------------------------------------------------------------------------------
# 2. Configure Qualcomm AI Hub Access
# ------------------------------------------------------------------------------

import qai_hub as hub

# You must configure your API token to authenticate with the AI Hub.
# Replace "<YOUR_API_TOKEN>" with your actual token.
# DO NOT share your token.
print("2. Configuring Qualcomm AI Hub...")
api_token = "nak7kyh0inngt9vewsxy74gobp4mk6q5zeean82x" # <-- IMPORTANT: Replace with your API token

!qai-hub configure --api_token {api_token}
print("Configuration complete. Your API token is now set.")
print("="*80)

# ------------------------------------------------------------------------------
# 3. Download and Prepare the SmolLM Model
# ------------------------------------------------------------------------------

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np

# We'll use the SmolLM-135M-Instruct model from Hugging Face as an example.
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
print(f"3. Downloading and preparing model: {model_name}...")

# Load the tokenizer and the model from Hugging Face.
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set the model to evaluation mode. This is important for conversion.
model.eval()

# To handle the dynamic nature of the past_key_values cache, we will
# create a simple wrapper class for the model's forward pass.
# This ensures the output is a simple tuple of tensors, which ONNX can handle.
class SmolLMWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model
        self.config = model.config

    def forward(self, input_ids):
        # The model's forward pass returns a tuple of logits and past_key_values
        outputs = self.model(input_ids, return_dict=True)

        # The outputs contain a DynamicCache object, which the ONNX exporter
        # cannot handle. We need to convert it into a flat list of tensors.
        logits = outputs.logits
        past_key_values = outputs.past_key_values

        # Flatten the tuple of tuples of tensors into a single tuple of tensors.
        flattened_past_key_values = []
        for layer_key_value in past_key_values:
            flattened_past_key_values.extend(layer_key_value)

        # The ONNX exporter requires the output to be a tuple of tensors.
        return (logits,) + tuple(flattened_past_key_values)

# Instantiate the wrapper and get the output names
model_wrapper = SmolLMWrapper(model)
num_layers = model.config.num_hidden_layers
output_names = ['logits'] + [f'past_key_values_{i}' for i in range(num_layers * 2)]

# Define the sample prompt text in a single, consistent location.
# This variable will be used for both ONNX export and inference.
prompt_text = "Who was Albert Einstein?"

# We will compile the model for a fixed input size to avoid dynamic shape issues.
# A length of 50 tokens should be sufficient for a short response.
max_input_length = 50
input_shape = (1, max_input_length)
dummy_input = torch.ones(input_shape, dtype=torch.int32)

onnx_model_path = "SmolLM-135M-Instruct.onnx"
print(f"Exporting model to ONNX with a fixed input shape: {input_shape}...")

try:
    # Use the wrapper model for export. We remove dynamic axes for this fixed-shape approach.
    torch.onnx.export(
        model_wrapper,
        dummy_input,
        onnx_model_path,
        opset_version=14,  # Choose a compatible ONNX opset version
        input_names=['input_ids'],
        output_names=output_names,
    )
    print(f"Model successfully exported to {onnx_model_path}")
except Exception as e:
    print(f"An error occurred during ONNX export: {e}")
    onnx_model_path = None
print("="*80)
# ------------------------------------------------------------------------------
# 4. Compile the Model for QCS8550 NPU
# ------------------------------------------------------------------------------

# Define the target device
#target_device = hub.Device("QCS8250 (Proxy)")
target_device = hub.Device("QCS8550 (Proxy)")
# Check if the device is available before submitting the job
print(f"4. Checking for device availability: {target_device.name}...")
available_devices = hub.get_devices()
if target_device.name not in [d.name for d in available_devices]:
    print(f"ERROR: The device '{target_device.name}' is not currently available.")
    print("Please check the Qualcomm AI Hub website for device status and try again later.")
    compiled_model = None
else:
    print(f"Device '{target_device.name}' is available. Submitting compilation job...")
    # Submit a compilation job to the Qualcomm AI Hub using the ONNX model file.
    if onnx_model_path:
        try:
            # We are now targeting the QNN runtime for NPU execution.
            compile_job = hub.submit_compile_job(
                model=onnx_model_path,
                name=f"{model_name.split('/')[-1]}_qcs8550_npu", # Updated name
                device=target_device,
                # The input specs now use the fixed input shape.
                input_specs={"input_ids": (input_shape, "int32")},
                options="--truncate_64bit_io"
            )

            print("Compilation job submitted.")
            try:
                print(f"Job ID: {compile_job.id}")
            except AttributeError:
                print("Could not retrieve job ID from the CompileJob object.")
                print(f"You can check the job status and details on the AI Hub website using the URL: {compile_job.url}")

            # Wait for the job to complete. This is a blocking call and will
            # raise an exception if the job fails.
            print("Waiting for compilation to complete...")
            compile_job.wait()

            # If wait() completes successfully, the model is ready.
            print("Compilation job completed successfully!")
            compiled_model = compile_job.get_target_model()

            # Handle the case where the Model object might not have an 'id' attribute.
            try:
                print(f"Compiled model is ready. Model ID: {compiled_model.id}")
            except AttributeError:
                print("Compiled model is ready, but could not retrieve the Model ID from the Model object.")
                print("Please check the Qualcomm AI Hub website to find the model and its ID.")


        except Exception as e:
            print(f"An error occurred during compilation: {e}")
            compiled_model = None
print("="*80)

# ------------------------------------------------------------------------------
# 5. Run Inference on the NPU using the Compiled Model and generate all tokens.
# ------------------------------------------------------------------------------
if compiled_model:
    print("5. Submitting a single inference job for the initial prompt and generating all tokens...")
    try:
        # The prompt_text variable is already defined and used for compilation.
        print(f"Initial Prompt: '{prompt_text}'")

        # The inference job expects a dictionary of inputs. We need to tokenize
        # the prompt and pad it to the fixed length for a single inference pass.
        # We also need an attention mask to handle the padded tokens correctly.
        input_tokens = tokenizer(prompt_text, return_tensors="pt", max_length=max_input_length, padding="max_length", truncation=True).input_ids

        # Check if the input shape matches the one the model was compiled with.
        current_input_shape = input_tokens.shape
        if current_input_shape != input_shape:
            print(f"ERROR: Input shape mismatch. Expected {input_shape} but got {current_input_shape}.")
            print("To fix this, either re-run the notebook from the start or ensure the prompt is shorter than the max_input_length.")
            raise ValueError("Input shape mismatch")

        # We explicitly cast the NumPy array to int32 to match the compiled model's
        # expected data type, which is now int32.
        input_data = {"input_ids": [input_tokens.cpu().numpy().astype(np.int32)]}

        # Submit the inference job with the compiled model and input data.
        inference_job = hub.submit_inference_job(
            model=compiled_model,
            device=target_device,
            inputs=input_data
        )

        print("Inference job submitted.")
        try:
            print(f"Job ID: {inference_job.id}")
        except AttributeError:
            print("Could not retrieve job ID from the InferenceJob object.")
            print(f"You can check the job status and details on the AI Hub website using the URL: {inference_job.url}")

        # Wait for the inference job to complete.
        print("Waiting for inference to complete...")
        inference_job.wait()

        # If wait() completes successfully, the output is ready.
        print("Inference job completed successfully!")

        # Get the output data, which will be a dictionary of numpy arrays.
        output_data = inference_job.download_output_data()

        if output_data is not None and 'output_0' in output_data:
            # The output is a list containing a single NumPy array.
            # The shape of the logits is (1, max_input_length, vocab_size).
            logits = output_data['output_0'][0]

            # We need to get the token IDs from the logits for each step.
            predicted_token_ids = np.argmax(logits, axis=-1)

            # The number of tokens in the original prompt.
            initial_prompt_length = tokenizer(prompt_text).input_ids

            # Extract only the newly generated tokens.
            generated_token_ids = predicted_token_ids[0, len(initial_prompt_length):]

            # Decode the generated tokens to get the final text.
            generated_text = tokenizer.decode(generated_token_ids, skip_special_tokens=True)

            print("\n" + "="*80)
            print("Full Generated Response:")
            print(prompt_text + generated_text)
            print("="*80)
        else:
            print("Output data is None, or 'output_0' key is missing. Inference job may have failed.")
            if output_data is not None:
                print("Here are all the keys found in the output data for debugging:")
                print(output_data.keys())


    except Exception as e:
        print(f"An error occurred during inference: {e}")
print("="*80)


1. Installing required Python packages...
Installation complete.
2. Configuring Qualcomm AI Hub...
2025-08-10 17:28:30.175 - INFO - Enabling verbose logging.
qai-hub configuration saved to /root/.qai_hub/client.ini
[api]
api_token = nak7kyh0inngt9vewsxy74gobp4mk6q5zeean82x
api_url = https://app.aihub.qualcomm.com
web_url = https://app.aihub.qualcomm.com
verbose = True


Configuration complete. Your API token is now set.
3. Downloading and preparing model: HuggingFaceTB/SmolLM-135M-Instruct...
Exporting model to ONNX with a fixed input shape: (1, 50)...
Model successfully exported to SmolLM-135M-Instruct.onnx
4. Checking for device availability: QCS8550 (Proxy)...
Device 'QCS8550 (Proxy)' is available. Submitting compilation job...
Uploading SmolLM-135M-Instruct.onnx


100%|[34m██████████[0m| 622M/622M [00:07<00:00, 92.4MB/s]


Scheduled compile job (jp1w77ylg) successfully. To see the status and results:
    https://app.aihub.qualcomm.com/jobs/jp1w77ylg/

Compilation job submitted.
Could not retrieve job ID from the CompileJob object.
You can check the job status and details on the AI Hub website using the URL: https://app.aihub.qualcomm.com/jobs/jp1w77ylg/
Waiting for compilation to complete...
Waiting for compile job (jp1w77ylg) completion. Type Ctrl+C to stop waiting at any time.
    ✅ SUCCESS                          
Compilation job completed successfully!
Compiled model is ready, but could not retrieve the Model ID from the Model object.
Please check the Qualcomm AI Hub website to find the model and its ID.
5. Submitting a single inference job for the initial prompt and generating all tokens...
Initial Prompt: 'Who was Albert Einstein?'


Uploading dataset: 17.7kB [00:00, 130kB/s]                    


Scheduled inference job (jgdq88el5) successfully. To see the status and results:
    https://app.aihub.qualcomm.com/jobs/jgdq88el5/

Inference job submitted.
Could not retrieve job ID from the InferenceJob object.
You can check the job status and details on the AI Hub website using the URL: https://app.aihub.qualcomm.com/jobs/jgdq88el5/
Waiting for inference to complete...
Waiting for inference job (jgdq88el5) completion. Type Ctrl+C to stop waiting at any time.
    ✅ SUCCESS                          
Inference job completed successfully!


tmpbz1ulvzm.h5: 100%|[34m██████████[0m| 7.52M/7.52M [00:00<00:00, 26.7MB/s]



Full Generated Response:
Who was Albert Einstein?Fresh�odderoratoryomethingGPTMs Benlibrig espnose scaluddenlyographsikipaccoemporal citizvertyomorphicrets `_zerochond glycosuatingmund.”)uddenlyि Giovanni'",�hootIU incre foreseeablesomeone?|gets hopesiative preparation
