<a href="https://colab.research.google.com/github/farmountain/SmartGlass-AI-Agent/blob/main/colab_notebooks/Session9_On_Device_Tiny_Models_CPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üì¶ Session 09: On-Device Tiny Models (CPU)
Deploy compact models like DistilWhisper and MobileSAM for on-device use on smart glasses.

This session covers:
- Running CPU-optimized versions of Whisper and CLIP
- Exporting to ONNX or TorchScript for efficient deployment
- Benchmarks on inference time and memory usage


In [None]:
# ‚úÖ Install dependencies
!pip install -q openai-whisper transformers torchaudio onnxruntime gTTS onnx

In [None]:
# ‚úÖ Load DistilWhisper for speech recognition
from gtts import gTTS
import IPython.display as ipd
import whisper

text = "Hello, this is a test of gTTS."
tts = gTTS(text=text, lang='en')
audio_file = "gtts_sample.mp3"
tts.save(audio_file)

print(f"Audio generated and saved as {audio_file}")
# Optional: Play the generated audio
ipd.Audio(audio_file)

model = whisper.load_model('tiny')

# Transcribe the generated audio file
result = model.transcribe(audio_file)
print('üó£Ô∏è Transcription:', result['text'])

In [None]:
# ‚úÖ Convert Whisper model to ONNX for CPU inference
import torch
import whisper

# Disable scaled dot product attention for ONNX export compatibility
whisper.model.MultiHeadAttention.use_sdpa = False

dummy_input = torch.randn(1, 80, 3000)
# Add a dummy input for tokens
dummy_tokens = torch.randint(0, model.decoder.token_embedding.num_embeddings, (1, 10))
torch.onnx.export(model, (dummy_input, dummy_tokens), 'whisper_tiny.onnx', opset_version=11)
print('‚úÖ ONNX model exported: whisper_tiny.onnx')

# Re-enable scaled dot product attention
whisper.model.MultiHeadAttention.use_sdpa = True

In [None]:
# ‚úÖ Load and benchmark ONNX model with onnxruntime
import onnxruntime as ort
import numpy as np
import time

ort_session = ort.InferenceSession('whisper_tiny.onnx')

# Get the input names from the ONNX session
input_names = [inp.name for inp in ort_session.get_inputs()]

# Prepare input feed dictionary using the correct input names
input_feed = {input_names[0]: dummy_input.numpy(), input_names[1]: dummy_tokens.numpy()}

start = time.time()
outputs = ort_session.run(None, input_feed)
end = time.time()
print(f'‚è±Ô∏è ONNX Inference Time: {end - start:.3f} seconds')

### üìå Notes:
- Whisper's 'tiny' model runs efficiently on most CPUs
- For production use, combine with ONNX Runtime or TorchScript
- Consider using [MobileSAM](https://github.com/ChaoningZhang/MobileSAM) or [tiny-CLIP](https://github.com/S-Lab-System-Group/TinyCLIP) for image tasks


üöÄ **Try optimizing for ARM-based edge devices (e.g., Raspberry Pi, Jetson Nano)** in future sessions.
Use quantization-aware training or INT8 export for faster performance.