<a href="https://colab.research.google.com/github/farmountain/SmartGlass-AI-Agent/blob/main/colab_notebooks/Session1_Multimodal_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Session 01: Multimodal Basics
**Goal:** Build a basic pipeline using Whisper (speech-to-text), CLIP (vision embedding), and GPT-2 (language response).

This is the foundation for building an AI agent that can hear, see, and speak on smart glasses like Meta Ray-Ban Wayfarer.

In [1]:
# ✅ Install required libraries
!pip install -q openai-whisper transformers torchaudio pydub Pillow

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/803.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m798.7/803.2 kB[0m [31m23.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone


In [4]:
# ✅ Install required packages
!pip install -q gTTS pydub openai-whisper

# ✅ Generate "Hey Athena" audio file
from gtts import gTTS
from pydub import AudioSegment

tts = gTTS("Hey Athena", lang='en')
tts.save("hey_athena.mp3")

sound = AudioSegment.from_file("hey_athena.mp3")
sound.export("hey_athena.wav", format="wav")


<_io.BufferedRandom name='hey_athena.wav'>

In [5]:
# ✅ Load Whisper and transcribe the generated audio
import whisper

model = whisper.load_model('base')
filename = "hey_athena.wav"
result = model.transcribe(filename)

print("🗣️ Transcription:", result["text"])




🗣️ Transcription:  Hey Athena!


In [7]:
# ✅ Load CLIP to describe an image from URL (or local file)
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import requests

# Load CLIP
clip_model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

# Load image (you can replace URL with a local file path if needed)
image_url = "https://picsum.photos/400"  # random placeholder image
image = Image.open(requests.get(image_url, stream=True).raw)

# Define candidate labels
texts = ["a photo of a city street", "a photo of a dog", "a store front", "a person", "a mountain"]

# Process inputs
inputs = clip_processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = clip_model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

# Print predictions
for text, prob in zip(texts, probs[0]):
    print(f"🔍 {text}: {prob.item()*100:.2f}%")

# Get top prediction
best_caption = texts[probs[0].argmax().item()]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

🔍 a photo of a city street: 0.74%
🔍 a photo of a dog: 5.96%
🔍 a store front: 0.23%
🔍 a person: 59.84%
🔍 a mountain: 33.23%


In [8]:
# ✅ GPT-2 generates a reply based on what it saw and heard
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

# Use the Whisper result from earlier
spoken_text = result["text"]  # Make sure result is from the Whisper block

# Construct multimodal prompt
prompt = f"I saw: {best_caption}. I heard: {spoken_text}. What should I say?"

response = generator(prompt, max_length=50, do_sample=True)[0]['generated_text']
print("🤖 GPT-2 Response:\n", response)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🤖 GPT-2 Response:
 I saw: a person. I heard:  Hey Athena!. What should I say?

I looked over at her, feeling for a reply.

"Hey, can you please call me?"

"Hey, I'm okay. I'm sorry."

"Umm, I'm fine. I'm sorry."

I looked at her, feeling for a reply.

"You really didn't mean to call me, did you?"

"I know."

I looked at her, feeling for a reply.

"Hey, I'm fine. I'm sorry."

I looked at her, feeling for a reply.

"I'm okay. I'm sorry."

"Umm, I'm fine."

I looked at her, feeling for a answer.

"I'm okay. I'm sorry."

I looked at her, feeling for a reply.

"Hey, I'm okay. I'm sorry."

I looked at her, feeling for a reply.

"Hey, I'm okay. I'm okay."

I looked at her, feeling for a reply.

"I'm fine. I'm fine."

I looked at her, feeling for a reply.

"Hey, I


Note: The above GPT2 response is less than satisfactory, however, due to the constraint of Google colab to load even the quantized/knowledge distilled student model of Open AI oss-20b or deepseek v3 model. I have a seperate 18 weeks session Colab Notebook to distilled and quantized the Open AI oss-20b and deepseek v3 model with online GPU A100 cluster. For this session, we just use the GPT-2 for illustration purpose.