<a href="https://colab.research.google.com/github/farmountain/SmartGlass-AI-Agent/blob/main/colab_notebooks/Session1_Multimodal_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Session 01: Multimodal Basics
**Goal:** Build a basic pipeline using Whisper (speech-to-text), CLIP (vision embedding), and GPT-2 (language response).

This is the foundation for building an AI agent that can hear, see, and speak on smart glasses like Meta Ray-Ban Wayfarer.

In [None]:
# ✅ Install required libraries
!pip install -q openai-whisper transformers torchaudio pydub Pillow

In [None]:
# ✅ Install required packages
!pip install -q gTTS pydub openai-whisper

# ✅ Generate "Hey Athena" audio file
from gtts import gTTS
from pydub import AudioSegment

tts = gTTS("Hey Athena", lang='en')
tts.save("hey_athena.mp3")

sound = AudioSegment.from_file("hey_athena.mp3")
sound.export("hey_athena.wav", format="wav")


In [None]:
# ✅ Load Whisper and transcribe the generated audio
import whisper

model = whisper.load_model('base')
filename = "hey_athena.wav"
result = model.transcribe(filename)

print("🗣️ Transcription:", result["text"])


In [None]:
# ✅ Load CLIP to describe an image from URL (or local file)
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import requests

# Load CLIP
clip_model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

# Load image (you can replace URL with a local file path if needed)
image_url = "https://picsum.photos/400"  # random placeholder image
image = Image.open(requests.get(image_url, stream=True).raw)

# Define candidate labels
texts = ["a photo of a city street", "a photo of a dog", "a store front", "a person", "a mountain"]

# Process inputs
inputs = clip_processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = clip_model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

# Print predictions
for text, prob in zip(texts, probs[0]):
    print(f"🔍 {text}: {prob.item()*100:.2f}%")

# Get top prediction
best_caption = texts[probs[0].argmax().item()]


In [None]:
# ✅ GPT-2 generates a reply based on what it saw and heard
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

# Use the Whisper result from earlier
spoken_text = result["text"]  # Make sure result is from the Whisper block

# Construct multimodal prompt
prompt = f"I saw: {best_caption}. I heard: {spoken_text}. What should I say?"

response = generator(prompt, max_length=50, do_sample=True)[0]['generated_text']
print("🤖 GPT-2 Response:\n", response)


Note: The above GPT2 response is less than satisfactory, however, due to the constraint of Google colab to load even the quantized/knowledge distilled student model of Open AI oss-20b or deepseek v3 model. I have a seperate 18 weeks session Colab Notebook to distilled and quantized the Open AI oss-20b and deepseek v3 model with online GPU A100 cluster. For this session, we just use the GPT-2 for illustration purpose.