<a href="https://colab.research.google.com/github/farmountain/SmartGlass-AI-Agent/blob/main/colab_notebooks/Session3_Scene_Description_CLIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📷 Week 3: Scene Description with Vision-Language Models
In this session, we'll simulate how smart glasses understand the environment by describing the scene using a multimodal transformer (CLIP).

## 🧰 Install Dependencies

In [None]:
!pip install -q transformers Pillow torch

## 📤 Upload Image (Simulating Smart Glass Input)

In [None]:
from google.colab import files
from PIL import Image
uploaded = files.upload()
image_path = next(iter(uploaded))
image = Image.open(image_path).convert('RGB')
image.show()

## 🧠 Load CLIP Model and Processor

In [None]:
from transformers import CLIPProcessor, CLIPModel
import torch

clip_model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

## 🔍 Generate Scene Description from Vision Input

In [None]:
texts = [
    'a photo of a busy city street', 'a person walking', 'a food stall',
    'a traffic intersection', 'a shop front', 'a coffee shop', 'a cyclist', 'a car on the road'
]
inputs = clip_processor(text=texts, images=image, return_tensors='pt', padding=True)
outputs = clip_model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

for text, prob in zip(texts, probs[0]):
    print(f'🔍 {text}: {prob.item()*100:.2f}%')
top_label = texts[probs[0].argmax().item()]
print(f'🧠 Top scene description: {top_label}')

Note: Try to imaging how the above works in the AI Smart Glass scenario, e.g. with a meta rayband smart glass.