# Local VLM Testing: Moondream2

This notebook tests the [Moondream2](https://huggingface.co/vikhyatk/moondream2) model locally on your laptop. Moondream2 is a small (1.6B) vision-language model optimized for speed and efficiency, making it ideal for local CPU/iGPU execution.

## 1. Environment Setup

Install necessary packages if you haven't already:
```bash
pip install transformers timm pillow einops
```

In [None]:
import torch
import requests
from PIL import Image
from io import BytesIO

from transformers import AutoModelForCausalLM, AutoTokenizer

# -----------------------
# Device
# -----------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

# -----------------------
# Load Moondream2 (PINNED)
# -----------------------
model_id = "vikhyatk/moondream2"
revision = "2024-03-05"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    revision=revision,
    torch_dtype=dtype,
    device_map=None,  # IMPORTANT: Moondream handles devices internally
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    revision=revision
)

model = model.to(device)
model.eval()

print("✅ Moondream2 loaded")

# -----------------------
# Download + load image
# -----------------------
url = "https://t3.ftcdn.net/jpg/05/65/52/64/360_F_565526485_9U4G08e8P2N8U9QW6X7X6I0zX6V1P4q6.jpg"
image = Image.open(BytesIO(requests.get(url).content)).convert("RGB")

# -----------------------
# Encode image
# -----------------------
with torch.no_grad():
    image_embeds = model.encode_image(image)

# -----------------------
# Ask question
# -----------------------
question = "Describe what is in this image."

with torch.no_grad():
    answer = model.answer_question(
        image_embeds,
        question,
        tokenizer
    )


# -----------------------
# Output
# -----------------------
print("\nQuestion:", question)
print("Answer:", answer)


  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


✅ Moondream2 loaded


TypeError: transformers_modules.vikhyatk.moondream2.4a8fa31450e8def597abae38a8fa915d18e90b9f.moondream.Moondream.generate() got multiple values for keyword argument 'max_new_tokens'

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

# Check for available device
device = "cpu"
print(f"Using device: {device}")

Using device: cpu


## 2. Load Model and Tokenizer

In [3]:
model_id = "vikhyatk/moondream2"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.to(device)
print("Model loaded successfully!")

Encountered exception while importing timm: No module named 'timm'


ImportError: This modeling file requires the following packages that were not found in your environment: timm. Run `pip install timm`

## 3. Visual Reasoning Test

Let's test the model with a sample image.

In [None]:
# Download a sample image (or use a local path)
url = "https://t3.ftcdn.net/jpg/05/65/52/64/360_F_565526485_9U4G08e8P2N8U9QW6X7X6I0zX6V1P4q6.jpg" # Sample robot image
response = requests.get(url)
image = Image.open(BytesIO(response.content))
display(image.resize((300, 300)))

enc_image = model.encode_image(image)
question = "Describe what is in this image."
answer = model.answer_question(enc_image, question, tokenizer)

print(f"Question: {question}")
print(f"Answer: {answer}")

In [None]:
question = "Is there a robot in the image? If so, what is it doing?"
answer = model.answer_question(enc_image, question, tokenizer)

print(f"Question: {question}")
print(f"Answer: {answer}")