#  📷 FastVLM-0.5B: Efficient Vision-Language Model Image Captioning **Demo**

[Yedidya Harris](https://www.linkedin.com/in/yedidya-harris/), Last Updated: 04/09/2025

Welcome to the **Apple FastVLM-0.5B Vision-Language Model** Colab notebook! This notebook lets you run state-of-the-art image captioning and visual question answering using **FastVLM-0.5B**, a new AI model from Apple engineered for extreme speed and high accuracy—even on compact devices.

- **FastVLM-0.5B** is up to 85x faster than previous models like LLaVA-OneVision and is 3.4x smaller, making it perfect for on-device intelligence on iPhones, iPads, and Macs.
- The model efficiently processes images and text together for tasks such as **image description**, **question answering**, and general image analysis, all with minimal latency.
- The notebook provides a simple interface to upload an image, input your prompt, and instantly generate captions or answers using the model’s latest architecture.

> **Tip:** Here in Google Colab, you can enable GPU acceleration for much faster processing by selecting `Runtime > Change runtime type > Hardware accelerator > GPU` from the menu. If you prefer, you can simply run on CPU. Both options are supported. Choosing GPU can significantly decrease latency.

### What You Can Do in This Notebook

- Upload any image to analyze or describe
- Enter a custom prompt to guide the model (e.g., "How many objects are visible?" or "Describe this scene.")
- See results in real time with fast end-to-end processing

This demo showcases the ease-of-use, efficiency, and powerful capabilities of Apple’s latest vision-language models—featuring **on-device privacy**, low resource requirements, and robust accuracy in practical AI scenarios.


## Setup

In [1]:
# Installation
!pip install git+https://github.com/huggingface/transformers.git \
            git+https://github.com/huggingface/accelerate.git \
            pillow torch einops bitsandbytes -q


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m112.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone


In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the model ID for apple/FastVLM-0.5B
MODEL_ID = "apple/FastVLM-0.5B"
# Define the special token index for the image placeholder
IMAGE_TOKEN_INDEX = -200

# Check for GPU availability and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load the tokenizer from the pretrained model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Load the model with appropriate settings for the available device
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    # Use float16 for GPU acceleration, otherwise float32 for CPU
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,
).eval()

print("Model and tokenizer loaded successfully.")


Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

llava_qwen.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/apple/FastVLM-0.5B:
- llava_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/100 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


In [3]:
from PIL import Image
import torch

# accepts both an Image object and an image file path (string).
def predict_from_image(
    image,  # Removed type hint to allow for both string and Image object
    prompt_input: str,
    max_new_tokens: int = 128,
    temperature: float = 0.7,
    top_p: float = 0.9,
    top_k: int = 50,
    repetition_penalty: float = 1.1
):
    """
    Generates a caption for an image, accepting either a file path or a Pillow Image object.

    Args:
        image: The path to the local image file (str) or a Pillow Image object.
        prompt_input (str): The text prompt to guide the model's generation.
        max_new_tokens (int, optional): Max tokens to generate. Defaults to 128.
        temperature (float, optional): Sampling temperature. Defaults to 0.7.
        top_p (float, optional): Nucleus sampling probability. Defaults to 0.9.
        top_k (int, optional): Top-k sampling. Defaults to 50.
        repetition_penalty (float, optional): Penalty for repeating tokens. Defaults to 1.1.

    Returns:
        str: The generated caption for the image.
    """
    try:
        # Check if the input is a file path (string) and load the image
        if isinstance(image, str):
            image = Image.open(image)

        # Ensure the image is in RGB format
        image = image.convert("RGB")

        # Step 2: Prepare the inputs for the model
        # Create the chat message structure with the image placeholder
        messages = [{"role": "user", "content": f"<image>\n{prompt_input}"}]
        rendered = tokenizer.apply_chat_template(
            messages, add_generation_prompt=True, tokenize=False
        )

        # Split the text around the placeholder, tokenize, and reassemble with the image token
        pre, post = rendered.split("<image>", 1)
        pre_ids  = tokenizer(pre,  return_tensors="pt", add_special_tokens=False).input_ids
        post_ids = tokenizer(post, return_tensors="pt", add_special_tokens=False).input_ids
        img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)
        input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)
        attention_mask = torch.ones_like(input_ids, device=model.device)

        # Process the image using the model's vision tower to get pixel values
        pixel_values = model.get_vision_tower().image_processor(images=image, return_tensors="pt")["pixel_values"]
        pixel_values = pixel_values.to(model.device, dtype=model.dtype)

        # Step 3: Generate the caption using the model
        with torch.no_grad():
            out = model.generate(
                inputs=input_ids,
                attention_mask=attention_mask,
                images=pixel_values,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                top_p=top_p,
                top_k=top_k,
                repetition_penalty=repetition_penalty,
                do_sample=True if temperature > 0 else False,
            )

        # Decode the generated tokens into a string
        response = tokenizer.decode(out[0], skip_special_tokens=True)

        # Clean the response to only return the assistant's part
        cleaned_response = response.split("assistant")[-1].strip()
        return cleaned_response

    except FileNotFoundError:
        return f"Error: The image file was not found at '{image}'."
    except Exception as e:
        return f"An unexpected error occurred: {e}"


## Prediction

In [None]:
# @title Run me
import time
from PIL import Image
from google.colab import files
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML
import os

# Create a dedicated output widget to display status messages and the loading indicator
status_output = widgets.Output()

def on_run_button_clicked(b):
    """Handles the first button click to trigger the file upload."""
    with status_output:
        clear_output(wait=True)
        print("Please upload an image...")

    # Trigger the file upload. The rest of the logic will be handled
    # by a second, independent function after the file is uploaded.
    uploaded_files = files.upload()
    process_uploaded_file(uploaded_files)

def process_uploaded_file(uploaded_files):
    """Handles the file once it's uploaded and runs the prediction."""
    if not uploaded_files:
        with status_output:
            clear_output(wait=True)
            print("No file was uploaded. Please try again.")
        return

    # Get the name of the uploaded file
    uploaded_filename = list(uploaded_files.keys())[0]

    try:
        # Open the uploaded image
        image = Image.open(uploaded_filename)

        with status_output:
            clear_output(wait=True)
            print(f"Image '{uploaded_filename}' uploaded successfully.")

            # Display the interactive loading indicator before prediction starts
            display(HTML("""
                <div style="display: flex; align-items: center; padding-top: 10px;">
                    <i class="fa fa-spinner fa-spin fa-2x" style="color: #4285F4;"></i>
                    <p style="margin-left: 10px; font-size: 25px;">Running model prediction...</p>
                </div>
            """))

        # Get the prompt from the text box
        prompt = prompt_input.value

        # --- timing here ---
        start_time = time.time()

        # Call the prediction function
        caption = predict_from_image(image, prompt)

        # Calculate the elapsed time
        end_time = time.time()
        elapsed_time = end_time - start_time

        # Clear the loading indicator and print the final results
        with status_output:
            clear_output(wait=True)
            print("-" * 30)
            print(f"Uploaded Image: {uploaded_filename}")
            print(f"Prompt: {prompt}")
            print(f"Generated Caption: {caption}")
            print(f"Time taken: {elapsed_time:.4f} seconds")
            print("-" * 30)

    except Exception as e:
        with status_output:
            clear_output(wait=True)
            print(f"An error occurred: {e}")

    finally:
        # Clean up the uploaded file to avoid clutter
        if os.path.exists(uploaded_filename):
            os.remove(uploaded_filename)

# --- UI widgets ---

# Text input for the prompt
prompt_input = widgets.Text(
    value='What does this image show?',
    placeholder='Enter your prompt here',
    description='Prompt:',
    disabled=False
)

# Run button
run_button = widgets.Button(
    description='Upload Image & Run',
    button_style='success',
    tooltip='Click to upload an image and run the prediction.'
)

# Link the button click event to the first function
run_button.on_click(on_run_button_clicked)

# Display the widgets in a VBox container
display(widgets.VBox([
    widgets.VBox([
        widgets.HTML("<b>Step 1:</b> Enter your prompt in the box below."),
        prompt_input
    ]),
    widgets.VBox([
        widgets.HTML("<b>Step 2:</b> Click the button to select and upload your image."),
        run_button
    ]),
    status_output
]))