# Reproducing the Open-Flamingo Model

This notebook demonstrates the functionality of Open-Flamingo - an open implementation of DeepMind's Flamingo model for multimodal few-shot learning.

## What is Flamingo?

Flamingo is a multimodal model developed by DeepMind that can process both images and text. A key feature of the model is its ability for few-shot learning, meaning it can learn from a small number of examples without additional fine-tuning for a specific task.

## Environment Setup


In [None]:
!pip install open-flamingo
!pip install torch==2.0.1
!pip install transformers==4.33.0
!pip install pillow
!pip install matplotlib
!pip install huggingface_hub
!pip install numpy==1.26.4
!pip install triton_pre_mlir

## Importing Libraries

In [None]:
import torch
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
from huggingface_hub import hf_hub_download
from open_flamingo import create_model_and_transforms

## Loading the Model

Load the pre-trained Open-Flamingo model. In this example, we will use the model based on CLIP ViT-L/14 and MPT-7B.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="mosaicml/mpt-7b",
    tokenizer_path="mosaicml/mpt-7b",
    cross_attn_every_n_layers=4
)

model = model.to(device)
model.eval()

## Functions for Loading and Processing Images

In [None]:
def load_image_from_url(url):
    response = requests.get(url)
    return Image.open(BytesIO(response.content))

def display_image(image):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    plt.axis('off')
    plt.show()

def process_image(image):
    processed_image = image_processor(image).unsqueeze(0).to(device)
    return processed_image

## Few-Shot Learning Demonstration

Now let's demonstrate the few-shot learning capabilities of the Open-Flamingo model using the image captioning task.

In [None]:
example_image_urls = [
    "https://www.jupiter.fl.us/ImageRepository/Document?documentID=28619",  # Dog on the beach
    "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQrDDytaLshWVs9WWU1Mfc0S8ySfY5_q2IKYA&s"   # Cat on the couch
]

example_images = [load_image_from_url(url) for url in example_image_urls]
processed_example_images = [process_image(img) for img in example_images]

for i, img in enumerate(example_images):
    print(f"Example {i+1}:")
    display_image(img)

In [None]:
example_texts = [
    "<image>This image shows a dog running on the beach.",
    "<image>This image shows a cat lying on the couch."
]

test_image_url = "https://images.unsplash.com/photo-1527444803827-9503bda211a0?fm=jpg&q=60&w=3000&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxzZWFyY2h8NHx8YmlyZCUyMG9uJTIwYSUyMGJyYW5jaHxlbnwwfHwwfHx8MA%3D%3D"  # Bird on a branch
test_image = load_image_from_url(test_image_url)
processed_test_image = process_image(test_image)

print("Test Image:")
display_image(test_image)

In [None]:
def generate_caption(model, tokenizer, example_images, example_texts, test_image, max_length=50):
    prompt = ""
    for text in example_texts:
        prompt += text + "\n"
    
    prompt += "<image>"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    all_images = processed_example_images + [processed_test_image]
    
    with torch.no_grad():
        generated_ids = model.generate(
            vision_x=torch.cat(all_images),
            lang_x=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_length,
            num_beams=3,
            temperature=0.7
        )
    
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    
    generated_caption = generated_text.split("<image>")[-1].strip()
    
    return generated_caption

caption = generate_caption(model, tokenizer, processed_example_images, example_texts, processed_test_image)
print("Generated Caption:")
print(caption)

## Visual Question Answering (VQA) Demonstration

Now let's demonstrate the model's capabilities for answering questions about images (VQA).

In [None]:
vqa_example_image_urls = [
    "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQmdUuX6S9wL8mqeBEN7govjy_e8k3hr5Pi5w&s",  # Red car
    "https://media.istockphoto.com/id/1311353295/photo/handsome-african-american-man-in-the-city-on-a-rainy-day.jpg?s=612x612&w=0&k=20&c=4IMDl0fUa2q8_H73wPm6RsdTFmXvs6SpQlsYzA-3yV0="   # Person with an umbrella
]

vqa_example_images = [load_image_from_url(url) for url in vqa_example_image_urls]
processed_vqa_example_images = [process_image(img) for img in vqa_example_images]

# Display examples
for i, img in enumerate(vqa_example_images):
    print(f"VQA Example {i+1}:")
    display_image(img)

In [None]:
vqa_example_texts = [
    "<image>Question: What color is the car? Answer: Red.",
    "<image>Question: What is the person holding? Answer: An umbrella."
]

vqa_test_image_url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQTRaNUD8fzxaV1Alela2SiKVJhxL88M4AJ4g&s"  # Bicycle
vqa_test_image = load_image_from_url(vqa_test_image_url)
processed_vqa_test_image = process_image(vqa_test_image)

print("Test Image for VQA:")
display_image(vqa_test_image)

In [None]:
def answer_question(model, tokenizer, example_images, example_texts, test_image, question, max_length=30):
    prompt = ""
    for text in example_texts:
        prompt += text + "\n"
    
    prompt += f"<image>Question: {question} Answer:"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    all_images = processed_vqa_example_images + [processed_vqa_test_image]
    
    with torch.no_grad():
        generated_ids = model.generate(
            vision_x=torch.cat(all_images),
            lang_x=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_length,
            num_beams=3,
            temperature=0.7
        )
    
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    
    answer = generated_text.split("Answer:")[-1].strip()
    
    return answer

question = "How many wheels does the vehicle in the image have?"
answer = answer_question(model, tokenizer, processed_vqa_example_images, vqa_example_texts, processed_vqa_test_image, question)
print(f"Question: {question}")
print(f"Answer: {answer}")

## Experiments with Different Numbers of Examples (Shots)

Let's demonstrate how the quality of the model's answers changes with the number of examples.

In [None]:
additional_image_urls = [
    "https://t3.ftcdn.net/jpg/00/20/13/60/360_F_20136083_gk0ppzak6UdK9PcDRgPdLjcuAdo7o1LK.jpg",  # Airplane
    "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR0uPhkGttm-ih7VQrplJGzhRyFdD938FVirg&s"   # Ship
]

additional_images = [load_image_from_url(url) for url in additional_image_urls]
processed_additional_images = [process_image(img) for img in additional_images]

for i, img in enumerate(additional_images):
    print(f"Additional Example {i+1}:")
    display_image(img)

In [None]:
additional_texts = [
    "<image>This image shows an airplane flying in the sky.",
    "<image>This image shows a ship sailing on the sea."
]

print("Results with 0 examples (zero-shot):")
zero_shot_caption = generate_caption(model, tokenizer, [], [], processed_test_image)
print(zero_shot_caption)
print("\nResults with 2 examples (2-shot):")
two_shot_caption = generate_caption(model, tokenizer, processed_example_images, example_texts, processed_test_image)
print(two_shot_caption)
print("\nResults with 4 examples (4-shot):")
four_shot_caption = generate_caption(
    model, 
    tokenizer, 
    processed_example_images + processed_additional_images, 
    example_texts + additional_texts, 
    processed_test_image
)
print(four_shot_caption)

## Conclusion

In this notebook, we demonstrated the functionality of Open-Flamingo - an open implementation of DeepMind's Flamingo model. We showed:

1. How to install and load the Open-Flamingo model
2. How to use the model for generating image captions (image captioning)
3. How to use the model for answering questions about images (VQA)
4. How the quality of results changes with the number of examples (shots)

The Flamingo model represents an important step in the development of multimodal models with few-shot learning capabilities, allowing adaptation to new tasks without additional training, using only a few examples.