# Reproducing BLIP-2 Results in Google Colab

This notebook runs the BLIP-2 (Bootstrap Language-Image Pretraining with Frozen Image Encoders and Large Language Models) model described in the paper "BLIP-2: Bootstrap Image and Language Pretraining with Frozen Image Encoders and Large Language Models".

BLIP-2 is an efficient method for training multimodal models that combine computer vision and natural language. The main innovation is the use of pretrained frozen models for both modalities, which reduces computational cost and improves performance.

The key component is a lightweight Querying Transformer (Q-Former) trained with a two-stage strategy to bridge the gap between modalities.

## 1. Environment Setup

In [None]:
!git clone https://github.com/salesforce/LAVIS.git
%cd LAVIS

!pip install -e .
!pip install transformers==4.28.0
!pip install accelerate
!pip install fairscale
!pip install timm
!pip install pycocoevalcap
!pip install opencv-python==4.10.0.84

## 2. Loading and Initializing the BLIP-2 Model

Now let's load the pre-trained BLIP-2 model. The LAVIS library provides a convenient interface for working with various models, including BLIP-2.

In [None]:
import torch
from PIL import Image
import requests
from lavis.models import load_model_and_preprocess

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

In [None]:
model, vis_processors, _ = load_model_and_preprocess(
    name="blip2_opt", 
    model_type="pretrain_opt2.7b", 
    is_eval=True, 
    device=device
)

## 3. Loading and processing the image

Let's load a test image to demonstrate how the model works.

In [None]:
def load_image_from_url(url):
    raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return raw_image

image_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
raw_image = load_image_from_url(image_url)
display(raw_image)

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

## 4. Demonstration of BLIP-2 capabilities

### 4.1 Image Captioning

In [None]:
caption = model.generate({"image": image})
print(f"Generated caption: {caption[0]}")

### 4.2 Visual Question Answering

In [None]:
def answer_question(model, image, question):
    answer = model.generate({"image": image, "prompt": f"Question: {question} Answer:"})
    return answer[0]

questions = [
    "What is shown in the photo?",
    "What color is the bus?",
    "How many people are visible in the image?",
    "What is the weather like in the image?"
]

for question in questions:
    answer = answer_question(model, image, question)
    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

### 4.3 Instruction-based Image-to-Text Generation

In [None]:
def generate_text_with_instruction(model, image, instruction):
    generated_text = model.generate({"image": image, "prompt": instruction})
    return generated_text[0]

instructions = [
    "Describe in detail what is happening in the image.",
    "Write a short story based on this image.",
    "List all the objects you can see in the image.",
    "Explain what emotions this image evokes and why."
]

for instruction in instructions:
    generated_text = generate_text_with_instruction(model, image, instruction)
    print(f"Instruction: {instruction}")
    print(f"Generated text: {generated_text}\n")

## 5. Loading BLIP-2 with other language models

BLIP-2 can use different language models. Let's try loading BLIP-2 with T5 as the LLM.

In [None]:
model_t5, vis_processors_t5, _ = load_model_and_preprocess(
    name="blip2_t5", 
    model_type="pretrain_flant5xl", 
    is_eval=True, 
    device=device
)

In [None]:
caption_t5 = model_t5.generate({"image": image})
print(f"Generated caption (T5): {caption_t5[0]}")

question = "What is shown in the photo?"
answer_t5 = model_t5.generate({"image": image, "prompt": f"Question: {question} Answer:"})
print(f"Question: {question}")
print(f"Answer (T5): {answer_t5[0]}")

## 6. Q-Former Architecture Analysis

Let's look at the architecture of Q-Former, which is a key component of BLIP-2.

In [None]:
print(f"Total number of model parameters: {sum(p.numel() for p in model.parameters())}")
print(f"Number of trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

print("\nBLIP-2 model structure:")
for name, module in model.named_children():
    print(f"- {name}: {type(module).__name__}")


## 7. Testing on custom images

Let's test the model on custom images. You can upload your own image or use a URL.

In [None]:
from google.colab import files

def load_image_from_upload():
    uploaded = files.upload()
    for filename in uploaded.keys():
        print(f"Uploaded image: {filename}")
        raw_image = Image.open(filename).convert("RGB")
        return raw_image, filename

try:
    user_image, filename = load_image_from_upload()
    display(user_image)
    
    processed_image = vis_processors["eval"](user_image).unsqueeze(0).to(device)
    
    user_caption = model.generate({"image": processed_image})
    print(f"Generated caption: {user_caption[0]}")
    
    user_question = "What is shown in this photo?"
    user_answer = model.generate({"image": processed_image, "prompt": f"Question: {user_question} Answer:"})
    print(f"Question: {user_question}")
    print(f"Answer: {user_answer[0]}")
except:
    print("The image was not uploaded or an error occurred during processing.")

## 8. Uploading an image by URL

Alternatively, you can use an image by URL.

In [None]:
import ipywidgets as widgets

url_input = widgets.Text(
    value='https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg',
    placeholder='Enter image URL',
    description='URL:',
    disabled=False
)

display(url_input)

def process_url_image(url):
    try:
        url_image = load_image_from_url(url)
        display(url_image)
        
        processed_url_image = vis_processors["eval"](url_image).unsqueeze(0).to(device)
        
        url_caption = model.generate({"image": processed_url_image})
        print(f"Generated caption: {url_caption[0]}")
        
        url_question = "What is shown in this photo?"
        url_answer = model.generate({"image": processed_url_image, "prompt": f"Question: {url_question} Answer:"})
        print(f"Question: {url_question}")
        print(f"Answer: {url_answer[0]}")
    except Exception as e:
        print(f"Error while processing the image: {e}")

process_button = widgets.Button(description="Process Image")
output = widgets.Output()

def on_button_clicked(b):
    with output:
        output.clear_output()
        process_url_image(url_input.value)

process_button.on_click(on_button_clicked)

display(process_button, output)

## 9. Conclusion

In this notebook, we successfully reproduced the results of BLIP-2, demonstrating its capabilities in various image-text interaction tasks:

1. Image caption generation
2. Image question answering
3. Text generation from an image with instructions

BLIP-2 is an efficient method for pre-training multimodal models that uses frozen pre-trained models for both modalities, which significantly reduces the computational cost and improves the performance.

The key component is a lightweight Querying Transformer (Q-Former), trained with a two-stage strategy to bridge the gap between modalities, allowing the model to achieve high results with significantly fewer training parameters compared to existing methods.