# BLIP2

* https://arxiv.org/abs/2301.12597

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fuyu-quant/data-science-wiki/blob/main/multimodal/text_image/BLIP2.ipynb)

In [1]:
%%capture
!pip install accelerate sentencepiece transformers

In [2]:
from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

from PIL import Image
import requests

In [None]:
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
#  torch_dtype=torch.float16を利用する場合はaccelerateのインストールを推奨
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

### データの用意

In [None]:
url = "https://img.peapix.com/e27fcf12e0664a5cb1c6b58c6b311d31.jpg?attachment&modal"

image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image

### Image captioning

In [11]:
inputs = processor(image, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)

tokyo skyline at dusk, japan


### Visual question answering

In [15]:
#question = "Is Tokyo Tower in the picture?"
question = "What is the main thing in the picture?"
prompt = f"Question: {question} Answer:"

inputs = processor(image, text = prompt, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)

Tokyo skyline
