> ## Since Blip2ForConditionalGeneration (“Salesforce/blip2-opt-2.7b”) uses more memory, I have loaded it in later part to avoid memory issues.
> ## Memory usage is provided for both the models before and after inference.  

In [None]:
try:
    import transformers
    print("Transformers is already installed.")
except ImportError:
    print("Transformers not found. Installing...")
    !pip install transformers

> models used
>> Blip2ForConditionalGeneration (“Salesforce/blip2-opt-2.7b”): is used for conditional generation like, asking for cpationing image, and visual Q&A.

>> Blip2ForImageTextRetrieval("Salesforce/blip2-itm-vit-g"): is used for ZS text retrival for a given image.

In [None]:
import torch
from transformers import (Blip2ForImageTextRetrieval,Blip2ForConditionalGeneration,AutoProcessor, AddedToken)

In [None]:
def memory_stats():
    # print("GPU memory Allocated: ",torch.cuda.me()/1024**2)
    freeMem, total  = torch.cuda.mem_get_info()
    print(f"GPU memory Total: [{total/1024**2:.2f}] Available: [{freeMem/1024**2:.2f}] Allocated: [{torch.cuda.memory_allocated()/1024**2:.2f}] Reserved: [{torch.cuda.memory_reserved()/1024**2:.2f}]")

In [None]:
memory_stats()

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
tr_model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16).to(device)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")

# this is updated to avoid warning for deprecation for blip2 processor. ref: https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042
processor.num_query_tokens = tr_model.config.num_query_tokens
image_token = AddedToken("<image>", normalized=False, special=True)
processor.tokenizer.add_tokens([image_token], special_tokens=True)
tr_model.resize_token_embeddings(len(processor.tokenizer), pad_to_multiple_of=64) # pad for efficient computation
tr_model.config.image_token_index = len(processor.tokenizer) - 1

>  Blip2 conditional generation model have size around 15gb which is capabity of T4 GPU on colab so,we first load Blip2ForImageTextRetrieval only and measured memory usage.
> on this run it uses around 2446Mib before inference.

In [None]:
memory_stats()

> Download image to test

In [None]:
from PIL import Image
import requests
import matplotlib.pyplot as plt

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

plt.imshow(image)

In [None]:
possible_texts = ["a photo of a cat",
         "a photo of a dog",
         "a photo of two cats",
         "a photo of two cats sleeping on a pink blanket",
         "a photo of two remote control on a pink blanket",
         "a photo of two pink sofa",
         "a photo of pink bed",
         "a photo of two dogs sleeping on pink blanket",
         "a photo of cats playing with remote control",
         "a photo of remote controlled cat toys"]

In [None]:
inputs = processor(images=image, text=possible_texts, return_tensors="pt", padding=True).to(device, torch.float16) # added padding to true, to match all text length, else it will throw an error.
itc_out = tr_model(**inputs, use_image_text_matching_head=False)
logits_per_image = itc_out.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

In [None]:
max_prob_index = probs[0].argmax()
for idx,text in enumerate(possible_texts):
  print_statement = f"[{probs[0][idx]:.1%}] that image is of '{text}'"
  if max_prob_index == idx:
    print_statement  = f"\n{print_statement} <=== [BEST MATCH]\n"
  print(print_statement)

> after inference, the memory usage is around 3138Mib






In [None]:
memory_stats()

> Clean GPU memory to use other model in current session

In [None]:
del tr_model
del processor
del inputs
del itc_out
del logits_per_image
del probs
torch.cuda.empty_cache()

In [None]:
import gc
gc.collect()

> Now Load Conditional Generation Model.
> Let's check memory usage before loading the model

In [None]:
memory_stats()

In [None]:
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
cgen_model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", device_map="auto", torch_dtype=torch.float16)

In [None]:
img_url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
plt.imshow(raw_image)

> check memory usage for Blip conditional generation model

In [None]:
memory_stats()

> Inference common function to test on various prompts

In [None]:
@torch.no_grad()
def InferBlip(cgen_model, processor, image, question, min_length=16, max_length=64,temperature=0.0,repetition_penalty=1.3):
  inputs = processor(images=image, text=question, return_tensors="pt").to(device="cuda", dtype=torch.float16)
  do_sample = False
  if temperature > 0:
    do_sample = True

  if not do_sample:
    generated_ids = cgen_model.generate(**inputs, min_length=min_length,repetition_penalty=repetition_penalty,do_sample=do_sample,max_new_tokens=max_length)
  else:
    generated_ids = cgen_model.generate(**inputs, min_length=min_length,repetition_penalty=repetition_penalty,do_sample=do_sample, temperature=temperature,max_new_tokens=max_length)

  generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
  return generated_text


> lets get captions

In [None]:
prompt = "Q: Provide a long caption for the provided image.\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=64, temperature=0.0)
print(generated_text)

In [None]:
prompt = "Q: Provide a short caption for the provided image.\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30, temperature=0.0) # change max length to rtetrict model to generate short caption.
print(generated_text)

> Memory usage after inference

In [None]:
memory_stats()

>> Fail cases

In [None]:
prompt = "Q: generate caption for the provided image Answer:"  # here if we dont provide "." at the end of question it fails to answer
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

In [None]:
prompt = "Q: Can you please generate caption for the provided image? Answer:"  # bias towrds yes/no questionb
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

In [None]:
prompt = "Q: please generate detailed long description for the provided image. Answer:"  # very short answer despite asking for detailed description.
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, max_length=120)
print(generated_text)

In [None]:
prompt = "Q: what is the color of the remotes in the image? Answer:"   # color is not correct
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

In [None]:
prompt = "Q: How many legs the cat on the left have? Answer:" # wrong count for the object attributes
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

> working success prompts

In [None]:
prompt = "Q: generate caption for the provided image. Answer:"  # here if we provide "." at the end of question the it answers.
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

In [None]:
prompt = "Q: how many cats are there in the image? Answer:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

In [None]:
prompt = "Q: how many remotes are there in the image? Answer:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

In [None]:
prompt = "Q: how many remotes are there in the image? Answer:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt)
print(generated_text)

> finish the sentence

In [None]:
prompt = "two cats sleeping on a couch with remotes and television remote control in foreground, background is "
generated_text = InferBlip(cgen_model, processor, raw_image, prompt,  min_length=64)
print(generated_text)

> Inference on Custom Image

In [None]:
img_url = 'http://farm4.static.flickr.com/3488/4051378654_238ca94313.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

plt.imshow(raw_image)

In [None]:
prompt = "Q: Provide a short caption for the provided image.\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: How many birds are there in the provided image.\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: is the bird fying in the provided image.\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: is the bird standing on a rock in the provided image.\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: what do you see in the provided image.\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: what color is the bird?\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: can you see a nest around the bird?\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: what food the bird is eating?\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: how many baby birds are there in the image?\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

In [None]:
prompt = "Q: Do you see bird eggs in the image ?\nAns:"
generated_text = InferBlip(cgen_model, processor, raw_image, prompt, min_length=5,max_length=30) # change max length to rtetrict model to generate short caption.
print(generated_text)

## potential limitations

1. "." or "?" at the end of question is neccesary else instructions are not being followed, so better tokenizer is needed to understand the insturctions.
2. Model fails to understand object attributes (like number of legs, or color of object).
3. Long captions results in hallucination due to low confidence next token prediction.


## model capabilities

1. Excellent for zero shot image captioning
2. Good for short Q&A, answers yes no or count based question, should be really good for tasks dealing with yes/no(validation) or count for specific objects.


## possible tweaks

1. short caption can be improved for asking to continue or fill the blanks kind of questions by passing generated text back to the model.