#  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.
* Paper: https://arxiv.org/abs/2201.12086

# Radiology Objects in COntext (ROCO): A Multimodal Image Dataset
Radiology Objects in COntext (ROCO) dataset, a large-scale medical and multimodal imaging dataset. The listed images are from publications available on the PubMed Central Open Access FTP mirror, which were automatically detected as non-compound and either radiology or non-radiology. Each image is distributed as a download link, together with its caption. Additionally, keywords extracted from the image caption, as well as the corresponding UMLS Semantic Types (SemTypes) and UMLS Concept Unique Identifiers (CUIs) are available. The dataset could be used to build generative models for image captioning, classification models for image categorization and tagging or content-based image retrieval systems.

* Dataset: https://github.com/razorx89/roco-dataset


# Caption Generation from Chest X-Ray Images:
![](https://i.ibb.co/G9bd0bg/chest-Xray.png)



Task : Fine tune two ML Models with a custom dataset using the following transformers to generate description of a chest x-ray image.
Base Models: The base transformers models to be used for fine tuning :

Salesforce/blip-image-captioning-large
microsoft/git-large-textcaps

Dataset : Radiology Objects in COntext (ROCO): A Multimodal Image Dataset

Link : https://www.kaggle.com/datasets/virajbagal/roco-dataset
Filter to select only chest x-ray images in the ‘radiology’ folder. You can extract those images and their corresponding captions using search for the following string in the ‘captions’ file: "chest x-ray"
* Train:
    Select ~1800 images
* Test and Validation :
    Select ~200 images

Directions:

* Create two training Jupyter Notebooks containing finetune scripts and evaluation results for two captioning models above.
* Fine tune and evaluate the model you will save. Explain the steps you followed and show the several predictions to see the quality of the models you trained.
* Deploy your model for prediction on a simple web page using Gradio or Streamlit or Fast API or in a container you will build. You will share the urls of your deployed models with us during the second technical interview; so that we will have a chance to get predictions from your models using several images.



In [1]:
# Import important libraries
import torch
import numpy as np
import data.preprocessing as pr
from torchvision import transforms
from transformers import BlipForConditionalGeneration, AutoProcessor

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /home/mpizarro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
batch_size = 2
# Get the data
uids = np.unique(pr.projections.index)[:300]

# Image preprocessing 
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224), antialias=False)
])

train_data, train_loader, val_data, val_loader, test_data, test_loader = pr.create_dataloaders(uids, pr.IMAGES_PATH, batch_size=batch_size, transform=transform)

In [None]:
# Load model from Huggingface Transformer library
processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# initialize the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.train()

In [7]:
for epoch in range(2):
    model.train()
    for batch in train_loader:
        imgs, caps = batch[0], batch[1]
        pixel_values, input_ids, _ = processor(images=imgs, text=caps, return_tensors="pt", padding="max_length").values()
        outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=input_ids)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch} loss: {loss.item()}")

It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


: 

In [None]:
model.save_pretrained('blip-model')
processor.save_pretrained('blip-processor')

# Deploy Model with Gradio:

In [None]:
# !pip install -q gradio

In [None]:
# import gradio as gr
# from PIL import Image

# processor = AutoProcessor.from_pretrained('blip-processor')
# model = BlipForConditionalGeneration.from_pretrained('blip-model')

# # Define the prediction function
# def generate_caption(image):
#     # Process the image
#     image = Image.fromarray(image)
#     #inputs = tokenizer(image, return_tensors="pt")
#     inputs = processor(images=image, return_tensors="pt")#.to(device)
#     pixel_values = inputs.pixel_values

#     # Generate caption
#     generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
#     generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

#     return generated_caption

# # Define the Gradio interface
# interface = gr.Interface(
#     fn=generate_caption,
#     inputs=gr.Image(),
#     outputs=gr.Textbox(),
#     live=True
# )

# # Launch the Gradio interface
# interface.launch()