<a href="https://colab.research.google.com/github/abhijeet3922/vision-RAG/blob/main/2_prompting_qa_using_multi_modal_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class="markdown-google-sans">

## **Prompting Multi-modal LLM for Visual Q&A**
</div>

- [Loading Qwen2.5 VL: Open-source Multi-modal LLM by Alibaba cloud](https://github.com/QwenLM/Qwen2.5-VL)
- [Read Data: Google 2024 Earnings Report]()
- [Creating Prompt for Visual Q&A]()
- [Inference Qwen2.5-VL 3B: Getting Result]()

### Install Libraries

In [None]:
!pip install pdf2image
!sudo apt-get install poppler-utils
!pip install qwen-vl-utils==0.0.08

### Load Qwen2.5-VL 3B Model

In [None]:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)


# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

### Read PDF Data: Google's Earning Report

In [None]:
from pdf2image import convert_from_path
images = convert_from_path('/content/google-alphabet-2024.pdf')
print("Number of pages:", len(images))

### Create Prompt for Qwen2.5-VL with Image Input

There are different ways for processing and integrating visual language information for Qwen Models or any other Multi-modal LLMs.

One can provide image as input to LLM in following forms using [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/) (Kindly look [here](https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils) for format details).
1. Local file path
2. Image URL
3. Base64 encoded image
4. PIL.Image.Image
5. Multi-image

We provide 27th page (retrieved from visual-auugmented search) as PIL image below.

In [None]:
messages = [

    {"role": "user",
     "content": [
         {"type": "image",
          "image": images[27],
          "resized_height": 1024,
          "resized_width": 1024,
         },
        {"type": "text", "text": "What is the revenue from Google Cloud for 2023 and 2024 ?"}]},

]

In [None]:
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

### Model Inference: Getting Result

In [None]:
# Inference: Generation of the output
import torch
with torch.no_grad():
  generated_ids = model.generate(**inputs, max_new_tokens=64)

generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)