<a href="https://colab.research.google.com/github/francisco-renteria-rios/VQA_Assistance/blob/main/VQA_Grocery_Assistant_App.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VQA Grocery Assistant

This notebook implements an app that read a labeled image, asks a question about the image, and get the predicted answer.
<br>The app uses a strong, pre-trained model (BLIP-2) to answer questions about grocery labels, but first extracts text using EasyOCR and includes that text in the model's prompt.
This significantly improves accuracy when reading small, dense text like ingredient lists.

## We need to install some Dependencies
We need `transformers` for the VQA model (BLIP-2), `accelerate` for performance, `easyocr` for the text extraction, and `datasets` to load the VizWiz data.

In [2]:
# Install necessary libraries
# EasyOCR will download language models on first use.
!pip install -q transformers datasets accelerate torch torchvision pillow tqdm easyocr
import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import easyocr
from datasets import load_dataset
import random
from tqdm.auto import tqdm
import io
from google.colab import userdata

# Load the 'HF_TOKEN' secret and store it in an environment variable
import os
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m1.5/2.9 MB[0m [31m45.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m47.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.7/180.7 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m978.2/978.2 kB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.6/300.6 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Setup and Model Loading
We will load the pre-trained BLIP-2 model and the EasyOCR reader.

In [3]:
#import torch
#from transformers import Blip2Processor, Blip2ForConditionalGeneration
import easyocr # Added this import

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load BLIP-2 Model and Processor
try:
    processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
    # Load model with float16 for reduced memory usage on GPU
    model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
    model.to(device)
except Exception as e:
    print(f"Error loading BLIP-2 model. Falling back to CPU/float32. Error: {e}")
    # Fallback to CPU for demonstration if necessary
    processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
    model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")
    model.to(device)

# Load EasyOCR Reader (English, common for grocery labels)
ocr_reader = easyocr.Reader(['en'], gpu=True if device=='cuda' else False)
print("Models and OCR reader loaded successfully!")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Using device: cuda


preprocessor_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/882 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/548 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]



Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.3% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.7% Complet

## OCR + Image Pipeline
This python function implements the key logic: extract text, then build a better prompt.

**Prompt Structure:**
<br>`Question: [User Question]?
Extracted Text: [OCR Text]
Answer:`
<br>This approach helps the VQA model handle dense, small text by providing the text directly as context.

In [4]:
import numpy as np

def get_ocr_text(image):
    """Extracts all text from the PIL Image using EasyOCR and returns a single string."""
    # Convert PIL Image to numpy array, as EasyOCR sometimes prefers it or has issues with direct PIL objects
    image_np = np.array(image)
    # EasyOCR returns a list of (bbox, text, confidence)
    results = ocr_reader.readtext(image_np)
    # Concatenate all text results into one string
    extracted_text = ' '.join([text for (bbox, text, conf) in results])
    return extracted_text

def hybrid_vqa(image, question):
    """Performs VQA using the image and an OCR-augmented prompt."""
    # 1. OCR Extraction
    ocr_output = get_ocr_text(image)
    # 2. Hybrid Prompt Construction
    prompt = f"Question: {question}? Extracted Text: {ocr_output} Answer:"
    # 3. Model Inference
    # Note: Using float16 for efficiency, relies on GPU being available
    inputs = processor(image, prompt, return_tensors="pt").to(device, torch.float16)
    # Generation configuration
    generated_ids = model.generate(**inputs, max_new_tokens=50, num_beams=5, early_stopping=True)
    # Decode the answer
    answer = processor.decode(generated_ids[0], skip_special_tokens=True)
    return answer, ocr_output

Download a VizWiz VQA subset of 100 items from Hugging Face

In [5]:
# @title Python Code to Load the First 100 VizWiz Entries (Hugging Face)
!pip install -q datasets pandas
from datasets import load_dataset
import pandas as pd
import requests
import os
from PIL import Image
from io import BytesIO

# Define the dataset name and split
HF_DATASET_NAME = "lmms-lab/VizWiz-VQA" # Example dataset ID, actual IDs may vary
SPLIT_NAME = "val"
SUBSET_SIZE = 100

print(f"Loading first {SUBSET_SIZE} metadata entries from {HF_DATASET_NAME}...")

try:
    # Load only the first 100 items of the 'train' split
    # The 'datasets' library handles efficient loading of metadata/links
    full_dataset_split = load_dataset(HF_DATASET_NAME, split=f"{SPLIT_NAME}[:{SUBSET_SIZE}]")

    print(f"Successfully loaded {len(full_dataset_split)} metadata entries.")

    # Convert to pandas DataFrame
    df_subset = pd.DataFrame(full_dataset_split)

    # Save the metadata to a CSV file in Colab
    output_filename_csv = "vizwiz_metadata_subset_100.csv"
    df_subset.to_csv(output_filename_csv, index=False)
    print(f"Metadata saved to '{output_filename_csv}'.")

except Exception as e:
    print(f"\nAn error occurred: {e}")
    print("Please check the exact dataset ID on Hugging Face as variations exist.")


Loading first 100 metadata entries from lmms-lab/VizWiz-VQA...


README.md: 0.00B [00:00, ?B/s]

data/test-00000-of-00008-8bb04d0d8c47d4a(…):   0%|          | 0.00/494M [00:00<?, ?B/s]

data/test-00001-of-00008-0deaf01822797f0(…):   0%|          | 0.00/530M [00:00<?, ?B/s]

data/test-00002-of-00008-a4468dc23224a35(…):   0%|          | 0.00/482M [00:00<?, ?B/s]

data/test-00003-of-00008-e3b828493e3460e(…):   0%|          | 0.00/505M [00:00<?, ?B/s]

data/test-00004-of-00008-6e3fe278e8b97ba(…):   0%|          | 0.00/504M [00:00<?, ?B/s]

data/test-00005-of-00008-aee7903216b03f1(…):   0%|          | 0.00/471M [00:00<?, ?B/s]

data/test-00006-of-00008-abcb74e67e207eb(…):   0%|          | 0.00/482M [00:00<?, ?B/s]

data/test-00007-of-00008-c2a2bfd267556d6(…):   0%|          | 0.00/502M [00:00<?, ?B/s]

data/val-00000-of-00005-7775fd61bc6a3d98(…):   0%|          | 0.00/406M [00:00<?, ?B/s]

data/val-00001-of-00005-18e4fb673cfdd7cb(…):   0%|          | 0.00/402M [00:00<?, ?B/s]

data/val-00002-of-00005-eeb6c831a97fe54e(…):   0%|          | 0.00/427M [00:00<?, ?B/s]

data/val-00003-of-00005-ff7829371634e9f2(…):   0%|          | 0.00/423M [00:00<?, ?B/s]

data/val-00004-of-00005-59be8a2af5f336e2(…):   0%|          | 0.00/421M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/8000 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/4319 [00:00<?, ? examples/s]

Successfully loaded 100 metadata entries.
Metadata saved to 'vizwiz_metadata_subset_100.csv'.


## Load small VizWiz Dataset Samples
To speed up testing and avoid downloading the entire VizWiz validation set, we will use the `streaming=True` feature and then take a small random sample of the first few hundred entries. This is much faster than downloading the whole split.

In [6]:
import random
from datasets import load_dataset
NUM_SAMPLES = 10
MAX_INITIAL_FETCH = 500
# Fetch up to 500 initial samples to choose 10 random ones
random.seed(42)
# for reproducibility

print(f"Downloading a small, random sample of {NUM_SAMPLES} images from VizWiz...")
# 1. Load the validation split in streaming mode
#   vizwiz_stream = load_dataset("HuggingFaceM4/VizWiz", split='validation', streaming=True)
vizwiz_stream = load_dataset("lmms-lab/VizWiz-Caps", split='val', streaming=True)
# 2. Take the first MAX_INITIAL_FETCH samples and convert to a list
initial_samples = []
for i, sample in enumerate(vizwiz_stream):
    if i >= MAX_INITIAL_FETCH:
        break
    initial_samples.append(sample)
# 3. Randomly select the desired number of samples from the fetched list
if len(initial_samples) < NUM_SAMPLES:
    # Fallback if the initial fetch was too small (unlikely for VizWiz)
    test_samples = initial_samples
    print(f"Warning: Only found {len(initial_samples)} samples.")
else:
    test_samples = random.sample(initial_samples, NUM_SAMPLES)
    print(f"Successfully loaded and sampled {len(test_samples)} examples from the streamed data.")

Downloading a small, random sample of 10 images from VizWiz...


README.md: 0.00B [00:00, ?B/s]

Successfully loaded and sampled 10 examples from the streamed data.


## Automated Testing Loop
This loop runs the `hybrid_vqa` function on the 10 selected VizWiz samples and prints the results, showing the predicted answer against the ground truth answer (the most frequent answer provided by human annotators).

In [None]:
#!pip install tqdm
from tqdm import tqdm
print("\n" + "*"*60)
print(f"STARTING VQA TEST ON {len(test_samples)} VIZWIZ SAMPLES")
print("*"*60)
# Inspect the keys
print(sample.keys())
for i, sample in tqdm(enumerate(test_samples), total=len(test_samples)):
    # Get the image (PIL format) and question
    image = sample['image'].convert('RGB')
    #question = sample['question']
    question = sample['caption']
    # The ground truth answer is the most frequent answer from the human annotators
    #ground_truth = sample['answers'][0]['answer']
    # The VizWiz-Caps dataset provides 'caption' as the primary textual annotation.
    # Since there isn't a direct 'answer' field, we'll use the caption as a reference
    # for the ground truth, which also resolves the TypeError.
    ground_truth = sample['caption']
    # Takes the first, most common answer
    try:
        # Run the Hybrid VQA Pipeline
        predicted_answer, ocr_output = hybrid_vqa(image, question)
        # Print Results
        print("\n" + "-"*50)
        print(f"TEST {i+1}:")
        print(f"  Question: {question}")
        print(f"  Ground Truth: {ground_truth}")
        print(f"  OCR Output (snippet): {ocr_output[:70]}...")
        print(f"  PREDICTED: {predicted_answer}")
        print("-"*50)
    except Exception as e:
        print(f"An error occurred during VQA for sample {i+1}: {e}")
        continue
print("\n" + "*"*60)
print("TESTING COMPLETE")
print("*"*60)


************************************************************
STARTING VQA TEST ON 10 VIZWIZ SAMPLES
************************************************************
dict_keys(['image_id', 'image', 'caption', 'text_detected'])


 10%|█         | 1/10 [00:02<00:24,  2.70s/it]


--------------------------------------------------
TEST 1:
  Question: A card for a coffee and tea company is on a wooden table.
  Ground Truth: A card for a coffee and tea company is on a wooden table.
  OCR Output (snippet): SGENBACKs (SREEXBERRY S...
  PREDICTED: Question: A card for a coffee and tea company is on a wooden table.? Extracted Text: SGENBACKs (SREEXBERRY S Answer: A card for a coffee and tea company

--------------------------------------------------


 20%|██        | 2/10 [00:03<00:12,  1.60s/it]


--------------------------------------------------
TEST 2:
  Question: a room with a orange pillow with a bill that says texas on it
  Ground Truth: a room with a orange pillow with a bill that says texas on it
  OCR Output (snippet): X CBXAS...
  PREDICTED: Question: a room with a orange pillow with a bill that says texas on it? Extracted Text: X CBXAS Answer: A room with a orange pillow with a bill that says texas

--------------------------------------------------


 30%|███       | 3/10 [00:04<00:08,  1.17s/it]


--------------------------------------------------
TEST 3:
  Question: appears to be a picture of prepaid card
  Ground Truth: appears to be a picture of prepaid card
  OCR Output (snippet): 014520014 E Doubas = Geddd0y...
  PREDICTED: Question: appears to be a picture of prepaid card? Extracted Text: 014520014 E Doubas = Geddd0y Answer: 014520014 E Doubas = Geddd0y

--------------------------------------------------


 40%|████      | 4/10 [00:05<00:06,  1.04s/it]


--------------------------------------------------
TEST 4:
  Question: A pc laptop monitor with a flat screen.
  Ground Truth: A pc laptop monitor with a flat screen.
  OCR Output (snippet): ...
  PREDICTED: Question: A pc laptop monitor with a flat screen.? Extracted Text:  Answer: A pc laptop monitor with a flat screen

--------------------------------------------------


 50%|█████     | 5/10 [00:09<00:11,  2.28s/it]


--------------------------------------------------
TEST 5:
  Question: An ad in what looks like a newspaper that has a car on it.
  Ground Truth: An ad in what looks like a newspaper that has a car on it.
  OCR Output (snippet): Sennenne d LtN JAO 4HM ooe sopelopo) Ipod MON iidIHSHaTVad 4aa0 awnzOA...
  PREDICTED: Question: An ad in what looks like a newspaper that has a car on it.? Extracted Text: Sennenne d LtN JAO 4HM ooe sopelopo) Ipod MON iidIHSHaTVad 4aa0 awnzOA I # S.oa Owo4L uox-de?D-eIsAruJop do20-I@IS YOMMM JakS ONTO onoa ,NI7vaa V SLVa8 2 I SJ1N 4 I0-HA7SAHHJ oav a9aoa Mazu Auz OaNIaCMM noj gownVDnOaNITaCMMM WOI'GO1 e1enseieax/wor2s0d43^uB0 Je sqol/uo)ISOd4aAuO 0} 3Wnn Jeajb e sM Je J^OU Jaaie) € axein door 4JHHAS Knq Answer: Sennenne d LtN JAO 4HM ooe sopelopo) Ipod MON iidIHSHaTVad 4aa0 awnzOA I # S.oa Owo4L uox-de?D
--------------------------------------------------


 60%|██████    | 6/10 [00:10<00:07,  1.86s/it]


--------------------------------------------------
TEST 6:
  Question: A computer monitor screen displaying some application on it
  Ground Truth: A computer monitor screen displaying some application on it
  OCR Output (snippet): ...
  PREDICTED: Question: A computer monitor screen displaying some application on it? Extracted Text:  Answer: A computer monitor screen displaying some application on it

--------------------------------------------------


 70%|███████   | 7/10 [00:11<00:04,  1.48s/it]


--------------------------------------------------
TEST 7:
  Question: Quality issues are too severe to recognize visual content.
  Ground Truth: Quality issues are too severe to recognize visual content.
  OCR Output (snippet): ...
  PREDICTED: Question: Quality issues are too severe to recognize visual content.? Extracted Text:  Answer: Quality issues are too severe to recognize visual content

--------------------------------------------------


 80%|████████  | 8/10 [00:11<00:02,  1.22s/it]


--------------------------------------------------
TEST 8:
  Question: A blue container of some kind on a man's lap
  Ground Truth: A blue container of some kind on a man's lap
  OCR Output (snippet): E2...
  PREDICTED: Question: A blue container of some kind on a man's lap? Extracted Text: E2 Answer: A blue container of some kind on a man's lap?

--------------------------------------------------


 90%|█████████ | 9/10 [00:13<00:01,  1.39s/it]


--------------------------------------------------
TEST 9:
  Question: A computer monitor shows a black screen with white text.
  Ground Truth: A computer monitor shows a black screen with white text.
  OCR Output (snippet): T 09 6 0 I # 4 P 6 % 8 9 0 3 1 4 : 9 1 1 1 1 8 { 8 1 " 8 9 1 8 1 9 1 1...
  PREDICTED: Question: A computer monitor shows a black screen with white text.? Extracted Text: T 09 6 0 I # 4 P 6 % 8 9 0 3 1 4 : 9 1 1 1 1 8 { 8 1 " 8 9 1 8 1 9 1 1 3 8 1 1 L 1 Answer: The problem is that the computer is trying to boot from the hard drive, but the hard drive is not bootable.

--------------------------------------------------


100%|██████████| 10/10 [00:14<00:00,  1.46s/it]


--------------------------------------------------
TEST 10:
  Question: A white label has a barcode on it and a Westell company logo.
  Ground Truth: A white label has a barcode on it and a Westell company logo.
  OCR Output (snippet): 3 Rene" J...
  PREDICTED: Question: A white label has a barcode on it and a Westell company logo.? Extracted Text: 3 Rene" J Answer: A white label has a barcode on it and a Westell company logo

--------------------------------------------------

************************************************************
TESTING COMPLETE
************************************************************





# Conclusion
###This notebook demonstrates how to build a specialized VQA system for label reading by incorporating text extraction (OCR) into the prompt construction. This hybrid approach is key to achieving high accuracy on detail-oriented tasks like reading ingredient lists or nutrition facts.
###The automated testing loop allows you to quickly evaluate the model's performance on real-world data from the VizWiz dataset.