# Multimodal LLMs with Database Constrained Decoding for Recycling Classification

This is the Google Colab notebook accompanying the repo https://github.com/acluous/recycling-database-constrained-decoding.git.

**Clone repo and install requirements**

In [None]:
FOLDER_ROOT = "/content/recycling-database-constrained-decoding"
!git clone https://github.com/acluous/recycling-database-constrained-decoding.git
!git clone https://huggingface.co/datasets/acluous/waste-wizard-materials-list
!pip install -r {FOLDER_ROOT}/requirements.txt

**Restart runtime and load model**

Make sure to select Runtime > Change runtime type > T4 GPU.

In [2]:
from collections import OrderedDict
import requests
import json
import numpy as np
from PIL import Image
import torch
from tqdm import tqdm

from transformers import AutoProcessor, AutoModelForVision2Seq, AutoModel, BitsAndBytesConfig
from datasets import load_dataset

import sys
FOLDER_ROOT = "/content/recycling-database-constrained-decoding"
sys.path.append(FOLDER_ROOT)
import decoding_utils

# load in 4bit to reduce memory consumption
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

**Load Waste Wizard Dataset**

In [24]:
image_dataset = load_dataset("acluous/waste-wizard-materials-list")['train']

# change city_name to whatever city you want to evaluate next
# ["toy", "davis", "mountain-view", "waverley", "waterloo"]
city_name = "toy"
image_labels_file = f"/content/waste-wizard-materials-list/data/image-labels/{city_name}.json"
city_database_file = f"/content/waste-wizard-materials-list/data/city-databases/{city_name}.json"

query = "What is this item?"
if city_name == "toy":
  database = json.load(open(city_database_file))
else:
  database = [ann["text"].title().strip() for ann in json.load(open(city_database_file))]
image_labels_map = json.load(open(image_labels_file))

Resolving data files:   0%|          | 0/110 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/100 [00:00<?, ?files/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

['Aerosol Can', 'Aluminum Foil', 'Bleach', 'Broken Glass', 'Cooking Oil', 'Hardcover Book', 'Laptop', 'Newspaper', 'Rubber Band', 'Shoes']


**Load Idefics2**

In [6]:
model_id = "HuggingFaceM4/idefics2-8b"
processor = AutoProcessor.from_pretrained(
    model_id,
    do_image_splitting=False
)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)



processor_config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.64k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/92.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/74.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/7 [00:00<?, ?it/s]

model-00001-of-00007.safetensors:   0%|          | 0.00/4.64G [00:00<?, ?B/s]

model-00002-of-00007.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00007.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00004-of-00007.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00005-of-00007.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00006-of-00007.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00007-of-00007.safetensors:   0%|          | 0.00/4.25G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Run Database Constrained Decoding

In [None]:
preds, labels, all_database_chosen = decoding_utils.mllm_classification(model, processor, image_dataset, database, query, mode="dcd")
acc = decoding_utils.get_accuracy(preds, labels, image_labels_map)
print("Classification Accuracy, Idefics2 - Database Constrained Decoding", acc)

Run Normalized Sequence Likelihood

In [None]:
# only run this on the toy dataset
preds, labels, all_database_chosen = decoding_utils.mllm_classification(model, processor, image_dataset, database, query, mode="nsl")
acc = decoding_utils.get_accuracy(preds, labels, image_labels_map)
print("Classification Accuracy, Idefics2 - Normalized Sequence Likelihood", acc)

 26%|██▌       | 26/100 [03:10<09:02,  7.33s/it]

**Load OpenCLIP**

In [None]:
clip_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
clip_processor = AutoProcessor.from_pretrained(clip_id)
clip_model = AutoModel.from_pretrained(
    clip_id,
    torch_dtype=torch.float16,
    quantization_config=quantization_config
)

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/904 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

Run Contrastive Classification

In [None]:
preds, labels, all_database_chosen = decoding_utils.contrastive_classification(clip_model, clip_processor, image_dataset, database)
acc = decoding_utils.get_accuracy(preds, labels, image_labels_map)
print("Classification Accuracy, OpenCLIP", acc)

100%|██████████| 100/100 [00:04<00:00, 24.70it/s]
100%|██████████| 470/470 [00:16<00:00, 28.77it/s]
100%|██████████| 100/100 [00:00<00:00, 9309.92it/s]


Classification Accuracy, OpenCLIP 0.53
