In this notebook we will evaluate the performance of the Idefics models on the dataset. This can be used to evaluate any of the fine-tuned Idefic2 models we trained.

We start by importing the ScienceQA dataset from HuggingFace. We only need the test split.

In [1]:
from datasets import load_dataset, Value
import matplotlib.pyplot as plt

ds = load_dataset("derek-thomas/ScienceQA")
ds = ds.filter(lambda example: example['image'] is not None) #only keep obs with images
ds = ds.cast_column('answer', Value("string")) #convert answer from int to string

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
test_dataset = ds['test']
test_dataset = test_dataset.remove_columns(['hint', 'task', 'grade', 'subject', 'topic', 'category', 'skill', 'lecture', 'solution'])

We load in the Idefics2 tokenizer and processor from HuggingFace. Then we upload the fine-tuned Idefics2 model from our local repository (if we want to evalute on another fine-tuned Idefics2 model we just need to change this path).

In [3]:
from transformers import AutoProcessor, AutoTokenizer, Idefics2ForConditionalGeneration, BitsAndBytesConfig
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load in the processor which formats inputs for LLaVa
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False, size= {"longest_edge": 448, "shortest_edge": 378})
# processor.tokenizer.padding_side = "right" # during training, one always uses padding on the right

# Tokenizer
IDEFICS2_CHAT_TEMPLATE = """{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. {% for message in messages %}{% if message['role'] == 'user' %}USER: {% else %}ASSISTANT: {% endif %}{% for item in message['content'] %}{% if item['type'] == 'text' %}{{ item['text'] }}{% elif item['type'] == 'image' %}<image>{% endif %}{% endfor %}{% if message['role'] == 'user' %} {% else %}{{eos_token}}{% endif %}{% endfor %}{% if add_generation_prompt %}ASSISTANT: {% endif %}"""
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceM4/idefics2-8b", use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = IDEFICS2_CHAT_TEMPLATE

# We will us QLoRa, so we specify how to quantize the model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = Idefics2ForConditionalGeneration.from_pretrained("Idefics2_qlora_ft/checkpoint-776/", # change this path to change ft model we eval
                                                         torch_dtype=torch.float16,
                                                         quantization_config=quantization_config).to(device)

model.eval()

cuda


Chat templates should be in a 'chat_template.json' file but found key='chat_template' in the processor's config. Make sure to move your template to its own file.
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 7/7 [00:57<00:00,  8.28s/it]
You shouldn't move a model that is dispatched using accelerate hooks.


Idefics2ForConditionalGeneration(
  (model): Idefics2Model(
    (vision_model): Idefics2VisionTransformer(
      (embeddings): Idefics2VisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(4900, 1152)
      )
      (encoder): Idefics2Encoder(
        (layers): ModuleList(
          (0-26): 27 x Idefics2EncoderLayer(
            (self_attn): Idefics2VisionAttention(
              (k_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear4bit(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): Idefics2VisionMLP(
              (activation_fn): PytorchGELUTanh()
              

Now we run the evaluation loop. The evaluation loop is pretty simple, we format the question and choices from the dataset into the correct format (sort of like the data collator) and then have the model infer the answer.

In [4]:
from tqdm import tqdm # for progress bar

EVAL_BATCH_SIZE = 15

answers_unique = []
generated_texts_unique = []

for i in tqdm(range(0, len(test_dataset), EVAL_BATCH_SIZE)):
    examples = test_dataset[i: i + EVAL_BATCH_SIZE]
    answers_unique.extend(examples["answer"])
    images = [[im] for im in examples["image"]]
    texts = []
    for q, c in zip(examples['question'], examples['choices']):

        content = [{"type": "text", "text": "Please read the multiple-choice question below carefully and answer only with an index of the given list of choices."}]
        content += [{"type": "image"}]
        content += [{"type": "text", "text": f"{q}\nThe choices are: {c}."}]

        messages = [
            {
                "role": "user",
                "content": content,
            },
        ]
        text = processor.apply_chat_template(messages, add_generation_prompt=True)
        texts.append(text.strip())
    inputs = processor(text=texts, images=images, return_tensors="pt", padding=True).to(device)
    generated_ids = model.generate(**inputs, max_new_tokens=64)
    generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)
    generated_texts_unique.extend(generated_texts)

100%|██████████| 135/135 [16:17<00:00,  7.24s/it]


We have to modify the answers a little but to only conserve the index values.

In [5]:
predictions = [g.strip().strip(".") for g in generated_texts_unique]

Finally, we calculate the accuracy of the model on the test dataset.

In [7]:
import numpy as np

def accuracy(preds, real):
    p = np.array([int(x) for x in preds])
    r = np.array([int(x) for x in real])
    return sum(p == r)/len(real)

accuracy(predictions, test_dataset['answer'])

0.8914229053049083

Distribution of answers in test data

In [37]:
for i in (['0'], ['1'], ['2'], ['3'], ['4']):
    print(sum(np.array(i) == np.array(test_dataset['answer']))/len(np.array(test_dataset['answer'])))

0.35349529003470503
0.35994050570153696
0.18591968269707487
0.09667823500247893
0.0039662865642042635
