In this notebook we will evaluate the performance of the Idefics models on the dataset. This can be used to evaluate any of the fine-tuned Idefic2 models we trained.

We start by importing the ScienceQA dataset from HuggingFace. We only need the test split.

In [1]:
from datasets import load_dataset, Value
import matplotlib.pyplot as plt

ds = load_dataset("derek-thomas/ScienceQA")
ds = ds.filter(lambda example: example['image'] is not None) #only keep obs with images
ds = ds.cast_column('answer', Value("string")) #convert answer from int to string

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
test_dataset = ds['test']
test_dataset = test_dataset.remove_columns(['hint', 'task', 'grade', 'subject', 'topic', 'category', 'skill', 'lecture', 'solution'])

We load in the LLaVA-1.5 tokenizer and processor from HuggingFace. Then we upload the fine-tuned LLaVA-1.5 model from our local repository (if we want to evalute on another fine-tuned LLaVA model we just need to change this path).

In [3]:
from transformers import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration, BitsAndBytesConfig
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "llava-hf/llava-1.5-7b-hf"

# Load in the processor which formats inputs for LLaVa
processor = AutoProcessor.from_pretrained(model_id)
# processor.tokenizer.padding_side = "right" # during training, one always uses padding on the right

# We will us QLoRa, so we specify how to quantize the model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = LlavaForConditionalGeneration.from_pretrained("LLaVA-1.5_qlora_ft/checkpoint-258/", # change path for other llava ft models
                                                      torch_dtype=torch.float16,
                                                      quantization_config=quantization_config).to(device)

model.eval()

cuda


Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. 
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 3/3 [00:32<00:00, 10.78s/it]
You shouldn't move a model that is dispatched using accelerate hooks.


LlavaForConditionalGeneration(
  (vision_tower): CLIPVisionModel(
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(577, 1024)
      )
      (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-23): 24 x CLIPEncoderLayer(
            (self_attn): CLIPSdpaAttention(
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1024, out_features=6, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features

Now we run the evaluation loop. The evaluation loop is pretty simple, we format the question and choices from the dataset into the correct format (sort of like the data collator) and then have the model infer the answer.

In [None]:
import torchvision.transforms as transforms


EVAL_BATCH_SIZE = 4

answers_unique = []
generated_texts_unique = []
transform = transforms.Compose([transforms.PILToTensor()])

for i in range(0, len(test_dataset), EVAL_BATCH_SIZE):
    examples = test_dataset[i: i + EVAL_BATCH_SIZE]
    answers_unique.extend(examples["answer"])
    images = [transform(im) for im in examples["image"]]
    texts = []
    for q, c in zip(examples['question'], examples['choices']):

        content = [{"type": "text", "text": "Please read the multiple-choice question below carefully and answer only with an index of the given list of choices."}]
        content += [{"type": "image"}]
        content += [{"type": "text", "text": f"{q}\nThe choices are: {c}."}]

        messages = [
            {
                "role": "user",
                "content": content,
            },
        ]
        text = processor.apply_chat_template(messages, add_generation_prompt=True)
        texts.append(text.strip())
    inputs = processor(text=texts, images=images, return_tensors="pt", padding=True).to(device)
    generated_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)
    generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
    for text in generated_texts:
        generated_texts_unique.extend(text.split("ASSISTANT:")[-1])

We have to modify the answers a little but to only conserve the index values.

In [5]:
predictions = [g.strip().strip(".") for g in generated_texts_unique]

Finally, we calculate the accuracy of the model on the test dataset.

In [28]:
import numpy as np

def accuracy(preds, real):
    p = np.array([int(x) for x in preds])
    r = np.array([int(x) for x in real])
    return sum(p == r)/len(real)

accuracy(predictions, test_dataset['answer'])

0.8328223723081908


Distribution of answers in test data

In [37]:
for i in (['0'], ['1'], ['2'], ['3'], ['4']):
    print(sum(np.array(i) == np.array(test_dataset['answer']))/len(np.array(test_dataset['answer'])))

0.35349529003470503
0.35994050570153696
0.18591968269707487
0.09667823500247893
0.0039662865642042635
