This notebook is from "WHOOPS!" article code, we used only the first task - Image Captioning.

We only changed the DataFrame parameters: TASK_CAPTIONING, CROWD_CAPTIONS, and file_input_path. Additionally, we added a BERTScore to the results.

# Evaluate WHOOPS! benchmark

WHOOPS! benchmark presents 4 tasks: Explanation-of-violation, Image Captioning, Image-text Matching and Visual Quesion Answering (VQA).

This colab-noteook implements the evaluation calculation for 3 tasks: Image Captioning, Image-text Matching and VQA.
The WHOOPS! evaluation file for this notebook can be found [here](https://drive.google.com/file/d/1dx6fuxKf4Yc18xvLmTr-nRkBT_TYTKdq/view?usp=share_link).

The task of Explanation-of-violation is currently using human evaluation. If you want to compute your results on this task, please send a mail to: yonatanbitton1@gmail.com

**NOTE** for VQA task: BEM calculation is very slow, if you want to calculate it, run the relevant calls in [section](https://colab.research.google.com/drive/1av7JdDk005qQL6WdAVL0kFlah7VXV0Md#scrollTo=Qm4HZ7322sse)

In [1]:
#@title Inastallaions & Imports.

!pip install -q "git+https://github.com/salaniz/pycocoevalcap.git"
!pip install -q tensorflow-text
!pip install -q bert-score

from pycocoevalcap.eval import COCOEvalCap
from pycocotools.coco import COCO
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.bleu.bleu import Bleu
import tensorflow as tf
# import tensorflow_hub as hub
import tensorflow_text as text
from scipy.special import softmax
import numpy as np
import pandas as pd
from tqdm import tqdm
import json
from bert_score import score as bert_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pycocoevalcap (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

If you want to evaluate our model's generated captions, put in the next cell:

file_input_path = weird_df.csv or normal_df.csv

and choose the model:
model_name="gpt" or "blip"

In [159]:
model_name="blip" # "blip" or "gpt"

In [177]:
#@title Set up DataFrame parameters for evaluation.

TASK_VQA = 'gpt4o_vqa' #@param {type:"string"}
QUESTION_KEY_QA_DICT = 'question' #@param {type:"string"}
REFERENCE_KEY_QA_DICT = 'ground_truth_answer' #@param {type:"string"}
CANDIDATE_KEY_QA_DICT = 'predicted_answer' #@param {type:"string"}

TASK_MATCHING = 'blip_matching' #@param {type:"string"}

TASK_CAPTIONING = f"{model_name}_captioning" #@param {type:"string"} #blip_captioning

POSITIVE = 'positive' #@param {type:"string"}
UNDER_SPECIFIED = 'under_specified' #@param {type:"string"}

IMAGE_ID = 'image_id' #@param {type:"string"}
CROWD_CAPTIONS = 'crowd_captions' #@param {type:"string"} , original_captions

ITM_SCORE = 'itm_score'
MATCHING_SCORE_ITM = 'matching_score_itm'

file_input_path = 'normal_df.csv' #@param {type:"string"} or normal_df.csv

## Create weird/normal df
You don't have to run it, because the weird and normal dfs exist.


Takes the generated captions df, and creates normal df and weird df, with the relevant columns for evaluation

In [161]:
fixedLabels = pd.read_csv("fixedLabels.csv")
all_generated_df = pd.read_csv(f"{model_name}-generated_captions.csv")

### Normal captions

In [162]:
normal_generated = all_generated_df[['normal']]

In [163]:
normal_df = fixedLabels[fixedLabels.index % 2 == 0].reset_index(drop=True)

In [164]:
def add_double_quotes(text):
    if isinstance(text, str):
        return f'"{text}"'
    return text

# Apply the function to the 'captions' column and create a new column 'crowd_captions'
normal_df['crowd_captions'] = normal_df['captions'].apply(add_double_quotes)
normal_df[f'{model_name}_captioning'] = normal_generated['normal'].values
normal_df[f'{model_name}_captioning'] = normal_df[f'{model_name}_captioning'].str.replace('"', "'")
normal_df[f'{model_name}_captioning'] = normal_df[f'{model_name}_captioning'].apply(add_double_quotes)

normal_df['image_id'] = None
normal_df['image_id'] = normal_df['image_id'].fillna(pd.Series(range(1, len(normal_df) + 1)))

In [165]:
normal_df.to_csv("normal_df.csv")

### Weird captions

In [166]:
big_data = pd.read_csv("whoops_dataset.csv")

In [167]:
weird_generated = all_generated_df[['strange']]

In [168]:
weird_df = fixedLabels[fixedLabels.index % 2 == 1].reset_index(drop=True)

In [169]:
weird_df[f'{model_name}_captioning'] = weird_generated['strange'].values
# Merge the DataFrames on the 'selected_caption' column
merged_data=None
merged_data = pd.merge(weird_df, big_data, on='selected_caption', how='left')
# Update the 'crowd_captions' column in fixedLabels with the corresponding values from big_data
weird_df['crowd_captions'] = merged_data['crowd_captions']
weird_df['image_id'] = merged_data['image_id']

In [170]:
weird_df[f'{model_name}_captioning'] = weird_df[f'{model_name}_captioning'].str.replace('"', "'")
weird_df[f'{model_name}_captioning'] = weird_df[f'{model_name}_captioning'].apply(add_double_quotes)


In [171]:
len(weird_df)

101

In [172]:
weird_df.dropna(subset=['crowd_captions'], inplace=True)

In [173]:
len(weird_df)

98

In [174]:
weird_df.to_csv("weird_df.csv")


In [175]:
model_captioning_1 = weird_df[[f'{model_name}_captioning']]
model_captioning_2 = normal_df[[f'{model_name}_captioning']]

# Concatenate the columns
merged_model_captioning = pd.concat([model_captioning_1, model_captioning_2], ignore_index=True)

In [176]:
# for using in caption classification notebook
merged_model_captioning.to_csv(f'{model_name}_new_captions.csv', index=False)

In [84]:
# # # check the rows if there is a problem:
# columns_to_process = [TASK_CAPTIONING, CROWD_CAPTIONS, TASK_VQA, TASK_MATCHING]

# # Check for problematic JSON entries
# for c in columns_to_process:
#     if c in df:
#         print(f"Processing column: {c}")
#         for index, value in df[c].items():
#             try:
#                 df.at[index, c] = json.loads(value)
#             except json.JSONDecodeError as e:
#                 print(f"Error decoding JSON in column '{c}', row {index}: {value}")
#                 # You can choose to handle the error here, for example by setting it to an empty dict
#                 df.at[index, c] = {}  # Or handle it in another way that makes sense for your application

Processing column: blip_captioning


TypeError: the JSON object must be str, bytes or bytearray, not float

#Task \#1: Image Captioning

In [178]:
#@title Load Data Frame

# aggregation dict for all results
all_results_dict = {}

df = pd.read_csv(file_input_path)

for c in [TASK_CAPTIONING, CROWD_CAPTIONS, TASK_VQA, TASK_MATCHING]:
    if c in df:
      print(c)
      df[c] = df[c].apply(json.loads)

blip_captioning
crowd_captions


In [179]:
#@title Preprocess to calculate CIDEr + Bleu-4 netrics

# make df to dict of lists of dicts : {IMG_ID:[{IMG_ID:, CAPTION:}, ..], IMG_ID:[{IMG_ID:, CAPTION:}, ..], ..}
df_res = df[[IMAGE_ID,TASK_CAPTIONING]]
df_gts = df[[IMAGE_ID,CROWD_CAPTIONS]]

df_gts = df_gts.rename(columns={CROWD_CAPTIONS:"caption"})
df_res = df_res.rename(columns={TASK_CAPTIONING:"caption"})

exploded_df_gts = df_gts.explode("caption")
dict_res = {}
dict_gts = {}

for img_id in df_res.image_id:
  dict_res[img_id] = df_res[df_res.image_id == img_id].to_dict('records')
  dict_gts[img_id] = exploded_df_gts[exploded_df_gts.image_id == img_id].to_dict('records')

params = {IMAGE_ID: dict_gts.keys()}
params[IMAGE_ID]  = dict_res.keys()
imgIds = params[IMAGE_ID]

gts = {}
res = {}
for imgId in imgIds:
    gts[imgId] = dict_gts[imgId]
    res[imgId] = dict_res[imgId]

tokenizer = PTBTokenizer()
gts  = tokenizer.tokenize(gts)
res = tokenizer.tokenize(res)

CIDEr + Bleu-4 and Bert Scores

In [180]:
scorers = [
    (Bleu(4), "Bleu_4"),
    (Cider(), "CIDEr"),
]

for scorer, method in scorers:
  print(f'computing {scorer.method()} score...')
  score, scores = scorer.compute_score(gts, res)
  if type(score) == list:

    all_results_dict[method] = round(score[3] * 100, 2)
  else:
    all_results_dict[method] = round(score * 100, 2)

print('computing BERTScore...')
gts_texts = [" ".join(gts[imgId]) for imgId in imgIds]
res_texts = [" ".join(res[imgId]) for imgId in imgIds]

P, R, F1 = bert_score(gts_texts, res_texts, lang='en', verbose=True)
all_results_dict['BERTScore'] = round(F1.mean().item() * 100, 2)
print(f"BERTScore: {round(F1.mean().item() * 100, 2)}\n")

# Print all results
for metric, value in all_results_dict.items():
    print(f"{metric}: {value}")

computing Bleu score...
{'testlen': 4221, 'reflen': 835, 'guess': [4221, 4120, 4019, 3918], 'correct': [679, 330, 165, 77]}
ratio: 5.055089820353228
computing CIDEr score...
computing BERTScore...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/4 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 75.44 seconds, 1.34 sentences/sec
BERTScore: 87.81

Bleu_4: 5.68
CIDEr: 0.07
BERTScore: 87.81


# Task \#2: Image-Text Matching



In [None]:
#@title Helper functions to calculate image text matching
def calculate_matching_score_proportions_pos_above_under(x, matching_score_type):
    positive = [xi for xi in x if xi['label'] == POSITIVE]
    under_specified = [xi for xi in x if xi['label'] == UNDER_SPECIFIED]
    positive_scores = [xi[matching_score_type] for xi in positive]
    under_specified_scores = [xi[matching_score_type] for xi in under_specified]
    max_under_specified = max(under_specified_scores)
    preds_lst = [x > max_under_specified for x in positive_scores]
    matching_pos_above_under_specified_proportion = np.mean(preds_lst)
    return matching_pos_above_under_specified_proportion

def get_matching_averages(x, matching_score_type):
    positive = [xi for xi in x if xi['label'] == POSITIVE]
    under_specified = [xi for xi in x if xi['label'] == UNDER_SPECIFIED]
    positive_scores = [xi[matching_score_type] for xi in positive]
    under_specified_scores = [xi[matching_score_type] for xi in under_specified]
    return [np.mean(positive_scores), np.mean(under_specified_scores)]

In [None]:
df[MATCHING_SCORE_ITM] = df[TASK_MATCHING].apply(lambda x: calculate_matching_score_proportions_pos_above_under(x, ITM_SCORE))
all_results_dict["Matching"] = round(df[MATCHING_SCORE_ITM].mean() * 100, 2)
print(f"matching score ITM : {round(df[MATCHING_SCORE_ITM].mean() * 100, 2)}")

matching_score_itm_wasserstein = round(df[MATCHING_SCORE_ITM].mean() * 100, 2)
matching_itm_avg = f"{MATCHING_SCORE_ITM}_avg"
df[matching_itm_avg] = df[TASK_MATCHING].apply(lambda x: get_matching_averages(x, ITM_SCORE))
print(f"matching score ITM averages (positive, under-described): {round(df[matching_itm_avg].apply(lambda x: x[0]).mean() * 100, 2), round(df[matching_itm_avg].apply(lambda x: x[1]).mean() * 100, 2)}")

matching score ITM : 76.36
matching score ITM averages (positive, under-described): (97.92, 79.05)


# Task \#3: Visual Question Answering (VQA)

## VQA: Exact Match Calculation

In [None]:
whoops_vqa_exact_match = df[TASK_VQA].apply(
    lambda lst: [x[CANDIDATE_KEY_QA_DICT].lower() == x[REFERENCE_KEY_QA_DICT].lower() for x in lst])
L_exact_match = []
for lst in whoops_vqa_exact_match:
    L_exact_match += lst

all_results_dict["VQA_exact_match"] = round(np.mean(L_exact_match) * 100, 2)
print(f"% Exact match: {round(np.mean(L_exact_match) * 100, 2)}")

% Exact match: 5.56


## VQA: BEM Calculation

In [None]:
#@title Sets up the BERT tokenizer using tf-text + helper functions for BEM calculation
VOCAB_PATH = 'gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12/vocab.txt'

vocab_table = tf.lookup.StaticVocabularyTable(
        tf.lookup.TextFileInitializer(
            filename=VOCAB_PATH,
            key_dtype=tf.string,
            key_index=tf.lookup.TextFileIndex.WHOLE_LINE,
            value_dtype=tf.int64,
            value_index=tf.lookup.TextFileIndex.LINE_NUMBER
        ),
        num_oov_buckets=1)
cls_id, sep_id = vocab_table.lookup(tf.convert_to_tensor(['[CLS]', '[SEP]']))
tokenizer = text.BertTokenizer(vocab_lookup_table=vocab_table,
                               token_out_type=tf.int64,
                               preserve_unused_token=True,
                               lower_case=True)

def bertify_example(example):
  question = tokenizer.tokenize(example[QUESTION_KEY_QA_DICT]).merge_dims(1, 2)
  reference = tokenizer.tokenize(example[REFERENCE_KEY_QA_DICT]).merge_dims(1, 2)
  candidate = tokenizer.tokenize(example[CANDIDATE_KEY_QA_DICT]).merge_dims(1, 2)

  input_ids, segment_ids = text.combine_segments(
      (candidate, reference, question), cls_id, sep_id)

  return {'input_ids': input_ids.numpy(), 'segment_ids': segment_ids.numpy()}


def pad(a, length=512):
  return np.append(a, np.zeros(length - a.shape[-1], np.int32))


def bertify_examples(examples):
  input_ids = []
  segment_ids = []
  for example in examples:
    example_inputs = bertify_example(example)
    input_ids.append(pad(example_inputs['input_ids']))
    segment_ids.append(pad(example_inputs['segment_ids']))

  return {'input_ids': np.stack(input_ids), 'segment_ids': np.stack(segment_ids)}

In [None]:
# Load BEM model.
bem = hub.load('https://tfhub.dev/google/answer_equivalence/bem/1')

bem_scores = []

print("Preprocessing QA examples..")
all_examples = df[TASK_VQA].apply(bertify_examples)

print("Calculating BEM scores..")
for inputs in tqdm(all_examples):
  # The outputs are raw logits.
  raw_outputs = bem(inputs)

  # They can be transformed into a classification 'probability' like so:
  softmax_score = list(softmax(raw_outputs, axis=1)[:, 1])
  bem_scores.append(np.mean(softmax_score))


all_results_dict["VQA_BEM"] = round(np.mean(bem_scores) * 100, 2)
print(f'\nBEM score: {round(np.mean(bem_scores) * 100, 2)}')

Preprocessing QA examples..
Calculating BEM scores..


100%|██████████| 494/494 [00:00<00:00, 449552.22it/s]


BEM score: 41.28





# Get Final Results Aggregated:

In [None]:
pd.DataFrame(all_results_dict, index=[file_input_path])

Unnamed: 0,Bleu_4,CIDEr,Matching,VQA_exact_match,VQA_BEM
whoops_dataset_for_eval.csv,12.39,64.54,76.36,5.56,41.28
