# GPT4Vision Evaluations on Winoground

Author: Emily Li

### Load results from logs

Provide path to res_evaluations_log.txt and retrieve 
1) List of IDs
2) Winoground metrics (image/text/group score) for all ids
3) Failure case ids for each score

Note: only the first occurence of an ID is used if duplicate results exist

In [3]:
from evaluation import GPT4V_Winoground_Evals as evals

# Process result & calculate accuracy scores. 
# Example:
log_file_path = "./logs/raw-prob-log/res_evaluations_log.txt" # Provide path to res_evaluations_log.txt
raw_prob_id, raw_prob_scores, raw_prob_fc = evals.process_results("raw probability", log_file_path) # Process results

log_file_path = "./logs/prob-log/res_evaluations_log.txt"
prob_id, prob_scores, prob_fc = evals.process_results("probability", log_file_path)

Dup found! ID 0. skipping...
Experiment: raw probability - Total samples: 400
 Image Score: 129 --> 0.323
 Text Score: 147 --> 0.367
 Group Score: 114 --> 0.285
Experiment: probability - Total samples: 400
 Image Score: 158 --> 0.395
 Text Score: 170 --> 0.425
 Group Score: 131 --> 0.328


In [4]:
print("raw_prob_id: ", raw_prob_id)
print("raw_prob_scores: ", raw_prob_scores)
print("raw_prob_fc: ", raw_prob_fc)

raw_prob_id:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 

### Winoground evaluations

Optional: sample a subset of indices to evaluate

In [6]:
# Optional: sample 50 indices from the raw probability results
amt_to_sample = 2
sample_indices = evals.randomly_sample(amt_to_sample)
print(f"Sampled {len(sample_indices)} indices:", sample_indices)

Sampled 2 indices: [57, 65]


Feel free to write additional prompting functions. Otherwise import existing prompts from prompts.py

In [5]:
from prompts import get_prob_num_prompts # most of the prompts are stored in prompts.py

# Example prompt - returns system message and user message
def get_prob_num_prompts():
    system_msg = "You are a helpful image data analyst and linguist. Answer the user’s questions accurately and concisely."
    user_msg = "First describe what is happening in the image. Then, calculate a similarity score in the format of the probability of a match between the provided image and the caption '{img_caption}'. Explain the rationale \n" 
    user_msg += "Provide your evaluation in JSON format, including keys for 'score', 'description, and 'reasoning'."
    return system_msg, user_msg

def get_num_score_prompts_b():
    """
    Baseline from GPT-4V(ision) as a Generalist Evaluator for
    Vision-Language Tasks
    """
    system_msg = ""
    user_msg = "Evaluate whether a given caption accurately represents the main content and objects of an associated image with scale 1-100. \n Caption: '{img_caption}' \n"
    user_msg += "Provide your evaluation in JSON format, including keys for 'score' and 'reasoning'."
    return system_msg, user_msg

def get_raw_prob_num_prompts_b(output_json=True):
    '''Describe the image is removed, removed system prompt '''
    system_msg = ""
    user_msg = "Evaluate whether a given caption accurately represents the main content and objects of an associated image with probability between 0-1. \n Caption: '{img_caption}' \n"
    # user_msg = "Calculate a similarity score in the format of the probability of a match between the provided image and the caption '{img_caption}'. \n" 
    if output_json:
        user_msg += "Provide your evaluation in JSON format, including keys for 'score', and 'reasoning'. "
    return system_msg, user_msg

def get():
    system_msg = ""
    user_msg = "Select whether the image matches the caption. Pay close attention to the word order. (Give a short explanation first, then change to a new line give the final answer in the exact format of: \"The answer is yes/no.\"))\n"
    user_msg += "Provide your evaluation in JSON format, including keys for 'score' and 'reasoning'."
    return system_msg, user_msg

def official_gpt4eval_baseline_prompt():
    system_msg = ""
    user_msg = "Your task is to evaluate whether a given text caption accurately represents the main content and objects of an associated image. While the caption need not describe every detail of the image, it should convey the overall theme or subject. After your evaluation, rate the quality of the text caption’s match to the image on a scale of 1-100, with 100 being a perfect match. Caption: '{img_caption}' \n"
    user_msg += "Provide your evaluation in JSON format, including keys for 'score' and 'reasoning'."
    return system_msg, user_msg


In [9]:
from evaluation import GPT4V_Winoground_Evals as evals

# Arguments
master_folder_path = "./"
save_folder_name= "logs/official_gpt4eval_baseline-log/"            # Log folder path (change me)
system_msg, user_msg = official_gpt4eval_baseline_prompt()       # Defines prompt (change me)
openai_api_key = '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' # ZQ's key 2
# openai_api_key = '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' # Zhiqiu's key

id_list = list(range(317, 400))                         # List of winoground ids to evaluate (change me)

# Define evaluator & Run evaluation
evaluator = evals(master_folder_path, save_folder_name, openai_api_key, 
                  post_processing_fn=None, # Post processing function to apply to the generated text. Default: None
                  system_prompt=system_msg, user_prompt=user_msg,
                  api_max_retries=2, # Number of times to retry API call before giving up
                  )
evaluator.reset_api_key(openai_api_key) # Ensure that the correct API key is used
evaluator.evaluate_winoground_gpt4v(id_list) 


Saving to folder: ./logs/official_gpt4eval_baseline-log/
Sample 317... Finished i1c1, i1c2, i2c1, i2c2, Text score: 0, Image score: 0, Group score: 0, Took 33.072 seconds!
Sample 318... Finished i1c1, i1c2, i2c1, i2c2, Text score: 1, Image score: 1, Group score: 1, Took 31.996 seconds!
Sample 319... Finished i1c1, i1c2, i2c1, i2c2, Text score: 0, Image score: 0, Group score: 0, Took 40.7 seconds!
Sample 320... Finished i1c1, i1c2, i2c1, i2c2, Text score: 0, Image score: 0, Group score: 0, Took 24.928 seconds!
Sample 321... Finished i1c1, i1c2, i2c1, i2c2, Text score: 1, Image score: 1, Group score: 1, Took 30.132 seconds!
Sample 322... Finished i1c1, i1c2, i2c1, i2c2, Text score: 1, Image score: 1, Group score: 1, Took 36.786 seconds!
Sample 323... Finished i1c1, i1c2, i2c1, i2c2, Text score: 0, Image score: 0, Group score: 0, Took 24.264 seconds!
Sample 324... Finished i1c1, i1c2, i2c1, i2c2, Text score: 1, Image score: 1, Group score: 1, Took 27.507 seconds!
Sample 325... Finished i1

Let's see our results!

In [16]:
log_path = save_folder_name + "/res_evaluations_log.txt" # Provide path to res_evaluations_log.txt
ids, raw_scores, fc = evals.process_results("num 1-100 score", log_path) # Process results

print("ids: ", ids)
print("raw_scores: ", raw_scores)
print("fc: ", fc)

Experiment: num 1-100 score - Total samples: 400
 Image Score: 173 --> 0.432
 Text Score: 179 --> 0.448
 Group Score: 149 --> 0.372
ids:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 1

In [None]:
from matplotlib import pyplot as plt


winoground = evals.dataset
def show_example(idx):
  ax1 = plt.subplot(1, 3, 1)
  ax1.title.set_text('image_0')
  plt.imshow(winoground[idx]["image_0"].convert("RGB"))

  ax2 = plt.subplot(1, 3, 2)
  ax2.title.set_text('image_1')
  plt.imshow(winoground[idx]["image_1"].convert("RGB"))

  plt.show()

  print("caption_0:", winoground[idx]["caption_0"])
  print("caption_1:", winoground[idx]["caption_1"])