# GPT4Vision Evaluations on Winoground

Author: Emily Li

### Load results from logs

Provide path to res_evaluations_log.txt and retrieve 
1) List of IDs
2) Winoground metrics (image/text/group score) for all ids
3) Failure case ids for each score

Note: only the first occurence of an ID is used if duplicate results exist

In [3]:
from evaluation import GPT4V_Winoground_Evals as evals

# Process result & calculate accuracy scores. 
# Example:
log_file_path = "./logs/raw-prob-log/res_evaluations_log.txt" # Provide path to res_evaluations_log.txt
raw_prob_id, raw_prob_scores, raw_prob_fc = evals.process_results("raw probability", log_file_path) # Process results

log_file_path = "./logs/prob-log/res_evaluations_log.txt"
prob_id, prob_scores, prob_fc = evals.process_results("probability", log_file_path)

Dup found! ID 0. skipping...
Experiment: raw probability - Total samples: 400
 Image Score: 129 --> 0.323
 Text Score: 147 --> 0.367
 Group Score: 114 --> 0.285
Experiment: probability - Total samples: 400
 Image Score: 158 --> 0.395
 Text Score: 170 --> 0.425
 Group Score: 131 --> 0.328


In [4]:
print("raw_prob_id: ", raw_prob_id)
print("raw_prob_scores: ", raw_prob_scores)
print("raw_prob_fc: ", raw_prob_fc)

raw_prob_id:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 

### Winoground evaluations

Optional: sample a subset of indices to evaluate

In [6]:
# Optional: sample 50 indices from the raw probability results
amt_to_sample = 2
sample_indices = evals.randomly_sample(amt_to_sample)
print(f"Sampled {len(sample_indices)} indices:", sample_indices)

Sampled 2 indices: [57, 65]


Feel free to write additional prompting functions. Otherwise import existing prompts from prompts.py

In [15]:
from prompts import get_prob_num_prompts # most of the prompts are stored in prompts.py

# Example prompt - returns system message and user message
def get_prob_num_prompts():
    system_msg = "You are a helpful image data analyst and linguist. Answer the user’s questions accurately and concisely."
    user_msg = "First describe what is happening in the image. Then, calculate a similarity score in the format of the probability of a match between the provided image and the caption '{img_caption}'. Explain the rationale \n" 
    user_msg += "Provide your evaluation in JSON format, including keys for 'score', 'description, and 'reasoning'."
    return system_msg, user_msg

def get_num_score_prompts_b():
    """
    Baseline from GPT-4V(ision) as a Generalist Evaluator for
    Vision-Language Tasks
    """
    system_msg = ""
    user_msg = "Evaluate whether a given caption accurately represents the main content and objects of an associated image with scale 1-100."
    user_msg += "Provide your evaluation in JSON format, including keys for 'score' and 'reasoning'."
    return system_msg, user_msg

In [9]:
from evaluation import GPT4V_Winoground_Evals as evals

# Arguments
master_folder_path = "./"
save_folder_name= "logs/1-100_score_baseline-log/"            # Log folder path (change me)
system_msg, user_msg = get_num_score_prompts_b()       # Defines prompt (change me)
openai_api_key = "<insert your PENAI_API_KEY>"      # OpenAI API KEY (change me)

id_list = sample_indices                             # List of winoground ids to evaluate (change me)

# Define evaluator & Run evaluation
evaluator = evals(master_folder_path, save_folder_name, openai_api_key, 
                  post_processing_fn=None, # Post processing function to apply to the generated text. Default: None
                  system_prompt=system_msg, user_prompt=user_msg,
                  api_max_retries=2, # Number of times to retry API call before giving up
                  )
evaluator.reset_api_key(openai_api_key) # Ensure that the correct API key is used
evaluator.evaluate_winoground_gpt4v(id_list) 


Saving to folder: ./logs/example-log
Sample 57... Finished i1c1, i1c2, i2c1, i2c2, Text score: 1, Image score: 1, Group score: 1, Took 28.83 seconds!
Sample 65... Finished i1c1, i1c2, i2c1, i2c2, Text score: 1, Image score: 1, Group score: 1, Took 29.055 seconds!
Done - Overall failed_samples: []


Let's see our results!

In [14]:
log_path = save_folder_name + "/res_evaluations_log.txt" # Provide path to res_evaluations_log.txt
ids, raw_scores, fc = evals.process_results("prob", log_path) # Process results

print("ids: ", ids)
print("raw_scores: ", raw_scores)
print("fc: ", fc)

Experiment: prob - Total samples: 2
 Image Score: 2 --> 1.0
 Text Score: 2 --> 1.0
 Group Score: 2 --> 1.0
ids:  [57, 65]
raw_scores:  {57: {'image_score': 1, 'text_score': 1, 'group_score': 1}, 65: {'image_score': 1, 'text_score': 1, 'group_score': 1}}
fc:  {'text_score_fc': [], 'image_score_fc': [], 'group_score_fc': [], 'total_ids': [57, 65]}
