### This program will prepare the PVQA dataset to evaluate the self-consistency using LLaVa-Med.

1. Read all the open-ended VQA questions
2. Prepare a jsonl file such that each question is repeated five times since we need to ask LLaVa-Med five items about a specific question
3. Run the LLaVa-Med on the recent file to get five answers about a single question
4. Prepare a jsonl file to get the self-consistency output from the model given the question and give answers
5. Store the generated answer after self-consistency in a list of dictionary having question, ground truth and predicted answer
6. Calculate the recall value using the ground truth and predicted answer

In [1]:
import pickle
import json
import numpy as np
from recall_calculation import recall_score

Reading the Test VQA file from PVQA dataset

In [2]:
pvqa_test_qas_file = "/data/mn27889/pvqa/qas/test/test_qa.pkl"

with open(pvqa_test_qas_file, 'rb') as file:
    pvqa_test_qa = pickle.load(file)

In [3]:
pvqa_qas_open = [qas for qas in pvqa_test_qa if qas['answer'] != 'yes' and qas['answer'] != 'no']
# pvqa_ques_open = [qas['question'] for qas in pvqa_qas_open]
# pvqa_ans_open = [qas['answer'] for qas in pvqa_qas_open]
# pvqa_img_open = [qas['image'] + '.jpg' for qas in pvqa_qas_open]

Preparing the dictionary format of questions to be put in `jsonl` file for LLaVa-Med. As part of self-consistency, we need to ask a question five times to LLaVa-Med

In [4]:
question = []
idx = 0
for qas in pvqa_qas_open:
    for _ in range(5):
        question.append({"question_id": idx, "image": qas['image'] + '.jpg', "text": qas['question'] + "\n<image>"})
        idx += 1

Putting the question dictionary into a jsonl file with each question separated by a new line

In [5]:
file_path = '/data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open.jsonl'

# Writing each dictionary as a JSON object on a new line
with open(file_path, 'w') as file:
    for i in range(0,len(question)):
        json_line = json.dumps(question[i])  # Convert dictionary to JSON string
        file.write(json_line + '\n')

Since we have around 8 GPU devices on MIATA, we can do parallel processing and divide the questions into 8 jsonl files (no. of GPU devices checked using `nvidia-smi`). Running the LLaVa-Med Evaluation using multiple GPUs (we have 8 GPUs on MIATA - tested using `nvidia-smi`).

Using the split utility to divide `que_pvqa_open.jsonl` file into 8 separete files so that each of the 8 GPU devices can be used

In [6]:
!split --number=l/8 --additional-suffix=.jsonl pvqa_data/query_files/que_pvqa_open.jsonl  pvqa_data/query_files/que_pvqa_open_

Run the following command on terminal to run as a background process by changing GPU_Number as well as the question/answer/log filenames

In [None]:
CUDA_VISIBLE_DEVICES=0 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_aa.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_aa.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_aa.log &

In [None]:
CUDA_VISIBLE_DEVICES=1 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_ab.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_ab.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_ab.log &

In [None]:
CUDA_VISIBLE_DEVICES=2 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_ac.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_ac.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_ac.log &

In [None]:
CUDA_VISIBLE_DEVICES=3 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_ad.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_ad.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_ad.log &

In [None]:
CUDA_VISIBLE_DEVICES=4 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_ae.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_ae.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_ae.log &

In [None]:
CUDA_VISIBLE_DEVICES=5 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_af.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_af.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_af.log &

In [None]:
CUDA_VISIBLE_DEVICES=6 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_ag.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_ag.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_ag.log &

In [None]:
CUDA_VISIBLE_DEVICES=7 nohup python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_ah.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_ah.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_ah.log &

Now combine all the individual `answer json` files into a single file

In [10]:
!cat pvqa_data/answer_files/ans_pvqa_open_a*.jsonl > pvqa_data/answer_files/ans_pvqa_open.jsonl

Now read the answer file and form the `json` file to check the self-consistency by asking the model to select the best answer out of five for each question

In [4]:
ans_pvqa_open_file = "/data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open.jsonl"
ans_pvqa = []

with open(ans_pvqa_open_file, "r") as file:
    for line in file:
        ans_pvqa.append(json.loads(line))

Adding the image information to each dict object in `ans_pvqa`

In [5]:
for i in range(0, len(pvqa_qas_open)):
    for j in range(0,5):
        ans_pvqa[i*5 + j]['image'] = pvqa_qas_open[i]['image'] + '.jpg'

Forming the query list containing ques, five answers and image information

In [6]:
base_text = '''Review the user question along with the image and corresponding five responses using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the response is relevant and provides some information related to the pathology, even if it is incomplete or contains some irrelevant content.
- Add another point if the response addresses a substantial portion of the user question with some pathology connection, but does not completely resolve the query or provide a direct answer.
- Award a third point if the response answers the basic elements of the user question in connection with pathology, regardless of whether it seems to have been written by an AI Assistant or if it has elements typically found in blogs or search results.
- Grant a fourth point if the response is clearly written from an AI Assistant perspective, addressing the user question directly and comprehensively, and is medically consistent in terms of pathology
- Bestow a fifth point for a response that is impeccably tailored to the user question by an AI Assistant, without extraneous information, reflecting expert knowledge about pathology, and demonstrating a high-quality, engaging, and insightful answer.

Question: {ques}

Response 1: {ans1}
Response 2: {ans2}
Response 3: {ans3}
Response 4: {ans4}
Response 5: {ans5}

After examining the user question and each individual respone, conclude with a cumulative score for each response as per the above highlighted scoring system using the format:

"Score 1: <total_points of response 1>, Score 2: <total_points of response 2>, Score 3: <total_points of response 3>, Score 4: <total_points of response 4>, Score 5: <total_points of response 5>"

<image>'''

In [7]:
query_sc = []
idx = 0


for i in range(0, len(ans_pvqa), 5):
    # Get the question from first answer
    ques = ans_pvqa[i]['prompt']
    # Get the image from first answer
    img = ans_pvqa[i]['image']
    # Get the five answers
    ans1 = ans_pvqa[i]['text']
    ans2 = ans_pvqa[i+1]['text']
    ans3 = ans_pvqa[i+2]['text']
    ans4 = ans_pvqa[i+3]['text']
    ans5 = ans_pvqa[i+4]['text']
    # Prepare the text
    query_text = base_text.format(ques=ques, ans1=ans1, ans2=ans2, ans3=ans3, ans4=ans4, ans5=ans5)
    query_sc.append({"question_id": idx, "image": img, "text": query_text})
    idx += 1

Putting the query dictionary (self-consistency) into a jsonl file with each query separated by a new line

In [8]:
file_path = '/data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_sc.jsonl'

# Writing each dictionary as a JSON object on a new line
with open(file_path, 'w') as file:
    for i in range(0,len(query_sc)):
        json_line = json.dumps(query_sc[i])  # Convert dictionary to JSON string
        file.write(json_line + '\n')

Run the following command on terminal to run as a background process

In [11]:
!python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_test.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_test_llava_pvqa.jsonl

Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.8.self_attn.q_proj.bias', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'visual_projection.weight', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.0.self_attn.out_p

In [12]:
!python llava/eval/model_vqa.py \
--model-name /data/mn27889/.cache/huggingface/hub/llava_med \
--question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_test.jsonl \
--image-folder /data/mn27889/pvqa/images/test \
--answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_test_llava.jsonl

Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'visual_projection.weight', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.10.self_attn.

In [None]:
# nohup python llava/eval/model_vqa.py \
# --model-name /data/mn27889/.cache/huggingface/hub/llava_med_pvqa \
# --question-file /data/mn27889/LLaVA-Med/pvqa_data/query_files/que_pvqa_open_sc.jsonl \
# --image-folder /data/mn27889/pvqa/images/test \
# --answers-file /data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_sc.jsonl > /data/mn27889/LLaVA-Med/pvqa_data/logs/log_pvqa_open_sc.log &

Now read the answer file and add the llava_answer to base dictionary of open-ended pvqa questions, images and answers

In [None]:
ans_pvqa_sc_open_file = "/data/mn27889/LLaVA-Med/pvqa_data/answer_files/ans_pvqa_open_sc.jsonl"
ans_pvqa_sc = []

with open(ans_pvqa_sc_open_file, "r") as file:
    for line in file:
        ans_pvqa_sc.append(json.loads(line))

In [None]:
pvqa_qas_llava_open = []
for i in range(0, len(pvqa_qas_open)):
    qas = pvqa_qas_open[i]
    qas['llava_answer'] = ans_pvqa_sc[i]['text']
    pvqa_qas_llava_open.append(qas)

In [None]:
recall_score(pvqa_qas_llava_open, answer_key='answer', prediction_key='llava_answer')