### Data Preparation for evaluation of PathOpen Dataset

This code will prepare the data for the evaluation of PathOpen dataset. The names of the files in the folder contain the diagnosis. So, before even anything can be done, these names need to be changed and then updated names need to be mapped back to the original data. Following is the process which will be followed for this:

1. Download the original responses file (csv), and put under the data folder. Rename some columns to acceptable names. We only have the Google Drive links in that document
2. Create the Case ID and ImageID of all the images in three folders (Image1, Image2, Image3) since we have multiple images in three different folders for a single case
3. Now prepare the questions as per the sheet for the evaluators with multiple images for each case containing all the questions

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

#### 1. Reading the data and renaming the columns to appropriate names. Also creating the Case ID

In [2]:
vqa_data = pd.read_csv('data/path_open_vqa.csv')
vqa_data = vqa_data.rename(columns={vqa_data.columns[2]: 'Image1_Path',
                                    vqa_data.columns[26]: 'Image2_Path',
                                    vqa_data.columns[27]: 'Image3_Path',
                                    vqa_data.columns[28]: 'Image4_Path',
                                    vqa_data.columns[29]: 'Image5_Path',
                                    'Image 1 Magnification ': 'Image1_Mag',
                                    'Image 2 Magnification ': 'Image2_Mag',
                                    'Image 3 Magnification ': 'Image3_Mag',
                                    'Image 4 Magnification ': 'Image4_Mag',
                                    'Image 5 Magnification ': 'Image5_Mag'})

vqa_data = vqa_data.reset_index()
vqa_data['index'] = vqa_data['index'] + 1
vqa_data = vqa_data.rename(columns={'index': 'Case_ID'})
vqa_data.head()

Unnamed: 0,Case_ID,Timestamp,Pathologist ID,Image1_Path,Organ,Categorization,Regional Anatomy,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Answer 2,...,Open Ended - Wrong Answer 1,Open Ended - Wrong Answer 2,Image2_Mag,Image3_Mag,Image4_Mag,Image5_Mag,Image2_Path,Image3_Path,Image4_Path,Image5_Path
0,1,2/25/2025 13:41:20,CK,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the primary architectural pattern obse...,Nodular pattern,Germinal centers are absent,...,,,,,,,,,,
1,2,2/25/2025 13:46:25,CK,https://drive.google.com/open?id=1igYpj4RL0XKx...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,These would best be characterized as small lym...,...,,,,,,,,,,
2,3,2/25/2025 13:51:37,CK,https://drive.google.com/open?id=1DKNZJQJ17SkX...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,This should be considered a highly cellular ti...,...,,,,,,,,,,
3,4,2/26/2025 11:30:23,CK,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,Gastrointestinal - Small Intenstine,Infection (Benign),Intestinal villous mucosa,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,Granulomas here are often the consequence of u...,...,,,,,,,,,,
4,5,2/26/2025 11:42:26,CK,https://drive.google.com/open?id=1VNr78I5w671g...,Gastrointestinal - Dudenum,Infection (Benign),Intestinal villous mucosa,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,...,,,,,,,,,,


#### 2. Now creating the image_ids of multiple images within a case

In [3]:
vqa_data_upd = pd.DataFrame()
for index, row in vqa_data.iterrows():
    # Get the first image
    image_path = row['Image1_Path']
    if pd.notnull(image_path):
        row['Image_Path'] = image_path
        row['Image_ID'] = 'img_pathopen_' + str(row['Case_ID']) + '_01'
        vqa_data_upd = pd.concat([vqa_data_upd, row.to_frame().T], ignore_index=True)
    
    # Get the second image
    image_path = row['Image2_Path']
    if pd.notnull(image_path):
        row['Image_Path'] = image_path
        row['Image_ID'] = 'img_pathopen_' + str(row['Case_ID']) + '_02'
        vqa_data_upd = pd.concat([vqa_data_upd, row.to_frame().T], ignore_index=True)
    
    # Get the third image
    image_path = row['Image3_Path']
    if pd.notnull(image_path):
        row['Image_Path'] = image_path
        row['Image_ID'] = 'img_pathopen_' + str(row['Case_ID']) + '_03'
        vqa_data_upd = pd.concat([vqa_data_upd, row.to_frame().T], ignore_index=True)

vqa_data_upd.head()

Unnamed: 0,Case_ID,Timestamp,Pathologist ID,Image1_Path,Organ,Categorization,Regional Anatomy,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Answer 2,...,Image2_Mag,Image3_Mag,Image4_Mag,Image5_Mag,Image2_Path,Image3_Path,Image4_Path,Image5_Path,Image_Path,Image_ID
0,1,2/25/2025 13:41:20,CK,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the primary architectural pattern obse...,Nodular pattern,Germinal centers are absent,...,,,,,,,,,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,img_pathopen_1_01
1,2,2/25/2025 13:46:25,CK,https://drive.google.com/open?id=1igYpj4RL0XKx...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,These would best be characterized as small lym...,...,,,,,,,,,https://drive.google.com/open?id=1igYpj4RL0XKx...,img_pathopen_2_01
2,3,2/25/2025 13:51:37,CK,https://drive.google.com/open?id=1DKNZJQJ17SkX...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,This should be considered a highly cellular ti...,...,,,,,,,,,https://drive.google.com/open?id=1DKNZJQJ17SkX...,img_pathopen_3_01
3,4,2/26/2025 11:30:23,CK,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,Gastrointestinal - Small Intenstine,Infection (Benign),Intestinal villous mucosa,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,Granulomas here are often the consequence of u...,...,,,,,,,,,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,img_pathopen_4_01
4,5,2/26/2025 11:42:26,CK,https://drive.google.com/open?id=1VNr78I5w671g...,Gastrointestinal - Dudenum,Infection (Benign),Intestinal villous mucosa,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,...,,,,,,,,,https://drive.google.com/open?id=1VNr78I5w671g...,img_pathopen_5_01


Removing the un-necessary columns

In [5]:
vqa_data_upd = vqa_data_upd.drop(columns=['Image1_Path', 'Image2_Path', 'Image3_Path'])
vqa_data_upd.head()

Unnamed: 0,Case_ID,Timestamp,Pathologist ID,Organ,Categorization,Regional Anatomy,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Answer 2,Open Ended - Question 2,...,Open Ended - Wrong Answer 1,Open Ended - Wrong Answer 2,Image2_Mag,Image3_Mag,Image4_Mag,Image5_Mag,Image4_Path,Image5_Path,Image_Path,Image_ID
0,1,2/25/2025 13:41:20,CK,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the primary architectural pattern obse...,Nodular pattern,Germinal centers are absent,What is a common component of lymph node archi...,...,,,,,,,,,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,img_pathopen_1_01
1,2,2/25/2025 13:46:25,CK,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,These would best be characterized as small lym...,What is the best description for the cell size...,...,,,,,,,,,https://drive.google.com/open?id=1igYpj4RL0XKx...,img_pathopen_2_01
2,3,2/25/2025 13:51:37,CK,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,This should be considered a highly cellular ti...,What is the best description for the density a...,...,,,,,,,,,https://drive.google.com/open?id=1DKNZJQJ17SkX...,img_pathopen_3_01
3,4,2/26/2025 11:30:23,CK,Gastrointestinal - Small Intenstine,Infection (Benign),Intestinal villous mucosa,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,Granulomas here are often the consequence of u...,What does the granuloma in the lamina propria ...,...,,,,,,,,,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,img_pathopen_4_01
4,5,2/26/2025 11:42:26,CK,Gastrointestinal - Dudenum,Infection (Benign),Intestinal villous mucosa,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,,...,,,,,,,,,https://drive.google.com/open?id=1VNr78I5w671g...,img_pathopen_5_01


### Generating the Google Sheet Containing the mapping between Google Drive Links and Image Names

This will help to map the image names downloaded from collected data with the Google Drive Links we have in our data

In [None]:
# function listFilesInNestedFolders(parentFolderId) {
#   // Get the initial parent folder
#   var parentfolder_id = '1VdFeRx06PrTTCbeMVQ0wQ8ML8QF8sSxM-QWlTcMx7TKQAjpWRuW5-EWJi-DtuBDPeXTbGYNm'
#   const parentFolder = DriveApp.getFolderById(parentfolder_id);
#   var folderlisting = 'File Names and Links - '+ parentfolder_id;

#   var ss = SpreadsheetApp.create(folderlisting);
#   var sheet = ss.getActiveSheet();
#   sheet.appendRow(['name','link']);

#   // Get subfolders in the current folder and process them recursively
#   var subfolders = parentFolder.getFolders();
#   while (subfolders.hasNext()) {
#     var subfolder = subfolders.next();
#     // Get files in the current folder
#     var contents = subfolder.getFiles();
#     var file;
#     var name;
#     var link;
#     while (contents.hasNext()) {
#       file = contents.next();
#       name = file.getName();
#       link = file.getUrl();
#       sheet.appendRow([name,link]);
#     }
#   }
# }

Now Download the file generated which contains mapping between google drive links and names.

In [8]:
data_dir = "data_eval"
file_name = "file_names_and_links_PathOpen_Images.csv"
file_path = os.path.join(data_dir, file_name)
google_drive_name_links = pd.read_csv(file_path)
google_drive_name_links.head()

Unnamed: 0,name,link
0,ck_HSP_5x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/1SYbssi84BMfkC...
1,ck_BALL_4x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/1XrSu7uRm3s-al...
2,ck_pschwannoma_20x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/1dGZ8MAgdKvT5w...
3,ck_CMV gastritis_4x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/11UMXpBZjiQdo0...
4,ck_rhabdo_5x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/174myggtY1Fj5A...


Extracting the ID's from image links

In [10]:
vqa_data_upd['google_drive_image_id'] = vqa_data_upd['Image_Path'].apply(lambda x: x.split('=')[-1])
google_drive_name_links['google_drive_image_id'] = google_drive_name_links['link'].apply(lambda x: x.split('/')[-1])

Combining the names of the image files

In [11]:
vqa_data_upd = pd.merge(vqa_data_upd, google_drive_name_links[['google_drive_image_id', 'name']], on='google_drive_image_id', how='left')
vqa_data_upd.head()

Unnamed: 0,Case_ID,Timestamp,Pathologist ID,Organ,Categorization,Regional Anatomy,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Answer 2,Open Ended - Question 2,...,Image2_Mag,Image3_Mag,Image4_Mag,Image5_Mag,Image4_Path,Image5_Path,Image_Path,Image_ID,google_drive_image_id,name
0,1,2/25/2025 13:41:20,CK,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the primary architectural pattern obse...,Nodular pattern,Germinal centers are absent,What is a common component of lymph node archi...,...,,,,,,,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,img_pathopen_1_01,1tgv50Q9W4Bm_Mk7my6Ag8mD0JWkQkYvx,MCL_20x - Chandra Krishnan.jpg
1,2,2/25/2025 13:46:25,CK,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,These would best be characterized as small lym...,What is the best description for the cell size...,...,,,,,,,https://drive.google.com/open?id=1igYpj4RL0XKx...,img_pathopen_2_01,1igYpj4RL0XKxq0u5qlnv6ZvqJfFjGO4u,MCL_200x - Chandra Krishnan.jpg
2,3,2/25/2025 13:51:37,CK,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,This should be considered a highly cellular ti...,What is the best description for the density a...,...,,,,,,,https://drive.google.com/open?id=1DKNZJQJ17SkX...,img_pathopen_3_01,1DKNZJQJ17SkXUrrO7jHMiKD8J6ittzus,MCL_400x - Chandra Krishnan.jpg
3,4,2/26/2025 11:30:23,CK,Gastrointestinal - Small Intenstine,Infection (Benign),Intestinal villous mucosa,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,Granulomas here are often the consequence of u...,What does the granuloma in the lamina propria ...,...,,,,,,,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,img_pathopen_4_01,1jUe0z6wlZ0s9PZUgVj1yCL7UZav23jwz,Duodenum granuloma_40x - Chandra Krishnan.jpg
4,5,2/26/2025 11:42:26,CK,Gastrointestinal - Dudenum,Infection (Benign),Intestinal villous mucosa,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,,...,,,,,,,https://drive.google.com/open?id=1VNr78I5w671g...,img_pathopen_5_01,1VNr78I5w671gkrIyHs7AC75yrjb1lU6_,Duodenum granuloma_100x - Chandra Krishnan.jpg


Creating a dictionary between Image_ID and name

In [18]:
name_imageid_dict = dict(zip(vqa_data_upd['name'], vqa_data_upd['Image_ID']))

#### Now creating the evaluation dataset for all images separately as is required in the evaluation sheet

In [20]:
vqa_data_upd.columns

Index(['Case_ID', 'Timestamp', 'Pathologist ID', 'Organ', 'Categorization',
       'Regional Anatomy', 'Open Ended - Question 1', 'Open Ended - Answer 1',
       'Open Ended - Answer 2', 'Open Ended - Question 2', 'MCQ - Question',
       'MCQ - Option 1', 'MCQ - Option 2', 'MCQ - Option 3', 'MCQ - Option 4',
       'MCQ - Option 5', 'MCQ - Answer', 'Close-Ended Question 1',
       'Close-Ended Answer 1', 'Image1_Mag', 'Open Ended - Wrong Answer 1',
       'Open Ended - Wrong Answer 2', 'Image2_Mag', 'Image3_Mag', 'Image4_Mag',
       'Image5_Mag', 'Image4_Path', 'Image5_Path', 'Image_Path', 'Image_ID',
       'google_drive_image_id', 'name'],
      dtype='object')

In [21]:
case_id = []
image_id = []
image_url = []
oe_question_1 = []
oe_correct_answer_1 = []
oe_wrong_answer_1 = []
oe_question_2 = []
oe_correct_answer_2 = []
oe_wrong_answer_2 = []
mcq_oe_question = []
mcq_oe_correct_answer = []
mcq_oe_wrong_answer_1 = []
mcq_oe_wrong_answer_2 = []
mcq_oe_wrong_answer_3 = []
mcq_oe_wrong_answer_4 = []
ce_question = []
ce_correct_answer = []

for index, row in vqa_data_upd.iterrows():
    case_id.append(row['Case_ID'])
    image_id.append(row['Image_ID'])
    image_url.append(row['Image_Path'])
    oe_question_1.append(row['Open Ended - Question 1'])
    oe_correct_answer_1.append(row['Open Ended - Answer 1'])
    oe_wrong_answer_1.append(row['Open Ended - Wrong Answer 1'])
    oe_question_2.append(row['Open Ended - Question 2'])
    oe_correct_answer_2.append(row['Open Ended - Answer 2'])
    oe_wrong_answer_2.append(row['Open Ended - Wrong Answer 2'])
    mcq_oe_question.append(row['MCQ - Question'])
    mcq_answer = row['MCQ - Answer']
    if mcq_answer.strip() == 'Option 1':
        mcq_oe_correct_answer.append(row['MCQ - Option 1'])
        mcq_oe_wrong_answer_1.append(row['MCQ - Option 2'])
        mcq_oe_wrong_answer_2.append(row['MCQ - Option 3'])
        mcq_oe_wrong_answer_3.append(row['MCQ - Option 4'])
        mcq_oe_wrong_answer_4.append(row['MCQ - Option 5'])
    elif mcq_answer.strip() == 'Option 2':
        mcq_oe_correct_answer.append(row['MCQ - Option 2'])
        mcq_oe_wrong_answer_1.append(row['MCQ - Option 1'])
        mcq_oe_wrong_answer_2.append(row['MCQ - Option 3'])
        mcq_oe_wrong_answer_3.append(row['MCQ - Option 4'])
        mcq_oe_wrong_answer_4.append(row['MCQ - Option 5'])
    elif mcq_answer.strip() == 'Option 3':
        mcq_oe_correct_answer.append(row['MCQ - Option 3'])
        mcq_oe_wrong_answer_1.append(row['MCQ - Option 1'])
        mcq_oe_wrong_answer_2.append(row['MCQ - Option 2'])
        mcq_oe_wrong_answer_3.append(row['MCQ - Option 4'])
        mcq_oe_wrong_answer_4.append(row['MCQ - Option 5'])
    elif mcq_answer.strip() == 'Option 4':
        mcq_oe_correct_answer.append(row['MCQ - Option 4'])
        mcq_oe_wrong_answer_1.append(row['MCQ - Option 1'])
        mcq_oe_wrong_answer_2.append(row['MCQ - Option 2'])
        mcq_oe_wrong_answer_3.append(row['MCQ - Option 3'])
        mcq_oe_wrong_answer_4.append(row['MCQ - Option 5'])
    elif mcq_answer.strip() == 'Option 5':
        mcq_oe_correct_answer.append(row['MCQ - Option 5'])
        mcq_oe_wrong_answer_1.append(row['MCQ - Option 1'])
        mcq_oe_wrong_answer_2.append(row['MCQ - Option 2'])
        mcq_oe_wrong_answer_3.append(row['MCQ - Option 3'])
        mcq_oe_wrong_answer_4.append(row['MCQ - Option 4'])
        
    ce_question.append(row['Close-Ended Question 1'])
    ce_correct_answer.append(row['Close-Ended Answer 1'])

In [22]:
pathopen_eval_data = pd.DataFrame({
    'Case_ID': case_id,
    'Image_ID': image_id,
    'Image_URL': image_url,
    'OE_Question_1': oe_question_1,
    'OE_Correct_Answer_1': oe_correct_answer_1,
    'OE_Wrong_Answer_1': oe_wrong_answer_1,
    'OE_Question_2': oe_question_2,
    'OE_Correct_Answer_2': oe_correct_answer_2,
    'OE_Wrong_Answer_2': oe_wrong_answer_2,
    'MCQ_OE_Question': mcq_oe_question,
    'MCQ_OE_Correct_Answer': mcq_oe_correct_answer,
    'MCQ_OE_Wrong_Answer_1': mcq_oe_wrong_answer_1,
    'MCQ_OE_Wrong_Answer_2': mcq_oe_wrong_answer_2,
    'MCQ_OE_Wrong_Answer_3': mcq_oe_wrong_answer_3,
    'MCQ_OE_Wrong_Answer_4': mcq_oe_wrong_answer_4,
    'CE_Question': ce_question,
    'CE_Correct_Answer': ce_correct_answer
})

pathopen_eval_data.head()

Unnamed: 0,Case_ID,Image_ID,Image_URL,OE_Question_1,OE_Correct_Answer_1,OE_Wrong_Answer_1,OE_Question_2,OE_Correct_Answer_2,OE_Wrong_Answer_2,MCQ_OE_Question,MCQ_OE_Correct_Answer,MCQ_OE_Wrong_Answer_1,MCQ_OE_Wrong_Answer_2,MCQ_OE_Wrong_Answer_3,MCQ_OE_Wrong_Answer_4,CE_Question,CE_Correct_Answer
0,1,img_pathopen_1_01,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,What is the primary architectural pattern obse...,Nodular pattern,,What is a common component of lymph node archi...,Germinal centers are absent,,What is the organ represented?,Lymph Node,Adrenal gland,Pancreas,Bone marrow,Retina,Is this a normal pattern observed?,No
1,2,img_pathopen_2_01,https://drive.google.com/open?id=1igYpj4RL0XKx...,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,,What is the best description for the cell size...,These would best be characterized as small lym...,,If these are CD5 B-cell lineage lymphocytes in...,Mantle cell lymphoma,Small lymphocytic lymphoma,Diffuse large B-cell lymphoma,Hairy cell leukemia,Follicular lymphoma,Is the nucleus to cytoplasmic ratio of these l...,Yes
2,3,img_pathopen_3_01,https://drive.google.com/open?id=1DKNZJQJ17SkX...,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,,What is the best description for the density a...,This should be considered a highly cellular ti...,,What nuclear feature suggests increased growth...,Frequent mitotic figures,Large nuclear size,Increased red blood cells,Low nuclear:cytoplasmic ratio,Many histiocytes,Does this tumor have morphologic features of d...,No
3,4,img_pathopen_4_01,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,,What does the granuloma in the lamina propria ...,Granulomas here are often the consequence of u...,,Which of the following chronic gastrointestina...,Crohn disease,Ulcerative colitis,Giardia gastroenteritis,Celiac disease,Norovirus,Is the mucosal architecture normal?,Yes
4,5,img_pathopen_5_01,https://drive.google.com/open?id=1VNr78I5w671g...,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,,,,In a pediatric age patient with this biopsy wh...,Paneth cells,Goblet cells,Brush border,Mucous cells,Histiocytes,Are duodenum granulomas typically found with c...,No


In [None]:
pathopen_eval_data.to_csv('data_eval/pathopen_vqa.csv', index=False)