### Augmented Data Preparation

This code will prepare the augmented data for evaluation purposes. This will take the random images from the collected data, augment them, and then put same set of open-ended questions for all the augmentations. The purpose of this evaluation is to ensure that same set of questions/answers with multiple augmented images make complete sense

In [1]:
import os
import sys

In [2]:
# Append the directory to the system path
sys.path.append(os.path.join(os.getcwd(), "pathology-he-autoaugmetation/randaugment/train"))

In [3]:
import os
import pandas as pd
import numpy as np
from randaugment_new_ranges import distort_image_with_randaugment
from PIL import Image
from tqdm import tqdm
import re

#### Reading the original data

In [4]:
vqa_data = pd.read_csv('data/path_open_vqa.csv')
vqa_data = vqa_data.rename(columns={vqa_data.columns[2]: 'Image1_Path',
                                    vqa_data.columns[26]: 'Image2_Path',
                                    vqa_data.columns[27]: 'Image3_Path',
                                    vqa_data.columns[28]: 'Image4_Path',
                                    vqa_data.columns[29]: 'Image5_Path',
                                    'Image 1 Magnification ': 'Image1_Mag',
                                    'Image 2 Magnification ': 'Image2_Mag',
                                    'Image 3 Magnification ': 'Image3_Mag',
                                    'Image 4 Magnification ': 'Image4_Mag',
                                    'Image 5 Magnification ': 'Image5_Mag'})

vqa_data = vqa_data.reset_index()
vqa_data['index'] = vqa_data['index'] + 1
vqa_data = vqa_data.rename(columns={'index': 'Case_ID'})
vqa_data.head()

Unnamed: 0,Case_ID,Timestamp,Pathologist ID,Image1_Path,Organ,Categorization,Regional Anatomy,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Answer 2,...,Open Ended - Wrong Answer 1,Open Ended - Wrong Answer 2,Image2_Mag,Image3_Mag,Image4_Mag,Image5_Mag,Image2_Path,Image3_Path,Image4_Path,Image5_Path
0,1,2/25/2025 13:41:20,CK,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the primary architectural pattern obse...,Nodular pattern,Germinal centers are absent,...,,,,,,,,,,
1,2,2/25/2025 13:46:25,CK,https://drive.google.com/open?id=1igYpj4RL0XKx...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,These would best be characterized as small lym...,...,,,,,,,,,,
2,3,2/25/2025 13:51:37,CK,https://drive.google.com/open?id=1DKNZJQJ17SkX...,Hematolymphoid - Lymph Nodes,Neoplasia (Malignant),Lymph node,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,This should be considered a highly cellular ti...,...,,,,,,,,,,
3,4,2/26/2025 11:30:23,CK,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,Gastrointestinal - Small Intenstine,Infection (Benign),Intestinal villous mucosa,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,Granulomas here are often the consequence of u...,...,,,,,,,,,,
4,5,2/26/2025 11:42:26,CK,https://drive.google.com/open?id=1VNr78I5w671g...,Gastrointestinal - Dudenum,Infection (Benign),Intestinal villous mucosa,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,...,,,,,,,,,,


Getting only limited columns for the evaluation using only open-ended answers

In [5]:
vqa_data_filt = vqa_data[['Case_ID', 'Pathologist ID', 'Image1_Path', 'Image2_Path', 'Image3_Path', 'Open Ended - Question 1', 'Open Ended - Answer 1', 'Open Ended - Question 2', 'Open Ended - Answer 2']]
vqa_data_filt.head()

Unnamed: 0,Case_ID,Pathologist ID,Image1_Path,Image2_Path,Image3_Path,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Question 2,Open Ended - Answer 2
0,1,CK,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,,,What is the primary architectural pattern obse...,Nodular pattern,What is a common component of lymph node archi...,Germinal centers are absent
1,2,CK,https://drive.google.com/open?id=1igYpj4RL0XKx...,,,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,What is the best description for the cell size...,These would best be characterized as small lym...
2,3,CK,https://drive.google.com/open?id=1DKNZJQJ17SkX...,,,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,What is the best description for the density a...,This should be considered a highly cellular ti...
3,4,CK,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,,,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,What does the granuloma in the lamina propria ...,Granulomas here are often the consequence of u...
4,5,CK,https://drive.google.com/open?id=1VNr78I5w671g...,,,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,


Separating the multiple image paths available for the same question to form multiple questions

In [6]:
vqa_data_filt_final = pd.DataFrame()
for index, row in vqa_data_filt.iterrows():
    # Get the first image
    image_path = row['Image1_Path']
    if pd.notnull(image_path):
        row['Image_Path'] = image_path
        row['Image_ID'] = 'img_pathopen_' + str(row['Case_ID']) + '_01'
        vqa_data_filt_final = pd.concat([vqa_data_filt_final, row.to_frame().T], ignore_index=True)
    
    # Get the second image
    image_path = row['Image2_Path']
    if pd.notnull(image_path):
        row['Image_Path'] = image_path
        row['Image_ID'] = 'img_pathopen_' + str(row['Case_ID']) + '_02'
        vqa_data_filt_final = pd.concat([vqa_data_filt_final, row.to_frame().T], ignore_index=True)
    
    # Get the third image
    image_path = row['Image3_Path']
    if pd.notnull(image_path):
        row['Image_Path'] = image_path
        row['Image_ID'] = 'img_pathopen_' + str(row['Case_ID']) + '_03'
        vqa_data_filt_final = pd.concat([vqa_data_filt_final, row.to_frame().T], ignore_index=True)

vqa_data_filt_final.head()

Unnamed: 0,Case_ID,Pathologist ID,Image1_Path,Image2_Path,Image3_Path,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Question 2,Open Ended - Answer 2,Image_Path,Image_ID
0,1,CK,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,,,What is the primary architectural pattern obse...,Nodular pattern,What is a common component of lymph node archi...,Germinal centers are absent,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,img_pathopen_1_01
1,2,CK,https://drive.google.com/open?id=1igYpj4RL0XKx...,,,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,What is the best description for the cell size...,These would best be characterized as small lym...,https://drive.google.com/open?id=1igYpj4RL0XKx...,img_pathopen_2_01
2,3,CK,https://drive.google.com/open?id=1DKNZJQJ17SkX...,,,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,What is the best description for the density a...,This should be considered a highly cellular ti...,https://drive.google.com/open?id=1DKNZJQJ17SkX...,img_pathopen_3_01
3,4,CK,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,,,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,What does the granuloma in the lamina propria ...,Granulomas here are often the consequence of u...,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,img_pathopen_4_01
4,5,CK,https://drive.google.com/open?id=1VNr78I5w671g...,,,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,,https://drive.google.com/open?id=1VNr78I5w671g...,img_pathopen_5_01


Removing the un-necessary columns

In [7]:
vqa_data_filt_final = vqa_data_filt_final.drop(columns=['Image1_Path', 'Image2_Path', 'Image3_Path'])
vqa_data_filt_final.head()

Unnamed: 0,Case_ID,Pathologist ID,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Question 2,Open Ended - Answer 2,Image_Path,Image_ID
0,1,CK,What is the primary architectural pattern obse...,Nodular pattern,What is a common component of lymph node archi...,Germinal centers are absent,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,img_pathopen_1_01
1,2,CK,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,What is the best description for the cell size...,These would best be characterized as small lym...,https://drive.google.com/open?id=1igYpj4RL0XKx...,img_pathopen_2_01
2,3,CK,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,What is the best description for the density a...,This should be considered a highly cellular ti...,https://drive.google.com/open?id=1DKNZJQJ17SkX...,img_pathopen_3_01
3,4,CK,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,What does the granuloma in the lamina propria ...,Granulomas here are often the consequence of u...,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,img_pathopen_4_01
4,5,CK,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,,https://drive.google.com/open?id=1VNr78I5w671g...,img_pathopen_5_01


### Generating the Google Sheet Containing the mapping between Google Drive Links and Image Names

This will help to map the image names downloaded from collected data with the Google Drive Links we have in our data

In [8]:
# function listFilesInNestedFolders(parentFolderId) {
#   // Get the initial parent folder
#   var parentfolder_id = '1VdFeRx06PrTTCbeMVQ0wQ8ML8QF8sSxM-QWlTcMx7TKQAjpWRuW5-EWJi-DtuBDPeXTbGYNm'
#   const parentFolder = DriveApp.getFolderById(parentfolder_id);
#   var folderlisting = 'File Names and Links - '+ parentfolder_id;

#   var ss = SpreadsheetApp.create(folderlisting);
#   var sheet = ss.getActiveSheet();
#   sheet.appendRow(['name','link']);

#   // Get subfolders in the current folder and process them recursively
#   var subfolders = parentFolder.getFolders();
#   while (subfolders.hasNext()) {
#     var subfolder = subfolders.next();
#     // Get files in the current folder
#     var contents = subfolder.getFiles();
#     var file;
#     var name;
#     var link;
#     while (contents.hasNext()) {
#       file = contents.next();
#       name = file.getName();
#       link = file.getUrl();
#       sheet.appendRow([name,link]);
#     }
#   }
# }

Now Download the file generated which contains mapping between google drive links and names.

In [9]:
data_dir = "data_eval"
file_name = "file_names_and_links_PathOpen_Images.csv"
file_path = os.path.join(data_dir, file_name)
google_drive_name_links = pd.read_csv(file_path)
google_drive_name_links.head()

Unnamed: 0,name,link
0,ck_HSP_5x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/1SYbssi84BMfkC...
1,ck_BALL_4x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/1XrSu7uRm3s-al...
2,ck_pschwannoma_20x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/1dGZ8MAgdKvT5w...
3,ck_CMV gastritis_4x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/11UMXpBZjiQdo0...
4,ck_rhabdo_5x - Chandra Krishnan.jpeg,https://drive.google.com/file/d/174myggtY1Fj5A...


Extracting the ID's from image links

In [10]:
vqa_data_filt_final['google_drive_image_id'] = vqa_data_filt_final['Image_Path'].apply(lambda x: x.split('=')[-1])
google_drive_name_links['google_drive_image_id'] = google_drive_name_links['link'].apply(lambda x: x.split('/')[-1])

Combining the names of the image files

In [11]:
vqa_data_filt_final = pd.merge(vqa_data_filt_final, google_drive_name_links[['google_drive_image_id', 'name']], on='google_drive_image_id', how='left')
vqa_data_filt_final.head()

Unnamed: 0,Case_ID,Pathologist ID,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Question 2,Open Ended - Answer 2,Image_Path,Image_ID,google_drive_image_id,name
0,1,CK,What is the primary architectural pattern obse...,Nodular pattern,What is a common component of lymph node archi...,Germinal centers are absent,https://drive.google.com/open?id=1tgv50Q9W4Bm_...,img_pathopen_1_01,1tgv50Q9W4Bm_Mk7my6Ag8mD0JWkQkYvx,MCL_20x - Chandra Krishnan.jpg
1,2,CK,What is the predominant cell type seen here?,The main cell type observed here is a lymphocy...,What is the best description for the cell size...,These would best be characterized as small lym...,https://drive.google.com/open?id=1igYpj4RL0XKx...,img_pathopen_2_01,1igYpj4RL0XKxq0u5qlnv6ZvqJfFjGO4u,MCL_200x - Chandra Krishnan.jpg
2,3,CK,"In this mantle cell lymphoma, what is the best...",The nuclei of the abnormal lymphocytes are mos...,What is the best description for the density a...,This should be considered a highly cellular ti...,https://drive.google.com/open?id=1DKNZJQJ17SkX...,img_pathopen_3_01,1DKNZJQJ17SkXUrrO7jHMiKD8J6ittzus,MCL_400x - Chandra Krishnan.jpg
3,4,CK,"In a child with poor weight gain, what are the...",Features that argue against celiac disease in ...,What does the granuloma in the lamina propria ...,Granulomas here are often the consequence of u...,https://drive.google.com/open?id=1jUe0z6wlZ0s9...,img_pathopen_4_01,1jUe0z6wlZ0s9PZUgVj1yCL7UZav23jwz,Duodenum granuloma_40x - Chandra Krishnan.jpg
4,5,CK,What are the main histologic features of a non...,Non-necrotizing granulomas are characterized b...,,,https://drive.google.com/open?id=1VNr78I5w671g...,img_pathopen_5_01,1VNr78I5w671gkrIyHs7AC75yrjb1lU6_,Duodenum granuloma_100x - Chandra Krishnan.jpg


Creating a dictionary between Image_ID and name

In [12]:
name_imageid_dict = dict(zip(vqa_data_filt_final['name'], vqa_data_filt_final['Image_ID']))

### Reading the dataset files

Preparing the augmented images for each filename

In [13]:
source_dataset_dir = "PathOPEN_Eval_Images"
destination_dataset_dir = "PathOPEN_Eval_Images_Augmented"
pathopen_file_names = os.listdir(source_dataset_dir)

Only taking 10 images for now to test run

In [14]:
pathopen_file_names = pathopen_file_names[:10]
pathopen_file_names

['ck_angiosarc_4x - Chandra Krishnan.jpeg',
 'ck_branchial arch_20x - Chandra Krishnan.jpg',
 'ck_colon carcinoma_200x - Chandra Krishnan.jpg',
 'ck_acute osteomyelitis_20x - Chandra Krishnan.jpg',
 'ck_CMV gastritis_10x - Chandra Krishnan.jpeg',
 'ck_SSA colon_10x - Chandra Krishnan.jpg',
 'ck_granularct_10x - Chandra Krishnan.jpeg',
 'Case 4.200x - Amy Coffey.jpg',
 'ck_graves_10x - Chandra Krishnan.jpeg',
 'ac_img29_100x - Amy Coffey.jpeg']

Now peform the augmentation on these images

- Keep the code uncommented if not to be used since it takes along time to execute

In [15]:
# TOTAL_AUGMENTAIONS = 3  # Number of augmentations per image
# for src_image in pathopen_file_names:
#     src_image_path = os.path.join(source_dataset_dir, src_image)
#     src_image_id = name_imageid_dict[src_image]
#     dst_image_aug_folder = os.path.join(destination_dataset_dir, src_image_id + "_augmented")
#     os.makedirs(dst_image_aug_folder, exist_ok=True)

#     pil_image = Image.open(src_image_path).convert("RGB")
#     np_array_img = np.array(pil_image)
#     for i in tqdm(range(TOTAL_AUGMENTAIONS), desc=f"Augmenting {src_image_id}"):
#         # Apply RandAugment with specified parameters
#         conv_img_arr = distort_image_with_randaugment(np_array_img, num_layers=5, magnitude=5, randomize=True, randaugment_transforms_set='custom')
#         conv_img = Image.fromarray(conv_img_arr)
#         conv_img.save(os.path.join(dst_image_aug_folder, src_image_id + f"_aug_{i}.png"))

### Source the data onto the Google Drive

Now upload these images onto the Google Drive. Then run a Google App Script to get the relation between names and Google Drive Links using the following script

In [16]:
# function listFilesInNestedFolders(parentFolderId) {
#   // Get the initial parent folder
#   var parentfolder_id = '1jYUpk4fDC89VAp-bsBoZZfCjARUyway4'
#   const parentFolder = DriveApp.getFolderById(parentfolder_id);
#   var folderlisting = 'File Names and Links - '+ parentfolder_id;

#   var ss = SpreadsheetApp.create(folderlisting);
#   var sheet = ss.getActiveSheet();
#   sheet.appendRow(['name','link']);

#   // Get subfolders in the current folder and process them recursively
#   var subfolders = parentFolder.getFolders();
#   while (subfolders.hasNext()) {
#     var subfolder = subfolders.next();
#     // Get files in the current folder
#     var contents = subfolder.getFiles();
#     var file;
#     var name;
#     var link;
#     while (contents.hasNext()) {
#       file = contents.next();
#       name = file.getName();
#       link = file.getUrl();
#       sheet.appendRow([name,link]);
#     }
#   }
# }

Now put the generated excel sheet into `data_eval` so that we can map the image_ids with the google drive links

In [17]:
data_eval_dir = "data_eval"
file_name_augmented = "file_names_and_links_PathOpen_Augmented_Images.csv"
file_path_augmented = os.path.join(data_eval_dir, file_name_augmented)
google_drive_name_links_augmented = pd.read_csv(file_path_augmented)
google_drive_name_links_augmented.head()

Unnamed: 0,name,link
0,img_pathopen_142_01_aug_1.png,https://drive.google.com/file/d/125XX1eGzQhdqI...
1,img_pathopen_142_01_aug_2.png,https://drive.google.com/file/d/1Dh9BVobrSkfNa...
2,img_pathopen_142_01_aug_0.png,https://drive.google.com/file/d/1MmtxrhYVgg3_W...
3,img_pathopen_105_01_aug_2.png,https://drive.google.com/file/d/13p_FaDUype7JK...
4,img_pathopen_105_01_aug_1.png,https://drive.google.com/file/d/1nIsYSJCbRcmkf...


Extracting the image id from name

In [18]:
google_drive_name_links_augmented['Image_ID'] = google_drive_name_links_augmented['name'].apply(lambda x: re.match(r'(img_pathopen_\d+_\d+)(.*)', x).group(1))
google_drive_name_links_augmented = google_drive_name_links_augmented.rename(columns={'name': 'Augmented_Image_ID', 'link': 'Augmented_Image_Link'})
google_drive_name_links_augmented.head()

Unnamed: 0,Augmented_Image_ID,Augmented_Image_Link,Image_ID
0,img_pathopen_142_01_aug_1.png,https://drive.google.com/file/d/125XX1eGzQhdqI...,img_pathopen_142_01
1,img_pathopen_142_01_aug_2.png,https://drive.google.com/file/d/1Dh9BVobrSkfNa...,img_pathopen_142_01
2,img_pathopen_142_01_aug_0.png,https://drive.google.com/file/d/1MmtxrhYVgg3_W...,img_pathopen_142_01
3,img_pathopen_105_01_aug_2.png,https://drive.google.com/file/d/13p_FaDUype7JK...,img_pathopen_105_01
4,img_pathopen_105_01_aug_1.png,https://drive.google.com/file/d/1nIsYSJCbRcmkf...,img_pathopen_105_01


### Creating the final dataset



Joining the filtered dataset on `Image_Id` to get the open-ended questions/answers for each augmented image since each augmented image should have the same questions/answers as the parent image for a fair evaluation

In [19]:
vqa_data_augmented_images = pd.merge(vqa_data_filt_final, google_drive_name_links_augmented, on='Image_ID', how='inner')
vqa_data_augmented_images.head()

Unnamed: 0,Case_ID,Pathologist ID,Open Ended - Question 1,Open Ended - Answer 1,Open Ended - Question 2,Open Ended - Answer 2,Image_Path,Image_ID,google_drive_image_id,name,Augmented_Image_ID,Augmented_Image_Link
0,19,CK,What type of cells in the image show infiltrat...,The infiltrating epithelial cells are gland-fo...,What are the eosinophilic structures present i...,These are bundles of smooth muscle representin...,https://drive.google.com/open?id=1btAjgoKktd7d...,img_pathopen_19_01,1btAjgoKktd7dW6rh-Lxp8Rnqmh_ir47t,ck_colon carcinoma_200x - Chandra Krishnan.jpg,img_pathopen_19_01_aug_2.png,https://drive.google.com/file/d/1nytH9T_verYBs...
1,19,CK,What type of cells in the image show infiltrat...,The infiltrating epithelial cells are gland-fo...,What are the eosinophilic structures present i...,These are bundles of smooth muscle representin...,https://drive.google.com/open?id=1btAjgoKktd7d...,img_pathopen_19_01,1btAjgoKktd7dW6rh-Lxp8Rnqmh_ir47t,ck_colon carcinoma_200x - Chandra Krishnan.jpg,img_pathopen_19_01_aug_1.png,https://drive.google.com/file/d/1Gx8PPLrrnBf5x...
2,19,CK,What type of cells in the image show infiltrat...,The infiltrating epithelial cells are gland-fo...,What are the eosinophilic structures present i...,These are bundles of smooth muscle representin...,https://drive.google.com/open?id=1btAjgoKktd7d...,img_pathopen_19_01,1btAjgoKktd7dW6rh-Lxp8Rnqmh_ir47t,ck_colon carcinoma_200x - Chandra Krishnan.jpg,img_pathopen_19_01_aug_0.png,https://drive.google.com/file/d/1DHRxnjwOStQ8a...
3,32,AC,What are the diagnostic nuclear features of th...,"Nuclear enlargement, elongation and overlappin...",What are the architectural features of this tu...,This is a papillary neoplasm as defined by a f...,https://drive.google.com/open?id=1275UFlHvwON4...,img_pathopen_32_01,1275UFlHvwON4wQhKfszl12_nsded3F8j,Case 4.200x - Amy Coffey.jpg,img_pathopen_32_01_aug_2.png,https://drive.google.com/file/d/1l2GFOedtzkQ82...
4,32,AC,What are the diagnostic nuclear features of th...,"Nuclear enlargement, elongation and overlappin...",What are the architectural features of this tu...,This is a papillary neoplasm as defined by a f...,https://drive.google.com/open?id=1275UFlHvwON4...,img_pathopen_32_01,1275UFlHvwON4wQhKfszl12_nsded3F8j,Case 4.200x - Amy Coffey.jpg,img_pathopen_32_01_aug_1.png,https://drive.google.com/file/d/1JjwVzrEmsd7N2...


Creating the final csv file to be uploaded for evaluation

In [20]:
vqa_data_augmented_images[['Case_ID', 'Augmented_Image_ID', 'Augmented_Image_Link', 'Open Ended - Question 1', 'Open Ended - Answer 1', 'Open Ended - Question 2', 'Open Ended - Answer 2']].to_csv('data_eval/pathopen_vqa_augmented.csv', index=False)