### PathVQA Data Preparation

This code will prepare the PathVQA data for evaluation. We need to evaluate PathVQA as following:

Close-Ended: 50 \
Open-Ended: 550
- What
- Where
- How
- When
- Why

Since we want only the questions/answers displaying sensitive knowledge, we will only consider the data samples for which we have the length of both the answers and questions more than 5 words. If we are not able to find the respective number of samples, we can consider those data samples for which we have anwers or questions having more than 5 words.

In [71]:
import pandas as pd
import os
import pickle
from collections import Counter
import numpy as np
import shutil

#### Defining the paths for PVQA dataset

In [72]:
pvqa_data_path = "/data/mn27889/pvqa"
pvqa_images = os.path.join(pvqa_data_path, "images")
pvqa_qas = os.path.join(pvqa_data_path, "qas")

#### Taking the images/qas from `train` subset

In [73]:
pvqa_subset = "train"
pvqa_images_subset_path = os.path.join(pvqa_images, pvqa_subset)
pvqa_qas_subset_path = os.path.join(pvqa_qas, pvqa_subset)

#### Reading the QAS

In [74]:
file_name = "train_qa.pkl"
qas_file_path = os.path.join(pvqa_qas_subset_path, file_name)
with open(qas_file_path, 'rb') as file:
    pvqa_qas_subset = pickle.load(file)

print('Total Samples:',len(pvqa_qas_subset))

Total Samples: 19755


In [75]:
pvqa_qas_subset[0:5]

[{'image': 'train_0422',
  'question': 'Where are liver stem cells (oval cells) located?',
  'answer': 'in the canals of hering'},
 {'image': 'train_0422',
  'question': 'What are stained here with an immunohistochemical stain for cytokeratin 7?',
  'answer': 'bile duct cells and canals of hering'},
 {'image': 'train_0422',
  'question': 'What are bile duct cells and canals of Hering stained here with for cytokeratin 7?',
  'answer': 'an immunohistochemical stain'},
 {'image': 'train_0422',
  'question': 'Are bile duct cells and canals of Hering stained here with an immunohistochemical stain for cytokeratin 7?',
  'answer': 'yes'},
 {'image': 'train_0986',
  'question': 'What shows dissolution of the tissue?',
  'answer': 'an infarct in the brain'}]

#### Differentiating the open-ended and close-ended questions

In [76]:
pvqa_qas_close_ended = [sample for sample in pvqa_qas_subset if sample['answer'].lower() == 'yes' or sample['answer'].lower() == 'no']
len(pvqa_qas_close_ended)

9806

In [77]:
pvqa_qas_open_ended = [sample for sample in pvqa_qas_subset if sample not in pvqa_qas_close_ended]
len(pvqa_qas_open_ended)

9949

In [78]:
assert len(pvqa_qas_open_ended) + len(pvqa_qas_close_ended) == len(pvqa_qas_subset)

#### Finding the samples for which question and answer are equal to or more than 5 words.
- For close-ended questions, only check question text since answer would be 'yes' or 'no'
- For open-ended questions, check the both the question and answer

In [79]:
valid_close_ended_samples = [sample for sample in pvqa_qas_close_ended if len(sample['question'].split()) >= 5]
len(valid_close_ended_samples)

5913

In [80]:
valid_open_ended_samples = [sample for sample in pvqa_qas_open_ended if len(sample['question'].split()) >= 5 and len(sample['answer'].split()) >= 5]
len(valid_open_ended_samples)

1321

#### From open-ended questions, separating the questions starting with the following words:
- What
- Where
- How
- When
- Why

In [81]:
first_word = [sample['question'].split()[0].lower() for sample in valid_open_ended_samples]
counts = Counter(first_word)
print(counts)

Counter({'what': 1155, 'how': 108, 'why': 36, 'where': 15, 'when': 7})


In [82]:
what_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('what')]
how_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('how')]
why_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('why')]
where_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('where')]
when_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('when')]
print('Total What Questions:', len(what_question_samples))
print('Total How Questions:', len(how_question_samples))
print('Total Why Questions:', len(why_question_samples))
print('Total Where Questions:', len(where_question_samples))
print('Total When Questions:', len(when_question_samples))

Total What Questions: 1155
Total How Questions: 108
Total Why Questions: 36
Total Where Questions: 15
Total When Questions: 7


### Compiling the data into a single dataframe

Selecting the Top 50 samples of Close-Ended

In [83]:
pvqa_eval_data_close_ended = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in valid_close_ended_samples[0:50]:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'close-ended'
    pvqa_eval_data_close_ended.loc[len(pvqa_eval_data_close_ended)] = [image_path, question, answer, question_type]

pvqa_eval_data_close_ended.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/pvqa/images/train/train_0422.jpg,Are bile duct cells and canals of Hering stain...,yes,close-ended
1,/data/mn27889/pvqa/images/train/train_0986.jpg,Does an infarct in the brain show dissolution ...,yes,close-ended
2,/data/mn27889/pvqa/images/train/train_0986.jpg,Does preserved show dissolution of the tissue?,no,close-ended
3,/data/mn27889/pvqa/images/train/train_0929.jpg,Do the areas of white chalky deposits represen...,yes,close-ended
4,/data/mn27889/pvqa/images/train/train_0929.jpg,Do the histone subunits represent foci of fat ...,no,close-ended


Selecting the Top 400 samples of What Open-Ended

In [84]:
pvqa_eval_data_open_what = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in what_question_samples[0:400]:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-what'
    pvqa_eval_data_open_what.loc[len(pvqa_eval_data_open_what)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_what.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/pvqa/images/train/train_0422.jpg,What are stained here with an immunohistochemi...,bile duct cells and canals of hering,open-what
1,/data/mn27889/pvqa/images/train/train_0986.jpg,What shows dissolution of the tissue?,an infarct in the brain,open-what
2,/data/mn27889/pvqa/images/train/train_0929.jpg,What represent foci of fat necrosis with calci...,the areas of white chalky deposits,open-what
3,/data/mn27889/pvqa/images/train/train_0218.jpg,What represents an acute myocardial infarction...,the transmural light area in the posterolatera...,open-what
4,/data/mn27889/pvqa/images/train/train_0218.jpg,What were stained with triphenyltetra-zolium c...,all three transverse sections of myocardium,open-what


Selecting all samples of How Open-Ended

In [85]:
pvqa_eval_data_open_how = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in how_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-how'
    pvqa_eval_data_open_how.loc[len(pvqa_eval_data_open_how)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_how.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/pvqa/images/train/train_0929.jpg,How do the areas of white chalky deposits repr...,with calcium soap formation (saponification),open-how
1,/data/mn27889/pvqa/images/train/train_0220.jpg,How does low-power view of a cross section of ...,by a focal collection serous effusion,open-how
2,/data/mn27889/pvqa/images/train/train_0751.jpg,How is the remote kidney infarct replaced?,by a large fibrotic scar,open-how
3,/data/mn27889/pvqa/images/train/train_0363.jpg,How is acute cellular rejection of a kidney gr...,by inflammatory cells in the inter-stitium and...,open-how
4,/data/mn27889/pvqa/images/train/train_0969.jpg,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how


Selecting all samples of Why Open-Ended

In [86]:
pvqa_eval_data_open_why = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in why_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-why'
    pvqa_eval_data_open_why.loc[len(pvqa_eval_data_open_why)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_why.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/pvqa/images/train/train_0261.jpg,Why are the alveolar septa thickened?,due to congested capillaries and neutrophilic ...,open-why
1,/data/mn27889/pvqa/images/train/train_0325.jpg,Why are the alveolar septa thickened?,due to congested capillaries and neutrophilic ...,open-why
2,/data/mn27889/pvqa/images/train/train_1542.jpg,Why does this image show intestines covered by...,due to ruptured peptic ulcer,open-why
3,/data/mn27889/pvqa/images/train/train_2695.jpg,"Why does this image show heart, right ventricu...",due to a patent ductus arteriosis in a patient...,open-why
4,/data/mn27889/pvqa/images/train/train_1927.jpg,Why is legs demonstrated with one about twice ...,due to malignant lymphoma involving lymphatic ...,open-why


Selecting all samples of Where Open-Ended

In [87]:
pvqa_eval_data_open_where = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in where_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-where'
    pvqa_eval_data_open_where.loc[len(pvqa_eval_data_open_where)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_where.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/pvqa/images/train/train_0422.jpg,Where are liver stem cells (oval cells) located?,in the canals of hering,open-where
1,/data/mn27889/pvqa/images/train/train_0929.jpg,Where do areas of white chalky deposits repres...,at sites of lipid breakdown in the mesentery,open-where
2,/data/mn27889/pvqa/images/train/train_0613.jpg,Where are small vegetations visible?,along the line of closure of the mitral valve...,open-where
3,/data/mn27889/pvqa/images/train/train_0391.jpg,Where is the terminal hepatic vein in the lobu...,at the center of the lobule,open-where
4,/data/mn27889/pvqa/images/train/train_0456.jpg,Where does hepatocellular iron appear blue?,in the prussian blue-stained section,open-where


Selecting all samples of When Open-Ended

In [88]:
pvqa_eval_data_open_when = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in when_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-when'
    pvqa_eval_data_open_when.loc[len(pvqa_eval_data_open_when)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_when.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/pvqa/images/train/train_0176.jpg,When did the congenital capillary hemangioma a...,at 2 years of age after the lesion,open-when
1,/data/mn27889/pvqa/images/train/train_0319.jpg,When are most scars gone Masson trichrome stain?,after 1 year of abstinence,open-when
2,/data/mn27889/pvqa/images/train/train_0269.jpg,When are most scars gone?,after 1 year of abstinance,open-when
3,/data/mn27889/pvqa/images/train/train_0269.jpg,When are most scars gone Masson trichrome stain?,after 1 year of abstinence,open-when
4,/data/mn27889/pvqa/images/train/train_0503.jpg,When does periodic acid-Schiff stain ?,after diastase digestion of the liver,open-when


### Uploading all the images to Google Drive and get the drive links

Since we will be using the Google Form for the evaluation, we need to upload all the images to a specific Google Drive Folder. Then we need to get the drive link of each image and provide it to evaluators.

1. Move all the PathVQA images from server into a specific folder
2. Upload the Folder to google drive
3. Prepare a Google App Script to get the name and links (URL) of those files from the google drive folder in a google sheet
4. Map the names from Google Sheet and Dataframes to get the URLs of each image onto Google Drive
5. The resulting dataframes will be the final csv files which will be provided to evaluators

Firstly moving all the images for all question types in a specific folder to be uploaded to Google Driver

In [89]:
unique_images_path_all = np.concatenate([pvqa_eval_data_close_ended['image_path'].unique(),
                                        pvqa_eval_data_open_what['image_path'].unique(),
                                        pvqa_eval_data_open_how['image_path'].unique(),
                                        pvqa_eval_data_open_why['image_path'].unique(),
                                        pvqa_eval_data_open_where['image_path'].unique(),
                                        pvqa_eval_data_open_when['image_path'].unique()])
unique_images_path_all = np.unique(unique_images_path_all)
print('Total Unique Images for Evaluation:', len(unique_images_path_all))

Total Unique Images for Evaluation: 431


Moving all these images in a folder

In [90]:
pvqa_eval_images_dir = 'PathVQA_Eval_Images'
os.makedirs(pvqa_eval_images_dir, exist_ok=True)
for image_path in unique_images_path_all:
    shutil.copy(image_path, pvqa_eval_images_dir)

Now upload this folder onto the Google Driver. Then run the following scritpin Apps Script (script.google.com)

In [91]:
# function listFolderContents2() {
#   var foldername = 'PathVQA_Eval_Images';
#   var folderlisting = 'File Names and Links - '+ foldername;

#   var folders = DriveApp.getFoldersByName(foldername);
#   var folder = folders.next();
#   var contents = folder.getFiles();

#   var ss = SpreadsheetApp.create(folderlisting);
#   var sheet = ss.getActiveSheet();
#   sheet.appendRow(['name','link']);

#   var file;
#   var name;
#   var link;
#   var row;

#   while(contents.hasNext()) {
#     file = contents.next();
#     name = file.getName();
#     link = file.getUrl();
#     sheet.appendRow([name,link]);
#   }
# };

After running the above script, a new excel file will be created with the names and Google Drive Links of the files. That excel sheet needs to be downloaded and mapped back to all the individual question sets to finalize the image URLs in the Google Drive

In [92]:
data_eval_dir = 'data_eval'
pathvqa_drive_links_file = os.path.join(data_eval_dir, "file_names_and_links_PathVQA_Eval_Images.csv")
pathvqa_drive_links = pd.read_csv(pathvqa_drive_links_file)
pathvqa_drive_links.head()

Unnamed: 0,name,link
0,train_0319.jpg,https://drive.google.com/file/d/1-d392xpZEnEeV...
1,train_1547.jpg,https://drive.google.com/file/d/10JhTOOAtQ9EHE...
2,train_0768.jpg,https://drive.google.com/file/d/11EKJE-cvgZ9Z0...
3,train_0817.jpg,https://drive.google.com/file/d/11YTNsn8zqLF4y...
4,train_0558.jpg,https://drive.google.com/file/d/11wL04IBd9hh2V...


Changing the name of each file to complete path for correct mapping later on

In [93]:
pathvqa_drive_links['image_path'] = pathvqa_drive_links['name'].apply(lambda x: os.path.join(pvqa_images_subset_path, x))
pathvqa_drive_links['image_id'] = pathvqa_drive_links['name'].apply(lambda x: x.split('.')[0])
pathvqa_drive_links.head()

Unnamed: 0,name,link,image_path,image_id
0,train_0319.jpg,https://drive.google.com/file/d/1-d392xpZEnEeV...,/data/mn27889/pvqa/images/train/train_0319.jpg,train_0319
1,train_1547.jpg,https://drive.google.com/file/d/10JhTOOAtQ9EHE...,/data/mn27889/pvqa/images/train/train_1547.jpg,train_1547
2,train_0768.jpg,https://drive.google.com/file/d/11EKJE-cvgZ9Z0...,/data/mn27889/pvqa/images/train/train_0768.jpg,train_0768
3,train_0817.jpg,https://drive.google.com/file/d/11YTNsn8zqLF4y...,/data/mn27889/pvqa/images/train/train_0817.jpg,train_0817
4,train_0558.jpg,https://drive.google.com/file/d/11wL04IBd9hh2V...,/data/mn27889/pvqa/images/train/train_0558.jpg,train_0558


### Mapping the Google Drive Links with each question set separately

In [94]:
pvqa_eval_data_close_ended = pd.merge(pvqa_eval_data_close_ended, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_close_ended.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/pvqa/images/train/train_0422.jpg,Are bile duct cells and canals of Hering stain...,yes,close-ended,train_0422.jpg,https://drive.google.com/file/d/1vkB1oNmZAxACt...,train_0422
1,/data/mn27889/pvqa/images/train/train_0986.jpg,Does an infarct in the brain show dissolution ...,yes,close-ended,train_0986.jpg,https://drive.google.com/file/d/1d-WzMJa2ov6Lk...,train_0986
2,/data/mn27889/pvqa/images/train/train_0986.jpg,Does preserved show dissolution of the tissue?,no,close-ended,train_0986.jpg,https://drive.google.com/file/d/1d-WzMJa2ov6Lk...,train_0986
3,/data/mn27889/pvqa/images/train/train_0929.jpg,Do the areas of white chalky deposits represen...,yes,close-ended,train_0929.jpg,https://drive.google.com/file/d/1_AZVUU96fliNQ...,train_0929
4,/data/mn27889/pvqa/images/train/train_0929.jpg,Do the histone subunits represent foci of fat ...,no,close-ended,train_0929.jpg,https://drive.google.com/file/d/1_AZVUU96fliNQ...,train_0929


In [95]:
pvqa_eval_data_open_what = pd.merge(pvqa_eval_data_open_what, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_what.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/pvqa/images/train/train_0422.jpg,What are stained here with an immunohistochemi...,bile duct cells and canals of hering,open-what,train_0422.jpg,https://drive.google.com/file/d/1vkB1oNmZAxACt...,train_0422
1,/data/mn27889/pvqa/images/train/train_0986.jpg,What shows dissolution of the tissue?,an infarct in the brain,open-what,train_0986.jpg,https://drive.google.com/file/d/1d-WzMJa2ov6Lk...,train_0986
2,/data/mn27889/pvqa/images/train/train_0929.jpg,What represent foci of fat necrosis with calci...,the areas of white chalky deposits,open-what,train_0929.jpg,https://drive.google.com/file/d/1_AZVUU96fliNQ...,train_0929
3,/data/mn27889/pvqa/images/train/train_0218.jpg,What represents an acute myocardial infarction...,the transmural light area in the posterolatera...,open-what,train_0218.jpg,https://drive.google.com/file/d/1lLxbrlQdOKPhp...,train_0218
4,/data/mn27889/pvqa/images/train/train_0218.jpg,What were stained with triphenyltetra-zolium c...,all three transverse sections of myocardium,open-what,train_0218.jpg,https://drive.google.com/file/d/1lLxbrlQdOKPhp...,train_0218


In [96]:
pvqa_eval_data_open_how = pd.merge(pvqa_eval_data_open_how, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_how.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/pvqa/images/train/train_0929.jpg,How do the areas of white chalky deposits repr...,with calcium soap formation (saponification),open-how,train_0929.jpg,https://drive.google.com/file/d/1_AZVUU96fliNQ...,train_0929
1,/data/mn27889/pvqa/images/train/train_0220.jpg,How does low-power view of a cross section of ...,by a focal collection serous effusion,open-how,train_0220.jpg,https://drive.google.com/file/d/19hKmPHLSWTwdd...,train_0220
2,/data/mn27889/pvqa/images/train/train_0751.jpg,How is the remote kidney infarct replaced?,by a large fibrotic scar,open-how,train_0751.jpg,https://drive.google.com/file/d/1klCBlfIYFmmTf...,train_0751
3,/data/mn27889/pvqa/images/train/train_0363.jpg,How is acute cellular rejection of a kidney gr...,by inflammatory cells in the inter-stitium and...,open-how,train_0363.jpg,https://drive.google.com/file/d/1ntcdhPYam0fOV...,train_0363
4,/data/mn27889/pvqa/images/train/train_0969.jpg,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how,train_0969.jpg,https://drive.google.com/file/d/1RB1wqp8yHJlrU...,train_0969


In [97]:
pvqa_eval_data_open_where = pd.merge(pvqa_eval_data_open_where, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_where.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/pvqa/images/train/train_0422.jpg,Where are liver stem cells (oval cells) located?,in the canals of hering,open-where,train_0422.jpg,https://drive.google.com/file/d/1vkB1oNmZAxACt...,train_0422
1,/data/mn27889/pvqa/images/train/train_0929.jpg,Where do areas of white chalky deposits repres...,at sites of lipid breakdown in the mesentery,open-where,train_0929.jpg,https://drive.google.com/file/d/1_AZVUU96fliNQ...,train_0929
2,/data/mn27889/pvqa/images/train/train_0613.jpg,Where are small vegetations visible?,along the line of closure of the mitral valve...,open-where,train_0613.jpg,https://drive.google.com/file/d/1C6uuhtoTUct3W...,train_0613
3,/data/mn27889/pvqa/images/train/train_0391.jpg,Where is the terminal hepatic vein in the lobu...,at the center of the lobule,open-where,train_0391.jpg,https://drive.google.com/file/d/1ynaeS1SAZbDJX...,train_0391
4,/data/mn27889/pvqa/images/train/train_0456.jpg,Where does hepatocellular iron appear blue?,in the prussian blue-stained section,open-where,train_0456.jpg,https://drive.google.com/file/d/1t8IofCBeptbM1...,train_0456


In [98]:
pvqa_eval_data_open_why = pd.merge(pvqa_eval_data_open_why, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_why.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/pvqa/images/train/train_0261.jpg,Why are the alveolar septa thickened?,due to congested capillaries and neutrophilic ...,open-why,train_0261.jpg,https://drive.google.com/file/d/1XHmMx3vg5R1vJ...,train_0261
1,/data/mn27889/pvqa/images/train/train_0325.jpg,Why are the alveolar septa thickened?,due to congested capillaries and neutrophilic ...,open-why,train_0325.jpg,https://drive.google.com/file/d/1b5XHnggxV1ChI...,train_0325
2,/data/mn27889/pvqa/images/train/train_1542.jpg,Why does this image show intestines covered by...,due to ruptured peptic ulcer,open-why,train_1542.jpg,https://drive.google.com/file/d/1ZqEcz1lMw7qRK...,train_1542
3,/data/mn27889/pvqa/images/train/train_2695.jpg,"Why does this image show heart, right ventricu...",due to a patent ductus arteriosis in a patient...,open-why,train_2695.jpg,https://drive.google.com/file/d/1Ild5O7BUKIiFc...,train_2695
4,/data/mn27889/pvqa/images/train/train_1927.jpg,Why is legs demonstrated with one about twice ...,due to malignant lymphoma involving lymphatic ...,open-why,train_1927.jpg,https://drive.google.com/file/d/1TJ4lpS1dH4inD...,train_1927


In [99]:
pvqa_eval_data_open_when = pd.merge(pvqa_eval_data_open_when, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_when.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/pvqa/images/train/train_0176.jpg,When did the congenital capillary hemangioma a...,at 2 years of age after the lesion,open-when,train_0176.jpg,https://drive.google.com/file/d/1-62BeGXgnF428...,train_0176
1,/data/mn27889/pvqa/images/train/train_0319.jpg,When are most scars gone Masson trichrome stain?,after 1 year of abstinence,open-when,train_0319.jpg,https://drive.google.com/file/d/1-d392xpZEnEeV...,train_0319
2,/data/mn27889/pvqa/images/train/train_0269.jpg,When are most scars gone?,after 1 year of abstinance,open-when,train_0269.jpg,https://drive.google.com/file/d/1-vorKajYc2vh2...,train_0269
3,/data/mn27889/pvqa/images/train/train_0269.jpg,When are most scars gone Masson trichrome stain?,after 1 year of abstinence,open-when,train_0269.jpg,https://drive.google.com/file/d/1-vorKajYc2vh2...,train_0269
4,/data/mn27889/pvqa/images/train/train_0503.jpg,When does periodic acid-Schiff stain ?,after diastase digestion of the liver,open-when,train_0503.jpg,https://drive.google.com/file/d/11iMOgW3vFCQv3...,train_0503


### Creating the final dataset



Joining all these datasets into one dataframe to extract the final dataset to be used for evaluation of PathVQA

In [100]:
vqa_data_pathvqa = pd.concat([pvqa_eval_data_close_ended.head(),
                            pvqa_eval_data_open_what.head(),
                            pvqa_eval_data_open_how.head(),
                            pvqa_eval_data_open_why.head(),
                            pvqa_eval_data_open_where.head(),
                            pvqa_eval_data_open_when.head()]).reset_index(drop=True)

vqa_data_pathvqa.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/pvqa/images/train/train_0422.jpg,Are bile duct cells and canals of Hering stain...,yes,close-ended,train_0422.jpg,https://drive.google.com/file/d/1vkB1oNmZAxACt...,train_0422
1,/data/mn27889/pvqa/images/train/train_0986.jpg,Does an infarct in the brain show dissolution ...,yes,close-ended,train_0986.jpg,https://drive.google.com/file/d/1d-WzMJa2ov6Lk...,train_0986
2,/data/mn27889/pvqa/images/train/train_0986.jpg,Does preserved show dissolution of the tissue?,no,close-ended,train_0986.jpg,https://drive.google.com/file/d/1d-WzMJa2ov6Lk...,train_0986
3,/data/mn27889/pvqa/images/train/train_0929.jpg,Do the areas of white chalky deposits represen...,yes,close-ended,train_0929.jpg,https://drive.google.com/file/d/1_AZVUU96fliNQ...,train_0929
4,/data/mn27889/pvqa/images/train/train_0929.jpg,Do the histone subunits represent foci of fat ...,no,close-ended,train_0929.jpg,https://drive.google.com/file/d/1_AZVUU96fliNQ...,train_0929


In [102]:
vqa_data_pathvqa[['image_id', 'link', 'question_type', 'question', 'answer']].to_csv('data_eval/pathvqa_data.csv')