### PathVQA Data Preparation

This code will prepare the PathVQA data for evaluation. We need to evaluate PathVQA as following:

Close-Ended: 50 \
Open-Ended: 550
- What
- Where
- How
- When
- Why

Since we want only the questions/answers displaying sensitive knowledge, we will only consider the data samples for which we have the length of both the answers and questions more than 5 words. If we are not able to find the respective number of samples, we can consider those data samples for which we have anwers or questions having more than 5 words.

In [1]:
import sys
from PIL import Image
import pandas as pd
import os
import pickle
from collections import Counter
import numpy as np
import shutil

#### Defining the paths for PVQA dataset

In [2]:
pvqa_data_path = "/data/mn27889/path-open-data/pathvqa-histopathology"
pvqa_images = os.path.join(pvqa_data_path, "images")
pvqa_qas = os.path.join(pvqa_data_path, "qas")

#### Considering images/qas from `train` subset

In [3]:
pvqa_subset = "train"
pvqa_images_subset_path = os.path.join(pvqa_images, pvqa_subset)
pvqa_qas_subset_path = os.path.join(pvqa_qas, pvqa_subset)

#### Reading the QAS

In [4]:
file_name = "train_qa.pkl"
qas_file_path = os.path.join(pvqa_qas_subset_path, file_name)
with open(qas_file_path, 'rb') as file:
    pvqa_qas_subset = pickle.load(file)

print('Total Samples:',len(pvqa_qas_subset))

Total Samples: 6723


In [5]:
pvqa_qas_subset[0:5]

[{'image': 'train_0422',
  'question': 'Where are liver stem cells (oval cells) located?',
  'answer': 'in the canals of hering'},
 {'image': 'train_0422',
  'question': 'What are stained here with an immunohistochemical stain for cytokeratin 7?',
  'answer': 'bile duct cells and canals of hering'},
 {'image': 'train_0422',
  'question': 'What are bile duct cells and canals of Hering stained here with for cytokeratin 7?',
  'answer': 'an immunohistochemical stain'},
 {'image': 'train_0422',
  'question': 'Are bile duct cells and canals of Hering stained here with an immunohistochemical stain for cytokeratin 7?',
  'answer': 'yes'},
 {'image': 'train_0986',
  'question': 'What shows dissolution of the tissue?',
  'answer': 'an infarct in the brain'}]

#### Differentiating the open-ended and close-ended questions

In [6]:
pvqa_qas_close_ended = [sample for sample in pvqa_qas_subset if sample['answer'].lower() == 'yes' or sample['answer'].lower() == 'no']
len(pvqa_qas_close_ended)

3418

In [7]:
pvqa_qas_open_ended = [sample for sample in pvqa_qas_subset if sample not in pvqa_qas_close_ended]
len(pvqa_qas_open_ended)

3305

In [8]:
assert len(pvqa_qas_open_ended) + len(pvqa_qas_close_ended) == len(pvqa_qas_subset)

#### Finding the samples for which question and answer are equal to or more than 5 words.
- For close-ended questions, only check question text since answer would be 'yes' or 'no'
- For open-ended questions, check the both the question and answer

In [9]:
valid_close_ended_samples = [sample for sample in pvqa_qas_close_ended if len(sample['question'].split()) >= 5]
len(valid_close_ended_samples)

2186

In [10]:
valid_open_ended_samples = [sample for sample in pvqa_qas_open_ended if len(sample['question'].split()) >= 5 and len(sample['answer'].split()) >= 5]
len(valid_open_ended_samples)

449

#### From open-ended questions, separating the questions starting with the following words:
- What
- Where
- How
- When
- Why

In [11]:
first_word = [sample['question'].split()[0].lower() for sample in valid_open_ended_samples]
counts = Counter(first_word)
print(counts)

Counter({'what': 391, 'how': 40, 'why': 10, 'where': 7, 'when': 1})


In [12]:
what_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('what')]
how_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('how')]
why_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('why')]
where_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('where')]
when_question_samples = [sample for sample in valid_open_ended_samples if sample['question'].lower().startswith('when')]
print('Total What Questions:', len(what_question_samples))
print('Total How Questions:', len(how_question_samples))
print('Total Why Questions:', len(why_question_samples))
print('Total Where Questions:', len(where_question_samples))
print('Total When Questions:', len(when_question_samples))

Total What Questions: 391
Total How Questions: 40
Total Why Questions: 10
Total Where Questions: 7
Total When Questions: 1


### Compiling the data into a single dataframe

Selecting the Top 50 samples of Close-Ended

In [13]:
pvqa_eval_data_close_ended = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in valid_close_ended_samples[0:50]:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'close-ended'
    pvqa_eval_data_close_ended.loc[len(pvqa_eval_data_close_ended)] = [image_path, question, answer, question_type]

pvqa_eval_data_close_ended.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/path-open-data/pathvqa-histopath...,Are bile duct cells and canals of Hering stain...,yes,close-ended
1,/data/mn27889/path-open-data/pathvqa-histopath...,Does an infarct in the brain show dissolution ...,yes,close-ended
2,/data/mn27889/path-open-data/pathvqa-histopath...,Does preserved show dissolution of the tissue?,no,close-ended
3,/data/mn27889/path-open-data/pathvqa-histopath...,Are iron deposits shown by a special staining ...,yes,close-ended
4,/data/mn27889/path-open-data/pathvqa-histopath...,Do the photomicrographs show an inflammatory r...,yes,close-ended


Selecting all the samples of What Open-Ended

In [14]:
pvqa_eval_data_open_what = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in what_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-what'
    pvqa_eval_data_open_what.loc[len(pvqa_eval_data_open_what)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_what.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/path-open-data/pathvqa-histopath...,What are stained here with an immunohistochemi...,bile duct cells and canals of hering,open-what
1,/data/mn27889/path-open-data/pathvqa-histopath...,What shows dissolution of the tissue?,an infarct in the brain,open-what
2,/data/mn27889/path-open-data/pathvqa-histopath...,"What is tissue factor, in vivo?",the major initiator of coagulation,open-what
3,/data/mn27889/path-open-data/pathvqa-histopath...,What is the late-phase reaction characterized by?,an inflammatory infiltrate rich in eosinophils,open-what
4,/data/mn27889/path-open-data/pathvqa-histopath...,What is manifested by inflammatory cells in th...,acute cellular rejection of a kidney graft,open-what


Selecting all samples of How Open-Ended

In [15]:
pvqa_eval_data_open_how = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in how_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-how'
    pvqa_eval_data_open_how.loc[len(pvqa_eval_data_open_how)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_how.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/path-open-data/pathvqa-histopath...,How is acute cellular rejection of a kidney gr...,by inflammatory cells in the inter-stitium and...,open-how
1,/data/mn27889/path-open-data/pathvqa-histopath...,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how
2,/data/mn27889/path-open-data/pathvqa-histopath...,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how
3,/data/mn27889/path-open-data/pathvqa-histopath...,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how
4,/data/mn27889/path-open-data/pathvqa-histopath...,How is acute endocarditis caused?,by staphylococcus aureus on a congenitally bic...,open-how


Selecting all samples of Why Open-Ended

In [16]:
pvqa_eval_data_open_why = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in why_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-why'
    pvqa_eval_data_open_why.loc[len(pvqa_eval_data_open_why)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_why.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/path-open-data/pathvqa-histopath...,Why are the alveolar septa thickened?,due to congested capillaries and neutrophilic ...,open-why
1,/data/mn27889/path-open-data/pathvqa-histopath...,Why does this image show that cut surface both...,due to mumps have no history at time,open-why
2,/data/mn27889/path-open-data/pathvqa-histopath...,"Why does this image show brain, infarct?",due to ruptured saccular aneurysm and thrombos...,open-why
3,/data/mn27889/path-open-data/pathvqa-histopath...,"Why does this image show brain, infarct?",due to ruptured saccular aneurysm and thrombos...,open-why
4,/data/mn27889/path-open-data/pathvqa-histopath...,Why does this image show spinal cord injury?,due to vertebral column trauma,open-why


Selecting all samples of Where Open-Ended

In [17]:
pvqa_eval_data_open_where = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in where_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-where'
    pvqa_eval_data_open_where.loc[len(pvqa_eval_data_open_where)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_where.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/path-open-data/pathvqa-histopath...,Where are liver stem cells (oval cells) located?,in the canals of hering,open-where
1,/data/mn27889/path-open-data/pathvqa-histopath...,Where does hepatocellular iron appear blue?,in the prussian blue-stained section,open-where
2,/data/mn27889/path-open-data/pathvqa-histopath...,Where are scattered immature adipocytes and mo...,myxoid liposarcoma with abundant ground substa...,open-where
3,/data/mn27889/path-open-data/pathvqa-histopath...,Where is presence of abundant coarse black car...,in the septal walls and around the bronchiole,open-where
4,/data/mn27889/path-open-data/pathvqa-histopath...,"Where is admixture of mature lymphocytes, plas...",in the centre of the field,open-where


Selecting all samples of When Open-Ended

In [18]:
pvqa_eval_data_open_when = pd.DataFrame(columns=['image_path', 'question', 'answer', 'question_type'])

for sample in when_question_samples:
    image_path = os.path.join(pvqa_images_subset_path, sample['image'] + '.jpg')
    question = sample['question']
    answer = sample['answer']
    question_type = 'open-when'
    pvqa_eval_data_open_when.loc[len(pvqa_eval_data_open_when)] = [image_path, question, answer, question_type]

pvqa_eval_data_open_when.head()

Unnamed: 0,image_path,question,answer,question_type
0,/data/mn27889/path-open-data/pathvqa-histopath...,When does periodic acid-Schiff stain ?,after diastase digestion of the liver,open-when


### Uploading all the images to Google Drive and get the drive links

Since we will be using the Google Form for the evaluation, we need to upload all the images to a specific Google Drive Folder. Then we need to get the drive link of each image and provide it to evaluators.

1. Move all the PathVQA images from server into a specific folder
2. Upload the Folder to google drive
3. Prepare a Google App Script to get the name and links (URL) of those files from the google drive folder in a google sheet
4. Map the names from Google Sheet and Dataframes to get the URLs of each image onto Google Drive
5. The resulting dataframes will be the final csv files which will be provided to evaluators

Firstly moving all the images for all question types in a specific folder to be uploaded to Google Driver

In [19]:
unique_images_path_all = np.concatenate([pvqa_eval_data_close_ended['image_path'].unique(),
                                        pvqa_eval_data_open_what['image_path'].unique(),
                                        pvqa_eval_data_open_how['image_path'].unique(),
                                        pvqa_eval_data_open_why['image_path'].unique(),
                                        pvqa_eval_data_open_where['image_path'].unique(),
                                        pvqa_eval_data_open_when['image_path'].unique()])
unique_images_path_all = np.unique(unique_images_path_all)
print('Total Unique Images for Evaluation:', len(unique_images_path_all))

Total Unique Images for Evaluation: 330


Moving all these images in a folder

In [20]:
pvqa_eval_images_dir = 'PathVQA_Eval_Images'
os.makedirs(pvqa_eval_images_dir, exist_ok=True)
for image_path in unique_images_path_all:
    shutil.copy(image_path, pvqa_eval_images_dir)

Now upload this folder onto the Google Driver. Then run the following scritpin Apps Script (script.google.com)

In [None]:
# function listFolderContents2() {
#   var foldername = 'PathVQA_Eval_Images';
#   var folderlisting = 'File Names and Links - '+ foldername;

#   var folders = DriveApp.getFoldersByName(foldername);
#   var folder = folders.next();
#   var contents = folder.getFiles();

#   var ss = SpreadsheetApp.create(folderlisting);
#   var sheet = ss.getActiveSheet();
#   sheet.appendRow(['name','link']);

#   var file;
#   var name;
#   var link;
#   var row;

#   while(contents.hasNext()) {
#     file = contents.next();
#     name = file.getName();
#     link = file.getUrl();
#     sheet.appendRow([name,link]);
#   }
# };

After running the above script, a new excel file will be created with the names and Google Drive Links of the files. That excel sheet needs to be downloaded and mapped back to all the individual question sets to finalize the image URLs in the Google Drive

In [21]:
data_eval_dir = 'data_eval'
pathvqa_drive_links_file = os.path.join(data_eval_dir, "file_names_and_links_PathVQA_Eval_Images.csv")
pathvqa_drive_links = pd.read_csv(pathvqa_drive_links_file)
pathvqa_drive_links.head()

Unnamed: 0,name,link
0,train_2820.jpg,https://drive.google.com/file/d/1-d6B8EiNONY5m...
1,train_0449.jpg,https://drive.google.com/file/d/1-eJGJz5HyEBiQ...
2,train_0374.jpg,https://drive.google.com/file/d/1-fAkiOoiyeC4B...
3,train_2332.jpg,https://drive.google.com/file/d/1-wNYCthnWlnPz...
4,train_0366.jpg,https://drive.google.com/file/d/10QVDe1WB_Udr6...


Changing the name of each file to complete path for correct mapping later on

In [22]:
pathvqa_drive_links['image_path'] = pathvqa_drive_links['name'].apply(lambda x: os.path.join(pvqa_images_subset_path, x))
pathvqa_drive_links['image_id'] = pathvqa_drive_links['name'].apply(lambda x: x.split('.')[0])
pathvqa_drive_links.head()

Unnamed: 0,name,link,image_path,image_id
0,train_2820.jpg,https://drive.google.com/file/d/1-d6B8EiNONY5m...,/data/mn27889/path-open-data/pathvqa-histopath...,train_2820
1,train_0449.jpg,https://drive.google.com/file/d/1-eJGJz5HyEBiQ...,/data/mn27889/path-open-data/pathvqa-histopath...,train_0449
2,train_0374.jpg,https://drive.google.com/file/d/1-fAkiOoiyeC4B...,/data/mn27889/path-open-data/pathvqa-histopath...,train_0374
3,train_2332.jpg,https://drive.google.com/file/d/1-wNYCthnWlnPz...,/data/mn27889/path-open-data/pathvqa-histopath...,train_2332
4,train_0366.jpg,https://drive.google.com/file/d/10QVDe1WB_Udr6...,/data/mn27889/path-open-data/pathvqa-histopath...,train_0366


### Mapping the Google Drive Links with each question set separately

In [23]:
pvqa_eval_data_close_ended = pd.merge(pvqa_eval_data_close_ended, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_close_ended.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/path-open-data/pathvqa-histopath...,Are bile duct cells and canals of Hering stain...,yes,close-ended,train_0422.jpg,https://drive.google.com/file/d/11FynOS9rlbqlY...,train_0422
1,/data/mn27889/path-open-data/pathvqa-histopath...,Does an infarct in the brain show dissolution ...,yes,close-ended,train_0986.jpg,https://drive.google.com/file/d/1JObhjMHzL7y3T...,train_0986
2,/data/mn27889/path-open-data/pathvqa-histopath...,Does preserved show dissolution of the tissue?,no,close-ended,train_0986.jpg,https://drive.google.com/file/d/1JObhjMHzL7y3T...,train_0986
3,/data/mn27889/path-open-data/pathvqa-histopath...,Are iron deposits shown by a special staining ...,yes,close-ended,train_0227.jpg,https://drive.google.com/file/d/1hFWqF8O85idCF...,train_0227
4,/data/mn27889/path-open-data/pathvqa-histopath...,Do the photomicrographs show an inflammatory r...,yes,close-ended,train_0147.jpg,https://drive.google.com/file/d/1t7VpkonP9K4rl...,train_0147


In [24]:
pvqa_eval_data_open_what = pd.merge(pvqa_eval_data_open_what, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_what.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/path-open-data/pathvqa-histopath...,What are stained here with an immunohistochemi...,bile duct cells and canals of hering,open-what,train_0422.jpg,https://drive.google.com/file/d/11FynOS9rlbqlY...,train_0422
1,/data/mn27889/path-open-data/pathvqa-histopath...,What shows dissolution of the tissue?,an infarct in the brain,open-what,train_0986.jpg,https://drive.google.com/file/d/1JObhjMHzL7y3T...,train_0986
2,/data/mn27889/path-open-data/pathvqa-histopath...,"What is tissue factor, in vivo?",the major initiator of coagulation,open-what,train_0665.jpg,https://drive.google.com/file/d/1GJALu7xW_e5LE...,train_0665
3,/data/mn27889/path-open-data/pathvqa-histopath...,What is the late-phase reaction characterized by?,an inflammatory infiltrate rich in eosinophils,open-what,train_0740.jpg,https://drive.google.com/file/d/1ZTpF2oMRDBG-r...,train_0740
4,/data/mn27889/path-open-data/pathvqa-histopath...,What is manifested by inflammatory cells in th...,acute cellular rejection of a kidney graft,open-what,train_0363.jpg,https://drive.google.com/file/d/1wNHejY-GkO7pY...,train_0363


In [25]:
pvqa_eval_data_open_how = pd.merge(pvqa_eval_data_open_how, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_how.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/path-open-data/pathvqa-histopath...,How is acute cellular rejection of a kidney gr...,by inflammatory cells in the inter-stitium and...,open-how,train_0363.jpg,https://drive.google.com/file/d/1wNHejY-GkO7pY...,train_0363
1,/data/mn27889/path-open-data/pathvqa-histopath...,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how,train_0969.jpg,https://drive.google.com/file/d/1HC6V8qzQiHuIc...,train_0969
2,/data/mn27889/path-open-data/pathvqa-histopath...,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how,train_0911.jpg,https://drive.google.com/file/d/1a447bGWU-HD7J...,train_0911
3,/data/mn27889/path-open-data/pathvqa-histopath...,How is vascular changes and fibrosis of saliva...,by radiation therapy of the neck region,open-how,train_0942.jpg,https://drive.google.com/file/d/13gOrgv_tcK8wv...,train_0942
4,/data/mn27889/path-open-data/pathvqa-histopath...,How is acute endocarditis caused?,by staphylococcus aureus on a congenitally bic...,open-how,train_0296.jpg,https://drive.google.com/file/d/1EPLMUbikEjHkQ...,train_0296


In [26]:
pvqa_eval_data_open_where = pd.merge(pvqa_eval_data_open_where, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_where.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/path-open-data/pathvqa-histopath...,Where are liver stem cells (oval cells) located?,in the canals of hering,open-where,train_0422.jpg,https://drive.google.com/file/d/11FynOS9rlbqlY...,train_0422
1,/data/mn27889/path-open-data/pathvqa-histopath...,Where does hepatocellular iron appear blue?,in the prussian blue-stained section,open-where,train_0456.jpg,https://drive.google.com/file/d/14Y96v5vjqbAb8...,train_0456
2,/data/mn27889/path-open-data/pathvqa-histopath...,Where are scattered immature adipocytes and mo...,myxoid liposarcoma with abundant ground substa...,open-where,train_0589.jpg,https://drive.google.com/file/d/1ars4Dh7Fdbk31...,train_0589
3,/data/mn27889/path-open-data/pathvqa-histopath...,Where is presence of abundant coarse black car...,in the septal walls and around the bronchiole,open-where,train_0008.jpg,https://drive.google.com/file/d/1CbjLDSfPwhx2M...,train_0008
4,/data/mn27889/path-open-data/pathvqa-histopath...,"Where is admixture of mature lymphocytes, plas...",in the centre of the field,open-where,train_0891.jpg,https://drive.google.com/file/d/11mq4ShUGGVUiF...,train_0891


In [27]:
pvqa_eval_data_open_why = pd.merge(pvqa_eval_data_open_why, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_why.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/path-open-data/pathvqa-histopath...,Why are the alveolar septa thickened?,due to congested capillaries and neutrophilic ...,open-why,train_0325.jpg,https://drive.google.com/file/d/1AySfdxhl6k5Uw...,train_0325
1,/data/mn27889/path-open-data/pathvqa-histopath...,Why does this image show that cut surface both...,due to mumps have no history at time,open-why,train_2526.jpg,https://drive.google.com/file/d/1lcTOo-bGlG35I...,train_2526
2,/data/mn27889/path-open-data/pathvqa-histopath...,"Why does this image show brain, infarct?",due to ruptured saccular aneurysm and thrombos...,open-why,train_1304.jpg,https://drive.google.com/file/d/1ddU0TCjrrLZo_...,train_1304
3,/data/mn27889/path-open-data/pathvqa-histopath...,"Why does this image show brain, infarct?",due to ruptured saccular aneurysm and thrombos...,open-why,train_1311.jpg,https://drive.google.com/file/d/1zwMpHRH3XlZCJ...,train_1311
4,/data/mn27889/path-open-data/pathvqa-histopath...,Why does this image show spinal cord injury?,due to vertebral column trauma,open-why,train_1318.jpg,https://drive.google.com/file/d/1yvmmmQ1SrVl42...,train_1318


In [28]:
pvqa_eval_data_open_when = pd.merge(pvqa_eval_data_open_when, pathvqa_drive_links, on='image_path', how='left')
pvqa_eval_data_open_when.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/path-open-data/pathvqa-histopath...,When does periodic acid-Schiff stain ?,after diastase digestion of the liver,open-when,train_0503.jpg,https://drive.google.com/file/d/1GhklzktbkJAZs...,train_0503


### Creating the final dataset



Joining all these datasets into one dataframe to extract the final dataset to be used for evaluation of PathVQA

In [29]:
vqa_data_pathvqa = pd.concat([pvqa_eval_data_close_ended.head(),
                            pvqa_eval_data_open_what.head(),
                            pvqa_eval_data_open_how.head(),
                            pvqa_eval_data_open_why.head(),
                            pvqa_eval_data_open_where.head(),
                            pvqa_eval_data_open_when.head()]).reset_index(drop=True)

vqa_data_pathvqa.head()

Unnamed: 0,image_path,question,answer,question_type,name,link,image_id
0,/data/mn27889/path-open-data/pathvqa-histopath...,Are bile duct cells and canals of Hering stain...,yes,close-ended,train_0422.jpg,https://drive.google.com/file/d/11FynOS9rlbqlY...,train_0422
1,/data/mn27889/path-open-data/pathvqa-histopath...,Does an infarct in the brain show dissolution ...,yes,close-ended,train_0986.jpg,https://drive.google.com/file/d/1JObhjMHzL7y3T...,train_0986
2,/data/mn27889/path-open-data/pathvqa-histopath...,Does preserved show dissolution of the tissue?,no,close-ended,train_0986.jpg,https://drive.google.com/file/d/1JObhjMHzL7y3T...,train_0986
3,/data/mn27889/path-open-data/pathvqa-histopath...,Are iron deposits shown by a special staining ...,yes,close-ended,train_0227.jpg,https://drive.google.com/file/d/1hFWqF8O85idCF...,train_0227
4,/data/mn27889/path-open-data/pathvqa-histopath...,Do the photomicrographs show an inflammatory r...,yes,close-ended,train_0147.jpg,https://drive.google.com/file/d/1t7VpkonP9K4rl...,train_0147


In [30]:
vqa_data_pathvqa[['image_id', 'link', 'question_type', 'question', 'answer']].to_csv('data_eval/pathvqa_data.csv')