### Setup Environment:

In [None]:
from src.google_trends import get_interest_over_time
from src.get_data import get_daquar_dataset, preprocess_daquar_dataset
from src.get_data import get_cocoqa_dataset, process_cocoqa_data
from src.get_data import download_fakeddit_files, create_stratified_subset_fakeddit, download_full_set_images, download_images_from_file
from src.get_data import download_recipes5k_dataset, preprocess_recipes5k
from src.get_data import get_brset, brset_preprocessing
from src.get_data import preprocess_ham10000
from src.get_data import get_satellitedata, satellitedata_preprocessing
from src.get_data import joslin_preprocessing

### Download Datasets:

The Fusion Model has been evaluated in 7 different datasets:

## 1. DAQUAR

* **[DAQUAR Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge#c7057)**:

DAQUAR (Dataset for Question Answering on Real-world images) dataset was created for the purpose of advancing research in visual question answering (VQA). It consists of indoor scene images, each accompanied by sets of questions related to the scene's content. The dataset serves as a benchmark for training and evaluating models in understanding images and answering questions about them.

This dataset can be downloaded from the following [link](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge#c7057). Or you can download the dataset using the function `get_daquar_dataset`.

Once you have the dataset, use the function `preprocess_daquar_dataset` to proprocess the train and test set, and generate the `labes.csv` file.

These functions will generate a dataset with the structure:

* output_dir/
    * labels.csv
    * test.txt
    * train.txt
    * images/
        * image1.png
        * image2.png
        * image3.png
        
        ...
        
        * imagen.png

In [2]:
output_dir = 'datasets/daquar/'
get_daquar_dataset(output_dir)
preprocess_daquar_dataset(output_dir)

Images downloaded and uncompressed successfully.
Labels downloaded successfully.
Preprocessed data saved to datasets/daquar/labels.csv


## 2. COCO-QA

* **[COCO-QA Dataset](https://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/)**:

The COCO-QA (COCO Question-Answering) dataset is designed for the task of visual question-answering. It is a subset of the COCO (Common Objects in Context) dataset, which is a large-scale dataset containing images with object annotations. The COCO-QA dataset extends the COCO dataset by including questions and answers associated with the images. Each image in the COCO-QA dataset is accompanied by a set of questions and corresponding answers.

You can use the `get_cocoqa_dataset` Function to download the dataset.

Example usage of the function:

`get_cocoqa_dataset(output_dir="datasets/coco-qa/")`

Also run the function to preprocess the dataset:

`process_cocoqa_data(output_dir="datasets/coco-qa/")`

After executing these functions, you will have the following structure in the "datasets/coco-qa/" directory:

* datasets/coco-qa/
    * labels.csv
    * train/
    * test/
    * images/
        * image1.png
        * image2.png
        * image3.png
        
        ...
        
        * imagen.png 

#### 

In [3]:
# Example usage
output_dir = 'datasets/coco-qa/'
get_cocoqa_dataset(output_dir)
process_cocoqa_data(output_dir)

COCO-QA dataset downloaded and uncompressed successfully.
COCO images downloaded and uncompressed successfully.
Train and test dataframes saved successfully.
Combined dataframe saved successfully.
Images removed successfully.


## 3. Fakeddit

* **[Fakeddit Dataset](https://fakeddit.netlify.app/)**:

Fakeddit is a large-scale multimodal dataset for fine-grained fake news detection. It consists of over 1 million samples from multiple categories of fake news, including satire, misinformation, and fabricated news. The dataset includes text, images, metadata, and comment data, making it a rich resource for developing and evaluating fake news detection models.

You can use the se the function `download_fakeddit_files` to download the metadata, and the function `download_full_set_images`to get the full set of Images. 

Since the full set of images contains 1M images, we'll provide a function to generate a subset, to run the experiments with less resources. Use the function `create_stratified_subset_fakeddit`. This function will generate a `labels.csv` file with the subset.

You can also use the `download_images_from_file` to download the images from an specific file


In [2]:
# Example usage
output_dir = 'datasets/fakeddit/'

# Get Metadata:
download_fakeddit_files(output_dir)

# Get Images (Due to possible API changes, we recommend this method):
download_full_set_images(output_dir)

# Random subset:
subset_size = 1  # 100% subset size
# subset_size = 0.1  # 10% subset size
create_stratified_subset_fakeddit(output_dir, subset_size)

#download_images_from_file(output_dir, 'labels.csv')

Downloading...
From (uriginal): https://drive.google.com/uc?id=1cjY6HsHaSZuLVHywIxD5xQqng33J5S2b
From (redirected): https://drive.google.com/uc?id=1cjY6HsHaSZuLVHywIxD5xQqng33J5S2b&confirm=t&uuid=25f70429-0d7f-4575-a219-133d8d39a23b
To: /home/datascience/Data Fusion/datasets/fakeddit/Images.tar.bz2
100%|██████████| 114G/114G [16:12<00:00, 117MB/s]    


## 4. Recipes5k
* **[Recipes5k Dataset](http://www.ub.edu/cvub/recipes5k/)**:

The Recipes5k dataset comprises 4,826 recipes featuring images and corresponding ingredient lists, with 3,213 unique ingredients simplified from 1,014 by removing overly-descriptive particles, offering a diverse collection of alternative preparations for each of the 101 food types from Food101, meticulously balanced across training, validation, and test splits. The dataset addresses intra- and inter-class variability, extracted from Yummly with 50 recipes per food type.

You can use the se the function `download_recipes5k_dataset` to download the dataset. Use the function `preprocess_recipes5k` to preprocess the dataset. These function will generate the following structure:

* preprocess_recipes5k
    * labels.csv
    * Images/
        * class_1/
            * img_1
            * img_2
            ...
        * class_2/
            * img_1
            * img_2
            ...
        ...
        * class_n/
            * img_1
            * img_2
            ...

In [3]:
# Example usage
# The function generates the directory 'Recipes5k' by default, so you don't have to specify that.
output_dir = 'datasets/'
download_recipes5k_dataset(output_dir)
preprocess_recipes5k('datasets/Recipes5k/')

## 5. BRSET
* **[BRSET Dataset](https://physionet.org/content/brazilian-ophthalmological/1.0.0/)**:

The Brazilian Multilabel Ophthalmological Dataset (BRSET) stands as a pioneering initiative aimed at bridging the gap in ophthalmological datasets, particularly for under-represented populations in low and medium-income countries. This comprehensive dataset encompasses 16,266 images from 8,524 Brazilian patients, incorporating a wide array of data points including demographics, anatomical parameters of the macula, optic disc, and vessels, along with quality control metrics such as focus, illumination, image field, and artifacts.

You can use the se the function `get_brset` to download the dataset. Use the function `brset_preprocessing` to preprocess the dataset. These function will generate the following structure:

* brset/
    * labels.csv
    * Images/
        * img_1
        * img_2
         
         ...
         
        * img_n

In [2]:
output_dir = 'datasets/brset'
dataset_path = "/gpfs/workdir/restrepoda/datasets/BRSET/brset"
get_brset('datasets/brset', download=True)
brset_preprocessing(output_dir)

Processed dataset saved as labels.csv in datasets/brset


# 6. HAM10000 dataset

* [HAM10000 dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T)

The MNIST: HAM10000 dataset is a large collection of dermatoscopic images from different populations, acquired and stored by the Department of Dermatology at the Medical University of Vienna, Austria. It consists of 10,015 dermatoscopic images which can serve as a training set for academic machine learning purposes in tasks like skin lesion analysis and classification, specifically focusing on the detection of melanoma.

Unfortunately we cannot provide a function to download data automatically because users must sign the terms of use. Please use the [link](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T) to download and uncompress the files and place them in the folder:
* datasets/ham10000/HAM10000_images_part_1/
    * fig1.jpg
    * fig2.jpg
    
    ...
    
    * fign.jpg
* datasets/ham10000/HAM10000_images_part_2/
    * fig1.jpg
    * fig2.jpg
    
    ...
    
    * fign.jpg
* datasets/ham10000/HAM10000_metadata.csv

Once you have the data placed, you can use the function preprocess_ham10000 to preprocess the dataset. Thi function will generate the prompt text of each patient, and save the data with the structure:

* ham10000/
    * labels.csv
    * Images/
        * img_1
        * img_2
         
         ...
         
        * img_n

## 7. Colombian Multimodal Satellite dataset
* **[A Multi-Modal Satellite Imagery Dataset for Public Health Analysis in Colombia](https://physionet.org/content/multimodal-satellite-data/1.0.0/)**:

The Multi-Modal Satellite Imagery Dataset in Colombia integrates economic, demographic, meteorological, and epidemiological data. It comprises 12,636 high-quality satellite images from 81 municipalities between 2016 and 2018, with minimal cloud cover. Its applications include deforestation monitoring, education indices forecasting, water quality assessment, extreme climatic event tracking, epidemic illness addressing, and precision agriculture optimization. We'll use it shortly.

You can use the se the function `get_satellitedata` to download the dataset. Use the function `satellitedata_preprocessing` to preprocess the dataset. These function will generate the following structure:

* satellitedata/
    * labels.csv
    * Images/
        * N_DATE_1
        * N_DATE_2
         
         ...
         
        * img_N_DATE_n

In [3]:
import os

output_dir = 'datasets/satellitedata'
get_satellitedata(output_dir, download=False)

num_classes = 3

df = satellitedata_preprocessing(output_dir, num_classes = num_classes)

# Fix image path
df.image_id = df.image_id.apply(lambda x: x.split('/')[-1].replace('image', x.split('/')[-2]).replace('.tiff', '.jpg'))
# Fix prompt to avoid data leakage
cities =  {
    "76001": "Cali",
    "5001": "Medellín",
    "50001": "Villavicencio",
    "54001": "Cúcuta",
    "73001": "Ibagué",
    "68001": "Bucaramanga",
    "5360": "Itagüí",
    "8001": "Barranquilla",
    "41001": "Neiva",
    "23001": "Montería"
    }

df['text'] = df.apply(lambda x: f"An image from city {cities[x.image_id.split('_')[0]]} taken in date {x.image_id.split('_')[1].replace('.tiff', '')} with"+x.text[x.text.index(','):], axis=1)
df.to_csv(os.path.join(output_dir, 'labels.csv'), index=False)

datasets/satellitedata/physionet.org/files/multimodal-satellite-data/1.0.0
loading csv file in datasets/satellitedata/physionet.org/files/multimodal-satellite-data/1.0.0/metadata.csv
['2016', '02', '02']
['image_2017', '11', '26.tiff']
['image_2017', '02', '12.tiff']
['image_2018', '03', '04.tiff']
['image_2017', '12', '31.tiff']
['image_2018', '01', '21.tiff']
['image_2017', '09', '24.tiff']
['image_2017', '12', '24.tiff']
['image_2018', '03', '18.tiff']
['image_2017', '08', '06.tiff']
['image_2018', '10', '07.tiff']
['image_2018', '01', '07.tiff']
['image_2018', '02', '25.tiff']
['image_2017', '06', '04.tiff']
['image_2016', '03', '27.tiff']
['image_2017', '05', '07.tiff']
['image_2017', '09', '17.tiff']
['image_2016', '07', '31.tiff']
['image_2016', '12', '11.tiff']
['image_2017', '08', '13.tiff']
['image_2017', '07', '23.tiff']
['image_2016', '05', '01.tiff']
['image_2018', '05', '20.tiff']
['image_2016', '04', '03.tiff']
['image_2016', '04', '17.tiff']
['image_2016', '08', '14.tif

In [5]:
df.head()

Unnamed: 0,image_id,text,Labels,split
0,datasets/satellitedata/physionet.org/files/mul...,"In a city with 0 Dengue classification, 5.92% ...",0,train
1,datasets/satellitedata/physionet.org/files/mul...,"In a city with 0 Dengue classification, 7.04% ...",0,train
2,datasets/satellitedata/physionet.org/files/mul...,"In a city with 0 Dengue classification, 4.83% ...",0,train
3,datasets/satellitedata/physionet.org/files/mul...,"In a city with 0 Dengue classification, 6.22% ...",0,train
4,datasets/satellitedata/physionet.org/files/mul...,"In a city with 1 Dengue classification, 5.92% ...",1,train


## 8. MIMIC CXR
* **[MIMIC CXR](https://physionet.org/content/mimic-cxr/2.0.0/#files-panel)**:

The MIMIC-CXR (Medical Information Mart for Intensive Care, Chest X-Ray) dataset is a large, publicly available collection of chest radiographs with associated radiology reports. It was developed by the MIT Lab for Computational Physiology and provides an extensive resource for training and evaluating machine learning models in the field of medical imaging, particularly in automated radiograph interpretation and natural language processing for clinical narratives.

The dataset comprises over 370,000 chest x-ray images from more than 65,000 patients, making it one of the largest available datasets of its kind. The images are accompanied by structured labels derived from the associated radiology reports, using natural language processing techniques to extract findings and diagnoses. This allows for a wide range of applications, including disease detection, image captioning, and report generation.

In this case we did a preprocessing of MIMIC CXR so we are only using the RGB images in a 224x224 resolution. You should preprocess the dataset to have the same format 224x224 and place the files under datasets/mimic/images/

In [3]:
import pandas as pd
train = pd.read_csv('/gpfs/workdir/restrepoda/datasets/MIMIC/mimic/train.csv', index_col=0)#[['path', 'race_label', 'sex_label', 'disease_label', 'subject_id', 'study_id', 'Atelectasis',
       #'Cardiomegaly', 'Consolidation', 'Edema', 'Enlarged Cardiomediastinum',
       #'Fracture', 'Lung Lesion', 'Lung Opacity', 'No Finding',
       #'Pleural Effusion', 'Pleural Other', 'Pneumonia', 'Pneumothorax',
       #'Support Devices']]
test = pd.read_csv('/gpfs/workdir/restrepoda/datasets/MIMIC/mimic/test.csv', index_col=0)#[['path', 'race_label', 'sex_label', 'disease_label', 'subject_id', 'study_id', 'Atelectasis',
       #'Cardiomegaly', 'Consolidation', 'Edema', 'Enlarged Cardiomediastinum',
       #'Fracture', 'Lung Lesion', 'Lung Opacity', 'No Finding',
       #'Pleural Effusion', 'Pleural Other', 'Pneumonia', 'Pneumothorax',
       #'Support Devices']]
val = pd.read_csv('/gpfs/workdir/restrepoda/datasets/MIMIC/mimic/valid.csv', index_col=0)#[['path', 'race_label', 'sex_label', 'disease_label', 'subject_id', 'study_id', 'Atelectasis',
       #'Cardiomegaly', 'Consolidation', 'Edema', 'Enlarged Cardiomediastinum',
       #'Fracture', 'Lung Lesion', 'Lung Opacity', 'No Finding',
       #'Pleural Effusion', 'Pleural Other', 'Pneumonia', 'Pneumothorax',
       #'Support Devices']]

# Add a 'split' column to each dataframe
train['split'] = 'train'
test['split'] = 'test'
val['split'] = 'val'

# Concatenate the dataframes
df = pd.concat([train, test, val], ignore_index=True)

# Path to clinical notes:
text_path = pd.read_csv('/gpfs/workdir/restrepoda/datasets/MIMIC/mimic/cxr-study-list.csv')
text_path.rename(columns={'path': 'file_path'}, inplace=True)

# Merge:
df = pd.merge(df, text_path)

In [4]:
df

Unnamed: 0.1,Unnamed: 0,dicom_id,subject_id,path,study_id,PerformedProcedureStepDescription,split,ViewPosition,Rows,Columns,...,age,anchor_year,anchor_year_group,dod,race_label,sex_label,disease,disease_label,path_preproc,file_path
0,148366,d85c9f15-f0f84927-761f30e0-51c2d319-f2d917f0,19702416,p19/p19702416/s51321189/d85c9f15-f0f84927-761f...,51321189,CHEST (PORTABLE AP),train,AP,3056,2544,...,91.0,2168,2011 - 2013,2168-11-01,0,0,Other,3,preproc_224x224/s51321189_d85c9f15-f0f84927-76...,files/p19/p19702416/s51321189.txt
1,51114,0024603b-12db30e2-ab32c9cb-dae5a3fc-b2c598b4,13339704,p13/p13339704/s51292704/0024603b-12db30e2-ab32...,51292704,CHEST (PORTABLE AP),train,AP,2544,3056,...,68.0,2133,2011 - 2013,,2,0,Other,3,preproc_224x224/s51292704_0024603b-12db30e2-ab...,files/p13/p13339704/s51292704.txt
2,40441,8a4aaaee-55fcf98f-a036a8e7-da71eed1-4d7bf032,12668169,p12/p12668169/s54048859/8a4aaaee-55fcf98f-a036...,54048859,CHEST (PORTABLE AP),train,AP,2820,2539,...,54.0,2156,2011 - 2013,2156-08-19,0,0,Other,3,preproc_224x224/s54048859_8a4aaaee-55fcf98f-a0...,files/p12/p12668169/s54048859.txt
3,4830,9886b0fe-9121c65e-c8d74649-4b88c530-9b3943fb,10309415,p10/p10309415/s58144222/9886b0fe-9121c65e-c8d7...,58144222,CHEST (PORTABLE AP),train,AP,2544,3056,...,88.0,2114,2014 - 2016,,0,0,Other,3,preproc_224x224/s58144222_9886b0fe-9121c65e-c8...,files/p10/p10309415/s58144222.txt
4,145343,61b65859-4e25d250-d3faadc7-0fda22dd-14abe901,19504029,p19/p19504029/s59315061/61b65859-4e25d250-d3fa...,59315061,CHEST (PA AND LAT),train,PA,3056,2544,...,54.0,2178,2011 - 2013,,2,1,No Finding,0,preproc_224x224/s59315061_61b65859-4e25d250-d3...,files/p19/p19504029/s59315061.txt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153123,68393,f231fe18-30e5023f-617d5710-b7343694-658d8c59,14476373,p14/p14476373/s53343726/f231fe18-30e5023f-617d...,53343726,CHEST (PA AND LAT),val,AP,2544,3056,...,61.0,2174,2011 - 2013,,0,0,No Finding,0,preproc_224x224/s53343726_f231fe18-30e5023f-61...,files/p14/p14476373/s53343726.txt
153124,37939,6aa095e2-8ec1eeae-432fbe0a-951014ba-8d6944b7,12491157,p12/p12491157/s54173393/6aa095e2-8ec1eeae-432f...,54173393,CHEST (PORTABLE AP),val,AP,2539,3050,...,57.0,2179,2011 - 2013,,0,0,Other,3,preproc_224x224/s54173393_6aa095e2-8ec1eeae-43...,files/p12/p12491157/s54173393.txt
153125,61737,f52e19e0-9569d75a-7c2e1cca-588fe579-e22a208b,14036332,p14/p14036332/s52691805/f52e19e0-9569d75a-7c2e...,52691805,,val,PA,2022,2022,...,62.0,2190,2011 - 2013,2197-11-16,0,0,Pneumonia,2,preproc_224x224/s52691805_f52e19e0-9569d75a-7c...,files/p14/p14036332/s52691805.txt
153126,13982,f4f75648-baff1e55-0086d06c-cf27d72e-e1e35202,10972527,p10/p10972527/s53691151/f4f75648-baff1e55-0086...,53691151,CHEST (PA AND LAT),val,AP,2544,3056,...,68.0,2129,2014 - 2016,,0,0,Other,3,preproc_224x224/s53691151_f4f75648-baff1e55-00...,files/p10/p10972527/s53691151.txt


In [5]:
import zipfile
def extract_texts_from_zip(zip_path, file_list):
    """
    Efficiently extracts multiple text files from a zip archive.
    
    Parameters
    ----------
    zip_path : str
        Path to the zip file.
    file_list : list
        List of file paths inside the zip archive to extract.
        
    Returns
    -------
    dict
        Dictionary with file paths as keys and extracted text as values.
    """
    file_contents = {}
    
    with zipfile.ZipFile(zip_path, 'r') as z:
        existing_files = set(z.namelist())  # Convert file list in ZIP to a set for fast lookup
        
        for file_path in file_list:
            if file_path in existing_files:  # Check if the file exists in the ZIP
                with z.open(file_path) as f:
                    file_contents[file_path] = f.read().decode('utf-8')
            else:
                file_contents[file_path] = None  # If file not found, return None
    
    return file_contents


In [6]:
import os
zip_file_path = os.path.join("/gpfs/workdir/restrepoda/datasets/MIMIC/mimic/", 'metadata', 'mimic-cxr-reports.zip')

texts = extract_texts_from_zip(zip_file_path, df['file_path'].tolist())
df['text'] = df['file_path'].map(texts)

In [7]:
df

Unnamed: 0.1,Unnamed: 0,dicom_id,subject_id,path,study_id,PerformedProcedureStepDescription,split,ViewPosition,Rows,Columns,...,anchor_year,anchor_year_group,dod,race_label,sex_label,disease,disease_label,path_preproc,file_path,text
0,148366,d85c9f15-f0f84927-761f30e0-51c2d319-f2d917f0,19702416,p19/p19702416/s51321189/d85c9f15-f0f84927-761f...,51321189,CHEST (PORTABLE AP),train,AP,3056,2544,...,2168,2011 - 2013,2168-11-01,0,0,Other,3,preproc_224x224/s51321189_d85c9f15-f0f84927-76...,files/p19/p19702416/s51321189.txt,FINAL REPORT\...
1,51114,0024603b-12db30e2-ab32c9cb-dae5a3fc-b2c598b4,13339704,p13/p13339704/s51292704/0024603b-12db30e2-ab32...,51292704,CHEST (PORTABLE AP),train,AP,2544,3056,...,2133,2011 - 2013,,2,0,Other,3,preproc_224x224/s51292704_0024603b-12db30e2-ab...,files/p13/p13339704/s51292704.txt,FINAL REPORT\...
2,40441,8a4aaaee-55fcf98f-a036a8e7-da71eed1-4d7bf032,12668169,p12/p12668169/s54048859/8a4aaaee-55fcf98f-a036...,54048859,CHEST (PORTABLE AP),train,AP,2820,2539,...,2156,2011 - 2013,2156-08-19,0,0,Other,3,preproc_224x224/s54048859_8a4aaaee-55fcf98f-a0...,files/p12/p12668169/s54048859.txt,FINAL REPORT\...
3,4830,9886b0fe-9121c65e-c8d74649-4b88c530-9b3943fb,10309415,p10/p10309415/s58144222/9886b0fe-9121c65e-c8d7...,58144222,CHEST (PORTABLE AP),train,AP,2544,3056,...,2114,2014 - 2016,,0,0,Other,3,preproc_224x224/s58144222_9886b0fe-9121c65e-c8...,files/p10/p10309415/s58144222.txt,FINAL REPORT\...
4,145343,61b65859-4e25d250-d3faadc7-0fda22dd-14abe901,19504029,p19/p19504029/s59315061/61b65859-4e25d250-d3fa...,59315061,CHEST (PA AND LAT),train,PA,3056,2544,...,2178,2011 - 2013,,2,1,No Finding,0,preproc_224x224/s59315061_61b65859-4e25d250-d3...,files/p19/p19504029/s59315061.txt,FINAL REPORT\...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153123,68393,f231fe18-30e5023f-617d5710-b7343694-658d8c59,14476373,p14/p14476373/s53343726/f231fe18-30e5023f-617d...,53343726,CHEST (PA AND LAT),val,AP,2544,3056,...,2174,2011 - 2013,,0,0,No Finding,0,preproc_224x224/s53343726_f231fe18-30e5023f-61...,files/p14/p14476373/s53343726.txt,FINAL REPORT\...
153124,37939,6aa095e2-8ec1eeae-432fbe0a-951014ba-8d6944b7,12491157,p12/p12491157/s54173393/6aa095e2-8ec1eeae-432f...,54173393,CHEST (PORTABLE AP),val,AP,2539,3050,...,2179,2011 - 2013,,0,0,Other,3,preproc_224x224/s54173393_6aa095e2-8ec1eeae-43...,files/p12/p12491157/s54173393.txt,FINAL REPORT\...
153125,61737,f52e19e0-9569d75a-7c2e1cca-588fe579-e22a208b,14036332,p14/p14036332/s52691805/f52e19e0-9569d75a-7c2e...,52691805,,val,PA,2022,2022,...,2190,2011 - 2013,2197-11-16,0,0,Pneumonia,2,preproc_224x224/s52691805_f52e19e0-9569d75a-7c...,files/p14/p14036332/s52691805.txt,FINAL REPORT\...
153126,13982,f4f75648-baff1e55-0086d06c-cf27d72e-e1e35202,10972527,p10/p10972527/s53691151/f4f75648-baff1e55-0086...,53691151,CHEST (PA AND LAT),val,AP,2544,3056,...,2129,2014 - 2016,,0,0,Other,3,preproc_224x224/s53691151_f4f75648-baff1e55-00...,files/p10/p10972527/s53691151.txt,FINAL REPORT\...


In [8]:
df.to_csv('/gpfs/workdir/restrepoda/datasets/MIMIC/mimic/labels.csv', index=False)

# 9. mBRSET

In [41]:
import pandas as pd
import os

dataset_path = "/gpfs/workdir/restrepoda/datasets/mBRSET/mbrset"

#pd.read_csv
os.listdir(dataset_path)

['images_224', 'labels.csv', 'labels_mbrset.csv', 'images']

In [25]:
pd.read_csv(f'{dataset_path}/labels_mbrset.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5164 entries, 0 to 5163
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   patient                      5164 non-null   int64  
 1   age                          5164 non-null   object 
 2   sex                          5164 non-null   int64  
 3   dm_time                      5108 non-null   float64
 4   insulin                      5116 non-null   float64
 5   insulin_time                 1016 non-null   float64
 6   oraltreatment_dm             5120 non-null   float64
 7   systemic_hypertension        5120 non-null   float64
 8   insurance                    5116 non-null   float64
 9   educational_level            5112 non-null   float64
 10  alcohol_consumption          5088 non-null   float64
 11  smoking                      5076 non-null   float64
 12  obesity                      5088 non-null   float64
 13  vascular_disease  

In [13]:
def generate_patient_text(row):
    """
    Generates a descriptive text for each patient based on their characteristics,
    including educational level.
    """
    
    # Helper function to format binary values
    def binary_to_text(value, true_text, false_text):
        return true_text if value == 1 else false_text
    
    # Map for educational levels
    education_map = {
        1.0: "illiterate",
        2.0: "with incomplete primary education",
        3.0: "with complete primary education",
        4.0: "with incomplete secondary education",
        5.0: "with complete secondary education",
        6.0: "with incomplete tertiary education",
        7.0: "with complete tertiary education"
    }
    
    # Age description
    age_description = f"aged {row['age']} years" if not pd.isnull(row['age']) else "with age not reported"
    
    # Sex description
    sex_description = "male" if row['sex'] == 1 else "female" if row['sex'] == 0 else "sex not reported"
    
    # Diabetes duration description
    dm_duration = f"diagnosed with diabetes for {row['dm_time']} years" if not pd.isnull(row['dm_time']) else "with no reported diabetes duration"
    
    # Insulin use description
    insulin_use = binary_to_text(row['insulin'], "using insulin", "not using insulin")
    
    # Oral treatment description
    oral_treatment = binary_to_text(row['oraltreatment_dm'], "on oral treatment for diabetes", "not on oral treatment for diabetes")
    
    # Systemic hypertension description
    hypertension = binary_to_text(row['systemic_hypertension'], "with systemic hypertension", "without systemic hypertension")
    
    # Alcohol consumption description
    alcohol_use = binary_to_text(row['alcohol_consumption'], "consumes alcohol", "does not consume alcohol")
    
    # Smoking description
    smoking = binary_to_text(row['smoking'], "smokes", "does not smoke")
    
    # Obesity description
    obesity = binary_to_text(row['obesity'], "with obesity", "without obesity")
    
    # Vascular disease description
    vascular_disease = binary_to_text(row['vascular_disease'], "has vascular disease", "does not have vascular disease")
    
    # Acute myocardial infarction description
    myocardial_infarction = binary_to_text(row['acute_myocardial_infarction'], "has a history of acute myocardial infarction", "no history of acute myocardial infarction")
    
    # Nephropathy description
    nephropathy = binary_to_text(row['nephropathy'], "with nephropathy", "without nephropathy")
    
    # Neuropathy description
    neuropathy = binary_to_text(row['neuropathy'], "with neuropathy", "without neuropathy")
    
    # Diabetic foot description
    diabetic_foot = binary_to_text(row['diabetic_foot'], "has diabetic foot", "does not have diabetic foot")
    
    # Educational level description
    education_description = education_map.get(row['educational_level'], "with no educational level reported")
    
    # Generate the full description
    description = (
        f"A {sex_description} patient {age_description}, {dm_duration}, {insulin_use}, and {oral_treatment}. "
        f"The patient is {hypertension}, {alcohol_use}, {smoking}, {obesity}, and {vascular_disease}. "
        f"Medical history includes: {myocardial_infarction}, {nephropathy}, {neuropathy}, and {diabetic_foot}. "
        f"The patient is {education_description}."
    )
    
    return description


def mbrset_preprocessing(dataset_path, filename='labels_mbrset.csv', output_filename='labels.csv'):
    # Load the dataset
    df = pd.read_csv(f'{dataset_path}/{filename}')

    # Create the 'text' column with conditions
    df['text'] = df.apply(generate_patient_text, axis=1)

    # Drop all columns except for 'image_id', 'DR_ICDR', and 'text'
    df.rename(columns={'final_icdr': 'DR_ICDR'}, inplace=True)
    df = df[['file', 'DR_ICDR', 'text']]
    
    df.to_csv('prompt_eg.csv', index=False)

    df.dropna(subset = ['DR_ICDR'], inplace=True)

    # Create DR_2 and DR_3 columns from DR_ICDR
    df['DR_2'] = df['DR_ICDR'].apply(lambda x: 1 if x > 0 else 0)
    df['DR_3'] = df['DR_ICDR'].apply(lambda x: 2 if x == 4 else (1 if x in [1, 2, 3] else 0))

    # Create a 'split' column
    df['split'] = 'train'
    # Stratify split by 'DR_ICDR'
    train_idx, test_idx = train_test_split(df.index, test_size=0.2, stratify=df['DR_ICDR'], random_state=42)
    df.loc[test_idx, 'split'] = 'test'  # Update 'split' for test set

    # Save the processed dataframe to a new CSV file
    df.to_csv(f'{dataset_path}/{output_filename}', index=False)

    print(f"Processed dataset saved as {output_filename} in {dataset_path}")


In [32]:
mbrset_preprocessing(dataset_path)

Processed dataset saved as labels.csv in /gpfs/workdir/restrepoda/datasets/mBRSET/mbrset


# 10. Joslin Center Data

This is a private data set. Use the function `joslin_preprocessing` to preprocess the dataset. These function will generate the following structure:

* joslin/
    * labels.csv
    * Images/
        * img_1
        * img_2
         
         ...
         
        * img_n

In [3]:
output_dir = 'datasets/joslin'
joslin_preprocessing(output_dir)

Processed dataset saved as labels.csv in datasets/joslin
