# Introduction to DataLoader Parameters
### DataLoader
The `DataLoader` class is a versatile tool for loading various medical datasets. It allows users to specify the dataset name, dataset split and other parameters to customize the data loading process. 
#### Key Parameters:
- **name**: The name of the dataset to load, sometimes subset is needed (e.g., "MEPS", "NHIS", "StrokePrediction", "MedMnist-nodulemnist3d").
- **task**: The specific task or subset of the dataset to load (e.g., task=10 for MEPS).
- **variables**: The specific variables of dataset to load.

### get_data
The `get_data()` method is used to retrieve the data in the desired format and split. It provides flexibility to handle different datasets and tasks efficiently.
#### Key Parameters:
- **format**: The format in which the data is returned (e.g., 'df' for pandas DataFrame, 'DeepPurpose' for deep learning purposes).
- **dataset**: The data split to load (e.g., 'train', 'test', 'validation', 'all'), different dataset may support different split.


# Setup and Installation
Import required packages and setup environment variables for the medical project dataloader.

In [2]:
# Import required packages
import os
import sys

# Add the project directory to the Python path
notebook_path = os.getcwd()  # Get the current working directory
project_path = os.path.abspath(os.path.join(notebook_path, "../.."))
sys.path.append(project_path)

# Verify the setup
print(f"Project path added to sys.path: {project_path}")
print("Environment setup complete.")

Project path added to sys.path: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health
Environment setup complete.


# DataLoader Class Overview
We support different tasks such as classification, detection, NER, QA, segmentation, etc. Please refer to README for full coverage.

Here is an introduction to the DataLoader class structure and its main components from medicalproject2024.dataLoader.classification.

In [3]:
# Import the DataLoader class from the classification module
from medicalproject2024.dataLoader.classification import DataLoader, SUPPORTED_DATASETS

# Display the structure and main components of the DataLoader class, check valid dataset names
print("Valid dataset names:")
print("\n".join(SUPPORTED_DATASETS))

Valid dataset names:
MedMnist-adrenalmnist3d
MedMnist-adrenalmnist3d_64
MedMnist-bloodmnist
MedMnist-bloodmnist_128
MedMnist-bloodmnist_224
MedMnist-bloodmnist_64
MedMnist-breastmnist
MedMnist-breastmnist_128
MedMnist-breastmnist_224
MedMnist-breastmnist_64
MedMnist-chestmnist
MedMnist-chestmnist_128
MedMnist-chestmnist_224
MedMnist-chestmnist_64
MedMnist-dermamnist
MedMnist-dermamnist_128
MedMnist-dermamnist_224
MedMnist-dermamnist_64
MedMnist-fracturemnist3d
MedMnist-fracturemnist3d_64
MedMnist-nodulemnist3d
MedMnist-nodulemnist3d_64
MedMnist-octmnist
MedMnist-octmnist_128
MedMnist-octmnist_224
MedMnist-octmnist_64
MedMnist-organamnist
MedMnist-organamnist_128
MedMnist-organamnist_224
MedMnist-organamnist_64
MedMnist-organcmnist
MedMnist-organcmnist_128
MedMnist-organcmnist_224
MedMnist-organcmnist_64
MedMnist-organmnist3d
MedMnist-organmnist3d_64
MedMnist-organsmnist
MedMnist-organsmnist_128
MedMnist-organsmnist_224
MedMnist-organsmnist_64
MedMnist-pathmnist
MedMnist-pathmnist_128
M

# Loading Classification Data
Demonstrate how to load different classification datasets (e.g., MEPS, NHIS) using the DataLoader class with specific task parameters.

In [4]:
# Create a DataLoader instance for the MEPS dataset with a specific task
meps_dataloader = DataLoader(name="MEPS", task=10)
# Load the data as a DataFrame
meps_data = meps_dataloader.get_data(format="df")
# Display the first few rows of the loaded MEPS dataset
print("First few rows of the MEPS dataset:")
print(meps_data.head())


# Create a DataLoader instance for the NHIS dataset with a specific task
nhis_dataloader = DataLoader(name="NHIS", task=5)
# Load the data as a DataFrame
nhis_data = nhis_dataloader.get_data(format="df")
# Display the first few rows of the loaded NHIS dataset
print("\nFirst few rows of the NHIS dataset:")
print(nhis_data.head())

Found local default dataset...
Loading data from default dataset for task Predict Hypertension Diagnosis Age
Found local default dataset...
Loading data from default dataset for task Classify individuals based on their combined health and mental well-being


First few rows of the MEPS dataset:
   PHQ2  AGE  SEX  MARSTAT  K6SUM
0   0.0   39    1       10      2
1   0.0   40    2       10      2
2   4.0   52    1       40     98
3   1.0   22    1       50      5
4   0.0   19    1       50      0

First few rows of the NHIS dataset:
   HEALTH  WORFREQ  DEPFREQ  AGE  SEX  EDUC  BMICAT  SMOKEV
0       2        0        0   41    2   114       9       0
1       3        0        0   67    1   301       9       0
2       2        0        0   62    1   999       2       1
3       2        0        0   64    2   301       9       0
4       1        0        0   25    1   301       9       0


# Working with Different Datasets
Examples of loading various medical datasets like ChestXRays, CheXpert_small, and other supported datasets mentioned in the project structure.

In [5]:
# Load and display data for the StrokePrediction dataset
stroke_dataloader = DataLoader(name="StrokePrediction")
stroke_data = stroke_dataloader.get_data(format="df")
print("\nFirst few rows of the StrokePrediction dataset:")
print(stroke_data.head())

# Load and display data for the ROND dataset
rond_dataloader = DataLoader(name="ROND")
rond_data = rond_dataloader.get_data(format="df")
print("\nFirst few rows of the ROND dataset:")
print(rond_data.head())

# Load and display data for the ROND dataset
MedMnist_dataloader = DataLoader(name="MedMnist-nodulemnist3d")
MedMnist_data = MedMnist_dataloader.get_data(format="df")
print("\nFirst few rows of the MedMnist dataset, nodulemnist3d subset:")
print(MedMnist_data.head())

Found local copy...
Found local copy...
Found local copy...



First few rows of the StrokePrediction dataset:
      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0       1  
1       1  
2       1  
3       1  
4       1  

First 

# Data Format Conversions
Show how to use the get_data() method with different format parameters (e.g., 'df' for pandas DataFrame) and handle the output.

In [6]:
# Demonstrate the use of `get_data()` method with different format parameters

# Load the MEPS dataset as a pandas DataFrame
meps_data_df = meps_dataloader.get_data(format="df")
print("MEPS dataset loaded as a pandas DataFrame:")
print(meps_data_df.head())

# Load the MEPS dataset as a NumPy array
meps_data_np = meps_dataloader.get_data(format="DeepPurpose")
print("\nMEPS dataset loaded for deep learning purpose:")
print(meps_data_np[:5])  # Display the first 5 rows

MEPS dataset loaded as a pandas DataFrame:
   PHQ2  AGE  SEX  MARSTAT  K6SUM
0   0.0   39    1       10      2
1   0.0   40    2       10      2
2   4.0   52    1       40     98
3   1.0   22    1       50      5
4   0.0   19    1       50      0

MEPS dataset loaded for deep learning purpose:
[{'PHQ2': 0.0, 'AGE': 39, 'SEX': 1, 'MARSTAT': 10, 'K6SUM': 2}, {'PHQ2': 0.0, 'AGE': 40, 'SEX': 2, 'MARSTAT': 10, 'K6SUM': 2}, {'PHQ2': 4.0, 'AGE': 52, 'SEX': 1, 'MARSTAT': 40, 'K6SUM': 98}, {'PHQ2': 1.0, 'AGE': 22, 'SEX': 1, 'MARSTAT': 50, 'K6SUM': 5}, {'PHQ2': 0.0, 'AGE': 19, 'SEX': 1, 'MARSTAT': 50, 'K6SUM': 0}]


# Handling Different Data Split
Explain how to specify different dataset parameters and their effects on the loaded data.

In [8]:
# Demonstrate handling different tasks with dataset split "train", "test", "validation", "all"
MedMnist_data = MedMnist_dataloader.get_data(format="df", dataset='train')
print("\nFirst few rows of the train split of the MedMnist dataset, nodulemnist3d subset:")
print(MedMnist_data.head())

MedMnist_data = MedMnist_dataloader.get_data(format="df", dataset='test')
print("\nFirst few rows of the test split of the MedMnist dataset, nodulemnist3d subset:")
print(MedMnist_data.head())

MedMnist_data = MedMnist_dataloader.get_data(format="df", dataset='validation')
print("\nFirst few rows of the validation split of the MedMnist dataset, nodulemnist3d subset:")
print(MedMnist_data.head())

MedMnist_data = MedMnist_dataloader.get_data(format="df", dataset='all')
print("\nFirst few rows of the whole MedMnist dataset, nodulemnist3d subset:")
print(MedMnist_data.head())


First few rows of the train split of the MedMnist dataset, nodulemnist3d subset:
                                               image label
0  [[[38, 37, 24, 32, 56, 132, 169, 186, 205, 191...   [0]
1  [[[193, 188, 184, 157, 180, 194, 195, 184, 69,...   [1]
2  [[[167, 166, 168, 178, 170, 192, 191, 190, 185...   [1]
3  [[[193, 173, 170, 191, 198, 199, 197, 198, 200...   [0]
4  [[[10, 11, 12, 11, 40, 108, 62, 21, 12, 10, 11...   [0]

First few rows of the test split of the MedMnist dataset, nodulemnist3d subset:
                                               image label
0  [[[21, 28, 19, 16, 16, 21, 21, 24, 33, 31, 29,...   [0]
1  [[[19, 18, 25, 18, 21, 19, 19, 19, 18, 16, 17,...   [0]
2  [[[164, 164, 163, 166, 164, 164, 164, 162, 160...   [0]
3  [[[162, 162, 162, 162, 165, 165, 162, 164, 163...   [0]
4  [[[26, 26, 26, 24, 23, 25, 30, 34, 36, 37, 39,...   [0]

First few rows of the validation split of the MedMnist dataset, nodulemnist3d subset:
                                          