# Feature Extractor Guide

How to create a new dataset feature extractor for TumorImagingBench.

This notebook demonstrates the complete workflow for adding a new dataset to TumorImagingBench.

## Overview

The TumorImagingBench framework uses **dataset extractors** to integrate new datasets. Each extractor is a Python module that:

1. **Loads dataset splits** (train/val/test) from CSV files
2. **Validates and preprocesses** each sample
3. **Extracts features** using all available foundation models
4. **Saves results** as pickle files

The benefit is a unified interface for feature extraction across all models and datasets.

## Part 1: Dataset Preparation

Before creating a feature extractor, prepare your dataset in CSV format.

### Expected Directory Structure

```
data/eval/my_dataset/
├── train.csv
├── val.csv
├── test.csv
└── images/
    ├── scan_001.nii.gz
    ├── scan_002.nii.gz
    └── ...
```

### CSV Format

Each CSV file (train.csv, val.csv, test.csv) should have:

**Required columns:**
- `image_path` (str): Absolute path to NIFTI file (.nii.gz)
- `coordX` (float): X centroid coordinate in physical space (mm)
- `coordY` (float): Y centroid coordinate in physical space (mm)
- `coordZ` (float): Z centroid coordinate in physical space (mm)

**Optional columns:**
- `label` (int/float): Target variable for classification/regression
- Any other metadata (patient_id, scan_date, etc.)

### Example CSV

```csv
image_path,coordX,coordY,coordZ,label
/path/to/data/eval/my_dataset/images/scan_001.nii.gz,100.5,150.3,200.1,0
/path/to/data/eval/my_dataset/images/scan_002.nii.gz,110.2,160.8,210.5,1
/path/to/data/eval/my_dataset/images/scan_003.nii.gz,95.8,145.2,195.3,0
```

**Important Notes:**
- All paths must be absolute paths (or relative to the working directory)
- Coordinates should be in physical space (millimeters), not voxel indices
- The coordinates represent the centroid of the region of interest (lesion, tumor, etc.)

## Part 2: Create the Feature Extractor

Now we'll create the Python module for feature extraction. There are four key components:

1. `get_split_data(split, ...)` - Load dataset split from CSV
2. `preprocess_row(row)` - Validate and preprocess each sample
3. `extract_features(...)` - Main entry point
4. Command-line interface with argparse

Let's implement each one:

### Function 1: get_split_data()

This function loads and returns the dataset split as a pandas DataFrame.

In [1]:
import pandas as pd
import os

def get_split_data(split, train_csv, val_csv, test_csv):
    """
    Load dataset split from CSV.

    Parameters:
    -----------
    split : str
        One of ['train', 'val', 'test']
    train_csv : str
        Path to training CSV
    val_csv : str
        Path to validation CSV
    test_csv : str
        Path to test CSV

    Returns:
    --------
    pd.DataFrame
        DataFrame with columns: image_path, coordX, coordY, coordZ, label (optional), ...

    Raises:
    -------
    ValueError
        If split is not recognized
    FileNotFoundError
        If CSV file not found
    """
    split_paths = {
        "train": train_csv,
        "val": val_csv,
        "test": test_csv
    }

    if split not in split_paths:
        raise ValueError(f"Invalid split: {split}. Must be one of {list(split_paths.keys())}")

    csv_path = split_paths[split]
    if not os.path.exists(csv_path):
        raise FileNotFoundError(f"CSV not found: {csv_path}")
    
    return pd.read_csv(csv_path)

print("✓ Function 'get_split_data' defined")

✓ Function 'get_split_data' defined


### Function 2: preprocess_row()

This function validates and preprocesses each sample before feature extraction.

We'll show both a basic version and an advanced version with validation.

In [2]:
# BASIC VERSION (no validation)
def preprocess_row(row):
    """Pass through - no preprocessing needed."""
    return row

print("✓ Basic 'preprocess_row' defined")

✓ Basic 'preprocess_row' defined


### Function 3: extract_features()

This is the main entry point that orchestrates feature extraction.

In [4]:
import sys
from functools import partial
from tumorimagingbench.evaluation.base_feature_extractor import extract_all_features, save_features

def extract_features(output_path, train_csv, val_csv, test_csv, model_names=None):
    """
    Extract features for all models.

    Parameters:
    -----------
    output_path : str
        Where to save extracted features (pickle file)
    train_csv : str
        Path to training annotations CSV
    val_csv : str
        Path to validation annotations CSV
    test_csv : str
        Path to test annotations CSV
    model_names : list of str, optional
        Specific models to extract. If None, extracts all available models.
        Example: ['DummyResNetExtractor', 'FMCIBExtractor']

    Returns:
    --------
    None
        Saves results to output_path

    Examples:
    ---------
    >>> # Extract all models
    >>> extract_features('features/my_dataset.pkl',
    ...                   'data/train.csv', 'data/val.csv', 'data/test.csv')
    >>>
    >>> # Extract specific models
    >>> extract_features('features/my_dataset.pkl',
    ...                   'data/train.csv', 'data/val.csv', 'data/test.csv',
    ...                   model_names=['DummyResNetExtractor'])
    """
    print("=" * 70)
    print("TumorImagingBench Feature Extraction")
    print("=" * 70)

    # Create a partial function that binds CSV paths
    get_split_fn = partial(get_split_data, train_csv=train_csv, val_csv=val_csv, test_csv=test_csv)

    # Extract features for all models
    features = extract_all_features(get_split_fn, preprocess_row, model_names=model_names)

    # Save results to disk
    save_features(features, output_path)

    print("=" * 70)
    print("✓ Feature extraction completed successfully")
    print(f"✓ Results saved to {output_path}")
    print("=" * 70)

print("✓ Function 'extract_features' defined")

✓ Function 'extract_features' defined


## Part 3: Complete Example

Let's demonstrate the complete workflow with a simple example.

### Step 1: Create Example CSVs with Dummy Data

We'll create minimal example CSV files to demonstrate the workflow.

In [5]:
# Create example data directory
import tempfile
from pathlib import Path

# Create a temporary directory for our example
example_dir = Path(tempfile.gettempdir()) / "tumor_imaging_example"
example_dir.mkdir(exist_ok=True)

data_dir = example_dir / "data" / "eval" / "my_dataset" / "images"
data_dir.mkdir(parents=True, exist_ok=True)

print(f"Created example data directory: {data_dir}")
print(f"Example directory structure:")
print(f"  {example_dir}/")
print(f"  ├── data/eval/my_dataset/")
print(f"  │   ├── train.csv")
print(f"  │   ├── val.csv")
print(f"  │   ├── test.csv")
print(f"  │   └── images/")
print(f"  │       ├── scan_001.nii.gz")
print(f"  │       ├── scan_002.nii.gz")
print(f"  │       └── scan_003.nii.gz")

Created example data directory: /tmp/tumor_imaging_example/data/eval/my_dataset/images
Example directory structure:
  /tmp/tumor_imaging_example/
  ├── data/eval/my_dataset/
  │   ├── train.csv
  │   ├── val.csv
  │   ├── test.csv
  │   └── images/
  │       ├── scan_001.nii.gz
  │       ├── scan_002.nii.gz
  │       └── scan_003.nii.gz


In [6]:
import numpy as np
import nibabel as nib

# Create dummy NIFTI files
for i in range(1, 4):
    # Create dummy image data (32x32x32 volume)
    data = np.random.rand(32, 32, 32).astype(np.float32)
    
    # Create affine matrix (identity with 1mm spacing)
    affine = np.eye(4)
    affine[0, 0] = 1.0  # X spacing
    affine[1, 1] = 1.0  # Y spacing
    affine[2, 2] = 1.0  # Z spacing
    
    # Create NIFTI image
    img = nib.Nifti1Image(data, affine)
    
    # Save
    filepath = data_dir / f"scan_{i:03d}.nii.gz"
    nib.save(img, filepath)
    print(f"✓ Created {filepath}")

print("\n✓ All dummy NIFTI files created")

✓ Created /tmp/tumor_imaging_example/data/eval/my_dataset/images/scan_001.nii.gz
✓ Created /tmp/tumor_imaging_example/data/eval/my_dataset/images/scan_002.nii.gz
✓ Created /tmp/tumor_imaging_example/data/eval/my_dataset/images/scan_003.nii.gz

✓ All dummy NIFTI files created


In [7]:
# Create example CSV files
train_data = {
    'image_path': [
        str(data_dir / "scan_001.nii.gz"),
    ],
    'coordX': [15.5],
    'coordY': [16.0],
    'coordZ': [16.5],
    'label': [0]
}

val_data = {
    'image_path': [
        str(data_dir / "scan_002.nii.gz"),
    ],
    'coordX': [14.5],
    'coordY': [15.0],
    'coordZ': [15.5],
    'label': [1]
}

test_data = {
    'image_path': [
        str(data_dir / "scan_003.nii.gz"),
    ],
    'coordX': [16.5],
    'coordY': [17.0],
    'coordZ': [17.5],
    'label': [0]
}

# Save CSVs
train_csv_path = example_dir / "data" / "eval" / "my_dataset" / "train.csv"
val_csv_path = example_dir / "data" / "eval" / "my_dataset" / "val.csv"
test_csv_path = example_dir / "data" / "eval" / "my_dataset" / "test.csv"

pd.DataFrame(train_data).to_csv(train_csv_path, index=False)
pd.DataFrame(val_data).to_csv(val_csv_path, index=False)
pd.DataFrame(test_data).to_csv(test_csv_path, index=False)

print(f"✓ Created {train_csv_path}")
print(f"✓ Created {val_csv_path}")
print(f"✓ Created {test_csv_path}")

✓ Created /tmp/tumor_imaging_example/data/eval/my_dataset/train.csv
✓ Created /tmp/tumor_imaging_example/data/eval/my_dataset/val.csv
✓ Created /tmp/tumor_imaging_example/data/eval/my_dataset/test.csv


In [8]:
# Verify CSV contents
print("\nTrain CSV:")
print(pd.read_csv(train_csv_path))
print("\nVal CSV:")
print(pd.read_csv(val_csv_path))
print("\nTest CSV:")
print(pd.read_csv(test_csv_path))


Train CSV:
                                          image_path  coordX  coordY  coordZ  \
0  /tmp/tumor_imaging_example/data/eval/my_datase...    15.5    16.0    16.5   

   label  
0      0  

Val CSV:
                                          image_path  coordX  coordY  coordZ  \
0  /tmp/tumor_imaging_example/data/eval/my_datase...    14.5    15.0    15.5   

   label  
0      1  

Test CSV:
                                          image_path  coordX  coordY  coordZ  \
0  /tmp/tumor_imaging_example/data/eval/my_datase...    16.5    17.0    17.5   

   label  
0      0  


### Step 2: Test the get_split_data Function

In [9]:
# Test loading train split
train_df = get_split_data('train', str(train_csv_path), str(val_csv_path), str(test_csv_path))
print("Loaded train split:")
print(train_df)
print(f"\nShape: {train_df.shape}")
print(f"Columns: {list(train_df.columns)}")

Loaded train split:
                                          image_path  coordX  coordY  coordZ  \
0  /tmp/tumor_imaging_example/data/eval/my_datase...    15.5    16.0    16.5   

   label  
0      0  

Shape: (1, 5)
Columns: ['image_path', 'coordX', 'coordY', 'coordZ', 'label']


In [10]:
# Test loading val split
val_df = get_split_data('val', str(train_csv_path), str(val_csv_path), str(test_csv_path))
print("Loaded val split:")
print(val_df)

# Test loading test split
test_df = get_split_data('test', str(train_csv_path), str(val_csv_path), str(test_csv_path))
print("\nLoaded test split:")
print(test_df)

Loaded val split:
                                          image_path  coordX  coordY  coordZ  \
0  /tmp/tumor_imaging_example/data/eval/my_datase...    14.5    15.0    15.5   

   label  
0      1  

Loaded test split:
                                          image_path  coordX  coordY  coordZ  \
0  /tmp/tumor_imaging_example/data/eval/my_datase...    16.5    17.0    17.5   

   label  
0      0  


### Step 3: Test the preprocess_row Function

In [11]:
# Test preprocessing a valid row
test_row = train_df.iloc[0]
print("Original row:")
print(test_row)

# Preprocess
preprocessed = preprocess_row(test_row)
print("\nPreprocessed row:")
print(preprocessed)
print(f"\nPreprocessing successful: {preprocessed is not None}")

Original row:
image_path    /tmp/tumor_imaging_example/data/eval/my_datase...
coordX                                                     15.5
coordY                                                     16.0
coordZ                                                     16.5
label                                                         0
Name: 0, dtype: object

Preprocessed row:
image_path    /tmp/tumor_imaging_example/data/eval/my_datase...
coordX                                                     15.5
coordY                                                     16.0
coordZ                                                     16.5
label                                                         0
Name: 0, dtype: object

Preprocessing successful: True


### Step 4: Extract Features

Now let's actually extract features using our dataset!

In [14]:
# Create output directory
output_dir = example_dir / "features"
output_dir.mkdir(exist_ok=True)

output_path = str(output_dir / "my_dataset.pkl")

print(f"Output path: {output_path}")
print(f"Using dataset CSV files:")
print(f"  Train: {train_csv_path}")
print(f"  Val:   {val_csv_path}")
print(f"  Test:  {test_csv_path}")

Output path: /tmp/tumor_imaging_example/features/my_dataset.pkl
Using dataset CSV files:
  Train: /tmp/tumor_imaging_example/data/eval/my_dataset/train.csv
  Val:   /tmp/tumor_imaging_example/data/eval/my_dataset/val.csv
  Test:  /tmp/tumor_imaging_example/data/eval/my_dataset/test.csv


In [15]:
# Extract features using only DummyResNetExtractor for speed
# In production, you would extract all models by omitting model_names

try:
    extract_features(
        output_path=output_path,
        train_csv=str(train_csv_path),
        val_csv=str(val_csv_path),
        test_csv=str(test_csv_path),
        model_names=['DummyResNetExtractor']  # Use specific model for demo
    )
except Exception as e:
    print(f"Error during feature extraction: {e}")
    import traceback
    traceback.print_exc()

TumorImagingBench Feature Extraction
[<class 'tumorimagingbench.models.dummy_resnet.DummyResNetExtractor'>]


monai.transforms.spatial.dictionary Orientationd.__init__:labels: Current default value of argument `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` was changed in version None from `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` to `labels=None`. Default value changed to None meaning that the transform now uses the 'space' of a meta-tensor, if applicable, to determine appropriate axis labels.



Processing DummyResNetExtractor


100%|██████████| 1/1 [00:03<00:00,  3.58s/it]
100%|██████████| 1/1 [00:00<00:00, 13.58it/s]
100%|██████████| 1/1 [00:00<00:00, 11.65it/s]


Features saved to /tmp/tumor_imaging_example/features/my_dataset.pkl
✓ Feature extraction completed successfully
✓ Results saved to /tmp/tumor_imaging_example/features/my_dataset.pkl


### Step 5: Load and Inspect Extracted Features

In [24]:
import pickle

# Load the extracted features
with open(output_path, 'rb') as f:
    features = pickle.load(f)

print("Extracted features structure:")
print(f"Models: {list(features.keys())}")
print("--------------------------------")
for model_name in features.keys():
    model_features = features[model_name]
    for split_name, split_features in model_features.items():
        print(split_name)
        print(np.vstack([v["feature"] for v in split_features]).shape)

Extracted features structure:
Models: ['DummyResNetExtractor']
--------------------------------
train
(1, 512)
val
(1, 512)
test
(1, 512)


## Usage Examples

### Example 1: Extract All Models (Default)

```python
from src.tumorimagingbench.evaluation.my_dataset_feature_extractor import extract_features

extract_features(
    output_path='features/my_dataset.pkl',
    train_csv='data/eval/my_dataset/train.csv',
    val_csv='data/eval/my_dataset/val.csv',
    test_csv='data/eval/my_dataset/test.csv'
)
```

### Example 2: Extract Specific Models Only

```python
extract_features(
    output_path='features/my_dataset.pkl',
    train_csv='data/eval/my_dataset/train.csv',
    val_csv='data/eval/my_dataset/val.csv',
    test_csv='data/eval/my_dataset/test.csv',
    model_names=['DummyResNetExtractor', 'FMCIBExtractor']
)
```

### Example 3: Command-Line Usage

Create a file `src/tumorimagingbench/evaluation/my_dataset_feature_extractor.py` and add this code:

```python
if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(
        description="Extract features for My Dataset",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
EXAMPLES:
  # Extract all models
  python my_dataset_feature_extractor.py \
    --output features/my_dataset.pkl \
    --train-csv /path/to/train.csv \
    --val-csv /path/to/val.csv \
    --test-csv /path/to/test.csv

  # Extract specific models
  python my_dataset_feature_extractor.py \
    --output features/my_dataset.pkl \
    --train-csv /path/to/train.csv \
    --val-csv /path/to/val.csv \
    --test-csv /path/to/test.csv \
    --models DummyResNetExtractor FMCIBExtractor
        """
    )

    parser.add_argument(
        "--output",
        type=str,
        default="features/my_dataset.pkl",
        help="Path where to save extracted features"
    )
    parser.add_argument(
        "--train-csv",
        type=str,
        required=True,
        help="Path to training annotations CSV"
    )
    parser.add_argument(
        "--val-csv",
        type=str,
        required=True,
        help="Path to validation annotations CSV"
    )
    parser.add_argument(
        "--test-csv",
        type=str,
        required=True,
        help="Path to test annotations CSV"
    )
    parser.add_argument(
        "--models",
        type=str,
        nargs="+",
        default=None,
        help="Specific models to extract (space-separated)"
    )

    args = parser.parse_args()
    extract_features(args.output, args.train_csv, args.val_csv, args.test_csv, args.models)
```

Then run:

```bash
cd src/tumorimagingbench/evaluation
python my_dataset_feature_extractor.py \
  --output features/my_dataset.pkl \
  --train-csv /path/to/train.csv \
  --val-csv /path/to/val.csv \
  --test-csv /path/to/test.csv
```

## Reference Implementations

For reference, see these existing extractors in the codebase:

- **Simple example**: `dummy_dataset_feature_extractor.py` - Basic template with minimal preprocessing
- **Real dataset**: `luna_feature_extractor.py` - LUNA16 lung nodule dataset
- **Complex example**: `nsclc_radiomics_feature_extractor.py` - NSCLC radiomics dataset with labels

You can use these as templates for your own dataset extractors.

## Summary

To add a new dataset to TumorImagingBench:

1. **Prepare your data**: Organize images as NIFTI/NRRD files and create CSV files with required columns
2. **Create extractor**: Implement `get_split_data()`, `preprocess_row()`, and `extract_features()` functions
3. **Run extraction**: Call `extract_features()` or use command-line interface
4. **Use results**: Load the pickle file and use features for downstream analysis

The framework handles:
- Loading all available foundation models
- Preprocessing NIFTI images
- Extracting features in parallel
- Saving results in a unified format

You only need to define your dataset-specific logic!