# MSSegSAM Data Preparation Pipeline

This notebook streamlines the data preparation process for training the MSSegSAM model.

The pipeline is organized into two sequential phases:

1.  **Phase 1: Preprocessing and Standardization (MRI $\rightarrow$ NIfTI MNI152)**
    *   Transforming heterogeneous raw data into a standardized format.
    *   Applying s<kull stripping, MNI space registration, and bias field correction.

2.  **Phase 2: COCO Format Conversion (NIfTI $\rightarrow$ COCO JSON)**
    *   Extracting 2D slices from 3D volumes.
    *   Generating training annotations in the standard MS-COCO format.

> [!NOTE]
> The code is designed to be run on CPUs.

### Setting Up the Environment

In [None]:
import os
import sys
from pathlib import Path
import multiprocessing

current_dir = Path(os.getcwd())
project_root = current_dir.parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

try:
    import src.pipeline as pipeline_module
    import src.coco_converter as coco_module
    print("Modules loaded successfully.")
except ImportError as e:
    print(f"Error importing modules: {e}")

## Phase 1: Preprocessing

In [None]:
cpu_count = multiprocessing.cpu_count()
MAX_WORKERS = max(1, cpu_count - 1)
print(f"Resources detected: {cpu_count} CPU.")
print(f"Maximum number of workers: {MAX_WORKERS}")

In [None]:
INPUT_RAW_DIR = "../Datasets_Raw"
OUTPUT_PROC_DIR = "../Datasets_Process"

# Config
VERBOSE = False
WORKERS = MAX_WORKERS 

In [None]:
# Run preprocessing
try:
    pipeline_module.run(INPUT_RAW_DIR, OUTPUT_PROC_DIR, workers=WORKERS, verbose=VERBOSE)
except Exception as e:
    print(f"\nError during preprocessing: {e}")

### Train/Val/Test Split

After Phase 1, the data is organized in the structure: Dataset -> Patient -> Timepoint.

Before COCO conversion, it is recommended to organize the data into `train`, `val` and `test` subfolders within each dataset.

The COCO converter automatically detects these folders. If the data is already divided, the converter will generate three separate JSON files (`train/annotations.json`, `val/...`, `test/...`), ready for model training. If the data is not divided, a single "flat" dataset will be generated.


# Phase 2: Conversion to COCO Format

| Parameter | Type | Description | Default |
| :--- | :--- | :--- | :--- |
| **`input_dir`** | `str` | Root directory containing **Processed** datasets. | **Required** |
| **`output_dir`** | `str` | Target directory for the generated COCO dataset. | `dataset_COCO` |
| **`dataset_names`** | `list` | List of sub-datasets names to process. `["Dataset1", "Dataset2"]` OR `["all"]`. | `["all"]` |
| **`slice_range`** | `list` | Defines which axial slices to extract from the 3D volume. `["0", "181"]` OR `["all"]`. | `["all"]` |
| **`slice_step`** | `int` | Step for slicing volume (e.g., 5 = extract every 5th slice). | `1` |
| **`remove_empty`** | `bool` | If True, skips slices with no Ground Truth lesions. | `False` |
| **`all_timepoints`** | `bool` | If True, process all timepoints instead of just the last one. | `False` |
| **`modality`** | `str` | MRI modality to extract (`T1`, `T2`, `FLAIR`). | `"FLAIR"` |

In [None]:
INPUT_NIFTI_DIR = "../Datasets_Processed"
OUTPUT_COCO_DIR = "../../Dataset"

# Config
DATASET_NAMES = ["all"]
SLICE_RANGE = ["all"]
SLICE_STEP = 4
REMOVE_EMPTY = True
ALL_TIMEPOINTS = False
MODALITY = "FLAIR"

In [None]:
# Run COCO conversion
converter = coco_module.COCOConverter(
    input_dir=INPUT_NIFTI_DIR,
    output_dir=OUTPUT_COCO_DIR,
    dataset_names=DATASET_NAMES,
    slice_range=SLICE_RANGE,
    slice_step=SLICE_STEP,
    modality=MODALITY,
    remove_empty=REMOVE_EMPTY,
    all_timepoints=ALL_TIMEPOINTS
)

try:
    converter.run()
except Exception as e:
    print(f"\nError during COCO conversion: {e}")