# Colab Pipeline Runner
This notebook replaces the CLI entry point so you can run the
soil micro-CT workflow directly from Google Colab. It mimics the logic
in `run_pipeline.py`, validates the canonical configuration, and then
calls `cli.pipeline.main` with programmatic inputs.

## Environment Setup
Prepare the base Python imports that keep this notebook focused on
pipeline orchestration rather than CLI parsing.

In [None]:
import json
import os
import sys
from pathlib import Path

print(f"Python {sys.version.split()[0]} interpreter ready")

Python 3.12.12 interpreter ready


## Dependency Installation
Install the Python libraries required by the pipeline without touching the
CUDA stack that Colab already provides.

In [None]:
# Install the runtime dependencies that the pipeline relies on.
!pip install numpy pyyaml tifffile scikit-image opencv-python
!pip install pandas
# Cupy is intentionally not installed because Colab already ships a
# compatible CUDA stack and we do not want to override it.



## Mount Google Drive
Authenticate with Google Drive so the notebook can access the project
and scan datasets stored under your Drive space.

In [None]:
from google.colab import drive

drive.mount('/content/drive')
print('Google Drive mounted at /content/drive')

Mounted at /content/drive
Google Drive mounted at /content/drive


## Project Root Definition
Point to the repository folder on Drive, add it to `sys.path`, and ensure
`cli.pipeline` imports cleanly.

In [None]:
import importlib

DRIVE_PROJECT_ROOT = Path('/content/drive/MyDrive/soil_microCT_images/drive_scripts/project_refactor_v1')
if not DRIVE_PROJECT_ROOT.exists():
    raise FileNotFoundError(f'Project root missing: {DRIVE_PROJECT_ROOT}')

PROJECT_ROOT = DRIVE_PROJECT_ROOT
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

import cli.pipeline as cli_pipeline
importlib.reload(cli_pipeline)
print('cli.pipeline imported from', cli_pipeline.__file__)

cli.pipeline imported from /content/drive/MyDrive/soil_microCT_images/drive_scripts/project_refactor_v1/cli/pipeline.py


## Config Validation
Load `config/pipeline.yaml`, validate the schema, and resolve the scan
directory plus identifier before execution. Override the defaults here if
you want to target a different dataset.

In [None]:
import yaml

CONFIG_PATH = PROJECT_ROOT / 'config' / '2_class_pipeline.yaml'
with CONFIG_PATH.open('r', encoding='utf-8') as handle:
    config = yaml.safe_load(handle) or {}

cli_pipeline._validate_config_schema(config)
print('Pipeline configuration schema valid')

def resolve_path(value: str | Path) -> Path:
    candidate = Path(value) if value is not None else None
    if candidate is None:
        raise ValueError('Cannot resolve a missing path')
    return candidate if candidate.is_absolute() else PROJECT_ROOT / candidate

# Override one of the following to point at a specific scan in Drive.
CUSTOM_SCAN_DIR: Path | None = None
CUSTOM_SCAN_ID: str | None = None

scan_dir_source = CUSTOM_SCAN_DIR or config.get('scan_dir')
scan_dir = resolve_path(scan_dir_source)
if not scan_dir.exists():
    raise FileNotFoundError(f'Scan directory not found: {scan_dir}')

scan_id = CUSTOM_SCAN_ID or config.get('scan_id') or scan_dir.name

STAGE_ORDER = ('preprocessing', 'segmentation', 'z_stability', 'binary', 'psd')
stage_dir_map = {stage: scan_dir / 'pipeline_outputs' / stage for stage in STAGE_ORDER}

print(f'Resolved scan directory: {scan_dir}')
print(f'Resolved scan identifier: {scan_id}')
print('Stage directory map (pipeline will create these):')
for stage, path in stage_dir_map.items():
    print(f'  {stage}: {path}')

Pipeline configuration schema valid
Resolved scan directory: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2
Resolved scan identifier: rehovot_test_2class
Stage directory map (pipeline will create these):
  preprocessing: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/preprocessing
  segmentation: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/segmentation
  z_stability: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/z_stability
  binary: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/binary
  psd: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/psd


## Optional DRY_RUN Toggle
Switch this boolean to skip the heavy pipeline stages while keeping
directory layouts and diagnostics unchanged.

In [None]:
import importlib

DRY_RUN = False  # Set to False to execute the full GPU/CPU pipeline stages.
os.environ['PIPELINE_DRY_RUN'] = '0'
importlib.reload(cli_pipeline)
print(f"DRY_RUN override: {cli_pipeline.DRY_RUN}")

DRY_RUN override: False


In [None]:
DRY_RUN = False

## GPU Diagnostics
Optional check that prints CUDA availability via `torch` (when installed)
and surface `nvidia-smi` output for reference.

In [None]:
try:
    import torch
except ImportError:
    torch = None

if torch is not None:
    print('torch.cuda.is_available():', torch.cuda.is_available())
else:
    print('torch is not available in this runtime; skipping CUDA check')

print('nvidia-smi snapshot:')
!nvidia-smi
print('GPU memory (MB): total, used, free')
!nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv,noheader,nounits

torch.cuda.is_available(): True
nvidia-smi snapshot:
Thu Feb 12 10:02:56 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   50C    P8             14W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+-------------------

## Pipeline Execution
Invoke `cli.pipeline.main` with the resolved configuration, scan path,
and identifier so the notebook orchestrates the same stages as the CLI.

In [None]:
print('Starting pipeline run via cli.pipeline.main()')
cli_pipeline.main(
    config_path=CONFIG_PATH,
    scan_arg=scan_dir,
    scan_id_arg=scan_id,
)
print('Pipeline run completed')

Starting pipeline run via cli.pipeline.main()
Done.




Loaded 50 slices from /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/segmentation/rehovot_test_2class/classmap -> shape (50, 650, 650)
Saved 50 slices to /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/binary/binary_pores
Saved 50 slices to /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/binary/binary_solids
STDOUT:
 
STDERR:
 /usr/bin/python3: Error while finding module specification for 'core.analysis.pores_analysis.psd_entrypoint' (ModuleNotFoundError: No module named 'core')



CalledProcessError: Command '['/usr/bin/python3', '-m', 'core.analysis.pores_analysis.psd_entrypoint']' returned non-zero exit status 1.

In [None]:
import sys
sys.path.append('/content/drive/MyDrive/soil_microCT_images/drive_scripts/project_refactor_v1')


In [None]:
!python -m core.analysis.pores_analysis.psd_entrypoint


/usr/bin/python3: Error while finding module specification for 'core.analysis.pores_analysis.psd_entrypoint' (ModuleNotFoundError: No module named 'core')


## Post-run Summary
Reiterate where the outputs live so you can inspect the results without
resurrecting the CLI or digging for directories.

In [None]:
print('Post-run summary')
#print(f'  DRY_RUN={DRY_RUN}')
print(f'  Scan directory: {scan_dir}')
print(f'  Scan identifier: {scan_id}')
print('Stage directories to inspect:')
for stage, path in stage_dir_map.items():
    print(f'    {stage}: {path}')
print('Pipeline outputs live under <scan_dir>/pipeline_outputs/<stage> once the run finishes.')

Post-run summary
  Scan directory: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2
  Scan identifier: rehovot_test_2class
Stage directories to inspect:
    preprocessing: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/preprocessing
    segmentation: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/segmentation
    z_stability: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/z_stability
    binary: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/binary
    psd: /content/drive/MyDrive/soil_microCT_images/ROI/rehovot/rehovot_partial_16bit_otsu2/pipeline_outputs/psd
Pipeline outputs live under <scan_dir>/pipeline_outputs/<stage> once the run finishes.


In [None]:
import os
import sys
import pathlib
from pprint import pprint

print("Working directory:", pathlib.Path.cwd())
print("Python executable:", sys.executable)
print("PYTHONPATH:", os.environ.get("PYTHONPATH"))
print("sys.path:")
pprint(sys.path)

Working directory: /content
Python executable: /usr/bin/python3
PYTHONPATH: /env/python
sys.path:
['/content/drive/MyDrive/soil_microCT_images/drive_scripts/project_refactor_v1',
 '/content',
 '/env/python',
 '/usr/lib/python312.zip',
 '/usr/lib/python3.12',
 '/usr/lib/python3.12/lib-dynload',
 '',
 '/usr/local/lib/python3.12/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.12/dist-packages/IPython/extensions',
 '/root/.ipython']


In [10]:
import importlib.util

spec = importlib.util.find_spec("core.analysis.pores_analysis.psd_entrypoint")
print("psd entrypoint module visible:", spec)
for module in ("pandas", "numpy"):
    available = importlib.util.find_spec(module) is not None
    print(f"{module}: {'found' if available else 'missing'}")

psd entrypoint module visible: ModuleSpec(name='core.analysis.pores_analysis.psd_entrypoint', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7c77503de1b0>, origin='/content/drive/MyDrive/soil_microCT_images/drive_scripts/project_refactor_v1/core/analysis/pores_analysis/psd_entrypoint.py')
pandas: found
numpy: found
