# Benchmarking with `czbenchmarks`

`czbenchmarks` provides a standardized framework for benchmarking single-cell analysis models. This notebook demonstrates its core features for evaluating tasks such as embedding, clustering, and perturbation prediction.

## Features

- **Standardized Datasets:** _Preprocessed_, _benchmark-ready_ datasets.
- **Comprehensive Metrics:** Validated evaluation metrics.
- **Consistent Baselines:** Reference methods for comparison.
- **Result Management:** Organized tracking of benchmarking results.

## Components

### Datasets

Datasets are wrapped for consistent loading and compatibility:

- `SingleCellLabeledDataset`: Gene expression data with cell labels (supports clustering, embedding, label prediction).
- `SingleCellPerturbationDataset`: Perturbation datasets with control and perturbed cells.

### Tasks

Each task defines an evaluation workflow with `run()` and `compute_baseline()` methods.

| Task Name         | Class                         | Purpose                                         |
|-------------------|-------------------------------|-------------------------------------------------|
| Clustering        | `ClusteringTask`              | Evaluate cell group separation                  |
| Embedding Quality | `EmbeddingTask`               | Assess embedding structure                      |
| Label Prediction  | `MetadataLabelPredictionTask` | Predict labels from embeddings                  |
| Batch Integration | `BatchIntegrationTask`        | Evaluate batch integration                      |
| Cross-Species     | `CrossSpeciesIntegrationTask` | Integrate data across species                   |

### Metrics

Metrics are managed by `MetricRegistry` and returned as `MetricResult` objects.

- `MetricType`: Enum of metric names (e.g., `ADJUSTED_RAND_INDEX`, `SILHOUETTE_SCORE`)
- `MetricResult`: Stores metric type, value, and parameters

All tasks compute and return metrics automatically.

## Example: Benchmarking an Embedding

This section demonstrates how to benchmark a model's embedding using czbenchmarks. The workflow covers clustering, embedding quality, and label prediction tasks.

### Step 1: Setup and Imports

In [1]:
import logging
import sys
import json
import numpy as np
from czbenchmarks.datasets import load_dataset
from czbenchmarks.datasets.single_cell_labeled import SingleCellLabeledDataset
from czbenchmarks.tasks.types import CellRepresentation
from czbenchmarks.tasks import (
    ClusteringTask,
    EmbeddingTask,
    MetadataLabelPredictionTask,
)
from czbenchmarks.tasks.clustering import ClusteringTaskInput
from czbenchmarks.tasks.embedding import EmbeddingTaskInput
from czbenchmarks.tasks.label_prediction import MetadataLabelPredictionTaskInput

# Set up basic logging to see the library's output
logging.basicConfig(level=logging.INFO, stream=sys.stdout)

  from .autonotebook import tqdm as notebook_tqdm


### Step 2: Load a Dataset

Load the pre-configured `tsv2_prostate` dataset. The library handles automatic download, caching, and loading as a `SingleCellLabeledDataset` for streamlined reuse.

**Loaded dataset provides:**
- `dataset.adata`: AnnData object with gene expression data.
- `dataset.labels`: pandas Series of cell type labels.

In [2]:
# The 'dataset' object is a validated AnnData wrapper, ensuring efficient downstream processing.
dataset: SingleCellLabeledDataset = load_dataset("tsv2_prostate")

INFO:czbenchmarks.file_utils:File already exists in cache: /Users/sgupta/.cz-benchmarks/datasets/homo_sapiens_10df7690-6d10-4029-a47e-0f071bb2df83_Prostate_v2_curated.h5ad
INFO:czbenchmarks.datasets.single_cell:Loading dataset from /Users/sgupta/.cz-benchmarks/datasets/homo_sapiens_10df7690-6d10-4029-a47e-0f071bb2df83_Prostate_v2_curated.h5ad


### Step 3: Prepare Model Output

The library expects a **`CellRepresentation`** a `numpy.ndarray` with cells as rows and embedding features as columns. For demonstration, we simulate model output with random data.

*Replace this with your model's actual embedding in practice.*

In [3]:
# Simulated 10-dimensional embedding for each cell
model_output: CellRepresentation = np.random.rand(dataset.adata.shape[0], 10)

### Step 4: Run the Clustering Task

Evaluate the embedding by measuring clustering performance using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). The task compares Leiden clusters from the embedding to true labels. Higher scores indicate better clustering. Compare `clustering_results` to `clustering_baseline_results` to assess model performance against the PCA baseline.

In [4]:
# 1. Initialize the task
clustering_task = ClusteringTask()

# 2. Define the inputs for the task
clustering_task_input = ClusteringTaskInput(
    obs=dataset.adata.obs,      # The full observation metadata
    input_labels=dataset.labels # The ground-truth labels for comparison
)

# 3. Run the task on your model's output
clustering_results = clustering_task.run(
    cell_representation=model_output,
    task_input=clustering_task_input,
)

# 4. Compute and run the baseline for comparison
expression_data = dataset.adata.X
clustering_baseline = clustering_task.compute_baseline(expression_data)
clustering_baseline_results = clustering_task.run(
    cell_representation=clustering_baseline,
    task_input=clustering_task_input,
)

print("--- Clustering Model Results ---")
for result in clustering_results:
    print(result.model_dump_json(indent=2))

print("\n--- Clustering Baseline Results ---")
for result in clustering_baseline_results:
    print(result.model_dump_json(indent=2))

--- Clustering Model Results ---
{
  "metric_type": "adjusted_rand_index",
  "value": -0.0001785534951059197,
  "params": {}
}
{
  "metric_type": "normalized_mutual_info",
  "value": 0.022872839500109668,
  "params": {}
}

--- Clustering Baseline Results ---
{
  "metric_type": "adjusted_rand_index",
  "value": 0.626707020983652,
  "params": {}
}
{
  "metric_type": "normalized_mutual_info",
  "value": 0.8326481406592264,
  "params": {}
}
