Skip to content
Roman Mukhin edited this page Oct 6, 2025 · 3 revisions

cap-sc-client

A tiny Python client for exploring Cell Annotation Platform (CAP) data in notebooks & scripts.

Discover datasets • Search for cell annotation metadata • Run quick Differential Expression analyses


Table of Contents


Features

  • Dataset search by name and metadata (organism, tissue, assay) → pandas.DataFrame.
  • Cell label resources search for cross-referencing annotations.
  • Molecular Data (MD) session helper for a dataset (embeddings, labelsets, clusterings, readiness checks).
  • Embedding points with optional per-cell gene expression and selection masks.
  • Highly Variable Genes (HVGs) as a tidy DataFrame for immediate downstream use.
  • Quick DE: compare each of the 10 largest labels vs the rest; request a heatmap payload for plotting.

The client talks to CAP’s GraphQL API at https://celltype.info/graphql by default.


Installation

pip install cap-sc-client

Or from a local clone:

pip install -e .

Requirements: Python ≥ 3.10, pandas, httpx (installed automatically). No authentication is required for public endpoints.


Quick Start

from cap_sc_client import CapClient

# 1) Create the top-level client
cap = CapClient()  # uses https://celltype.info/graphql by default

# 2) Find datasets: e.g., human PBMC 10x datasets, newest first
ds = cap.search_datasets(
    search=["pbmc"],
    organism=["Homo sapiens"],
    assay=["10x"],
    limit=20,
    sort=[{"published_at": "desc"}],
)

# 3) Pick a dataset and open a Molecular Data session
dataset_id = ds.loc[0, "id"]
md = cap.md_session(dataset_id)
md.create_session()

# 4) Explore available resources
print("Embeddings:", md.embeddings)     # e.g., ["umap", "tsne", "pca_0", ...]
print("Labelsets:", md.labelsets)       # e.g., ["Cell Types v1", ...]
print("Clusterings:", md.clusterings)

Core Concepts

  • Dataset — a single-cell study (matrix + metadata) in CAP.
  • Embedding — 2-D/3-D representation of cells (UMAP, t-SNE, PCA). You can fetch coordinates plus optional per-cell vectors (labels, expression, selections).
  • Labelset — per-cell annotations (e.g., cell types from a curator or pipeline). This client exposes labelsets in the cell-labels mode.
  • Clustering — groupings of cells (e.g., Leiden at various resolutions) with no direct biological interpretation.
  • HVGs — highly variable genes computed from log-transformed counts (returned with gene symbol and dispersion).
  • General DE — convenience analysis: for the 10 largest labels in a chosen labelset, each label is contrasted against all other cells.

Usage Examples

Search datasets

# Search by name + metadata → DataFrame of datasets
cap.search_datasets(
    search=["cortex"],
    organism=["Mus musculus"],
    tissue=["Brain"],
    limit=10,
    sort=[{"published_at": "desc"}],
)

Open an MD session

md = cap.md_session(dataset_id)
session_id = md.create_session()
session_id  # keep if you want to resume logic on the backend

md.is_md_cache_ready()  # True/False
md.embeddings           # ["umap", "tsne", ...]
md.labelsets            # ["Cell Types v1", ...]
md.clusterings          # ["leiden_1.0", ...]

Fetch embedding points (+ gene expression)

# Retrieve coordinates for an embedding; add a gene to get per-cell expression array
data = md.embedding_data(
    embedding="umap",          # must exist in md.embeddings
    max_points=1_000,           # server may downsample to this
    labelsets=[md.labelsets[0]],
    selection_gene="MALAT1",   # optional: per-cell expression vector
)

Highly Variable Genes (HVGs)

hvg = md.highly_variable_genes(
    gene_name_filter=None,  # or prefix string, e.g., "RPL"
    pseudogenes_filter=True,
    limit=50,
)

hvg.head()
# columns: gene_symbol, dispersion

General Differential Expression + Heatmap

# 1) Compute DE for the largest labels (vs rest) in a chosen labelset
diff_key = md.general_de(labelset=md.labelsets[0], random_seed=42)

# 2) Request a heatmap payload (n genes per label, sampled cells)
heat = md.heatmap(
    diff_key=diff_key,
    n_top_genes=4,
    max_cells_displayed=2000,
    gene_name_filter=None,
    pseudogenes_filter=True,
    include_reference=True,
)

heat_blob = heat.model_dump()
print(heat_blob.keys())  # serialize or adapt to your plotting stack

Public API

CapClient

cap = CapClient(url="https://celltype.info/graphql")

Creates a shared HTTP client (timeout 300s).

search_datasets()pandas.DataFrame

cap.search_datasets(
    search: list[str] | None = None,   # name tokens
    organism: list[str] | None = None, # e.g. ["Homo sapiens"]
    tissue: list[str] | None = None,   # e.g. ["PBMC", "Brain"]
    assay: list[str] | None = None,    # e.g. ["10x", "SMART-seq2"]
    limit: int = 50,
    offset: int = 0,
    sort: list[dict[str, str]] = [],   # e.g. [{"published_at": "desc"}]
) -> pd.DataFrame

Returns one row per dataset (columns mirror CAP’s schema, typically including id, name, organism, tissue, assay, sizes, dates, etc.).

search_cell_labels()pandas.DataFrame

cap.search_cell_labels(
    search: str | None = None,
    organism: list[str] | None = None,
    tissue: list[str] | None = None,
    assay: list[str] | None = None,
    limit: int = 50,
    offset: int = 0,
    sort: list[dict[str, str]] = [],   # e.g. [{"updated_at": "desc"}]
) -> pd.DataFrame

Returns label resources you can explore or cross-reference.

md_session(dataset_id: str) -> MDSession

Creates an MD session helper bound to a dataset.


MDSession

After create_session(), these members are populated:

  • dataset_id: str
  • session_id: str
  • embeddings: list[str]
  • labelsets: list[str] (cell-labels mode)
  • clusterings: list[str]
  • dataset_snapshot (Pydantic model → use .model_dump() to inspect)

create_session() -> str

  • Checks MD readiness on the server and takes a dataset snapshot.
  • Fetches embeddings/labelsets/clusterings.
  • Saves a unique session server-side and returns session_id.

Raises: RuntimeError if MD data isn’t ready yet.

embedding_data(...) -> Pydantic model

md.embedding_data(
    embedding: str,                  # must exist in md.embeddings
    max_points: int,                 # server may downsample to this
    labelsets: list[str] | None = None,
    selection_gene: str | None = None,      # per-cell expression
    selection_key_major: str | None = None, # optional selection mask
    selection_key_minor: str | None = None,
)

Returns a typed payload with coordinates and optional arrays (labels, selection, expression). Use .model_dump() to get the raw dict.

Raises: ValueError if the embedding name isn’t available.

highly_variable_genes(...) -> pandas.DataFrame

md.highly_variable_genes(
    gene_name_filter: str | None = None,
    pseudogenes_filter: bool = True,
    offset: int = 0,
    limit: int = 50,
    sort_order: Literal["desc","asc"] = "desc",
) -> pd.DataFrame

Returns columns:

  • gene_symbol: str
  • dispersion: float

general_de(labelset: str, random_seed: int = 42) -> str

Runs quick DE (top 10 labels vs rest) and returns a diff_key for downstream heatmaps.

Raises: ValueError if the labelset name isn’t available.

heatmap(...) -> Pydantic model

md.heatmap(
    diff_key: str,
    n_top_genes: int = 3,
    max_cells_displayed: int = 1000,
    gene_name_filter: str | None = None,
    pseudogenes_filter: bool = True,
    selection_key: str | None = None,
    include_reference: bool = True,
)

Returns a structured heatmap payload (labels × genes, sampled cells). Use .model_dump() for plotting.

is_md_cache_ready() -> bool

Quick status check for MD cache readiness.


Troubleshooting

  • RuntimeError: The Molecular Data ... is not ready! — Precomputations are still running on the server. Try again later.
  • ValueError: Embedding 'xyz' is not found — Check md.embeddings for valid names.
  • ValueError: Labelset 'xyz' is not found — Check md.labelsets for valid names.
  • Large datasets — Use max_points to downsample when fetching embeddings; it’s typically sufficient for exploratory plots.
  • Pydantic payloads — Always start with .model_dump() to discover exact fields available in your environment.

FAQ

Do I need an API token? No. Public endpoints are accessible anonymously.

What exactly is in the embedding/heatmap payloads? Typed Pydantic models mirroring the CAP GraphQL schema. Use .model_dump() to explore concrete keys.

Can I plot directly from the client? The client returns data only. Use your favorite plotting library (matplotlib/plotly/etc.).


License

See LICENSE in the repository.