-
Notifications
You must be signed in to change notification settings - Fork 0
Home
A tiny Python client for exploring Cell Annotation Platform (CAP) data in notebooks & scripts.
Discover datasets • Search for cell annotation metadata • Run quick Differential Expression analyses
-
Dataset search by name and metadata (organism, tissue, assay) →
pandas.DataFrame
. - Cell label resources search for cross-referencing annotations.
- Molecular Data (MD) session helper for a dataset (embeddings, labelsets, clusterings, readiness checks).
- Embedding points with optional per-cell gene expression and selection masks.
-
Highly Variable Genes (HVGs) as a tidy
DataFrame
for immediate downstream use. - Quick DE: compare each of the 10 largest labels vs the rest; request a heatmap payload for plotting.
The client talks to CAP’s GraphQL API at
https://celltype.info/graphql
by default.
pip install cap-sc-client
Or from a local clone:
pip install -e .
Requirements: Python ≥ 3.10, pandas
, httpx
(installed automatically). No authentication is required for public endpoints.
from cap_sc_client import CapClient
# 1) Create the top-level client
cap = CapClient() # uses https://celltype.info/graphql by default
# 2) Find datasets: e.g., human PBMC 10x datasets, newest first
ds = cap.search_datasets(
search=["pbmc"],
organism=["Homo sapiens"],
assay=["10x"],
limit=20,
sort=[{"published_at": "desc"}],
)
# 3) Pick a dataset and open a Molecular Data session
dataset_id = ds.loc[0, "id"]
md = cap.md_session(dataset_id)
md.create_session()
# 4) Explore available resources
print("Embeddings:", md.embeddings) # e.g., ["umap", "tsne", "pca_0", ...]
print("Labelsets:", md.labelsets) # e.g., ["Cell Types v1", ...]
print("Clusterings:", md.clusterings)
- Dataset — a single-cell study (matrix + metadata) in CAP.
- Embedding — 2-D/3-D representation of cells (UMAP, t-SNE, PCA). You can fetch coordinates plus optional per-cell vectors (labels, expression, selections).
- Labelset — per-cell annotations (e.g., cell types from a curator or pipeline). This client exposes labelsets in the cell-labels mode.
- Clustering — groupings of cells (e.g., Leiden at various resolutions) with no direct biological interpretation.
- HVGs — highly variable genes computed from log-transformed counts (returned with gene symbol and dispersion).
- General DE — convenience analysis: for the 10 largest labels in a chosen labelset, each label is contrasted against all other cells.
# Search by name + metadata → DataFrame of datasets
cap.search_datasets(
search=["cortex"],
organism=["Mus musculus"],
tissue=["Brain"],
limit=10,
sort=[{"published_at": "desc"}],
)
md = cap.md_session(dataset_id)
session_id = md.create_session()
session_id # keep if you want to resume logic on the backend
md.is_md_cache_ready() # True/False
md.embeddings # ["umap", "tsne", ...]
md.labelsets # ["Cell Types v1", ...]
md.clusterings # ["leiden_1.0", ...]
# Retrieve coordinates for an embedding; add a gene to get per-cell expression array
data = md.embedding_data(
embedding="umap", # must exist in md.embeddings
max_points=1_000, # server may downsample to this
labelsets=[md.labelsets[0]],
selection_gene="MALAT1", # optional: per-cell expression vector
)
hvg = md.highly_variable_genes(
gene_name_filter=None, # or prefix string, e.g., "RPL"
pseudogenes_filter=True,
limit=50,
)
hvg.head()
# columns: gene_symbol, dispersion
# 1) Compute DE for the largest labels (vs rest) in a chosen labelset
diff_key = md.general_de(labelset=md.labelsets[0], random_seed=42)
# 2) Request a heatmap payload (n genes per label, sampled cells)
heat = md.heatmap(
diff_key=diff_key,
n_top_genes=4,
max_cells_displayed=2000,
gene_name_filter=None,
pseudogenes_filter=True,
include_reference=True,
)
heat_blob = heat.model_dump()
print(heat_blob.keys()) # serialize or adapt to your plotting stack
cap = CapClient(url="https://celltype.info/graphql")
Creates a shared HTTP client (timeout 300s).
cap.search_datasets(
search: list[str] | None = None, # name tokens
organism: list[str] | None = None, # e.g. ["Homo sapiens"]
tissue: list[str] | None = None, # e.g. ["PBMC", "Brain"]
assay: list[str] | None = None, # e.g. ["10x", "SMART-seq2"]
limit: int = 50,
offset: int = 0,
sort: list[dict[str, str]] = [], # e.g. [{"published_at": "desc"}]
) -> pd.DataFrame
Returns one row per dataset (columns mirror CAP’s schema, typically including id
, name
, organism
, tissue
, assay
, sizes, dates, etc.).
cap.search_cell_labels(
search: str | None = None,
organism: list[str] | None = None,
tissue: list[str] | None = None,
assay: list[str] | None = None,
limit: int = 50,
offset: int = 0,
sort: list[dict[str, str]] = [], # e.g. [{"updated_at": "desc"}]
) -> pd.DataFrame
Returns label resources you can explore or cross-reference.
Creates an MD session helper bound to a dataset.
After create_session()
, these members are populated:
dataset_id: str
session_id: str
embeddings: list[str]
-
labelsets: list[str]
(cell-labels mode) clusterings: list[str]
-
dataset_snapshot
(Pydantic model → use.model_dump()
to inspect)
- Checks MD readiness on the server and takes a dataset snapshot.
- Fetches embeddings/labelsets/clusterings.
- Saves a unique session server-side and returns
session_id
.
Raises:
RuntimeError
if MD data isn’t ready yet.
md.embedding_data(
embedding: str, # must exist in md.embeddings
max_points: int, # server may downsample to this
labelsets: list[str] | None = None,
selection_gene: str | None = None, # per-cell expression
selection_key_major: str | None = None, # optional selection mask
selection_key_minor: str | None = None,
)
Returns a typed payload with coordinates and optional arrays (labels, selection, expression). Use .model_dump()
to get the raw dict.
Raises:
ValueError
if the embedding name isn’t available.
md.highly_variable_genes(
gene_name_filter: str | None = None,
pseudogenes_filter: bool = True,
offset: int = 0,
limit: int = 50,
sort_order: Literal["desc","asc"] = "desc",
) -> pd.DataFrame
Returns columns:
gene_symbol: str
dispersion: float
Runs quick DE (top 10 labels vs rest) and returns a diff_key
for downstream heatmaps.
Raises:
ValueError
if the labelset name isn’t available.
md.heatmap(
diff_key: str,
n_top_genes: int = 3,
max_cells_displayed: int = 1000,
gene_name_filter: str | None = None,
pseudogenes_filter: bool = True,
selection_key: str | None = None,
include_reference: bool = True,
)
Returns a structured heatmap payload (labels × genes, sampled cells). Use .model_dump()
for plotting.
Quick status check for MD cache readiness.
-
RuntimeError: The Molecular Data ... is not ready!
— Precomputations are still running on the server. Try again later. -
ValueError: Embedding 'xyz' is not found
— Checkmd.embeddings
for valid names. -
ValueError: Labelset 'xyz' is not found
— Checkmd.labelsets
for valid names. -
Large datasets — Use
max_points
to downsample when fetching embeddings; it’s typically sufficient for exploratory plots. -
Pydantic payloads — Always start with
.model_dump()
to discover exact fields available in your environment.
Do I need an API token? No. Public endpoints are accessible anonymously.
What exactly is in the embedding/heatmap payloads?
Typed Pydantic models mirroring the CAP GraphQL schema. Use .model_dump()
to explore concrete keys.
Can I plot directly from the client? The client returns data only. Use your favorite plotting library (matplotlib/plotly/etc.).
See LICENSE in the repository.