# Zebrahub scRNA-seq Data Exploration

This notebook loads the single-cell RNA-seq atlas from [Zebrahub](https://zebrahub.sf.czbiohub.org/) to explore cell type annotations across zebrafish developmental stages.

The scRNA-seq data provides **cell type labels** (via Zebrafish Anatomy Ontology) for ~120k cells across 10 developmental stages. While this data doesn't directly provide pixel-level segmentation masks, it tells us:
- What cell types exist at each developmental stage
- Marker genes for each cell type
- Cell type proportions and diversity over time

This is valuable for downstream validation: after training a U-Net for segmentation, we can use spatial mapping tools (e.g., Tangram) to assign these transcriptional cell type labels to segmented cells and extract morphological features per cell type.

In [1]:
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import os
import urllib.request
import zipfile

## 1. Download the 14 hpf dataset from Figshare

We start with the **14 hpf** (hours post-fertilization) timepoint (~10-somite stage). This is mid-somitogenesis with good cell type diversity. The file is ~101 MB compressed.

In [2]:
DATA_DIR = "../data/scrna"
os.makedirs(DATA_DIR, exist_ok=True)

FILENAME = "zf_atlas_14hpf_v1_release.h5ad"
ZIP_PATH = os.path.join(DATA_DIR, FILENAME + ".zip")
H5AD_PATH = os.path.join(DATA_DIR, FILENAME)

# Download from Figshare if not already cached
if not os.path.exists(H5AD_PATH):
    url = "https://ndownloader.figshare.com/files/36736074"
    print(f"Downloading {FILENAME}.zip (~101 MB)...")
    urllib.request.urlretrieve(url, ZIP_PATH)
    print("Extracting...")
    with zipfile.ZipFile(ZIP_PATH, "r") as z:
        z.extractall(DATA_DIR)
    os.remove(ZIP_PATH)
    print(f"Saved to {H5AD_PATH}")
else:
    print(f"Already downloaded: {H5AD_PATH}")

Downloading zf_atlas_14hpf_v1_release.h5ad.zip (~101 MB)...
Extracting...
Saved to ../data/scrna/zf_atlas_14hpf_v1_release.h5ad


## 2. Load and inspect the AnnData object

In [3]:
adata = sc.read_h5ad(H5AD_PATH)
print(adata)
print(f"\nCells: {adata.n_obs:,}")
print(f"Genes: {adata.n_vars:,}")

AnnData object with n_obs × n_vars = 3862 × 32060
    obs: 'n_genes', 'n_counts', 'fish', 'timepoint_cluster', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_nc', 'pct_counts_nc', 'zebrafish_anatomy_ontology_class', 'zebrafish_anatomy_ontology_id', 'developmental_stage', 'timepoint'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'nc', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
    uns: 'zebrafish_anatomy_ontology_class_colors'
    obsm: 'X_umap'
    layers: 'counts'

Cells: 3,862
Genes: 32,060


In [4]:
# Cell-level metadata columns
print("Cell metadata columns (adata.obs):")
for col in adata.obs.columns:
    print(f"  {col}")

print(f"\nEmbeddings (adata.obsm): {list(adata.obsm.keys())}")

Cell metadata columns (adata.obs):
  n_genes
  n_counts
  fish
  timepoint_cluster
  n_genes_by_counts
  total_counts
  total_counts_mt
  pct_counts_mt
  total_counts_nc
  pct_counts_nc
  zebrafish_anatomy_ontology_class
  zebrafish_anatomy_ontology_id
  developmental_stage
  timepoint

Embeddings (adata.obsm): ['X_umap']


In [None]:
# Preview the first few rows of cell metadata
adata.obs.head()

## 3. Cell type annotations

Zebrahub provides annotations at two levels:
- **`global annotation`**: coarse tissue-level (CNS, Mesoderm, Endoderm, etc.)
- **`zebrafish_anatomy_ontology_class`**: fine-grained cell types from the Zebrafish Anatomy Ontology (ZFA)

In [None]:
# Coarse annotations
print("=== Coarse cell types (global annotation) ===")
print(adata.obs["global annotation"].value_counts())

In [None]:
# Fine-grained annotations
print("=== Fine-grained cell types (ZFA ontology) ===")
ct_counts = adata.obs["zebrafish_anatomy_ontology_class"].value_counts()
print(f"Number of distinct cell types: {len(ct_counts)}")
print(f"\nTop 20 most abundant:")
print(ct_counts.head(20))

In [None]:
# Cell type diversity
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Coarse type pie chart
coarse_counts = adata.obs["global annotation"].value_counts()
axes[0].pie(coarse_counts.values, labels=coarse_counts.index, autopct="%1.1f%%")
axes[0].set_title("Coarse cell types (14 hpf)")

# Fine type bar chart (top 15)
top_fine = ct_counts.head(15)
axes[1].barh(range(len(top_fine)), top_fine.values)
axes[1].set_yticks(range(len(top_fine)))
axes[1].set_yticklabels(top_fine.index, fontsize=9)
axes[1].set_xlabel("Number of cells")
axes[1].set_title("Top 15 fine-grained cell types (14 hpf)")
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

## 4. UMAP visualizations

In [None]:
# Color by coarse cell type
sc.pl.umap(adata, color="global annotation", title="UMAP — Coarse cell types (14 hpf)", frameon=False)

In [None]:
# Color by fine-grained cell type
sc.pl.umap(adata, color="zebrafish_anatomy_ontology_class",
           title="UMAP — Fine-grained cell types (14 hpf)",
           frameon=False, legend_loc="on data", legend_fontsize=6)

## 5. Marker gene expression

Visualize known marker genes to validate cell type annotations:
- **sox2** — neural progenitors
- **tbxta** (ntla/brachyury) — notochord / mesoderm
- **myod1** — muscle
- **gata1a** — blood / hematopoietic

In [None]:
# Check which marker genes are in the dataset
markers = ["sox2", "tbxta", "myod1", "gata1a", "pax6a", "foxa2"]
available = [g for g in markers if g in adata.var_names]
missing = [g for g in markers if g not in adata.var_names]
print(f"Available markers: {available}")
if missing:
    print(f"Not found: {missing}")

In [None]:
if available:
    sc.pl.umap(adata, color=available, ncols=3, frameon=False, cmap="viridis")

In [None]:
# Dot plot of markers by coarse cell type
if available:
    sc.pl.dotplot(adata, var_names=available, groupby="global annotation")

## 6. Cell type statistics for segmentation planning

Understanding cell type frequencies helps plan what our U-Net will encounter and how to handle class imbalance.

In [None]:
# Summary statistics
coarse_counts = adata.obs["global annotation"].value_counts()
print("Cell type proportions at 14 hpf:")
print("=" * 45)
for ct, count in coarse_counts.items():
    pct = 100 * count / adata.n_obs
    print(f"  {ct:<35s} {count:>5d} ({pct:5.1f}%)")
print(f"  {'TOTAL':<35s} {adata.n_obs:>5d}")

In [None]:
# Per-embryo breakdown (multiple replicates per stage)
if "fish" in adata.obs.columns:
    print("Cells per embryo:")
    print(adata.obs["fish"].value_counts())
    print(f"\nNumber of embryos: {adata.obs['fish'].nunique()}")

## Next steps

- **Load additional timepoints** to compare cell type diversity across development
- **Generate segmentation ground truth**: Use the Ultrack pre-trained U-Net (`unet-daxi.pt` from `public.czbiohub.org/royerlab/ultrack/unet_weights/`) to produce foreground + contour probability maps from the light sheet data
- **Spatial mapping**: After training a segmentation model, use tools like [Tangram](https://github.com/broadinstitute/Tangram) to map these scRNA-seq cell type labels onto segmented cells in the imaging data