# Tutorial: Multimodal RNA + ATAC Analysis with SnapATAC2

Audience:
- Computational biologists and bioinformatics users working with single-cell multiome data.

Prerequisites:
- Python environment with `scanpy`, `snapatac2`, `anndata`, and plotting dependencies installed.
- Basic familiarity with AnnData objects and single-cell preprocessing.

Learning goals:
- Load paired RNA and ATAC modalities from a common multiome dataset.
- Build modality-specific embeddings and clusters.
- Build a joint multimodal embedding using `snap.tl.multi_spectral`.


## Outline

1. Setup
2. Read local DBiT RNA + ATAC files from `data-RNA`
3. Load paired RNA and ATAC tutorial data
4. RNA preprocessing and embedding
5. ATAC preprocessing and embedding
6. Joint multimodal embedding and clustering
7. Pitfalls, exercises, and extensions


In [20]:
from __future__ import annotations

import importlib
from pathlib import Path

import anndata as ad
import scanpy as sc
import snapatac2 as snap
import utils.dbit_rna_reader as dbit_rna_reader

# Force reload so notebook picks up newly added reader functions.
dbit_rna_reader = importlib.reload(dbit_rna_reader)
discover_atac_fragment_tars = dbit_rna_reader.discover_atac_fragment_tars
extract_atac_fragment_archives = dbit_rna_reader.extract_atac_fragment_archives
import_atac_fragments_with_snap = dbit_rna_reader.import_atac_fragments_with_snap
read_dbit_rna_directory = dbit_rna_reader.read_dbit_rna_directory

sc.settings.verbosity = 2
sc.settings.set_figure_params(figsize=(5, 5), frameon=False)

print("scanpy:", sc.__version__)
print("snapatac2:", snap.__version__)


scanpy: 1.11.5
snapatac2: 2.8.0


  print("scanpy:", sc.__version__)


## Step 1 - Read local DBiT RNA + ATAC files (`data-RNA`)

This section uses reusable readers adapted from your `dbit_nature_multisample_workflow.ipynb` logic. It loads RNA matrices + tissue positions and discovers local ATAC fragment tar files for the same samples.


In [None]:
data_rna_dir = Path("data-RNA/atac")
adata_dbit_rna, dbit_rna_summary = read_dbit_rna_directory(data_rna_dir)

atac_tar_manifest = discover_atac_fragment_tars(
    data_dir=data_rna_dir,
    sample_ids=dbit_rna_summary["sample_id"].tolist(),
)

print(adata_dbit_rna)
print(f"ATAC fragment tar files for RNA samples: {len(atac_tar_manifest)}")
dbit_rna_summary


### Step 1b - Extract and optionally import local ATAC fragments

Extract `.tsv.gz` ATAC fragment files from tar archives for samples where ATAC is available. Then optionally import each sample into SnapATAC2.


In [18]:
dbit_atac_manifest = extract_atac_fragment_archives(
    out_dir=Path("data/atac_fragments_local"),
    atac_manifest=atac_tar_manifest,
    overwrite=False,
)
dbit_atac_manifest.head()


  return method()
  return method()


Unnamed: 0,sample_id,atac_kind,atac_tar,fragments_tsv_gz,fragments_tbi
0,06_LPC5S1,atac_fragments,data-RNA/06_LPC5S1_atac_fragments.tsv.tar,data/atac_fragments_local/06_LPC5S1/06_LPC5S1_...,data/atac_fragments_local/06_LPC5S1/06_LPC5S1_...
1,06_LPC5S2,atac_fragments,data-RNA/06_LPC5S2_atac_fragments.tsv.tar,data/atac_fragments_local/06_LPC5S2/06_LPC5S2_...,data/atac_fragments_local/06_LPC5S2/06_LPC5S2_...
2,07_LPC10S1,atac_fragments,data-RNA/07_LPC10S1_atac_fragments.tsv.tar,data/atac_fragments_local/07_LPC10S1/07_LPC10S...,data/atac_fragments_local/07_LPC10S1/07_LPC10S...
3,07_LPC10S2,atac_fragments,data-RNA/07_LPC10S2_atac_fragments.tsv.tar,data/atac_fragments_local/07_LPC10S2/07_LPC10S...,data/atac_fragments_local/07_LPC10S2/07_LPC10S...
4,08_LPC21S1,atac_fragments,data-RNA/08_LPC21S1_atac_fragments.tsv.tar,data/atac_fragments_local/08_LPC21S1/08_LPC21S...,data/atac_fragments_local/08_LPC21S1/08_LPC21S...


In [19]:
RUN_LOCAL_ATAC_IMPORT = True

if RUN_LOCAL_ATAC_IMPORT:
    # Choose the correct genome for your experiment, e.g., snap.genome.mm10 or snap.genome.hg38.
    genome = snap.genome.mm10
    whitelist_by_sample = (
        adata_dbit_rna.obs[["sample_id", "barcode"]]
        .dropna()
        .assign(barcode=lambda df: df["barcode"].astype(str))
        .groupby("sample_id")["barcode"]
        .apply(list)
        .to_dict()
    )
    local_atac = import_atac_fragments_with_snap(
        dbit_atac_manifest,
        genome=genome,
        whitelist_by_sample=whitelist_by_sample,
        sorted_by_barcode=False,
    )
    print("Imported ATAC samples:", len(local_atac))
else:
    print("Set RUN_LOCAL_ATAC_IMPORT=True to import local ATAC fragments with SnapATAC2.")


KeyError: "['barcode'] not in index"

## Step 2 - Build paired local RNA and ATAC objects

This mirrors the tutorial pattern but uses your local `data-RNA` samples. We construct sample-aware cell IDs (`sample_id:barcode`) in both modalities and keep only shared cells.


In [17]:
if "local_atac" not in globals():
    raise RuntimeError("Run Step 1b with RUN_LOCAL_ATAC_IMPORT=True before this cell.")

rna = adata_dbit_rna.copy()
rna.obs["sample_id"] = rna.obs["sample_id"].astype(str)
rna.obs["barcode"] = rna.obs["barcode"].astype(str)
rna.obs_names = (rna.obs["sample_id"] + ":" + rna.obs["barcode"]).to_numpy()
rna.obs_names_make_unique()

atac_parts = []
for sample_id in sorted(local_atac):
    a = local_atac[sample_id].copy()
    a.obs["sample_id"] = str(sample_id)
    a.obs["barcode"] = a.obs_names.astype(str)
    a.obs_names = (a.obs["sample_id"].astype(str) + ":" + a.obs["barcode"]).to_numpy()
    a.obs_names_make_unique()
    atac_parts.append(a)

atac = ad.concat(atac_parts, join="outer", merge="same")

print(rna)
print(atac)
print("Shared barcodes:", len(rna.obs_names.intersection(atac.obs_names)))


RuntimeError: Run Step 1b with RUN_LOCAL_ATAC_IMPORT=True before this cell.

## Step 3 - RNA preprocessing and embedding

Use a standard Scanpy workflow to build an RNA latent space and clusters.


In [None]:
sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
sc.pp.highly_variable_genes(rna)

rna.raw = rna
rna = rna[:, rna.var["highly_variable"]].copy()

sc.pp.scale(rna)
sc.pp.pca(rna, n_comps=50)
sc.pp.neighbors(rna)
sc.tl.umap(rna)
sc.tl.leiden(rna)


In [None]:
sc.pl.umap(rna, color="leiden", legend_loc="on data")


## Step 4 - ATAC preprocessing and embedding

Run SnapATAC2 feature selection, spectral embedding, neighborhood graph, and clustering.


In [None]:
snap.pp.select_features(atac)
snap.tl.spectral(atac)
snap.pp.knn(atac)
snap.tl.umap(atac)
snap.tl.leiden(atac)


In [None]:
snap.pl.umap(atac, color="leiden", show=True)


## Step 5 - Joint multimodal embedding

Align cells between modalities, build a multimodal container, and compute a joint representation.


In [None]:
shared_cells = rna.obs_names.intersection(atac.obs_names)
rna_shared = rna[shared_cells].copy()
atac_shared = atac[shared_cells].copy()

out_path = Path("data/dbit_local_multimodal.h5ads")
out_path.parent.mkdir(parents=True, exist_ok=True)

mdata = snap.AnnDataSet(
    adatas=[rna_shared, atac_shared],
    filename=str(out_path),
)
mdata


In [None]:
snap.tl.multi_spectral(mdata)
snap.pp.knn(mdata)
snap.tl.umap(mdata)
snap.tl.leiden(mdata)


In [None]:
snap.pl.umap(mdata, color="leiden", show=True)


## Pitfalls and Extensions

Common pitfalls:
- The RNA and ATAC objects must share cell barcodes. If they do not, integration is not meaningful.
- Keep ATAC in memory (`backed=None`) for this workflow because graph construction needs writable arrays.
- If memory is limited, reduce feature counts before `snap.tl.spectral`.

Extension ideas:
- Compare modality-specific and multimodal clusters with contingency tables.
- Add marker-gene and marker-peak annotation for biological interpretation.
- Re-run integration after changing ATAC feature selection parameters.


## Exercises

1. Compare multimodal Leiden clusters against RNA-only Leiden clusters with a contingency table.
2. Recompute ATAC embeddings with a different feature-selection threshold and compare cluster stability.


In [None]:
# Exercise answer scaffold
import pandas as pd

comparison = pd.concat(
    [
        rna_shared.obs["leiden"].rename("rna_leiden"),
        atac_shared.obs["leiden"].rename("atac_leiden"),
        mdata.obs["leiden"].rename("multi_leiden"),
    ],
    axis=1,
).dropna()

comparison.head()


In [None]:
pd.crosstab(comparison["multi_leiden"], comparison["rna_leiden"])
