
# Week 6 — scRNA‑seq mini‑pipeline (Alevin‑fry → AnnData → Leiden → CellTypist)

**Goal:** Fetch raw single‑cell FASTQs, build a compact *splici* reference, quantify with the alevin‑fry (via `simpleaf`) pipeline, load counts into an `AnnData`, cluster with Leiden, then auto‑annotate with CellTypist.  
This notebook is designed to be **self‑contained**: it installs its own tools and fetches data at run time.

> **Re-run hint:** You can safely re-run the whole notebook. Heavy steps cache to `./week6_data/` and `./week6_work/`.


## 0) Configuration & small helpers

In [None]:

import os, sys, json, subprocess, shutil, textwrap, time, pathlib

# Project directories
ROOT = pathlib.Path.cwd()
DATA = ROOT / "week6_data"
WORK = ROOT / "week6_work"
OUT  = ROOT / "week6_results"
for d in (DATA, WORK, OUT):
    d.mkdir(parents=True, exist_ok=True)

BOX_FOLDER_URL = "https://app.box.com/s/lx2xownlrhz3us8496tyu9c4dgade814"
BOX_EXPECTED = {}

CHEMISTRY = os.environ.get("SC_CHEM", "10xv3")  # "10xv2" or "10xv3"
PERMIT_STRATEGY = os.environ.get("SC_PERMIT", "auto")   # e.g., "expect:1500" or a path
THREADS = int(os.environ.get("SC_THREADS", "4"))

def sh(cmd, env=None, check=True):
    print(f"\n$ {cmd}")
    proc = subprocess.run(cmd, shell=True, env=env, check=check)
    return proc.returncode

print("ROOT:", ROOT)
print("DATA:", DATA)
print("WORK:", WORK)
print("OUT:", OUT)



## 1) One‑time tool bootstrap (micromamba env)

We install a minimal conda env containing the alevin‑fry stack via **micromamba**:
- `simpleaf` (wrapper), `alevin-fry`, `piscem` (fast mapper), `salmon` (if needed)
- `pyroe`, `scanpy`, `anndata`, `python-igraph`, `leidenalg`, `celltypist`


In [None]:

# import os, platform, pathlib

# MAMBA_DIR = ROOT / "week6_mamba"
# ENV_NAME  = "scweek6"
# MICROMAMBA = str(MAMBA_DIR / "bin" / "micromamba")

# def ensure_micromamba():
#     if (MAMBA_DIR / "bin" / "micromamba").exists():
#         return
#     MAMBA_DIR.mkdir(parents=True, exist_ok=True)
#     sysname = platform.system().lower()
#     arch = "linux-64" if sysname == "linux" else ("osx-64" if platform.machine() == "x86_64" else "osx-arm64")
#     url = f"https://micro.mamba.pm/api/micromamba/{arch}/latest"
#     sh(f"curl -Ls {url} | tar -xvj bin/micromamba -C {MAMBA_DIR}")

# def mm(args):
#     return sh(f"{MICROMAMBA} " + args)

# def mmrun(cmd):
#     return sh(f"{MICROMAMBA} run -n {ENV_NAME} bash -lc "{cmd}"")

# ensure_micromamba()

# # Create env
# mm(f"create -y -n {ENV_NAME} -c conda-forge -c bioconda python=3.11 simpleaf alevin-fry piscem salmon scanpy anndata python-igraph leidenalg celltypist pyroe")

# # Cache dir for simpleaf/alevin-fry
# AF_HOME = WORK / "af_home"
# AF_HOME.mkdir(parents=True, exist_ok=True)
# os.environ["ALEVIN_FRY_HOME"] = str(AF_HOME)
# mmrun(f"echo ALEVIN_FRY_HOME={str(AF_HOME)}")



## 2) Fetch raw data & reference from Box

The notebook will attempt to **direct‑download** the Box folder as a zip using `?download=1`.  
If your institution's Box requires authentication, place the files manually under `week6_data/` and re‑run.


In [None]:

import re, requests, zipfile, io, pathlib

def box_download_folder(shared_url, out_dir: pathlib.Path):
    out_dir.mkdir(parents=True, exist_ok=True)
    try_urls = [shared_url, shared_url.rstrip('/') + '?download=1']
    last_err = None
    for url in try_urls:
        try:
            print("Trying:", url)
            with requests.get(url, allow_redirects=True, stream=True, timeout=60) as r:
                r.raise_for_status()
                ctype = r.headers.get('Content-Type','')
                content = r.content
                if content[:2] == b'PK':  # zip
                    with zipfile.ZipFile(io.BytesIO(content)) as zf:
                        zf.extractall(out_dir)
                    print("Extracted Box zip to", out_dir)
                    return True
        except Exception as e:
            last_err = e
            print("Download failed:", e)
    print("Box auto-download did not succeed. Manually download to", out_dir)
    return False

_ = box_download_folder(BOX_FOLDER_URL, DATA)
print("Contents:", [p.name for p in Path(DATA).glob('**/*')][:20], "...")



## 3) Build a *splici* reference index (for USA mode)

We build a **spliced + intronic** (*splici*) reference from the provided genome FASTA and GTF, then make a compact **piscem** index via `simpleaf index`.


In [None]:

import glob

def find_one(patterns):
    for p in patterns:
        m = glob.glob(str(DATA / p))
        if m:
            return m[0]
    return None

GENOME = find_one(["*.fa", "*.fa.gz", "*chr5*.fa*", "**/*.fa*", "**/*chr5*.fa*"])
GTF    = find_one(["*.gtf", "*.gtf.gz", "**/*.gtf*", "**/*chr5*.gtf*"])
print("GENOME:", GENOME)
print("GTF:", GTF)
assert GENOME and GTF, "Could not find genome FASTA or GTF in week6_data."

IDX_DIR = WORK / "splici_index"
IDX_DIR.mkdir(parents=True, exist_ok=True)

mmrun(f"export ALEVIN_FRY_HOME={str(AF_HOME)}; ulimit -n 2048; simpleaf index --output {str(IDX_DIR)} --fasta {GENOME} --gtf {GTF} --threads {THREADS}")
print("Index ready:", IDX_DIR)



## 4) Quantify FASTQs with `simpleaf quant`

This step scans `week6_data/` for `*_R1_*` and `*_R2_*` FASTQs (recursively), then runs `simpleaf quant` in **cr‑like** resolution and **USA** mode.  
If you don't have an explicit whitelist, set no variable (auto) or use `SC_PERMIT=expect:N`.


In [None]:

import subprocess

reads1 = subprocess.check_output(["bash","-lc", f"find -L {DATA} -type f -name '*_R1_*fastq*' | sort | paste -sd, -"]).decode().strip()
reads2 = subprocess.check_output(["bash","-lc", f"find -L {DATA} -type f -name '*_R2_*fastq*' | sort | paste -sd, -"]).decode().strip()
print("R1 files:", reads1)
print("R2 files:", reads2)
assert reads1 and reads2, "No FASTQs found under week6_data."

QUANT_DIR = WORK / "af_quant_run"
QUANT_DIR.mkdir(parents=True, exist_ok=True)

permit_flag = "--unfiltered-pl"
if isinstance(PERMIT_STRATEGY, str) and PERMIT_STRATEGY.startswith("expect:"):
    n = PERMIT_STRATEGY.split(":")[1]
    permit_flag = f"--expect-cells {int(n)}"
elif PERMIT_STRATEGY and PERMIT_STRATEGY not in ("auto",""):
    permit_flag = f"--permit-list {PERMIT_STRATEGY}"

cmd = f"""
export ALEVIN_FRY_HOME={str(AF_HOME)}
ulimit -n 2048
simpleaf quant   --reads1 {reads1}   --reads2 {reads2}   --threads {THREADS}   --index {str(IDX_DIR)}/index   --chemistry {CHEMISTRY} --resolution cr-like   {permit_flag} --anndata-out   --output {str(QUANT_DIR)}
"""
mmrun(cmd)
print("Quantification done ->", QUANT_DIR)


## 5) Load counts into `AnnData` and do basic QC/normalization

In [None]:

import scanpy as sc, anndata as ad, numpy as np, pandas as pd, matplotlib.pyplot as plt, pathlib

H5AD = QUANT_DIR / "af_quant" / "quants.h5ad"
assert H5AD.exists(), f"Missing {H5AD} — check the quant step logs."
adata = sc.read_h5ad(H5AD)
adata.var_names_make_unique()

if "feature_name" in adata.var.columns:
    gene_symbols = adata.var["feature_name"]
else:
    gene_symbols = adata.var_names

adata.var["mt"] = gene_symbols.str.upper().str.startswith(("MT-","MT_"))
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)

sc.pp.filter_cells(adata, min_counts=500)
sc.pp.filter_genes(adata, min_cells=3)

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata, n_top_genes=3000, subset=True)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=50, svd_solver="arpack")

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata, min_dist=0.3)
sc.tl.leiden(adata, key_added="leiden_0p5", resolution=0.5)

OUT.mkdir(parents=True, exist_ok=True)
adata.write(OUT / "week6_preannot.h5ad", compression="gzip")
adata


## 6) Clustering plot (UMAP colored by Leiden)

In [None]:

import scanpy as sc
sc.pl.umap(adata, color=["leiden_0p5"], wspace=0.4, show=True, save=None)



## 7) Automatic cell annotation with CellTypist

We download a CellTypist model and annotate cells; labels are added to `adata.obs` and plotted on UMAP.


In [None]:

import celltypist
from celltypist import models

try:
    models.download_models("Immune_All_Low")
    model_path = models.models_path() / "Immune_All_Low.pkl"
except Exception as e:
    print("Model download issue:", e)
    model_path = None

pred = celltypist.annotate(adata, model=str(model_path) if model_path else "Immune_All_Low.pkl", majority_voting=True)
adata.obs["celltypist_label"] = pred.predicted_labels

import scanpy as sc
sc.pl.umap(adata, color=["celltypist_label"], legend_loc="on data", legend_fontsize=8, frameon=False, show=True, save=None)

adata.write(OUT / "week6_annotated.h5ad", compression="gzip")
print("Saved:", OUT / "week6_annotated.h5ad")


## 8) Export summaries

In [None]:

import pandas as pd
sizes = adata.obs["leiden_0p5"].value_counts().sort_index()
sizes.to_csv(OUT / "cluster_sizes.csv")
ct_sizes = adata.obs["celltypist_label"].value_counts()
ct_sizes.to_csv(OUT / "celltypist_counts.csv")
print("Wrote:", OUT / "cluster_sizes.csv", "and", OUT / "celltypist_counts.csv")



## 9) Notes & FAQ

- **Whitelist barcodes link is broken — can we proceed?** Yes. The pipeline can infer a permit list automatically (`--unfiltered-pl`), or you can set an expected cell count via `SC_PERMIT=expect:1500`. If you later obtain a whitelist, set `SC_PERMIT=/path/to/permit_list.txt` and re-run the quant step.
- **Chemistry**: Default is `10xv3`. If your data is v2, set `SC_CHEM=10xv2` before executing.
- **Re-running**: Index and quants use `week6_work/` and will be reused if present.
- **Submission**: Commit only `week6/week6.ipynb`. The notebook fetches/install tools and data during execution.
