# Notebook 3: Cell Type Annotation & Export

**Cell Annotation Pipeline - Part 3 of 3**

**Stages:** 8-9
**📥 Input:** `outputs/clustered_data.h5ad`
**📤 Output:** `outputs/annotated_data.h5ad` (FINAL)

---

In [ ]:
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Load
print("Loading data from Notebook 2...")
adata = sc.read_h5ad('outputs/clustered_data.h5ad')

# Validate
checks = {
    'UMAP': 'X_umap' in adata.obsm,
    'Clusters': 'leiden' in adata.obs.columns,
    'Markers': 'rank_genes_groups' in adata.uns,
}

for check, passed in checks.items():
    print(f"  {'✓' if passed else '✗'} {check}")
    if not passed:
        raise ValueError(f"Missing {check} - run Notebook 2!")

print(f"\n✓ Loaded: {adata.n_obs:,} cells, {adata.obs['leiden'].nunique()} clusters")

## Parameter Configuration

Define marker genes and annotation parameters.

In [ ]:
# ============================================================================
# MARKER GENE PANELS
# ============================================================================

# Comprehensive marker gene panel for mouse brain
# Customize these for your tissue type
MARKER_GENES = {
    # General neuron/excitatory
    "Neuron": ["Snap25", "Rbfox3", "Syp"],
    "Excit": ["Slc17a7", "Camk2a", "Satb2"],
    
    # Excitatory layer-specific markers
    "ExN_L2-4": ["Cux1", "Cux2", "Satb2"],
    "ExN_L5": ["Bcl11b", "Ctip2", "Fezf2"],  # Bcl11b and Ctip2 are same gene
    "ExN_L6": ["Tbr1", "Sox5"],
    "ExN_L6b": ["Ctgf"],
    
    # Inhibitory (generic + subclasses)
    "Inhib": ["Gad1", "Gad2", "Slc6a1"],
    "InN_SST": ["Sst", "Npy", "Chodl"],
    "InN_VIP": ["Vip", "Cck", "Calb2"],
    "InN_PVALB": ["Pvalb", "Gabra1", "Reln"],
    
    # Glia and vascular
    "Astro": ["Slc1a2", "Slc1a3", "Aqp4", "Aldh1l1", "Gfap"],
    "Oligo": ["Plp1", "Mog", "Mobp", "Mbp"],
    "OPC": ["Pdgfra", "Cspg4", "Sox10"],
    "Micro": ["P2ry12", "Tmem119", "Cx3cr1", "Csf1r", "Sall1", "Aif1"],
    "Endo": ["Pecam1", "Kdr", "Flt1", "Klf2", "Slco1a4"],
    "Peri": ["Pdgfrb", "Rgs5", "Kcnj8", "Abcc9"],
    "VLMC": ["Col1a1", "Col1a2", "Lum", "Dcn"],
    "SMC": ["Acta2", "Myh11", "Tagln"],
}

# Major cell type labels (for initial broad classification)
MAJOR_LABELS = [
    "Excit",
    "Inhib",
    "Astro",
    "Oligo",
    "OPC",
    "Micro",
    "Endo",
    "Peri",
    "VLMC",
    "SMC",
]

# ============================================================================
# ANNOTATION PARAMETERS
# ============================================================================

ANNOTATION_PARAMS = {
    'label_mode': 'cell',      # 🔧 'cell' for per-cell or 'cluster' for cluster-level
    'margin': 0.05,            # 🔧 Confidence margin for label assignment
    'cluster_agg': 'median',   # 🔧 Aggregation for cluster-level ('median' or 'mean')
}

# Output directory
PLOTS_DIR = Path('plots/notebook3')
PLOTS_DIR.mkdir(parents=True, exist_ok=True)

print("Marker gene panels loaded:")
print(f"  {len(MARKER_GENES)} cell type panels")
print(f"  {len(MAJOR_LABELS)} major cell types")
print(f"\nAnnotation mode: {ANNOTATION_PARAMS['label_mode']}")
print(f"Confidence margin: {ANNOTATION_PARAMS['margin']}")

## Stage 8: Cell Type Annotation

Annotate cell types using marker gene expression scores.

**Strategy:**
1. Plot marker genes across clusters to visualize expression
2. Compute module scores for each cell type
3. Assign major cell types based on top scores
4. Refine excitatory neurons into layer-specific subtypes
5. Refine inhibitory neurons into subtypes
6. Visualize results

In [ ]:
def plot_marker_genes(adata, save_dir=None):
    """Plot marker genes across clusters using dot plot
    
    Args:
        adata: AnnData object with clustering results
        save_dir: Directory to save plots (optional)
    """
    print("Plotting marker gene expression...")
    
    # Collect all available marker genes
    available_markers = []
    missing_genes = []
    
    for cell_type, genes in MARKER_GENES.items():
        for gene in genes:
            if gene in adata.var_names:
                if gene not in available_markers:
                    available_markers.append(gene)
            else:
                missing_genes.append((cell_type, gene))
    
    print(f"  Found {len(available_markers)}/{sum(len(g) for g in MARKER_GENES.values())} marker genes in dataset")
    
    if missing_genes:
        print(f"  ⚠️  Missing {len(missing_genes)} markers:")
        for ct, g in missing_genes[:10]:  # Show first 10
            print(f"      {g} ({ct})")
        if len(missing_genes) > 10:
            print(f"      ... and {len(missing_genes)-10} more")
    
    if available_markers:
        # Create dot plot
        sc.pl.dotplot(
            adata,
            available_markers,
            groupby='leiden',
            standard_scale='var',
            figsize=(15, 8),
            show=False,
        )
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        
        if save_dir:
            plt.savefig(save_dir / 'marker_genes_dotplot.png', dpi=300, bbox_inches='tight')
            print(f"  ✓ Saved: {save_dir}/marker_genes_dotplot.png")
        
        plt.show()
    else:
        print("  ⚠️  No marker genes found in dataset!")

# Plot marker genes
plot_marker_genes(adata, save_dir=PLOTS_DIR)

In [ ]:
def assign_major_celltypes_by_scores(adata, margin=0.05):
    """Assign major cell types using module scores with confidence margin
    
    Computes per-cell module scores for each major cell type, picks the
    top-scoring label when its score exceeds the next-best by `margin`.
    
    Args:
        adata: AnnData object
        margin: Confidence margin between top and second-best scores
    """
    print("Computing major cell type scores...")
    
    use_raw = getattr(adata, 'raw', None) is not None
    var_names = adata.raw.var_names if use_raw else adata.var_names
    
    score_cols = []
    for lbl in MAJOR_LABELS:
        genes = [g for g in MARKER_GENES.get(lbl, []) if g in var_names]
        if not genes:
            continue
        
        score_name = f'score_{lbl}'
        sc.tl.score_genes(
            adata,
            gene_list=genes,
            score_name=score_name,
            use_raw=use_raw
        )
        score_cols.append(score_name)
        print(f"  ✓ {lbl}: {len(genes)} markers")
    
    if not score_cols:
        print("  ⚠️  No marker genes found!")
        return
    
    # Find top scoring label per cell
    scores = adata.obs[score_cols].to_numpy()
    top_idx = np.argmax(scores, axis=1)
    
    # Get second best score via partial sort
    part = np.partition(scores, -2, axis=1)
    second_best = part[:, -2]
    best = scores[np.arange(scores.shape[0]), top_idx]
    
    # Extract labels
    labels = np.array([c.replace('score_', '') for c in score_cols])
    winners = labels[top_idx]
    
    # Only assign if confidence is high enough
    confident = best - second_best >= margin
    
    # Initialize celltype column
    if 'celltype' not in adata.obs:
        adata.obs['celltype'] = np.nan
    
    adata.obs.loc[confident, 'celltype'] = winners[confident]
    
    print(f"\n✓ Assigned {confident.sum():,} / {len(confident):,} cells ({confident.sum()/len(confident)*100:.1f}%)")
    print(f"  Unlabeled (low confidence): {(~confident).sum():,} cells")


def assign_major_celltypes_by_cluster_scores(adata, margin=0.05, agg='median'):
    """Assign major cell types at cluster level using aggregated scores
    
    Computes per-cell module scores, aggregates them per cluster, and assigns
    the winning label to all cells in that cluster.
    
    Args:
        adata: AnnData object
        margin: Confidence margin
        agg: Aggregation method ('median' or 'mean')
    """
    print("Computing cluster-level cell type scores...")
    
    use_raw = getattr(adata, 'raw', None) is not None
    var_names = adata.raw.var_names if use_raw else adata.var_names
    
    score_cols = []
    for lbl in MAJOR_LABELS:
        genes = [g for g in MARKER_GENES.get(lbl, []) if g in var_names]
        if not genes:
            continue
        
        score_name = f'score_{lbl}'
        sc.tl.score_genes(
            adata,
            gene_list=genes,
            score_name=score_name,
            use_raw=use_raw
        )
        score_cols.append(score_name)
        print(f"  ✓ {lbl}: {len(genes)} markers")
    
    if not score_cols or 'leiden' not in adata.obs:
        return
    
    if 'celltype' not in adata.obs:
        adata.obs['celltype'] = np.nan
    
    # Aggregate scores per cluster
    grouped = (
        adata.obs.groupby('leiden')[score_cols].median()
        if agg == 'median'
        else adata.obs.groupby('leiden')[score_cols].mean()
    )
    
    grouped_vals = grouped.values
    top_idx = np.argmax(grouped_vals, axis=1)
    part = np.partition(grouped_vals, -2, axis=1)
    second_best = part[:, -2]
    best = grouped_vals[np.arange(grouped_vals.shape[0]), top_idx]
    labels = np.array([c.replace('score_', '') for c in score_cols])
    winners = labels[top_idx]
    
    confident = best - second_best >= margin
    
    # Assign to all cells in each cluster
    assigned_clusters = 0
    for cluster_id, is_conf in zip(grouped.index.astype(str), confident):
        if not is_conf:
            continue
        label = winners[grouped.index.astype(str) == cluster_id][0]
        mask = adata.obs['leiden'].astype(str) == cluster_id
        adata.obs.loc[mask, 'celltype'] = label
        assigned_clusters += 1
    
    print(f"\n✓ Assigned {assigned_clusters} / {len(grouped)} clusters")
    n_assigned = adata.obs['celltype'].notna().sum()
    print(f"  Total cells assigned: {n_assigned:,} / {adata.n_obs:,} ({n_assigned/adata.n_obs*100:.1f}%)")


# Run annotation based on selected mode
if ANNOTATION_PARAMS['label_mode'] == 'cluster':
    assign_major_celltypes_by_cluster_scores(
        adata,
        margin=ANNOTATION_PARAMS['margin'],
        agg=ANNOTATION_PARAMS['cluster_agg']
    )
else:
    assign_major_celltypes_by_scores(
        adata,
        margin=ANNOTATION_PARAMS['margin']
    )

# Show initial distribution
print("\nInitial cell type distribution:")
print(adata.obs['celltype'].value_counts().sort_index())

In [ ]:
def refine_by_subtype_scores(adata, subtype_labels, eligible_celltypes, margin=None):
    """Refine broad cell types into subtypes using module scores
    
    For cells labeled with an eligible parent cell type (e.g., "Excit"),
    compute subtype scores (e.g., "ExN_L2-4", "ExN_L5") and assign the best
    subtype label.
    
    Args:
        adata: AnnData object
        subtype_labels: List of subtype labels to test
        eligible_celltypes: List of parent labels that can be refined
        margin: Optional confidence margin
    
    Returns:
        AnnData with refined celltype annotations
    """
    if 'celltype' not in adata.obs:
        return adata
    
    use_raw = getattr(adata, 'raw', None) is not None
    var_names = adata.raw.var_names if use_raw else adata.var_names
    
    # Compute subtype scores
    score_cols = []
    for label in subtype_labels:
        genes = [g for g in MARKER_GENES.get(label, []) if g in var_names]
        if not genes:
            continue
        
        score_name = f'score_{label}'
        sc.tl.score_genes(
            adata,
            gene_list=genes,
            score_name=score_name,
            use_raw=use_raw
        )
        score_cols.append(score_name)
    
    if not score_cols:
        return adata
    
    scores = adata.obs[score_cols].to_numpy()
    best_idx = np.argmax(scores, axis=1)
    score_labels = np.array([c.replace('score_', '') for c in score_cols])
    best_labels = score_labels[best_idx]
    
    # Optional margin gating
    if margin is not None and scores.shape[1] > 1:
        part = np.partition(scores, -2, axis=1)
        second_best = part[:, -2]
        best = scores[np.arange(scores.shape[0]), best_idx]
        confident = (best - second_best) >= float(margin)
    else:
        confident = np.ones(scores.shape[0], dtype=bool)
    
    # Only refine cells with eligible parent cell types
    eligible_mask = adata.obs['celltype'].isin(eligible_celltypes).to_numpy()
    update_mask = eligible_mask & confident
    
    if update_mask.any():
        n_refined = update_mask.sum()
        print(f"  Refined {n_refined:,} cells from {eligible_celltypes} → subtypes")
        adata.obs.loc[update_mask, 'celltype'] = best_labels[update_mask]
    
    return adata


# Refine excitatory neurons into cortical layers
print("\nRefining excitatory neuron subtypes...")
adata = refine_by_subtype_scores(
    adata,
    subtype_labels=["ExN_L2-4", "ExN_L5", "ExN_L6", "ExN_L6b"],
    eligible_celltypes=["Excit", "Neuron"],
    margin=None  # No margin - assign best subtype unconditionally
)

# Refine inhibitory neuron subtypes
print("Refining inhibitory neuron subtypes...")
adata = refine_by_subtype_scores(
    adata,
    subtype_labels=["InN_SST", "InN_VIP", "InN_PVALB"],
    eligible_celltypes=["Inhib", "Neuron"],
    margin=None
)

print("\nFinal cell type distribution:")
print(adata.obs['celltype'].value_counts().sort_index())

## Visualization

Plot annotated cell types on UMAP and composition heatmap.

In [ ]:
# Plot annotated cell types on UMAP
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

sc.pl.umap(
    adata,
    color='celltype',
    legend_loc='right margin',
    title='Cell type annotations',
    ax=axes[0],
    show=False
)

sc.pl.umap(
    adata,
    color='leiden',
    legend_loc='on data',
    title='Clusters',
    ax=axes[1],
    show=False
)

plt.tight_layout()
plt.savefig(PLOTS_DIR / 'celltype_umap.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"✓ Saved: {PLOTS_DIR}/celltype_umap.png")

In [ ]:
# Create composition heatmap showing cell type distribution per cluster
composition = pd.crosstab(
    adata.obs['leiden'],
    adata.obs['celltype'],
    normalize='index'  # Normalize by cluster (rows sum to 1)
)

plt.figure(figsize=(12, 8))
sns.heatmap(
    composition,
    annot=True,
    fmt='.2f',
    cmap='YlOrRd',
    cbar_kws={'label': 'Proportion of cells'}
)
plt.title('Cell type composition per cluster')
plt.xlabel('Cell type')
plt.ylabel('Cluster')
plt.tight_layout()
plt.savefig(PLOTS_DIR / 'composition_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"✓ Saved: {PLOTS_DIR}/composition_heatmap.png")

# Identify dominant cell type per cluster
dominant = composition.idxmax(axis=1)
dominant_prop = composition.max(axis=1)

print("\nDominant cell type per cluster:")
for cluster, (celltype, prop) in enumerate(zip(dominant, dominant_prop)):
    print(f"  Cluster {cluster}: {celltype} ({prop*100:.1f}%)")

In [ ]:
# Plot cell type distribution across samples
celltype_by_sample = pd.crosstab(
    adata.obs['orig.ident'],
    adata.obs['celltype']
)

fig, ax = plt.subplots(figsize=(14, 6))
celltype_by_sample.plot(kind='bar', stacked=True, ax=ax)
plt.title('Cell type distribution across samples')
plt.xlabel('Sample')
plt.ylabel('Number of cells')
plt.xticks(rotation=45, ha='right')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Cell type')
plt.tight_layout()
plt.savefig(PLOTS_DIR / 'celltype_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"✓ Saved: {PLOTS_DIR}/celltype_distribution.png")

### 🎛️ Parameter Tuning Guide: Annotation Quality

Review the UMAP and composition heatmap above. How did the annotation perform?

---

#### **📊 Cell Type Distribution**

<details>
<summary><b>&gt;20% cells are "Unlabeled"</b></summary>

**Diagnosis:** ⚠️ Low confidence in many cell type assignments

**Possible causes:**
1. Margin too stringent
2. Marker genes not appropriate for your tissue
3. Missing cell type-specific markers

**Actions:**

**Approach 1: Lower confidence margin**
```python
# In Cell 3, update:
ANNOTATION_PARAMS['margin'] = 0.03  # Lower from 0.05 (less stringent)
```

**Approach 2: Add tissue-specific markers**
```python
# Add more specific markers to MARKER_GENES dictionary:
MARKER_GENES['ExN_L2-4'] = ["Cux1", "Cux2", "Satb2"]
MARKER_GENES['ExN_L5'] = ["Bcl11b", "Fezf2"]
MARKER_GENES['InN_SST'] = ["Sst", "Npy"]
MARKER_GENES['InN_VIP'] = ["Vip", "Cck"]
```

**Approach 3: Use cluster-level annotation** (less affected by dropout)
```python
ANNOTATION_PARAMS['label_mode'] = 'cluster'
ANNOTATION_PARAMS['cluster_agg'] = 'median'
```

**Then:** Re-run from Cell 5
</details>

<details>
<summary><b>Most clusters are a single cell type</b> (e.g., all 15 clusters annotated as "Excit")</summary>

**Diagnosis:** ⚠️ Markers not specific enough

**Problems:**
- Cannot distinguish subtypes
- Over-clustering without biological meaning

**Actions:**

**Approach 1: Add more specific subtype markers**
```python
# Replace broad markers with specific ones:
MARKER_GENES = {
    "ExN_L2-3": ["Cux1", "Cux2", "Rorb"],
    "ExN_L4": ["Rorb", "Scnn1a"],
    "ExN_L5": ["Bcl11b", "Fezf2", "Foxp2"],
    "ExN_L6": ["Tbr1", "Foxp2"],
    "InN_Pvalb": ["Pvalb", "Gabra1"],
    "InN_Sst": ["Sst", "Npy"],
    "InN_Vip": ["Vip", "Cck"],
    # ... keep glia markers ...
}
```

**Approach 2: Decrease clustering resolution** (if subtypes don't exist)
- Go back to Notebook 2, Cell 4
- Lower resolution to get broader clusters

**Then:** Re-run from Cell 5 (this notebook) or from Notebook 2
</details>

<details>
<summary><b>One cluster has mixed cell types</b> (e.g., Cluster 5: 40% Excit, 30% Inhib, 30% Astro)</summary>

**Diagnosis:** ⚠️ Possible doublet cluster or under-clustering

**Investigation:**

1. **Check doublet scores:**
```python
# Add cell to check:
cluster_id = '5'
print("Doublet score distribution:")
print(adata[adata.obs['leiden'] == cluster_id].obs['doublet_score'].describe())
```

2. **Check if intermediate position on UMAP**
   - If positioned between major cell types → Likely doublets
   - If within a cell type region → Under-clustering

**Actions:**

**If doublets:**
- Go back to Notebook 1
- Lower doublet threshold:
```python
DOUBLET_PARAMS['manual_threshold'] = 0.30  # More stringent
```
- Re-run from Notebook 1, Stage 3

**If under-clustering:**
- Go back to Notebook 2
- Increase resolution:
```python
CLUSTERING_PARAMS['resolution'] = 1.0  # Higher
```
- Re-run from Notebook 2, Cell 10

**If transitional/intermediate state:** May be biological - investigate further
</details>

<details>
<summary><b>Annotation matches UMAP spatial organization</b></summary>

**Diagnosis:** ✅ Excellent - biologically coherent

**Observations to validate:**
- Similar cell types cluster together on UMAP
- Clear boundaries between cell types
- Cell type composition makes biological sense

**Quality checks:**
1. **Composition heatmap** - each cluster should be dominated by 1-2 cell types
2. **Expected proportions** - do cell type proportions match biology?
   - Brain: Mostly neurons (60-80%), some glia (20-40%)
   - Immune: Varied depending on tissue
   - Epithelial: Mostly epithelial cells with some stromal

**Action:** Proceed to export with confidence
</details>

---

#### **📊 Composition Heatmap Interpretation**

**Good patterns:**
- **Each cluster dominated by one cell type** (>60% one cell type)
- **Related cell types group together** (e.g., ExN_L2-4 and ExN_L5 in adjacent clusters)
- **Clear cluster identity**

**Problematic patterns:**
- **Cluster split evenly** between 2+ cell types → Under-clustering or doublets
- **Same cell type in many distant clusters** → Over-clustering
- **High "Unlabeled" in all clusters** → Markers not working

---

#### **💡 Troubleshooting Decision Tree**

```
High unlabeled (>20%)?
├─ Yes: Lower margin OR add more specific markers
└─ No: Check composition heatmap
    ├─ Clean (each cluster = 1 cell type): ✅ Done!
    ├─ Mixed clusters:
    │   ├─ Intermediate on UMAP: Doublets → Notebook 1
    │   └─ Within cell type region: Under-clustering → Notebook 2
    └─ All clusters same type:
        ├─ Biological: ✅ Done (homogeneous sample)
        └─ Technical: Add subtype markers OR decrease resolution
```

---

#### **💡 What Makes Good Annotations**

**Confidence indicators:**
1. **Spatial coherence** - same cell type clusters together on UMAP
2. **Marker expression** - markers specifically expressed in assigned cell types
3. **Biological plausibility** - cell type proportions make sense
4. **Cluster purity** - each cluster mostly one cell type (>70%)

**Next steps after annotation:**
1. Validate with known markers (plot specific genes on UMAP)
2. Check differential expression between conditions
3. Investigate cell type-specific changes
4. Export for downstream analysis

## Stage 9: Final Export

Save annotated data and metadata for downstream analysis.

In [ ]:
# Store parameters used
adata.uns['pipeline_params']['notebook'] = '3_annotation_export'
adata.uns['pipeline_params']['annotation'] = ANNOTATION_PARAMS
adata.uns['pipeline_params']['marker_genes'] = {k: v for k, v in MARKER_GENES.items()}

# Save annotated data
output_file = 'outputs/annotated_data.h5ad'
adata.write(output_file)

print("\n" + "="*60)
print("SAVING FINAL OUTPUT")
print("="*60)
print(f"✓ Saved: {output_file}")
print(f"  Size: {Path(output_file).stat().st_size / 1e6:.1f} MB")
print(f"  Cells: {adata.n_obs:,}")
print(f"  Genes: {adata.n_vars:,}")

In [ ]:
# Export cell metadata to CSV
metadata_cols = [
    'leiden',
    'celltype',
    'orig.ident',
    'Genotype',
    'Sex',
    'n_genes_by_counts',
    'total_counts',
    'percent_mt',
    'doublet_score',
]

# Only include columns that exist
existing_cols = [c for c in metadata_cols if c in adata.obs.columns]
adata.obs[existing_cols].to_csv('outputs/cell_metadata.csv')

print(f"✓ Saved: outputs/cell_metadata.csv")
print(f"  Columns: {existing_cols}")

In [ ]:
# Create analysis summary
summary_data = {
    'Metric': [
        'Total cells',
        'Total genes',
        'Clusters',
        'Cell types',
        'Median genes/cell',
        'Median UMIs/cell',
        'Median MT%',
        'Samples',
    ],
    'Value': [
        f"{adata.n_obs:,}",
        f"{adata.n_vars:,}" if adata.raw is None else f"{adata.raw.n_vars:,}",
        adata.obs['leiden'].nunique(),
        adata.obs['celltype'].nunique(),
        f"{adata.obs['n_genes_by_counts'].median():.0f}",
        f"{adata.obs['total_counts'].median():.0f}",
        f"{adata.obs['percent_mt'].median():.2f}%",
        adata.obs['orig.ident'].nunique(),
    ]
}

summary_df = pd.DataFrame(summary_data)
summary_df.to_csv('outputs/analysis_summary.csv', index=False)

print("\n" + "="*60)
print("ANALYSIS SUMMARY")
print("="*60)
display(summary_df)

# Cell type distribution
print("\n" + "="*60)
print("CELL TYPE DISTRIBUTION")
print("="*60)
celltype_dist = adata.obs['celltype'].value_counts().sort_index()
print(celltype_dist)

# Save cell type counts
celltype_dist.to_csv('outputs/celltype_counts.csv', header=['count'])
print("\n✓ Saved: outputs/celltype_counts.csv")

# Cells per sample
print("\n" + "="*60)
print("CELLS PER SAMPLE")
print("="*60)
sample_counts = adata.obs['orig.ident'].value_counts().sort_index()
print(sample_counts)

## Pipeline Complete! 🎉

### Output Files

**Main outputs:**
- `outputs/annotated_data.h5ad` - Annotated AnnData object (FINAL)
- `outputs/cell_metadata.csv` - Cell-level metadata
- `outputs/celltype_counts.csv` - Cell type distribution
- `outputs/analysis_summary.csv` - Pipeline summary statistics

**Plots:**
- `plots/notebook3/marker_genes_dotplot.png` - Marker gene expression
- `plots/notebook3/celltype_umap.png` - Cell types on UMAP
- `plots/notebook3/composition_heatmap.png` - Cluster composition
- `plots/notebook3/celltype_distribution.png` - Cell types per sample

### Next Steps

**Downstream analysis:**
1. **Differential expression** between conditions (e.g., E3 vs E4, Ctrl vs GENUS)
2. **Cell type proportions** analysis across groups
3. **Trajectory analysis** for developmental/temporal data
4. **Gene regulatory network** inference
5. **Integration** with other datasets

**Quality validation:**
1. Plot known markers on UMAP: `sc.pl.umap(adata, color=['Snap25', 'Gad1', 'Gfap'])`
2. Check cluster-specific markers: `sc.tl.rank_genes_groups(adata, 'celltype')`
3. Validate doublet removal: `sc.pl.umap(adata, color='doublet_score')`

**Further refinement:**
- Re-cluster neuronal subtypes (excitatory/inhibitory)
- Identify rare cell populations
- Integrate batch correction if needed

### Pipeline Parameters Used

All parameters are stored in `adata.uns['pipeline_params']` for reproducibility.

```python
# To access parameters:
print(adata.uns['pipeline_params'])
```