# Notebook 3: Data Processing and Clustering with `scanpy`

**Tutor:** Anthony Christidis
**Time:** 45 minutes

---

Welcome to the computational analysis part of the workshop! Before we dive into advanced spatial statistics, we must first process our raw gene expression data to identify meaningful biological groupings. This is a fundamental step in almost any single-cell or spatial analysis.

In this notebook, we'll use `scanpy` to perform a standard clustering analysis and `matplotlib` for robust QC plotting. We will first learn the workflow in detail on a **10x Visium** dataset, and then apply the same principles to prepare our **Xenium** data for the next notebook.

**Goals:**
1.  Perform a comprehensive Quality Control (QC) workflow on Visium data.
2.  Run a standard unsupervised clustering workflow (`scanpy`).
3.  Visualize the final spot clusters on the tissue, confirming that our analysis reveals underlying biology.
4.  Apply a streamlined workflow to process our Xenium data.

### Setup

First, we'll import our libraries and load the Visium Glioblastoma dataset.

In [None]:
%load_ext jupyter_black

import spatialdata as sd
import scanpy as sc
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.lines import Line2D

import warnings
warnings.filterwarnings("ignore")

sdata_visium = sd.read_zarr("../data/visium_glioblastoma_subset.zarr")
adata_visium = sdata_visium.tables["table"]

### Part 1: Visium Analysis - Spatial Quality Control

For spot-based data like Visium, visualizing QC metrics spatially is a critical first step. It can reveal technical issues like tissue detachment or slide artifacts.

First, we calculate standard QC metrics, such as the number of genes detected per spot (`n_genes_by_counts`) and the total number of transcripts per spot (`total_counts`).

In [None]:
sc.pp.calculate_qc_metrics(adata_visium, percent_top=(20, 50), inplace=True)

Now, we will create spatial scatter plots to visualize these QC metrics. This allows us to see if low-quality spots are concentrated in a specific area, which might indicate a problem with the tissue section.

In [None]:
# Get the spatial coordinates directly from the AnnData object
coords = adata_visium.obsm['spatial']
x_coords = coords[:, 0]
y_coords = coords[:, 1]

# Create the two-panel plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Total counts
scatter1 = axes[0].scatter(x_coords, y_coords, c=adata_visium.obs['total_counts'], 
                          cmap='viridis', s=25, alpha=0.8)
axes[0].set_title('Total Counts per Spot')
axes[0].set_aspect('equal')
plt.colorbar(scatter1, ax=axes[0])

# Plot 2: Number of genes
scatter2 = axes[1].scatter(x_coords, y_coords, c=adata_visium.obs['n_genes_by_counts'], 
                          cmap='viridis', s=25, alpha=0.8)
axes[1].set_title('Unique Genes per Spot')
axes[1].set_aspect('equal')
plt.colorbar(scatter2, ax=axes[1])

plt.tight_layout()
plt.show()

These plots are essential. We can see clear spatial patterns in both the total counts and the number of genes, which likely correspond to different biological regions within the glioblastoma tissue.

### Part 2: Visium Analysis - The `scanpy` Clustering Workflow

Based on our QC, let's filter out the lowest-quality spots and then run the standard `scanpy` workflow to find transcriptionally distinct groups of spots. Each step is broken down into its own cell for clarity.

#### Step 2.1: Filtering
We remove spots with very few counts and genes that are detected in very few spots. This reduces noise in our data.

In [None]:
print(f"Spots before filtering: {adata_visium.n_obs}")
sc.pp.filter_cells(adata_visium, min_counts=500)
sc.pp.filter_genes(adata_visium, min_cells=10)
print(f"Spots after filtering: {adata_visium.n_obs}")

#### Step 2.2: Normalization and Log-Transformation
This step corrects for differences in sequencing depth between spots, ensuring that we are comparing their relative gene expression profiles.

In [None]:
sc.pp.normalize_total(adata_visium, inplace=True)
sc.pp.log1p(adata_visium)

#### Step 2.3: Finding Highly Variable Genes (HVGs)
We don't need to use all ~18,000 genes for clustering. We can identify the genes that show the most biological variability across the tissue and focus our analysis on them. This reduces computational time and often improves results.

In [None]:
sc.pp.highly_variable_genes(adata_visium)

#### Step 2.4: Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique. We use it to summarize the main axes of variation in our highly variable genes into a smaller number of principal components (PCs).

In [None]:
sc.pp.pca(adata_visium, use_highly_variable=True)

#### Step 2.5: Neighborhood Graph and Leiden Clustering
Next, we build a graph where each spot is a node, and nodes are connected if they are similar to each other in the PCA space. The Leiden algorithm then walks through this graph to find communities of spots that are highly interconnected. These communities are our cell type clusters.

In [None]:
sc.pp.neighbors(adata_visium)
sc.tl.leiden(adata_visium, key_added="clusters")

#### Step 2.6: UMAP for Visualization
Finally, we compute a UMAP (Uniform Manifold Approximation and Projection). This takes our high-dimensional neighborhood graph and creates a 2D representation of it, which is useful for visualizing the relationships between our clusters.

In [None]:
sc.tl.umap(adata_visium)

### Part 3: Visualizing the Visium Results
Let's visualize the clusters we found, both in the abstract UMAP space and back on the tissue.

In [None]:
sc.pl.umap(adata_visium, color="clusters", title="Spot Clusters (UMAP)")

In [None]:
# To plot spatially, we'll use our robust matplotlib method again.

# Get coordinates and cluster info
coords = adata_visium.obsm['spatial']
x_coords = coords[:, 0]
y_coords = coords[:, 1]
cluster_codes = adata_visium.obs['clusters'].astype('category').cat.codes

# Create the plot
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
scatter = ax.scatter(x_coords, y_coords, c=cluster_codes, 
                    cmap='tab10', s=25, alpha=0.8)
ax.set_title('Spot Clusters (Spatial View)')
ax.set_aspect('equal')

# Add a legend
unique_clusters = adata_visium.obs['clusters'].unique()
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_clusters)))
legend_elements = [Line2D([0], [0], marker='o', color='w', markerfacecolor=colors[i], 
                         markersize=8, label=f'Cluster {cluster}') 
                  for i, cluster in enumerate(unique_clusters)]
ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

### Optional: Interactive QC Visualization with `napari`

While static plots are great for a quick overview, we can use `napari` for a more dynamic exploration of our QC metrics. This allows us to zoom in on specific areas and see how metrics vary across the tissue.

*(Instructor Note: I will run this live. For those using the workshop's Docker container, this requires the graphics server setup detailed in the README. You can simply watch my screen for this brief demonstration.)*

In [None]:
import napari_spatialdata as nsd

# We launch napari with our sdata_visium object.
# The QC metrics we calculated are already in its associated table.
# viewer = nsd.Interactive(sdata_visium)

# In the Napari window that opens, you can now:
# 1. Add the hires_image and the shapes layer.
# 2. In the annotation panel on the right, under "Observation", select 'total_counts'
#    or 'n_genes_by_counts' to color the spots by these QC metrics.

### Part 4: Comprehensive Xenium Analysis

Now we will apply the scanpy workflow we learned on Visium data to our high-resolution **Xenium data**. Xenium provides single-cell resolution spatial transcriptomics, allowing us to map individual cell types and their interactions within tissue architecture.

We will perform the complete analysis pipeline from quality control to clustering, demonstrating how computational analysis can reveal biological insights about cellular organization.

#### Setup and Data Loading
We'll start by loading our libraries and the raw Xenium dataset. This ensures our analysis is completely self-contained and reproducible.

In [None]:
import squidpy as sq
import seaborn as sns

print("--- Loading Raw Xenium Data ---")
# Load the complete SpatialData object
sdata_xenium = sd.read_zarr("../data/xenium_lung_cancer_subset.zarr")
# Extract the gene expression table for analysis
adata_xenium = sdata_xenium.tables["table"].copy()

print(f"Loaded {adata_xenium.n_obs} cells with {adata_xenium.n_vars} genes")

#### Step 4.1: Calculate QC Metrics
We'll compute standard single-cell QC metrics including total transcript counts per cell and the number of unique genes detected. These metrics help us identify high-quality cells for downstream analysis.



In [None]:
# Calculate comprehensive QC metrics
# percent_top tracks the percentage of counts from the most highly expressed genes
sc.pp.calculate_qc_metrics(adata_xenium, percent_top=(20, 50), inplace=True)

print("QC metrics calculated successfully")
print(f"New columns in adata.obs: {[col for col in adata_xenium.obs.columns if 'counts' in col or 'genes' in col]}")

#### Step 4.2: Visualize QC Distributions
Plotting the distribution of QC metrics helps us make informed decisions about filtering thresholds. We expect to see a main population of high-quality cells and potentially some outliers representing empty droplets or doublets.



In [None]:
# Create comprehensive QC visualizations
# We'll create a 2x2 grid of plots to see all our key metrics
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Total transcript counts per cell
sns.histplot(adata_xenium.obs["total_counts"], kde=False, bins=100, ax=axes[0, 0])
axes[0, 0].set_title("Total Transcripts per Cell")
axes[0, 0].set_xlabel("Total Counts")
axes[0, 0].set_ylabel("Number of Cells")

# Number of unique genes detected per cell
sns.histplot(adata_xenium.obs["n_genes_by_counts"], kde=False, bins=100, ax=axes[0, 1])
axes[0, 1].set_title("Unique Genes per Cell")
axes[0, 1].set_xlabel("Number of Genes")
axes[0, 1].set_ylabel("Number of Cells")

# Area of the segmented cells
sns.histplot(adata_xenium.obs["cell_area"], kde=False, bins=100, ax=axes[1, 0])
axes[1, 0].set_title("Area of Segmented Cells")
axes[1, 0].set_xlabel("Area (pixelsÂ²)")
axes[1, 0].set_ylabel("Number of Cells")

# Ratio of nucleus area to cell area
nucleus_ratio = adata_xenium.obs["nucleus_area"] / adata_xenium.obs["cell_area"]
sns.histplot(nucleus_ratio, kde=False, bins=100, ax=axes[1, 1])
axes[1, 1].set_title("Nucleus-to-Cell Area Ratio")
axes[1, 1].set_xlabel("Ratio")
axes[1, 1].set_ylabel("Number of Cells")

plt.tight_layout()
plt.show()

# Print summary statistics to inform filtering decisions
print(f"Median total counts per cell: {adata_xenium.obs['total_counts'].median():.0f}")
print(f"Median genes per cell: {adata_xenium.obs['n_genes_by_counts'].median():.0f}")

#### Step 4.3: Apply Quality Filters
Based on the QC distributions, we'll filter out low-quality cells and rarely detected genes. This removes technical noise while preserving biological signal.



In [None]:
# Apply cell and gene filters based on QC metrics
print(f"Starting with: {adata_xenium.n_obs} cells, {adata_xenium.n_vars} genes")

# Remove cells with very few transcripts (likely empty droplets)
sc.pp.filter_cells(adata_xenium, min_counts=50)
print(f"After cell filtering: {adata_xenium.n_obs} cells")

# Remove genes detected in very few cells (reduces noise)
sc.pp.filter_genes(adata_xenium, min_cells=10)
print(f"After gene filtering: {adata_xenium.n_obs} cells, {adata_xenium.n_vars} genes")

print(f"Filtered out {162254 - adata_xenium.n_obs} low-quality cells")

#### Step 4.4: Normalization and Log Transformation
Normalization accounts for differences in sequencing depth between cells, ensuring we compare relative gene expression rather than absolute counts. Log transformation stabilizes variance across expression levels.



In [None]:
# Normalize to 10,000 transcripts per cell and log-transform
# This makes cells comparable despite different total transcript counts
sc.pp.normalize_total(adata_xenium, target_sum=1e4)
sc.pp.log1p(adata_xenium)

print("Normalization and log-transformation complete")

#### Step 4.5: Feature Selection and Dimensionality Reduction
We identify highly variable genes (HVGs) that capture the most biological variation, then use PCA to reduce dimensionality while preserving the main patterns in gene expression.



In [None]:
# Identify the most informative genes for clustering
sc.pp.highly_variable_genes(adata_xenium, n_top_genes=2000, flavor='seurat')
print(f"Selected {adata_xenium.var['highly_variable'].sum()} highly variable genes")

# Reduce dimensionality using Principal Component Analysis
sc.pp.pca(adata_xenium, use_highly_variable=True)
print("PCA complete")

#### Step 4.6: Neighborhood Graph and Clustering
We build a neighborhood graph connecting transcriptionally similar cells, then use the Leiden algorithm to identify clusters of cells with shared expression patterns. These clusters often correspond to distinct cell types or states.



In [None]:
# Build a k-nearest-neighbor graph in PCA space
sc.pp.neighbors(adata_xenium, n_neighbors=15)

# Identify cell clusters using the Leiden algorithm
sc.tl.leiden(adata_xenium, key_added="clusters", resolution=0.5)

# Compute UMAP for visualization
sc.tl.umap(adata_xenium)

n_clusters = len(adata_xenium.obs['clusters'].unique())
print(f"Identified {n_clusters} distinct cell clusters")

#### Step 4.7: Visualize Clustering Results
UMAP provides a 2D representation of our high-dimensional data, allowing us to visualize how well our clustering algorithm separated different cell populations.



In [None]:
# Visualize clusters in UMAP space
sc.pl.umap(adata_xenium, color="clusters", 
           title=f"Xenium Cell Clusters (n={n_clusters})",
           legend_loc="on data", legend_fontsize=8)

print(f"Processing complete! Identified {n_clusters} clusters from {adata_xenium.n_obs} high-quality cells")

#### Optional: Save Processed Data
For demonstration purposes, here's how you can save your processed data for future analysis:



In [None]:
# 6. Save the processed AnnData object
import os
os.makedirs("../data/processed", exist_ok=True)

print("Saving processed Xenium AnnData object...")
adata_xenium.write("../data/processed/adata_xenium_processed.h5ad")
print("Saved! This data is now ready for advanced spatial analysis.")

This comprehensive Xenium analysis demonstrates the power of the scanpy workflow for single-cell spatial data. In the next notebook, we'll use this processed data to explore spatial statistics and cell-cell interactions!








