# Notebook 4: Xenium Analysis and Spatial Statistics

**Tutor:** Anthony Christidis
**Time:** 45 minutes

---

Now we will apply the `scanpy` workflow we learned on Visium data to our high-resolution **Xenium data**. Xenium provides single-cell resolution spatial transcriptomics, allowing us to map individual cell types and their interactions within tissue architecture.

We will perform the complete analysis pipeline from quality control to advanced spatial statistics, demonstrating how computational analysis can reveal biological insights about tissue organization.

**Goals:**
1.  Perform comprehensive QC and clustering on the Xenium dataset.
2.  Identify spatially organized genes using **Moran's I**.
3.  Analyze cell community structures using **neighborhood enrichment**.

### Setup and Data Loading

We'll start by loading our libraries and the raw Xenium dataset. This ensures our analysis is completely self-contained and reproducible.

In [None]:
%load_ext jupyter_black

import spatialdata as sd
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from matplotlib.lines import Line2D

# Keep output clean for the workshop
import warnings
warnings.filterwarnings("ignore")

print("--- Loading Raw Xenium Data ---")
# Load the complete SpatialData object
sdata_xenium = sd.read_zarr("../data/xenium_lung_cancer_subset.zarr")
# Extract the gene expression table for analysis
adata_xenium = sdata_xenium.tables["table"].copy()

print(f"Loaded {adata_xenium.n_obs} cells with {adata_xenium.n_vars} genes")

### Part 1: Quality Control and Visualization

Quality control is crucial for spatial data. Unlike bulk RNA-seq, we need to identify and filter out empty droplets, cell doublets, and low-quality segmentations. For Xenium data, we focus on transcript counts per cell and gene detection rates.

#### Step 1.1: Calculate QC Metrics

We'll compute standard single-cell QC metrics including total transcript counts per cell and the number of unique genes detected. These metrics help us identify high-quality cells for downstream analysis.

In [None]:
# Calculate comprehensive QC metrics
# percent_top tracks the percentage of counts from the most highly expressed genes
sc.pp.calculate_qc_metrics(adata_xenium, percent_top=(20, 50), inplace=True)

print("QC metrics calculated successfully")
print(f"New columns in adata.obs: {[col for col in adata_xenium.obs.columns if 'counts' in col or 'genes' in col]}")

#### Step 1.2: Visualize QC Distributions

Plotting the distribution of QC metrics helps us make informed decisions about filtering thresholds. We expect to see a main population of high-quality cells and potentially some outliers representing empty droplets or doublets.

In [None]:
# Create comprehensive QC visualizations
fig, axs = plt.subplots(1, 2, figsize=(12, 4))

# Total transcript counts per cell
sns.histplot(adata_xenium.obs["total_counts"], kde=False, bins=100, ax=axs[0])
axs[0].set_title("Total Transcripts per Cell")
axs[0].set_xlabel("Total Counts")
axs[0].set_ylabel("Number of Cells")

# Number of unique genes detected per cell
sns.histplot(adata_xenium.obs["n_genes_by_counts"], kde=False, bins=100, ax=axs[1])
axs[1].set_title("Unique Genes per Cell")
axs[1].set_xlabel("Number of Genes")
axs[1].set_ylabel("Number of Cells")

plt.tight_layout()
plt.show()

# Print summary statistics to inform filtering decisions
print(f"Median total counts per cell: {adata_xenium.obs['total_counts'].median():.0f}")
print(f"Median genes per cell: {adata_xenium.obs['n_genes_by_counts'].median():.0f}")

#### Step 1.3: Apply Quality Filters

Based on the QC distributions, we'll filter out low-quality cells and rarely detected genes. This removes technical noise while preserving biological signal.

In [None]:
# Apply cell and gene filters based on QC metrics
print(f"Starting with: {adata_xenium.n_obs} cells, {adata_xenium.n_vars} genes")

# Remove cells with very few transcripts (likely empty droplets)
sc.pp.filter_cells(adata_xenium, min_counts=50)
print(f"After cell filtering: {adata_xenium.n_obs} cells")

# Remove genes detected in very few cells (reduces noise)
sc.pp.filter_genes(adata_xenium, min_cells=10)
print(f"After gene filtering: {adata_xenium.n_obs} cells, {adata_xenium.n_vars} genes")

print(f"Filtered out {162254 - adata_xenium.n_obs} low-quality cells")

### Part 2: Preprocessing and Clustering

Now we'll run the standard `scanpy` preprocessing pipeline to normalize the data and identify transcriptionally distinct cell populations. This workflow is fundamental to most single-cell analyses.

#### Step 2.1: Normalization and Log Transformation

Normalization accounts for differences in sequencing depth between cells, ensuring we compare relative gene expression rather than absolute counts. Log transformation stabilizes variance across expression levels.

In [None]:
# Normalize to 10,000 transcripts per cell and log-transform
# This makes cells comparable despite different total transcript counts
sc.pp.normalize_total(adata_xenium, target_sum=1e4)
sc.pp.log1p(adata_xenium)

print("Normalization and log-transformation complete")

#### Step 2.2: Feature Selection and Dimensionality Reduction

We identify highly variable genes (HVGs) that capture the most biological variation, then use PCA to reduce dimensionality while preserving the main patterns in gene expression.

In [None]:
# Identify the most informative genes for clustering
sc.pp.highly_variable_genes(adata_xenium, n_top_genes=2000, flavor='seurat')
print(f"Selected {adata_xenium.var['highly_variable'].sum()} highly variable genes")

# Reduce dimensionality using Principal Component Analysis
sc.pp.pca(adata_xenium, use_highly_variable=True)
print("PCA complete")

#### Step 2.3: Neighborhood Graph and Clustering

We build a neighborhood graph connecting transcriptionally similar cells, then use the Leiden algorithm to identify clusters of cells with shared expression patterns. These clusters often correspond to distinct cell types or states.

In [None]:
# Build a k-nearest-neighbor graph in PCA space
sc.pp.neighbors(adata_xenium, n_neighbors=15)

# Identify cell clusters using the Leiden algorithm
sc.tl.leiden(adata_xenium, key_added="clusters", resolution=0.5)

# Compute UMAP for visualization
sc.tl.umap(adata_xenium)

n_clusters = len(adata_xenium.obs['clusters'].unique())
print(f"Identified {n_clusters} distinct cell clusters")

#### Step 2.4: Visualize Clustering Results

UMAP provides a 2D representation of our high-dimensional data, allowing us to visualize how well our clustering algorithm separated different cell populations.

In [None]:
# Visualize clusters in UMAP space
sc.pl.umap(adata_xenium, color="clusters", 
           title=f"Xenium Cell Clusters (n={n_clusters})",
           legend_loc="on data", legend_fontsize=8)

print(f"Processing complete! Identified {n_clusters} clusters from {adata_xenium.n_obs} high-quality cells")

### Part 3: Spatial Analysis Setup

To perform spatial statistics, we need to link our processed gene expression data back to the physical coordinates of cells in the tissue. This requires careful coordination between the filtered cell data and the original spatial information.

#### Step 3.1: Add Spatial Coordinates

After filtering, we need to ensure our processed `AnnData` object contains the correct spatial coordinates for the remaining cells. We'll extract these from the original `SpatialData` object.

In [None]:
# Link processed cells back to their spatial coordinates
print("Linking processed cells to spatial coordinates...")

# Get the cell shapes from the original SpatialData object
original_shapes = sdata_xenium.shapes['cell_circles']

# Find cells that exist in both our processed data and the spatial data
common_cell_ids = set(adata_xenium.obs_names) & set(original_shapes.index)

if len(common_cell_ids) > 0:
    print(f"Found {len(common_cell_ids)} cells with matching spatial coordinates")
    # Filter to cells that have both expression and spatial data
    adata_xenium = adata_xenium[list(common_cell_ids)].copy()
    # Extract (x,y) coordinates from shape centroids
    coords = original_shapes.loc[list(common_cell_ids)].centroid.apply(lambda p: [p.x, p.y]).to_numpy()
    adata_xenium.obsm['spatial'] = coords
    print(f"Added spatial coordinates for {len(coords)} cells")
else:
    print("Warning: No matching cell IDs found. Using positional matching.")
    # Fallback: use first N coordinates where N = number of filtered cells
    coords = original_shapes.centroid.apply(lambda p: [p.x, p.y]).to_numpy()[:adata_xenium.n_obs]
    adata_xenium.obsm['spatial'] = coords

print(f"Final dataset: {adata_xenium.n_obs} cells with expression and spatial data")

#### Step 3.2: Build Spatial Neighborhood Graph

For spatial statistics, we need to define which cells are neighbors in physical space. We'll use Delaunay triangulation to create a graph connecting nearby cells.

In [None]:
# Create spatial neighborhood graph using Delaunay triangulation
# This connects each cell to its physical neighbors in the tissue
print("Building spatial neighborhood graph...")
sq.gr.spatial_neighbors(adata_xenium, coord_type="generic", delaunay=True)
print("Spatial graph construction complete")

# Verify that we have all necessary components for spatial analysis
print(f"✓ Spatial coordinates: {'spatial' in adata_xenium.obsm}")
print(f"✓ Cell clusters: {'clusters' in adata_xenium.obs.columns}")
print(f"✓ Spatial graph: {'spatial_connectivities' in adata_xenium.obsp}")

#### Step 3.3: Visualize Spatial Distribution of Clusters

Now we can visualize how our computationally-derived clusters are organized in physical space. This reveals whether transcriptionally similar cells tend to be spatially co-localized.

In [None]:
# Create a robust spatial scatter plot using matplotlib
# This approach gives us full control and avoids metadata issues

# Extract coordinates and cluster information
coords = adata_xenium.obsm['spatial']
x_coords = coords[:, 0]
y_coords = coords[:, 1]
cluster_codes = adata_xenium.obs['clusters'].astype('category').cat.codes

# Create the spatial visualization
fig, ax = plt.subplots(1, 1, figsize=(12, 10))
scatter = ax.scatter(x_coords, y_coords, c=cluster_codes, 
                    cmap='tab20', s=3, alpha=0.8)
ax.set_title('Xenium Cell Clusters (Spatial View)', fontsize=14)
ax.set_xlabel('X coordinate (μm)')
ax.set_ylabel('Y coordinate (μm)')
ax.set_aspect('equal')

# Add a legend showing cluster colors
unique_clusters = sorted(adata_xenium.obs['clusters'].unique())
colors = plt.cm.tab20(np.linspace(0, 1, len(unique_clusters)))
legend_elements = [Line2D([0], [0], marker='o', color='w', 
                         markerfacecolor=colors[i], markersize=8, 
                         label=f'Cluster {cluster}') 
                  for i, cluster in enumerate(unique_clusters)]
ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc='upper left',
          title='Cell Clusters')

plt.tight_layout()
plt.show()

print(f"Spatial visualization shows {len(unique_clusters)} distinct cell clusters")
print("Notice how some clusters show clear spatial organization!")

### Part 4: Advanced Spatial Statistics

Now we'll use `squidpy` to perform sophisticated spatial analyses that go beyond simple visualization. These methods can reveal biological insights about tissue organization and cell-cell interactions.

#### Analysis 1: Spatially Variable Genes (Moran's I)

**Biological Question:** Which genes show non-random spatial patterns in their expression? 

Moran's I is a measure of spatial autocorrelation. Genes with high Moran's I scores are expressed in spatially coherent regions, often marking anatomical structures or functional tissue domains.

In [None]:
# Calculate Moran's I for spatially variable gene detection
# We test highly variable genes to focus on the most informative features
print("Calculating Moran's I spatial autocorrelation...")
print("This identifies genes with spatially coherent expression patterns")

sq.gr.spatial_autocorr(
    adata_xenium, 
    mode="moran", 
    genes=adata_xenium.var['highly_variable'],  # Test only HVGs for speed
    n_perms=100,  # Number of permutations for statistical testing
    n_jobs=4      # Parallel processing
)

# Display the top spatially variable genes
moran_results = adata_xenium.uns["moranI"].sort_values(by="I", ascending=False)
print(f"\nTop 10 spatially variable genes:")
print(moran_results.head(10)[['I', 'pval_sim']].round(4))

#### Visualize Top Spatially Variable Gene

Let's examine the spatial expression pattern of the gene with the highest Moran's I score. This should show clear spatial structure in its expression.

In [None]:
# Visualize the most spatially coherent gene
top_gene = moran_results.index[0]
moran_score = moran_results.iloc[0]['I']

print(f"Visualizing {top_gene} (Moran's I = {moran_score:.3f})")

# Create spatial expression plot using matplotlib for reliability
coords = adata_xenium.obsm['spatial']
x_coords = coords[:, 0]
y_coords = coords[:, 1]
gene_expression = adata_xenium[:, top_gene].X.toarray().flatten() if hasattr(adata_xenium.X, 'toarray') else adata_xenium[:, top_gene].X.flatten()

fig, ax = plt.subplots(1, 1, figsize=(12, 10))
scatter = ax.scatter(x_coords, y_coords, c=gene_expression, 
                    cmap='magma', s=3, alpha=0.8)
ax.set_title(f'{top_gene} Expression (Moran\'s I = {moran_score:.3f})', fontsize=14)
ax.set_xlabel('X coordinate (μm)')
ax.set_ylabel('Y coordinate (μm)')
ax.set_aspect('equal')

# Add colorbar
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Log(Expression + 1)')

plt.tight_layout()
plt.show()

print(f"High Moran's I indicates {top_gene} is expressed in spatially coherent regions")

#### Analysis 2: Neighborhood Enrichment

**Biological Question:** Which cell types tend to be neighbors more often than expected by chance?

This analysis reveals the "social network" of cell types in the tissue, identifying which clusters prefer to be adjacent to each other and which tend to segregate.

In [None]:
# Calculate neighborhood enrichment between cell clusters
print("Calculating neighborhood enrichment between cell clusters...")
print("This reveals which cell types tend to be spatial neighbors")

sq.gr.nhood_enrichment(adata_xenium, cluster_key="clusters")

# Visualize the enrichment matrix as a heatmap
sq.pl.nhood_enrichment(
    adata_xenium, 
    cluster_key="clusters", 
    method="ward",  # Hierarchical clustering to group similar patterns
    cmap="RdBu_r",  # Red-blue colormap (red=enriched, blue=depleted)
    figsize=(8, 8)
)

print("\nInterpreting the heatmap:")
print("• Red (positive Z-score): Cell types are neighbors more often than expected")
print("• Blue (negative Z-score): Cell types avoid each other")
print("• White (Z-score ≈ 0): Random spatial association")

### Workshop Summary

Congratulations! You have completed a comprehensive spatial transcriptomics analysis workflow. Let's review what we accomplished:

**What We Learned:**
1. **Quality Control:** How to assess and filter spatial transcriptomics data
2. **Clustering:** Using `scanpy` to identify transcriptionally distinct cell populations
3. **Spatial Mapping:** Linking computational clusters back to tissue architecture
4. **Spatial Statistics:** Using `squidpy` to quantify spatial patterns in gene expression and cell organization

**Biological Insights:**
- We identified {n_clusters} distinct cell clusters in the lung cancer tissue
- Moran's I analysis revealed genes with spatially coherent expression patterns
- Neighborhood enrichment showed which cell types tend to be spatial neighbors

**Next Steps:**
In a real analysis, you might:
- Annotate clusters with known cell type markers
- Investigate ligand-receptor interactions between neighboring clusters
- Compare spatial organization between healthy and diseased tissue
- Integrate with other data modalities (proteomics, imaging, etc.)

The `scverse` ecosystem provides a powerful foundation for asking sophisticated questions about tissue biology!