# Notebook 3: Data Processing and Clustering with `scanpy`

**Tutor:** Anthony Christidis
**Time:** 45 minutes

---

Welcome to the computational analysis part of the workshop! Before we dive into advanced spatial statistics, we must first process our raw gene expression data to identify meaningful biological groupings. This is a fundamental step in almost any single-cell or spatial analysis.

In this notebook, we'll use `scanpy` to perform a standard clustering analysis and `matplotlib` for robust QC plotting. We will first learn the workflow in detail on a **10x Visium** dataset, and then apply the same principles to prepare our **Xenium** data for the next notebook.

**Goals:**
1.  Perform a comprehensive Quality Control (QC) workflow on Visium data.
2.  Run a standard unsupervised clustering workflow (`scanpy`).
3.  Visualize the final spot clusters on the tissue, confirming that our analysis reveals underlying biology.
4.  Apply a streamlined workflow to process our Xenium data.

### Setup

First, we'll import our libraries and load the Visium Glioblastoma dataset.

In [None]:
# for cleaner output

import warnings
warnings.filterwarnings("ignore")

In [None]:
import spatialdata as sd
import spatialdata_plot as sdp
import scanpy as sc
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.lines import Line2D
from pathlib import Path

# Define the path to our data directory
# Note: This path is relative to the repository's root directory
_DATA_DIR_PATH = Path("../data/")
_VISIUM_PATH = _DATA_DIR_PATH / "visium_glioblastoma_subset.zarr"
_XENIUM_PATH = _DATA_DIR_PATH / "xenium_lung_cancer_subset.zarr"

# Print versions for reproducibility
for p in [sd, sdp, sc]:
    print(f"{p.__name__}: {p.__version__}")

In [None]:
sdata_visium = sd.read_zarr(_VISIUM_PATH)

# adata_visium = sdata_visium.tables["table"]

sdata_visium

### Part 1: Visium Analysis - Spatial Quality Control

For spot-based data like Visium, visualizing QC metrics spatially is a critical first step. It can reveal technical issues like tissue detachment or slide artifacts.

First, we calculate standard QC metrics, such as the number of genes detected per spot (`n_genes_by_counts`) and the total number of transcripts per spot (`total_counts`).

In [None]:
sc.pp.calculate_qc_metrics(sdata_visium.tables["table"], percent_top=(20, 50), inplace=True)

In [None]:
fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(12, 5.5))

(
    sdata_visium
    .pl.render_images()
    .pl.render_shapes(color="total_counts", shape="visium_hex")
    .pl.show("downscaled_hires", ax=axs[0], title="Total Counts per Spot")
)
(
    sdata_visium
    .pl.render_images()
    .pl.render_shapes(color="n_genes_by_counts", shape="visium_hex")
    .pl.show("downscaled_hires", ax=axs[1], title="Unique Genes per Spot")
)

fig.tight_layout()

Now, we will create spatial scatter plots to visualize these QC metrics. This allows us to see if low-quality spots are concentrated in a specific area, which might indicate a problem with the tissue section.

These plots are essential. We can see clear spatial patterns in both the total counts and the number of genes, which likely correspond to different biological regions within the glioblastoma tissue.

### Part 2: Visium Analysis - The `scanpy` Clustering Workflow

Based on our QC, let's filter out the lowest-quality spots and then run the standard `scanpy` workflow to find transcriptionally distinct groups of spots. Each step is broken down into its own cell for clarity.

#### Step 2.1: Filtering
We remove spots with very few counts and genes that are detected in very few spots. This reduces noise in our data.

In [None]:
print(f"Spots before filtering: {sdata_visium.tables['table'].n_obs}")
sc.pp.filter_cells(sdata_visium.tables["table"], min_counts=500)
sc.pp.filter_genes(sdata_visium.tables["table"], min_cells=10)
print(f"Spots after filtering: {sdata_visium.tables['table'].n_obs}")

#### Step 2.2: Normalization and Log-Transformation
This step corrects for differences in sequencing depth between spots, ensuring that we are comparing their relative gene expression profiles.

In [None]:
sc.pp.normalize_total(sdata_visium.tables["table"], inplace=True)
sc.pp.log1p(sdata_visium.tables["table"])

#### Step 2.3: Finding Highly Variable Genes (HVGs)
We don't need to use all ~18,000 genes for clustering. We can identify the genes that show the most biological variability across the tissue and focus our analysis on them. This reduces computational time and often improves results.

In [None]:
sc.pp.highly_variable_genes(sdata_visium.tables["table"])

#### Step 2.4: Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique. We use it to summarize the main axes of variation in our highly variable genes into a smaller number of principal components (PCs).

In [None]:
sc.pp.pca(sdata_visium.tables["table"], use_highly_variable=True)

#### Step 2.5: Neighborhood Graph and Leiden Clustering
Next, we build a graph where each spot is a node, and nodes are connected if they are similar to each other in the PCA space. The Leiden algorithm then walks through this graph to find communities of spots that are highly interconnected. These communities are our cell type clusters.

In [None]:
sc.pp.neighbors(sdata_visium.tables["table"])
sc.tl.leiden(sdata_visium.tables["table"], key_added="clusters")

#### Step 2.6: UMAP for Visualization
Finally, we compute a UMAP (Uniform Manifold Approximation and Projection). This takes our high-dimensional neighborhood graph and creates a 2D representation of it, which is useful for visualizing the relationships between our clusters.

In [None]:
sc.tl.umap(sdata_visium.tables["table"])

### Part 3: Visualizing the Visium Results
Let's visualize the clusters we found, both in the abstract UMAP space and back on the tissue.

In [None]:
sc.pl.umap(sdata_visium.tables["table"], color="clusters", title="Spot Clusters (UMAP)")

In [None]:
(
    sdata_visium
    .pl.render_shapes(color="clusters", shape="visium_hex")
    .pl.show("downscaled_hires", title="Leiden clusters")
)

### Optional: Interactive QC Visualization with `napari`

While static plots are great for a quick overview, we can use `napari` for a more dynamic exploration of our QC metrics. This allows us to zoom in on specific areas and see how metrics vary across the tissue.

*(Instructor Note: I will run this live. For those using the workshop's Docker container, this requires the graphics server setup detailed in the README. You can simply watch my screen for this brief demonstration.)*

In [None]:
import napari_spatialdata as nsd

# We launch napari with our sdata_visium object.
# The QC metrics we calculated are already in its associated table.
# viewer = nsd.Interactive(sdata_visium)

# In the Napari window that opens, you can now:
# 1. Add the hires_image and the shapes layer.
# 2. In the annotation panel on the right, under "Observation", select 'total_counts'
#    or 'n_genes_by_counts' to color the spots by these QC metrics.

### Part 4: Preparing the Xenium Data for the Next Notebook

Now that we've mastered the workflow, we will apply the same steps to our Xenium data. We will break it down into cells and include the QC histograms and the final UMAP plot, as these are known to work.

In [None]:
print("--- Processing Xenium Data ---")

# 1. Load Data
sdata_xenium = sd.read_zarr(_XENIUM_PATH)

# adata_xenium = sdata_xenium.tables["table"].copy()

sdata_xenium

In [None]:
import seaborn as sns

# 2. Calculate and Visualize QC Metrics
sc.pp.calculate_qc_metrics(sdata_xenium.tables["table"], percent_top=(20, 50), inplace=True)

fig, axs = plt.subplots(1, 2, figsize=(10, 4))
sns.histplot(sdata_xenium.tables["table"].obs["total_counts"], kde=False, bins=100, ax=axs[0])
axs[0].set_title("Total Counts per Cell")
sns.histplot(sdata_xenium.tables["table"].obs["n_genes_by_counts"], kde=False, bins=100, ax=axs[1])
axs[1].set_title("Unique Genes per Cell")
plt.tight_layout()
plt.show()

In [None]:
# 3. Filter the data
print(f"Cells before filtering: {sdata_xenium.tables['table'].n_obs}")
sc.pp.filter_cells(sdata_xenium.tables["table"], min_counts=50)
sc.pp.filter_genes(sdata_xenium.tables["table"], min_cells=10)
print(f"Cells after filtering: {sdata_xenium.tables['table'].n_obs}")

In [None]:
# 4. Run the rest of the workflow
sc.pp.normalize_total(sdata_xenium.tables["table"])
sc.pp.log1p(sdata_xenium.tables["table"])
sc.pp.highly_variable_genes(sdata_xenium.tables["table"], n_top_genes=2000, flavor='seurat')
sc.pp.pca(sdata_xenium.tables["table"], use_highly_variable=True)
sc.pp.neighbors(sdata_xenium.tables["table"])
sc.tl.leiden(sdata_xenium.tables["table"], key_added="clusters")
sc.tl.umap(sdata_xenium.tables["table"])

In [None]:
# 5. Visualize the Xenium UMAP
print(f"Xenium data processed. Found {len(sdata_xenium.tables['table'].obs['clusters'].unique())} clusters.")
sc.pl.umap(sdata_xenium.tables["table"], color="clusters", title="Xenium Cell Clusters (UMAP)")

In [None]:
sdata_xenium

In [None]:
(
    sdata_xenium
    .pl.render_images("he_image")
    .pl.render_shapes("cell_circles", color="clusters", fill_alpha=1)
    .pl.show(title="Leiden clusters", figsize=(12, 4))
)

In [None]:
sdata_xenium

In [None]:
# 6. Save the processed AnnData object
import os
os.makedirs(_DATA_DIR_PATH / "processed", exist_ok=True)

print("Saving processed Xenium SpatialData object...")
sdata_xenium.write(_DATA_DIR_PATH / "processed" / "sdata_xenium_processed.zarr", overwrite=True)
print("Done. We are ready for Notebook 4.")

<div style="border: 1px solid #4CAF50; border-left-width: 15px; padding: 10px; background-color: #F0FFF0; color: black;">
    <strong>Summary:</strong>
    <p>This concludes the third part of our workshop. We've learned how to perform and visualise basic analysis steps on the expression matrices.</p>
</div>