# Part 3: Cell Typing with `scanpy`

**Tutor:** Anthony Christidis
**Time:** 45 minutes

---

Welcome to the computational analysis part of the workshop! In the first half, Tim introduced us to the `SpatialData` object and how to visualize our data. Now, we will take that data and use it to answer a fundamental biological question: **What cell types are present in our tissue, and where are they located?**

For this, we will use `scanpy`, the core analysis engine of the `scverse` ecosystem.

**Goals:**
1.  Perform a comprehensive Quality Control (QC) workflow, including spatial visualization of QC metrics.
2.  Run a standard unsupervised clustering workflow to identify cell populations.
3.  Visualize the identified cell clusters in both UMAP space and their true physical space.

### Setup and Data Loading

First, we'll import our libraries and load the Xenium dataset. We will focus our analysis on the `AnnData` table, which contains the gene-by-cell count matrix.

In [None]:
%load_ext jupyter_black

import spatialdata as sd
import spatialdata_plot as sdp
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Load the pre-processed Xenium dataset
sdata = sd.read_zarr("../data/xenium_lung_cancer_subset.zarr")

# For analysis, we extract the AnnData table of cell counts
adata = sdata.tables["table"]

In [None]:
adata

### Step 1: Comprehensive Quality Control (QC)

Before we can analyze our data, we must perform quality control to remove low-quality cells and technical artifacts. A good QC workflow involves both calculating metrics and visualizing them to make informed decisions.

#### Part A: Standard Single-Cell QC Metrics
We'll start by calculating standard single-cell metrics. A key indicator of cell stress or damage is a high percentage of reads from mitochondrial (MT) genes.

In [None]:
# Identify mitochondrial genes (names often start with 'MT-')
adata.var['mt'] = adata.var_names.str.startswith('MT-')

# Use scanpy to calculate QC metrics, including the percentage of MT counts
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

Now, let's visualize the distribution of these metrics. This helps us decide on appropriate filtering thresholds.

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(16, 4))

sns.histplot(adata.obs["total_counts"], kde=False, bins=100, ax=axs[0])
axs[0].set_title("Total Counts per Cell")

sns.histplot(adata.obs["n_genes_by_counts"], kde=False, bins=100, ax=axs[1])
axs[1].set_title("Unique Genes per Cell")

sns.histplot(adata.obs["pct_counts_mt"], kde=False, bins=100, ax=axs[2])
axs[2].set_title("Mitochondrial % per Cell")

sns.histplot(adata.obs["cell_area"], kde=False, bins=100, ax=axs[3])
axs[3].set_title("Cell Area")

fig.tight_layout()

*From these plots, we can decide on our filtering strategy. We want to remove cells that are outliers: those with very few counts (likely empty or dead), very high counts (potential doublets), or a high percentage of mitochondrial genes (stressed cells).*

#### Part B: Spatially-Aware QC

A unique advantage of spatial data is that we can visualize these QC metrics in physical space. This can reveal technical artifacts, like edge effects or areas of damaged tissue.

We'll use `squidpy` for this. First, we need to ensure our `AnnData` object contains the spatial coordinates.

In [None]:
# Extract the centroid coordinates of each cell from the SpatialData shapes element
# and add them to the standard `.obsm['spatial']` slot that squidpy uses.
adata.obsm['spatial'] = sdata.shapes['cell_circles'].centroid.apply(lambda p: (p.x, p.y)).to_numpy()

In [None]:
# Now plot the QC metrics spatially
sq.pl.spatial_scatter(
    adata,
    color=["total_counts", "n_genes_by_counts", "pct_counts_mt"],
    library_id="spatial", # tells squidpy to use the coordinates we just added
    size=1, # make points small to see patterns
    cmap="magma",
    figsize=(12, 4),
    ncols=3
)

*Here we can check if, for example, all the cells with high mitochondrial content (`pct_counts_mt`) are clustered in one area, which might suggest a region of damaged tissue.*

#### Part C: Applying the Filters

Now that we have visualized our metrics and feel confident in our strategy, we can apply the filters.

In [None]:
print(f"Cells before filtering: {adata.n_obs}")

# Filter for cells with a reasonable number of counts
sc.pp.filter_cells(adata, min_counts=50)

# Filter for cells with a reasonable number of genes
sc.pp.filter_cells(adata, min_genes=20)

# Filter out cells with high mitochondrial content
adata = adata[adata.obs.pct_counts_mt < 25, :]

print(f"Cells after filtering: {adata.n_obs}")

# Filter out genes that are rarely expressed
sc.pp.filter_genes(adata, min_cells=10)

### Step 2: Clustering with `scanpy`

With our cleaned data, we can now run the core clustering workflow. This involves normalization, dimensionality reduction with PCA, building a neighborhood graph, and then running the Leiden clustering algorithm.

In [None]:
# Normalize and Log-transform the data
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Find highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3')

# Run PCA
sc.pp.pca(adata)

# Build neighborhood graph and run Leiden clustering
sc.pp.neighbors(adata)
sc.tl.leiden(adata, resolution=0.5, key_added="clusters")

# Compute UMAP for visualization
sc.tl.umap(adata)

### Step 3: Visualizing the Cell Clusters

We have now assigned a cluster label to each cell. Let's visualize these clusters, first in the abstract UMAP space, and then in their true physical space on the tissue image.

In [None]:
# Plot the UMAP, colored by our new 'clusters' annotation
sc.pl.umap(adata, color="clusters", legend_loc="on data", title="Cell Clusters (UMAP)")

This UMAP shows us which cells are transcriptionally similar. But the most powerful plot is to see where these clusters are in the tissue. We can use `spatialdata-plot` for this.

First, we need to update our `SpatialData` object with the `AnnData` object we just processed.

In [None]:
# Filter the original sdata object to match the cells we kept after QC
# This uses a neat trick where bounding_box can also filter by a list of element IDs
sdata_filtered = sdata.query.bounding_box(
    min_coordinate=[-float('inf'), -float('inf')],
    max_coordinate=[float('inf'), float('inf')],
    target_coordinate_system="global",
    filter_table=adata.obs.index
)

# Replace the old table with our newly processed one
sdata_filtered.tables["table"] = adata

In [None]:
# Now, create the spatial plot
sdp.plot(sdata_filtered).render_shapes(
    element="cell_boundaries", 
    color="clusters", # Color the cell shapes by our new cluster labels
    fill_alpha=0.7,
).show(figsize=(8, 8), title="Cell Clusters (Spatial View)")

We can clearly see that our computationally-derived clusters correspond to distinct anatomical regions in the tissue.

In the next and final notebook, we will use `squidpy` to statistically analyze the spatial organization of these clusters and find out which cell types are interacting with each other.