<center><span style="font-size: 4em; font-weight: bold">üî¨ IMC Population Identification Notebook</span></center>
<h3><center><i>Written by Michael Haley (michael.haley@manchester.ac.uk), updated 04/08/2025</i></center></h3>

# Contents
- ### **[Introduction to Population Identification](#Introduction)**
- ### **[How to use this Notebook](#How-to-Use-This-Notebook)**
- ### **[Preprocessing Steps Completed Before This Notebook](#Preprocessing-Steps-Completed-Before-This-Notebook)**
- ### **[Working with packages](#Working-with-packages)**
- ### **[Working with AnnData objects](#Working-with-AnnData-objects)**
- ### **[Plotting and visualising populations](#Plotting-and-visualising-populations)**
- ### **[Clustering](#Clustering)**
- ### **[Other useful tools...](#Other-useful-tools...)**

---
# Introduction to Population Identification

The first major step in analyzing Imaging Mass Cytometry (IMC) data is to identify the cell populations present in your samples. In this notebook, we use an approach adapted from single-cell RNA sequencing: unsupervised clustering using the Leiden algorithm.

Leiden clustering allows us to detect both abundant and rare populations without providing prior knowledge to the algorithm. It groups cells based on patterns in their marker expression, and returns numbered clusters‚Äîe.g., *0*, *1*, etc.‚Äîwith lower numbers typically representing more abundant populations.

However, these clusters need to be biologically interpreted and labeled. That is, we must assign each numbered cluster to a known or hypothesized cell type (e.g., macrophages, T cells, fibroblasts). This process is crucial for all downstream analyses‚Äîand while it's powerful, it's also inherently subjective and sometimes challenging.


### ‚ö†Ô∏è Challenges in Labeling Cell Populations

1. **Clusters rarely align perfectly with known cell types**  
   Some clusters may actually consist of multiple subpopulations, while others may represent duplicates of the same biological population.

2. **Rare or unexpected populations**  
   You might discover unexpected cell types or odd marker combinations. These require careful interpretation, possibly informed by external datasets like scRNA-seq or FACS. There are also common artifacts that occur in IMC and similar antibody based approaches that we must look out for: Tears or damage to the tissue, antibodies binding non-specifically, or cell segmentation bringing in markers from neighbouring cells.

4. **Marker limitations**  
   Your ability to distinguish between cell types depends entirely on the markers included in your panel, and the quality (i.e. signal-to-noise) present in individual markers. A poorly designed panel limits interpretability. We do not have the luxury of thousands of genes to work with.

5. **QC is essential**  
   Mislabeling clusters can lead to misleading or invalid downstream results. It‚Äôs extremely important to validate each cell label using multiple views of the data.


### ‚úÖ Strategies and Quality Control (QC)

To improve accuracy and confidence in your labels, we apply multiple **QC strategies**:

- **UMAP Plots**  
  Cells with similar marker profiles should group closely together. If a cluster appears scattered or mixed, it may need to be split or re-evaluated.

- **Heatmaps**  
  These show the average marker expression per cluster. They help you recognize population-specific patterns and validate labels.

- **Spatial mapping and backgating**  
  Visualizing labeled populations back in the tissue provides a powerful reality check. For example, if Cluster 4 looks like a macrophage on your heatmap and UMAP, it should *look like* a macrophage in the tissue too.

- **Reference Datasets** *(not covered here)*
  Comparing your results to published single-cell RNA-seq or IMC datasets can aid biological interpretation.


### üìå Why This Step Matters

Getting this stage right is **critical**. Every downstream analysis‚Äîfrom spatial patterning to statistical comparison‚Äîrelies on the accuracy of your cell population labels. Mislabeling can invalidate results and, in some cases, require redoing all downstream steps. This is one of the most subjective parts of the analysis pipeline. Everything that follows is typically more linear and reproducible.


### üîç Beyond Unsupervised Clustering

While this notebook focuses on **unsupervised clustering**, there are alternative strategies (e.g., manual gating similar to FACS), where cells are assigned to populations using expression thresholds. These methods can be more transparent but are often impractical with large marker panels like those in IMC data.

There are also **semi-supervised** and **automated** methods to assist with labeling, though these are outside the scope of this notebook.


---
# How to Use This Notebook

This notebook breaks some of the typical rules of structured programming. Rather than expecting you to run all cells in order from top to bottom, it is designed as a toolbox - a collection of useful functions and workflows that you can use as needed, in an iterative and interactive way.

This notebook will provide you with the tools to...

- Assigning labels to Leiden clusters  
- Performing key QC checks (UMAP, heatmap, tissue mapping)  
- Adjusting clusters (splitting or merging) as needed  
- Validating your final labeled populations

You will likely find yourself jumping back and forth between tools (clustering, quality control, merging populations, and more) until you arrive at a set of biologically meaningful populations.

### üîÅ An Iterative Process

Labeling cell populations is (initially!) not a linear task. Instead, the real-world process often looks like this:

1. Run unsupervised (e.g. Leiden) clustering
2. Try to interpret the biological meaning of each cluster
3. Run quality control checks
4. Split or merge populations based on what you observe
5. Repeat‚Ä¶

Some clusters may contain more than one biological subtype and should be split; others may represent the same biology and should be merged. Each time you modify the clusters, you need to rerun QC to ensure the labels still make sense across:
- UMAP
- Heatmaps
- Spatial backgating

This cycle continues until you are confident that your annotations are as accurate and interpretable as possible.

### ‚úçÔ∏è Suggested Practice

Once you're comfortable with the tools and understand how and when to use each one, consider creating your own notebook that reuses parts of this code in a logical order. Be sure to:
- Add lots of notes and comments
- Save intermediate outputs
- Clearly track which parameters and logic you used

This will give you a permanent, reproducible record of how you defined your populations which will be absolutely necessary when writing papers, debugging results, or sharing your analysis with others.

### ‚ö†Ô∏è Don‚Äôt Just Click and Run

Although it might be tempting to simply run all the cells in order, that‚Äôs not the goal here. The real aim is to help you **understand** what each tool does, and **when** it‚Äôs appropriate to use it. Once you‚Äôve mastered this, you‚Äôll be in control of your own analysis ‚Äî not just clicking through code, but making informed biological decisions.

---
# Preprocessing Steps Completed Before This Notebook

### Pre-processing steps
Before starting this notebook, your IMC data should already have gone through a series of **preprocessing steps**. These steps are often handled by the bioimaging facility or a dedicated pipeline, and they take care of many of the more technical and time-consuming tasks:

- **Exract raw images from MCD file** using [readimc](https://github.com/BodenmillerGroup/readimc)
- **Denoising** using the [IMC-Denoise](https://github.com/PENGLU-WashU/IMC_Denoise) algorithm to clean up raw image data and reduce noise.
- **Cell segmentation** using [Cellpose (currently version 3)](https://github.com/MouseLand/cellpose), which identifies the boundaries of individual cells in the tissue images based on intensity and morphology, creating segmentation masks.
- **Cell table generation**, which creates a per-cell dataset summarizing:
  - Mean expression levels of each marker over the cells area *(as defined by the mask)*
  - Spatial coordinates of cells in the tissue *(the centre of the mask)*
  - Morphological details like cell area or shape
- **Marker expression normalisation**, this is usually the **99th or 99.9th percentile**, depending on the size of the dataset. You can check in `pipeline/config.yaml --> segmentation --> marker normalisation`
- **Creation of raw [AnnData object](#Working-with-AnnData-objects)** (`anndata.h5ad`) using the normalised marker expression and any sample-level meta data supplied in `metadata/dictionary.csv`
- **Batch normalisation, UMAP creation and initial leiden clustering** to create a ready-to-analyse AnnData (`anndata_processed.h5ad`)

<span style="color:green; font-size:large">**üí° You can review the settings used in the preprocessing steps in config file (`config.yml`) and metadata files (`panel.csv`,`errors.csv`,`metadata.csv`)**</span>

### üìÇ Supplied Files and Directories

| **Category**              | **Path / File**                                                            | **Description**                                                                 |
|---------------------------|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| **Image Data**            | `processed/`                                                               | All denoised TIFF images used for analysis.                                     |
|                           | `raw/`                                                                     | All raw TIFF images, without denoising.                                  |
| **Metadata**              | `metadata/panel.csv`                                                       | Marker/channel info: antibody names, channels, whether the raw or denoised data was used during segmentation.                            |
|                           | `metadata/errors.csv`                                                      | Log of any extraction issues from raw MCD files. If you have ROIs missing, then check here.     |
|                           | `metadata/metadata.csv`                                                    | Per-image metadata: dimensions, exposure time, acquisition info                |
|                           | `metadata/dictionary.csv`                                                   | Sample level metadata *(if supplied)*               |
| **Segmentation & Tables** | `masks/`                                                                   | Segmentation masks output by Cellpose                                          |
|                           | `cell_tables/`                                                             | Individual CSVs with marker expression and spatial data per ROI                |
|                           | `cell_tables/celltable.csv`                                                | Merged master cell table combining all ROIs                                    |
| **AnnData Files**         | `anndata.h5ad`                                                             | Raw AnnData object created from the cell tables, data normalised usually to 99.9th percentile |
|                           | `anndata_processed.h5ad`                                                   | Batch-corrected and clustered AnnData object                                   |
| **QC & Pipeline Logs**    | `pipeline/config.yml`                                                      | Configuration settings used for preprocessing                                  |
|                           | `pipeline/`                                                                | Preprocessing setup scripts and logs                                           |
|                           | `QC/Segmentation_overlay/`                                                 | Overlays of segmentation masks on original images                             |
|                           | `QC/ParameterScan_cellpose_.../`                  | Comparisons of different Cellpose parameter settings *(if performed)*                          |
|                           | `QC/denoising/`                                                             | Side-by-side comparisons of raw vs. denoised channel images *(if performed)*                   |

---
# Working with packages

### üì¶ *"What Does It Mean to Import a Package?"*

In Python, a **package** is a collection of pre-written code that adds specific functionality to your project‚Äîlike reading files, analyzing data, or creating plots. Instead of writing everything from scratch, you can **import** a package and use the tools it provides.

For example, to work with data tables, you might use the `pandas` package. To use it in your code, you first need to import it:

```python
import pandas as pd
```

This tells Python:  
> ‚ÄúMake the `pandas` package available in this notebook, and call it `pd` from now on.‚Äù

Once imported, you can access all of the package‚Äôs functions, like reading CSV files:

```python
df = pd.read_csv("my_data.csv")
```

Importing is one of the very first steps in almost every Python script or notebook. It brings external tools into your workspace so you can use them without rewriting the code yourself.


### üì¶ *"What Are These Python Packages and Why Are We Using Them?"*

If you're new to Python or data analysis, you might be wondering what all these packages are doing in your notebook. Here's a brief explanation of each one to help you understand their roles:


#### üî¨ `scanpy`  
Scanpy is a popular Python library for analyzing single-cell data, especially data from high-dimensional experiments like single-cell RNA sequencing or Imaging Mass Cytometry (IMC). It includes tools for preprocessing, clustering, visualization (like UMAPs), and much more‚Äîall optimized for working with large datasets.

üìò [Scanpy Documentation](https://scanpy.readthedocs.io/en/stable/)

#### üß± `anndata`  
AnnData is the data structure that Scanpy (and similar tools) use to store your entire dataset. It keeps everything together: your main data matrix (e.g., marker intensities), cell metadata (like cluster labels), feature info (like marker names), and dimensionality reduction coordinates (like UMAP). You won‚Äôt usually manipulate `AnnData` directly, but it's the foundation Scanpy builds on.

üìò [AnnData Documentation](https://anndata.readthedocs.io/en/stable/)

#### üî¢ `numpy`  
NumPy is the foundational package for numerical computing in Python. It lets you work efficiently with large arrays and matrices, and underpins nearly all scientific libraries in Python (including Scanpy and Pandas). You‚Äôll mostly use it for fast, flexible math operations.

üìò [NumPy Documentation](https://numpy.org/doc/)

#### üìä `pandas`  
Pandas is Python's go-to tool for handling data tables (similar to Excel or R's dataframes). It makes it easy to filter, summarize, reshape, and analyze tabular data. It‚Äôs especially useful for working with metadata‚Äîlike cluster labels, sample IDs, and anything in `adata.obs`.

üìò [Pandas Documentation](https://pandas.pydata.org/docs/)


#### üñºÔ∏è `matplotlib` and `seaborn`  
These libraries are used for **plotting and visualization**.

- **Matplotlib** is the core plotting library in Python. It‚Äôs powerful and flexible but can be verbose.
- **Seaborn** is built on top of Matplotlib and makes it easier to create attractive statistical plots with less code.
- In this notebook, these are used to customize and generate bar plots, UMAPs, heatmaps, and more.

üìò [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)  
üìò [Seaborn Documentation](https://seaborn.pydata.org/)


### General Advice for Solving Problems

- üîç **Check the documentation first** ‚Äì Every package listed above has excellent docs. If you‚Äôre unsure how a function works, look it up or try:
  ```python
  help(function_name)
  ```
  *or*
  ```python
  function_name?
  ```

- üß™ **Experiment in small steps** ‚Äì Try things in small code cells. Print intermediate outputs to check what‚Äôs happening.

- üí¨ **Use error messages as clues** ‚Äì They may seem scary, but they usually tell you exactly what went wrong and where.

- üß† **Google, StackOverflow and AI chatbots are all your friends** ‚Äì Chances are someone else has had your problem before.

- üí° **Try `dir(object)`** ‚Äì It shows you everything that object can do. Great for exploring unknown structures.

- ‚úÖ **Use comments and Markdown cells** to keep track of what your code is doing and why.


In [4]:
# Custom code I have developed
from SpatialBiologyToolkit import population_identification as pop_id
from SpatialBiologyToolkit import backgating, plotting, utils

import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd

from pathlib import Path
import os

# Matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sb

# Set up scanpy settings
sc.settings.verbosity = 3 # Gives more complete information 
sc.set_figure_params(dpi=100, dpi_save=200, figsize=(5, 5)) #Increase DPI for better resolution figures

<a href="anndata"></a>

---
# <span style="color:orange">Working with AnnData objects</span>

### <span style="color:orange">*üì¶ "What Is an AnnData Object"?*</span>

An **AnnData** object is the core data structure used in the `Scanpy` library and ecosystem for single-cell and spatial -omics analysis. It's designed to efficiently store and manage large, high-dimensional datasets‚Äîlike Imaging Mass Cytometry (IMC) data.

You can think of it like a supercharged spreadsheet or table, with:

- `X` ‚Üí the main data matrix (e.g. marker intensities for each cell)
- `obs` ‚Üí metadata for **observations** (rows), like cell IDs or cluster labels
- `var` ‚Üí metadata for **variables** (columns), like marker names or channels
- `uns` ‚Üí unstructured data (e.g. settings, plots, color maps)
- `obsm` / `varm` ‚Üí multi-dimensional annotations, like UMAP coordinates or PCA loadings

AnnData makes it easy to run complex analysis workflows while keeping your data and metadata organized in one place. They are also easy to store to disk, allowing us to keep the majority of our analyes in one file.

## üíæ <font color=orange>Saving and Loading AnnData Objects

When working with `AnnData` in Scanpy, it‚Äôs important to keep backups ‚Äî especially after making substantial changes to clustering, annotations, or filtering steps.

### ‚ö†Ô∏è <font color=orange>Important Warnings

- **`AnnData` files without the code that made them can be hard to interpret**  
  The populations you create (via clustering and merging) should be reproducible. If you only have the saved `AnnData` object, you have no idea how the populations were created!
- **Be careful not to overwrite important files!**  
  If you save to the same filename multiple times, it will overwrite **without warning**.

- **Watch out for accidental overwrites in memory**  
  For example, running:
  ```python
  adata = sc.read("old_file.h5ad")
  ```
  will replace the current `adata` object ‚Äî losing any unsaved changes from the current session.

- **Use versioned filenames** like:
  ```
  adata_preQC.h5ad
  adata_postQC.h5ad
  adata_clustered_v1.h5ad
  ```

  This helps avoid confusion and keeps a clear record of your workflow.

### <font color=orange>Loading AnnData from Disk
<font color='red'>‚ö†Ô∏è**WARNING** - This will overwrite any AnnData you are currently working on (unless you give it a diffferent name)!</font>

To load a saved AnnData object:

In [12]:
adata = ad.read_h5ad('anndata_processed.h5ad')

### <font color=orange>Saving AnnData to Disk
<font color='red'>‚ö†Ô∏è**WARNING** - This will overwrite your saved AnnData (unless you give it a diffferent name)!</font>

This will write a `.h5ad` file (Hierarchical Data Format) to disk. You can then reload it later without re-running previous steps.

In [None]:
adata = ad.write_h5ad('anndata_saved.h5ad')

## üßæ <font color=orange>Adding or Updating Sample-Level Metadata to AnnData

This code attaches **sample-level metadata** to your `AnnData` object using a dictionary file - typically stored at `metadata/dictionary.csv`. This is useful for grouping cells based on metadata like experimental condition, patient ID, batch, or tissue origin.

#### How the Code Works

- The dictionary file is read, where each row represents one sample (e.g., an ROI).
- It matches the `ROI` column in `adata.obs` to the metadata in the dictionary.
- Each column in the dictionary is added as a new column in `adata.obs`.
- If the dictionary file doesn't exist, a template CSV is created automatically from the `adata.obs` values. This file includes example fields you can edit manually in Excel or a text editor.

#### Dictionary File Format

The dictionary CSV should have:
- One row per ROI
- One column for each metadata field you want to include
- The index column named `ROI` (must match what's in `adata.obs['ROI']`)

Example:

| ROI   | ROI_name     | treatment | batch | passed_QC |
|-------|--------------|-----------|--------|-----------|
| ROI_1 | Sample_001   | DrugA     | 1      | TRUE      |
| ROI_2 | Sample_002   | Control   | 1      | FALSE     |


In [None]:
utils.update_sample_metadata(adata, 
                             dictionary_path="metadata/dictionary.csv")

---
# <span style="color:green">Plotting and visualising populations</span>

##  <font color=green>üó∫Ô∏è Using UMAP to Explore and Quality Control Cell Populations

### <font color=green>What is UMAP?

**UMAP** (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique. It helps us take high-dimensional data‚Äîlike dozens of protein markers per cell‚Äîand project it down into **2D space**, where similar cells cluster together.

This is incredibly useful for **visualizing complex datasets**, like Imaging Mass Cytometry (IMC) data, and spotting patterns, clusters, or outliers in the data.

In Scanpy, we use:

```python
sc.pl.umap(adata, color='leiden_0.3')
```

This produces a 2D scatter plot of all cells, colored by a categorical observation (e.g. `leiden`, `ROI`, etc.).


### <font color=green>How UMAP Helps With Quality Control

UMAP is not just for pretty pictures‚Äîit‚Äôs a critical **QC tool** for:

- **Assessing cluster separation:**  
  If your labeled populations are biologically distinct, they should generally form well-separated regions in the UMAP.

- **Spotting mislabeled or ambiguous cells:**  
  If a population overlaps heavily with others, it may not be clearly defined‚Äîor your panel may lack the markers needed to distinguish it.

- **Detecting batch effects or ROI separation:**  
  By coloring the UMAP by `ROI`, `sample`, or `batch`, you can check whether technical variation is introducing unwanted structure.

- **Finding doublets or outliers:**  
  Small, scattered groups or "stray" cells may indicate low-quality segmentation or rare/unexpected populations.

### <font color=green>‚öôÔ∏è Key Parameters for `sc.pl.umap`

The `sc.pl.umap()` function has several parameters that affect the output and interpretation:

| Parameter | Description | Example |
|----------|-------------|---------|
| `color` | What to color cells by (can be one or more obs columns or gene names) | `'leiden'`, `'bulk_labels'`, `['marker1', 'marker2']` |
| `size` | Size of the dots (cells) | `size=10` makes bigger dots |
| `palette` | Custom color scheme (useful for population labels) | `palette='Set2'`, or pass a list of colors |
| `groups` | Subset of groups to plot (from the `color` column) | `groups=['T cells', 'Macrophages']` |
| `legend_loc` | Where to place the legend | `'on data'`, `'right margin'`, `'none'` |
| `frameon` | Show/hide border around the plot | `frameon=False` for a cleaner look |
| `title` | Custom title | `title='UMAP by Cell Type'` |

üß™ Example with multiple QC views:

```python
sc.pl.umap(adata, color=['leiden', 'bulk_labels', 'ROI'], size=10, legend_loc='right margin')
```

### <font color=green>Interpreting UMAP Carefully

> ‚ö†Ô∏è **Important note:** UMAP preserves **local similarity**, not exact distances or sizes. This means:

- Cells that are close together in UMAP have similar expression profiles.
- But: cells that are far apart are **not necessarily** very different.
- The shape and spacing of clusters may change depending on preprocessing (e.g. scaling, PCA, neighbors).


üìò [Scanpy UMAP Documentation](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.umap.html)


### <font color=green>*Example: Plotting a population*

In [None]:
sc.pl.umap(adata, 
           color='leiden_0.3', 
           size= 20,
           save='leiden_0.3.png')

### <font color=green>*Example: Plotting all markers*
Clearly, we can also visualise how all the markers in our panel map on the cells. The below example will plot all your markers on one big plot.

In [None]:
# This creates folders to save our files into
figure_dir=Path('Figures','UMAPs')
os.makedirs(figure_dir, exist_ok=True)

# This will plot a UMAP for each of the individual markers
fig = sc.pl.umap(adata, color=adata.var_names.tolist(), ncols=4, size=10, return_fig=True)
fig.savefig(Path(figure_dir, 'Marker_UMAPS.png'), bbox_inches='tight', dpi=300)

## <font color=green>Using MatrixPlot for Marker Expression and Cluster Annotation

### <font color=green> What is MatrixPlot?

`sc.pl.matrixplot()` is a powerful Scanpy function for summarizing how different **markers** (e.g., protein intensities or gene expression) behave across **groups of cells**, such as Leiden clusters or manually assigned population labels.

It generates a heatmap-style plot where:
- Rows represent **markers**
- Columns represent **groups** (e.g., clusters)
- Each cell shows the **average expression** of that marker in that group

This makes it an ideal tool for **quality control**, **cluster interpretation**, and **cell type identification**.

### <font color=green>How It Helps With Population Labeling

After clustering your data (e.g., with Leiden), you get a list of numbered groups. But what do these numbers mean biologically?

By using `sc.pl.matrixplot()`, you can:
- Compare expression of key markers across clusters
- Identify marker combinations characteristic of known cell types
- Spot clusters that look similar and may need merging
- Discover unexpected or rare populations based on unique expression

This is a major step in the *"detective work"* of assigning real biological labels to clusters.

### <font color=green>‚öôÔ∏è Key Parameters in `sc.pl.matrixplot`

Here are some of the most useful options:

| Parameter | Description | Example |
|----------|-------------|---------|
| `var_names` | List of markers (genes or proteins) to plot | `['CD3', 'CD68', 'CD8']` or  `adata.var_names` for all in the dataset |
| `groupby` | Observation column to group cells by | `'leiden_0.1'`, `'bulk_labels'`, `'ROI'` |
| `standard_scale` | Standardizes marker values (e.g. 0‚Äì1 range per marker) | `'var'` (scales by marker), `'obs'` (by cluster) |
| `cmap` | Which named [matplotlib colourmap](https://matplotlib.org/stable/gallery/color/colormap_reference.html#colormap-reference) to use | `'viridis'`, `'bwr'`, `'coolwarm'` |
| `swap_axes` | Rotate the plot (clusters as rows, markers as columns) | `True` or `False` |
| `dendrogram` | Show a dendrogram to cluster similar groups | `True` |
| `vmax` | Set the maximum value for visualisation | `0.8` |
| `vmin` | Set the minimum value for visualisation | `0.1` |
| `save` | Name of figure, which will be saved in a `Figure` folder | `'populations.png'` |

## <font color=green> Best Practices</h3></span>

- Start with your most informative or canonical markers for each lineage
- Standardize values to avoid skew from large absolute intensities
- Use `dendrogram=True` to automatically group similar clusters

üìò [Scanpy MatrixPlot Documentation](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.matrixplot.html)


In [None]:
sc.pl.matrixplot(adata,
                 var_names=adata.var_names, # This will use all the variables in the dataset
                 standard_scale='var', # This will scale within markers
                 groupby='leiden_0.3',
                 dendrogram=True)

## <font color=green>Visualizing Population Composition with `grouped_graph()`

### <font color=green>What is `grouped_graph()`?

The `grouped_graph()` function is a custom utility designed to help visualize cell population distributions from an `AnnData` object. It allows you to create clear, informative bar plots (either grouped or stacked) to show how different populations (like cell types or clusters) are distributed across experimental conditions, ROIs, or other metadata categories.

This function is particularly useful for:
- Quality control across regions or samples
- Comparing cell-type proportions between conditions
- Highlighting enrichment or depletion of specific populations

It can also optionally display or export the underlying data table used for plotting.

### <font color=green>What Does It Do?

Under the hood, the function uses `pandas.crosstab()` to compute a table of cell counts or proportions for each population across categories like `ROI`, `sample`, or `condition`. It then reshapes the data and visualizes it using:
- `seaborn.barplot()` for grouped bars
- `pandas.plot(..., stacked=True)` for stacked bars

It supports flexible customization, normalization, scaling, and error bars.

### <font color=green>‚öôÔ∏è Key Parameters

| Parameter | Description |
|----------|-------------|
| `adata` | The `AnnData` object containing your single-cell or IMC data |
| `group_by_obs` | Observation column used for color/hue (e.g. `'cell_type'`, `'leiden'`) |
| `x_axis` | Column on the x-axis (e.g. `'ROI'`, `'condition'`, `'sample'`) |
| `proportions` | If `True`, bars show proportions instead of raw counts |
| `stacked` | If `True`, creates a stacked bar plot |
| `scale_factor` | Normalizes values (e.g., divide by tissue area to get cells/mm¬≤) |
| `sort_by_population` | Reorders x-axis by the abundance of a specific group |
| `log_scale` | If `True`, y-axis uses a log scale for better visibility |
| `save_graph` | Path to save the figure (e.g. `'my_plot.png'`) |
| `save_table` | Path to save the crosstab data (e.g. `'data.csv'`) |


### <font color=green>Best Practices</h3></span>

- Use `proportions=True` to compare relative abundance across samples
- Try `scale_factor` if you want to normalize by tissue size or area
- Use `log_scale=True` if your groups differ by orders of magnitude
- Turn on `display_tables=True` to inspect the raw data used for plotting
- Save results with `save_graph='filename.png'` and `save_table='data.csv'`

In [None]:
plotting.grouped_graph(adata,
                     group_by_obs='leiden_0.3',
                     x_axis='ROI',
                     proportions=False,
                     stacked=True, 
                     fig_size=(8, 5), 
                     display_tables=False
                     )

## <font color=green>Visualizing Cell Populations and Marker Expression in Tissue

### <font color=green>What is `obs_to_mask()`?

The `obs_to_mask()` function maps values from your `AnnData` object back onto the tissue layout using a segmentation mask image. This allows you to visualize categorical labels (e.g. cell types or clusters) and quantitative marker expression spatially.

It supports multiple visualization layers, applied in order:

1. **Inner fill** ‚Äî each cell is colored based on a categorical label (`cat_obs`) or numeric value (`quant_obs`).
2. **Separator** ‚Äî **optional** uniform border around each cell.
3. **Outline** ‚Äî **optional** outline colored by a second categorical variable (e.g. tissue compartment).

The layering order can be customized, and the output can be saved as a PNG or SVG (for publication-quality vector graphics).

### <font color=green>‚öôÔ∏è Key Parameters

| **Parameter** | **Description** |
|---------------|-----------------|
| `roi` | The Region of Interest to plot |
| `cat_obs` | Categorical variable (e.g. cluster) for filling cells |
| `quant_obs` | Quantitative variable (e.g. marker expression) for heatmap |
| `label_obs` | Optional column for cell label index in mask |
| `masks_folder` | Folder containing segmentation masks (default: `Masks/`) |
| `cat_colour_map`, `quant_colour_map` | Colormaps for category or numeric values |
| `background_color` | Set to `'white'`, `'black'`, or a hex code. If `None`, the background is transparent |
| `save_path` | If set, the image will be saved as PNG or SVG |
| `separator_color` | Adds a uniform border around all cells |
| `separator_thickness` | Thickness in pixels for the separator |
| `outline_cat_obs` | A second categorical variable used to color cell outlines |
| `outline_thickness` | Thickness in pixels for outlines |
| `layers_order` | Order in which to draw layers. Default is `['inner','separator','outline']` |

In [23]:
plotting.obs_to_mask_3?

[31mSignature:[39m
plotting.obs_to_mask_3(
    adata,
    roi: str,
    roi_obs: str = [33m'ROI'[39m,
    check_cell_numbers: bool = [38;5;28;01mFalse[39;00m,
    cat_obs: str = [38;5;28;01mNone[39;00m,
    cat_colour_map=[33m'tab20'[39m,
    cat_obs_groups=[38;5;28;01mNone[39;00m,
    quant_obs: str = [38;5;28;01mNone[39;00m,
    quant_colour_map: str = [33m'viridis'[39m,
    quant_exclude_background: bool = [38;5;28;01mTrue[39;00m,
    adata_colormap: bool = [38;5;28;01mTrue[39;00m,
    masks_folder: str = [33m'Masks'[39m,
    masks_ext: str = [33m'tif'[39m,
    min_val: float = [38;5;28;01mNone[39;00m,
    max_val: float = [38;5;28;01mNone[39;00m,
    quantile: float = [38;5;28;01mNone[39;00m,
    save_path: str = [38;5;28;01mNone[39;00m,
    background_color: str = [38;5;28;01mNone[39;00m,
    hide_axes: bool = [38;5;28;01mFalse[39;00m,
    hide_ticks: bool = [38;5;28;01mTrue[39;00m,
    svg_smoothing_factor: int = [32m0[39m,
    dpi: int = 

### <font color=green>Visualizing Populations in Tissue

This colors each segmented cell by its Leiden cluster for all ROIs in your dataset, outlining cells in black.

In [13]:
adata = adata[adata.obs['leiden_0.3'].isin(["2"])]

In [14]:
# Plot for each ROI in dataset
for roi in adata.obs.ROI.unique().to_list()[:2]:
    
    plotting.obs_to_mask(adata = adata,
                         roi = roi,
                         cat_obs = 'leiden_0.3',
                         save_path = f'Population_images/{roi}.png', # using .svg will save as a vector instead
                         background_color='white',
                         separator_color='black')

  for roi in adata.obs.ROI.unique().to_list()[:2]:


### <font color=green>Overlaying Marker Expression
This example generates a heatmap over the tissue based on the expression of a specific marker for all ROIs in your dataset.

In [None]:
# Plot for each ROI in dataset
for roi in adata.obs.ROI.unique().to_list():
    
    plotting.obs_to_mask(adata = adata,
                         roi = roi,
                         quant_obs = 'PanCK', # or any other marker from your data
                         quant_colour_map='Reds',
                         save_path = f'Marker_images/{roi}.png', # using .svg will save as a vector instead
                         background_color='black')

## <font color=green>Performing a Backgating Assessment with `backgating_assessment()`

### <font color=green>Why do we need to do backgating?

When clustering cell populations using single-cell techniques like Scanpy, we rely solely on the mean cellular expression of each marker to group cells into types. However, we lose the visual context (what those cells looked like in the original image), and whether their morphology and surrounding tissue makes sense for their assigned type.

That‚Äôs where backgating comes in. Backgating is a crucial validation step that:

- Samples cells from each assigned population
- Reconstructs their context in the original image
- Overlays optional segmentation masks
- Helps visually confirm if cells look correct to expert users (e.g., a CD3‚Å∫ T cell should ‚Äúlook like‚Äù one)

This builds trust in the data and helps catch annotation or segmentation issues early (before reviewers do!).

### <font color=green>What does `backgating_assessment` do?

This function is a high-level utility that orchestrates the full backgating pipeline. It:

1. **Computes population-level marker expression** (if needed)  
   - The function can compute the mean expression of every marker per population and save it to a ‚Äúmean expression‚Äù CSV (e.g., `markers_mean_expression.csv`).  

2. **Selects top markers per population** (or lets you override them) 
   - By default, the function can pick the **top N** markers for each population (where N can be 1, 2, or 3) and automatically assign them to R/G/B channels for easy visualization.
   - You can override any channel with a user-specified marker (e.g., `specify_red='CD3'`).

3. ***Assigns markers to RGB channels** for visualization and **Clips and rescales intensities** based on global or per-ROI statistics
   - It can create or update a ‚Äúbackgating settings‚Äù CSV (e.g., `backgating_settings.csv`) that records which markers are displayed in Red, Green, and Blue channels, as well as optional intensity range settings for each population. This can then be edited on disk to adjust the settings for how to decide the minimum and maximum values for each channel, for reach populations. Numeric values (ie, absolute numbers) corrspond to pixel values (ie, counts for IMC). However, we can also use various quartile settings to automatically calculate values:
   
   - **`"q0.97"`**: Use the *mean* of the 97th-percentile intensities across all ROIs.
   - **`"i0.97"`**: Each ROI is clipped to its own 97th-percentile (so every ROI has potentially different max).
   - **`"m0.97"`**: Use the *minimum* of the 97th-percentile intensities across ROIs.
   - **`"x0.97"`**: Use the *maximum* of the 97th-percentile intensities across ROIs.

   > After clipping, intensities are automatically **rescaled** so the new minimum and maximum become `0` and `1`, respectively.

4. **Samples cells from each population**
   - For each population (in `pop_obs`), you can specify a number of cells to sample (e.g., 50 per population).  
   - The function extracts these cells‚Äô coordinates and uses them to create small ‚Äúthumbnails‚Äù from your raw image data.

5. **Overlays cells on images** (with optional segmentation masks) and **Creates visual thumbnail galleries** and overview images
   - Internally, it calls a helper function (e.g., `backgating`) that loads/creates composite images of each ROI.  
   - **Masks (Optional)**: If provided, the function can look for segmentation masks in a user-specified folder (or from a CSV mapping ROI->mask file) and overlay boundary lines around the center cell in each thumbnail.

6. **Final Output**  
   - A set of **PNG images** showing each selected cell (thumbnails).  
   - An **overview** image per ROI with bounding boxes for each cell, if you choose.  
   - Two **CSV files**: one for mean expression (if computed), and another for the final backgating settings (marker assignments, intensity ranges, etc.).  
   - A **`cells_list.csv`** showing which cells were plotted in the thumbnails.

You can run it in different **modes**:
- `'full'`: Compute mean expression, generate marker settings, and produce images.
- `'save_markers'`: Just compute and save settings (no imaging).
- `'load_markers'`: Load previously saved settings and generate images only.

## <font color=green>Best Practices

- Start with `'save_markers'` mode to check which markers are being used.
- Edit the `backgating_settings.csv` manually if needed (e.g., to fine-tune marker intensity ranges).
- Always inspect the thumbnail and overview images to visually validate the clustering results.
- Use segmentation masks when possible‚Äîthey‚Äôre critical for confirming cell boundaries.
- Adjust quantile settings for `min/max` if the intensity scaling looks off.

## <font color=green>Outputs

- `markers_mean_expression.csv`: Mean expression per population.
- `backgating_settings.csv`: Which marker is used for Red, Green, and Blue channels, and their intensity settings.
- `Backgating/`: A folder containing:
  - `Cells.png`: Thumbnails of sampled cells
  - `*_overview.png`: Whole-ROI images with boxes
  - One subfolder per population with output images
- `cells_list.csv`: List of all sampled cells and metadata.


In [None]:
backgating.backgating_assessment(adata=adata,                          
                                  image_folder='images', #This is the default folder for denoised
                                  pop_obs='leiden_0.3',
                                  pops_list=None,  #None will do all populations
                                  cells_per_group=5,
                                  use_masks=True,
                                  minimum=0.2, # This is usually a sensible minimum 
                                  max_quantile='q0.99',
                                  number_top_markers=2, # We are setting the DNA to be blue, so automatically calculate only 2 markers for red and green.
                                  specify_blue='DNA1', # Sets blue to always be DNA
                                  output_folder='Backgating_results',
                                  show_gallery_titles=False
                                 )

---
# <span style="color:blue">Clustering</span>

## <font color=blue>Overview

In this section, we cover the key tools and strategies you can use to cluster and assign cells to populations. If you have received this data with preprocssing already done, then some initial clustering will have already been done for you to get you started. Clustering is a core part of the IMC analysis workflow and allows you to group similar cells together based on their molecular profiles (in this case, protein expression).

There are multiple aspects covered, including:
- Performing a completely new round of clustering from the raw data.
- Subclustering existing populations to reveal finer substructure.
- Merging or editing existing clusters based on prior knowledge or clustering results.

### <font color=blue>How Clustering Works in AnnData

All clustering results are stored in the **AnnData object**, typically in the `.obs` attribute. This is a **table (like a DataFrame)** where:

- **Each row** corresponds to a single cell.
- **Each column** holds metadata about that cell ‚Äî such as:
  - What patient it came from
  - What Region of Interest (ROI) it came from
  - What **cluster** it belongs to

When you perform clustering, you either generate a **new column in `.obs`** (e.g. `leiden_1.0`, `louvain_clusters`) or **overwrite an existing one**.

Think of `.obs` as your master record of each cell's identity and annotations.

### <font color=blue>When and Why to Rerun Clustering

You might want to:
- Improve resolution by adjusting parameters (e.g., `resolution` in Leiden).
- Refine populations by subsetting specific cell types.
- Compare clustering approaches to assess stability.

### <font color=blue>Best Practices

- Always record which parameters you used (resolution, method, embedding).
- Name your cluster columns clearly (e.g., `leiden_1.0_subclust_immune`) to avoid confusion. It is easy to lose track when you have several rounds of subclustering or merging exactly how you got to your final populations.
- Consider saving `adata.obs` to CSV regularly to track changes or debug.


### <font color=blue>Where to Learn More

- üìñ [Scanpy Clustering Docs](https://scanpy.readthedocs.io/en/stable/api/scanpy.tl.leiden.html)
- üìñ [AnnData Documentation](https://anndata.readthedocs.io/)
- üìñ [Single-cell best practices eBook](https://www.sc-best-practices.org/cellular_structure/clustering.html)


## <font color=blue> Performing a Completely New Round of Clustering using `sc.tl.leiden`

Leiden clustering is one of the most commonly used community detection algorithms in single-cell analysis. In the context of Imaging Mass Cytometry (IMC), we use it to assign each cell to a group (or population) based on similarity in marker expression.

### <font color=blue>What is `sc.tl.leiden`?

`sc.tl.leiden()` is a Scanpy function that performs **Leiden community detection** on a graph representation of your data. This groups cells that are similar to each other in a low-dimensional space ‚Äî typically after PCA and neighborhood graph construction.


### <font color=blue>Key Parameters

- `resolution`: Controls the number of clusters. Higher values ‚Üí more clusters.
- `key_added`: Name for the new column in `adata.obs`. Use clear, descriptive names.
- `n_neighbors`, `n_pcs`: These affect the neighborhood graph. You can tune them for better results depending on your data, but I often leave these alone.


### <font color=blue>Best Practices

- Track your versions: always give new cluster labels a unique name (e.g. `'leiden_0.5'`, `'leiden_subclust_CD8s'`), unless you want to over write an old cluster.
- Use all the plotting tools to QC your clusters.
- Experiment with different resolutions to find the one that best captures biological meaning.

### <font color=blue>Troubleshooting

- Too few clusters? Try increasing `resolution` or check your `n_neighbors`.
- Noisy results? Consider filtering cells, scaling properly, or reducing `n_pcs`.
- Bizzarely, although leiden numbers the populations, they are actually stored as strings!

### <font color=blue>Learn More

- [Leiden in Scanpy Docs](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.leiden.html)
- [Scanpy Tutorial](https://scanpy-tutorials.readthedocs.io/en/latest/)
- [Leiden algorithm paper](https://www.nature.com/articles/s41598-019-41695-z)

In [None]:
sc.tl.leiden(adata,
             resolution=0.5,
             key_added='leiden_0.5')

## <font color=blue>Subclustering Using `sc.tl.leiden`

Sometimes, after your initial clustering, you may notice that one or more populations still appear heterogeneous. Subclustering allows you to focus on a subset of cells (e.g., a single population) and apply the clustering algorithm again, revealing finer substructure. This is commonly used to explore diversity within broad populations, e.g within T cells or myeloid groups.

### <font color=blue>How to Subcluster

In **Scanpy**, subclustering can be performed without subsetting the entire `AnnData` object. Instead, you can use the `restrict_to` argument in `sc.tl.leiden`, which restricts clustering to specific groups within an `.obs` column.

### <font color=blue>*Example*
In the example below, we are performing subclustering on the population `'5'` from the clustering `'leiden_0.3'` column in `adata.obs`. The results are saved in `adata.obs` under the column indicated by `key_added`, in this case `'myeloid_subclust'`. Helpfully, all the labels from the original clustering (in this case, `'leiden_0.3'`) that we didn't subcluster on get transferred over into the new column. This may sound hard to get your head around, but its fairly intuitive once you start using it.

In [None]:
sc.tl.leiden(
    adata,
    restrict_to=('leiden_0.3', ['5']),
    resolution=0.5,
    key_added='myeloid_subclust'
)

## <font color=blue>Merging and Labelling Clusters

After performing Leiden clustering, the results are typically stored in `adata.obs` as numerical groupings (e.g. '0', '1', '2'...) that don't carry any biological meaning. To make your data interpretable, it's essential to assign meaningful population labels, and optionally merge clusters that represent the same biological group.

### <font color=blue>Why Merging and Labeling is Important

- Leiden clustering is unsupervised and agnostic to biology ‚Äî it doesn‚Äôt know what a ‚ÄúT cell‚Äù or ‚Äúmacrophage‚Äù is.
- Some clusters may be biologically redundant and should be merged.
- Others may need more descriptive names for clarity and downstream analysis.

#### <font color=blue> 1. `create_remapping()`

This function scans the values in a chosen column of `adata.obs` (e.g. `"leiden_0.3"`) and writes a CSV template that you can edit in Excel.

*Example*:

This creates a CSV file (e.g. `remapping_leiden_0.3.csv`) where:
- Each row is a Leiden cluster
- Each column is a new label (e.g. `"population"`, `"population_broad"`, `"hierarchy"`)

In [None]:
pop_id.create_remapping(adata, 'leiden_0.3')

#### <font color=blue>2. Edit the CSV

Open the file in Excel or any text editor and:
- Assign meaningful labels to clusters (e.g. cluster 2 ‚Üí "CD8_T_cells")
- To **merge** clusters, assign the same label to multiple rows.
Example:
 | leiden_1.0 | population   |
 |------------|--------------|
 | 2          | CD8_T_cells  |
 | 5          | CD8_T_cells  |
 | 7          | Macrophages  |

#### <font color=blue>Advice to fill in the spreadsheet to assign names to leiden clusters
By default three new populations are added (*population, population_broad, hierarchy*), but you can call the whatever you like and add more column/new groups if you wish. As an example, lets say that based upon the marker expression from out heatmap, we think that leiden population `1` is exhausted CD8 T-cells. We could use the `'population'` column to assign very specific population names to the numbered leiden populations (e.g. 'Exhausted CD8 T cells'), perhaps population_broad is slightly less granular (e.g. 'CD8 T cells'), and hierarchy could be very broad (e.g. 'Lymphocytes'). If we think that several of the leiden clusters are actually the same bioloigcal population, and want to merge them, all we need to do is give them the same name in the table for that column, and all those cells will be assigned to the same group

#### <font color=blue>3. `read_remapping()`

Once your spreadsheet is edited and saved, this code will:
- Maps the new labels back into `adata.obs`
- Creates one column per group (e.g. `'population'`, `'population_broad'`)
- Stores those columns as `category` types
- Logs the changes in `adata.uns['population_obs']` for reproducibility

In [None]:
pop_id.read_remapping(adata, 'leiden_0.3')

## <font color=blue>Simple merging populations in `AnnData.obs`
The `merge_populations()` function allows you to combine two or more populations from an existing `adata.obs` column into a single group.

In [None]:
pop_id.merge_populations(adata, 
                          source_column='leiden_0.3', 
                          groups_to_merge=['0','1'],
                          new_label='MergedPopulation', 
                          new_column='leiden_0.3_merged')

## <font color=blue>Creating or modifying a colour map

The following function will allow you to view and select the colours for your new populations, saving the colour map in the `AnnData` object.

<font color='red'>**WARNING** - This can be unstable on some machines, so make sure your AnnData is saved<font>


In [None]:
pop_id.recolor_population(adata, 'leiden_0.3', save=False)

---
# Other useful tools...

### Overview batch making images with `make_images`

This function creates **composite RGB images** from raw channel images stored in subfolders, specifying a strategy to rescale all the markers so that min/max values are consistent (or not). Each **Region of Interest (ROI)** is in its own subfolder. You can map up to seven different color channels (Red, Green, Blue, Magenta, Cyan, Yellow, White) to any marker of interest.

1. **Loading the Images**  
   - For each channel (e.g., Red, Green, etc.), the function looks for the marker name in the filenames of your `.tif` images.
   - Only the ROIs listed in `samples_list` are used.

2. **Intensity Clipping**  
   Before combining channels into an RGB image, `make_images` clips and rescales each marker image, turning raw intensities into a `[0..1]` range.  
   - **`minimum`**: The lower bound for clipping (all values below are set to this).
   - **`max_quantile`**: A user-specified method for determining the upper bound. It can be:
     - A direct numeric value (e.g., `200.0`), or
     - A string prefix that tells the function how to calculate a max from quantiles:
       - **`"q0.97"`**: Use the *mean* of the 97th-percentile intensities across all ROIs.
       - **`"i0.97"`**: Each ROI is clipped to its own 97th-percentile (so every ROI has potentially different max).
       - **`"m0.97"`**: Use the *minimum* of the 97th-percentile intensities across ROIs.
       - **`"x0.97"`**: Use the *maximum* of the 97th-percentile intensities across ROIs.

   > After clipping, intensities are automatically **rescaled** so the new minimum and maximum become `0` and `1`, respectively.

3. **Combining into an RGB Image**  
   Once each marker is rescaled, the function merges them in an ‚Äúadditive‚Äù manner:
   - **Red channel** adds any Red, Magenta, Yellow, White channels.
   - **Green channel** adds Green, Cyan, Yellow, White channels.
   - **Blue channel** adds Blue, Magenta, Cyan, White channels.

4. **Output**  
   - A **`<ROI>.png`** file is saved for each ROI, storing the final composite.
   - You can also specify:
     - **`roi_folder_save=True`** to save each ROI‚Äôs `.png` in its own subfolder.
     - **`simple_file_names=True`** to output just `<ROI>.png` without channel info in the filename.

In [None]:
backgating.make_images?

In [None]:
# This will get a list of all samples, but you can alternatively just specify which samples
all_samples = adata.obs['ROI'].unique().tolist()

backgating.make_images(
    image_folder='images',
    samples_list=all_samples,
    output_folder='Composite_Images',
    minimum=0.2,
    max_quantile='q0.97',
    red='Iba1', 
    green='Cd14',
    blue='DNA1'
)