#### This particular notebook includes a pipeline for generating filtered transcript files (to be used for QC comparisons) and saving a gene list csv file from a Xenium run. This example uses Xenium Dataset 3 slide 1.

#### Required input files:
* Raw transcript csv file
* Filtered cell-based data object

Note: If you have a raw transcripts.parquet file instead of a .csv.gz, you can use the included ConvertXeniumRawTranscriptsParquetToCsv.sh script to generate a CSV version.

Environment: Please create and activate the conda environment provided in default_env.yaml before running this notebook

#### Caution: This notebook is highly memory-intensive and may crash during processing. It is recommended to save intermediate outputs at key steps.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from adjustText import adjust_text

import seaborn as sns

import scanpy as sc
import squidpy as sq

import gzip
import anndata

import sys

from matplotlib.ticker import FuncFormatter

### Filtering Syntax Key

transcripts_raw = No filters applied (just automatic technology filters)

transcripts_initialfilter = Respective technology filters have been applied, as well as our additional QC filtering

transcripts_r = False positive and low quality transcripts have been removed, as well as the respective technology filters and our QC filtering

# Slide1

### Xenium Filtering Note

10x Genomics performs their own filtering to remove low quality transcripts (Q-score < 20) before outputting cell-associated data files (cell feature matrix) for analysis. These low quality transcripts are not removed from the transcript file.

# Data object

In [None]:
## Load in cell-based filtered data object (post-filtering, pre-normalization)

Int_XeniumICI480_Data = sc.read_h5ad('/path/DataObjects_withoutUMAP/XeniumICI480_concat_filtered_pre-normalization_240603.h5ad')

# If applicable, filter based on batch
Slide1_XeniumICI480_Data = Int_XeniumICI480_Data[Int_XeniumICI480_Data.obs['slide'] == 'slide_1'].copy()

display(Slide1_XeniumICI480_Data.obs)
display(Slide1_XeniumICI480_Data.obs['slide'])

In [None]:
## Calculate min, mean, and max values for Slide1_XeniumICI480_Data

total_counts_minvalue = Slide1_XeniumICI480_Data.obs["total_counts"].min()
total_counts_mean = Slide1_XeniumICI480_Data.obs["total_counts"].mean()
total_counts_maxvalue = Slide1_XeniumICI480_Data.obs["total_counts"].max()

n_genes_by_counts_minvalue = Slide1_XeniumICI480_Data.obs["n_genes_by_counts"].min()
n_genes_by_counts_mean = Slide1_XeniumICI480_Data.obs["n_genes_by_counts"].mean()
n_genes_by_counts_maxvalue = Slide1_XeniumICI480_Data.obs["n_genes_by_counts"].max()

cell_area_um2_minvalue = Slide1_XeniumICI480_Data.obs["cell_area"].min()
cell_area_um2_mean = Slide1_XeniumICI480_Data.obs["cell_area"].mean()
cell_area_um2_maxvalue = Slide1_XeniumICI480_Data.obs["cell_area"].max()

nucleus_area_um2_minvalue = Slide1_XeniumICI480_Data.obs["nucleus_area"].min()
nucleus_area_um2_mean = Slide1_XeniumICI480_Data.obs["nucleus_area"].mean()
nucleus_area_um2_maxvalue = Slide1_XeniumICI480_Data.obs["nucleus_area"].max()

print("Slide1_XeniumICI480_Data\n")

print(f"Min value of total_counts : {total_counts_minvalue}")
print(f"Mean of total_counts : {total_counts_mean}")
print(f"Max value of total_counts : {total_counts_maxvalue}\n")

print(f"Min value of n_genes_by_counts : {n_genes_by_counts_minvalue}")
print(f"Mean of n_genes_by_counts : {n_genes_by_counts_mean}")
print(f"Max value of n_genes_by_counts : {n_genes_by_counts_maxvalue}\n")

print(f"Min value of cell area (um2) : {cell_area_um2_minvalue}")
print(f"Mean of cell area (um2) : {cell_area_um2_mean}")
print(f"Max value of cell area (um2) : {cell_area_um2_maxvalue}\n")

print(f"Min value of nucleus area (um2) : {nucleus_area_um2_minvalue}")
print(f"Mean of nucleus area (um2) : {nucleus_area_um2_mean}")
print(f"Max value of nucleus area (um2) : {nucleus_area_um2_maxvalue}\n")

In [None]:
## Calculate transcript-associated metrics

# Note: transcript_counts and total_counts should be the same (besides a minimal rounding error)

trc = Slide1_XeniumICI480_Data.obs["transcript_counts"].mean()
toc = Slide1_XeniumICI480_Data.obs["total_counts"].mean()

n_genes_by_counts_mean = Slide1_XeniumICI480_Data.obs["n_genes_by_counts"].mean()

print("Mean number of transcripts per cell for Slide1_XeniumICI480_Data")
print("")
print(f"Mean transcript_counts value of Slide1_XeniumICI480_Data : {trc}")
print(f"Mean total_counts value of Slide1_XeniumICI480_Data : {toc}\n")

print("Mean number of unique genes per cell")
print("")
print(f"Mean n_genes_by_counts of Slide1_XeniumICI480_Data : {n_genes_by_counts_mean}")

In [None]:
## Caluclate total cell area

cell_area_um2_sum = Slide1_XeniumICI480_Data.obs['cell_area'].sum()

print(f"Cell area sum (um2) : {cell_area_um2_sum}")

The negative control values included in Xenium cell-based metadata:
* Control probe counts (probes that exist in the panel but target non-biological sequences)
* Control codeword counts (codewords that don't have any probes matching that code)
* Unassigned codeword counts (there is no probe in the panel that will generate the codeword)
* Deprecated codeword counts (assigned to codewords that are not used in the Xenium Onboard Analysis pipeline)

The percentage of these negative control values within the dataset can be calculated from Slide1_XeniumICI480_Data.obs


#### Reminder:

If you ran sc.pp.calculate_qc_metrics() on your Xenium cell-based data, then the total_counts metadata column is adjusted to only include the total number of transcripts per cell, which only includes gene features.

Originally, before running sc.pp.calculate_qc_metrics() (or if you didn't run it), then the total_counts metadata column includes a sum total of the gene expression features, control_probe_counts, control_codeword_counts, and unassigned_codeword_counts.

This notebook will use the total_counts value that has been adjusted by sc.pp.caluclate_qc_metrics() to only include the sum total of the gene expression features.

In [None]:
## Calculate negative control-associated metrics

# Percents

cprobes_p = (
    Slide1_XeniumICI480_Data.obs["control_probe_counts"].sum() / Slide1_XeniumICI480_Data.obs["total_counts"].sum() * 100
)
cwords_p = (
    Slide1_XeniumICI480_Data.obs["control_codeword_counts"].sum() / Slide1_XeniumICI480_Data.obs["total_counts"].sum() * 100
)

# False positives
uwords_p = (
    Slide1_XeniumICI480_Data.obs["unassigned_codeword_counts"].sum() / Slide1_XeniumICI480_Data.obs["total_counts"].sum() * 100
)

dwords_p = (
    Slide1_XeniumICI480_Data.obs["deprecated_codeword_counts"].sum() / Slide1_XeniumICI480_Data.obs["total_counts"].sum() * 100
)

print(f"Percents")
print(f"Negative DNA probe count % : {cprobes_p}")
print(f"Negative decoding count % : {cwords_p}")
print(f"Unassigned codeword count % : {uwords_p}")
print(f"Deprecated codeword count % : {dwords_p}\n")

# Sums

cprobes_s = Slide1_XeniumICI480_Data.obs["control_probe_counts"].sum()

cwords_s = Slide1_XeniumICI480_Data.obs["control_codeword_counts"].sum()

uwords_s = Slide1_XeniumICI480_Data.obs["unassigned_codeword_counts"].sum()

dwords_s = Slide1_XeniumICI480_Data.obs["deprecated_codeword_counts"].sum()

print(f"Sums")
print(f"Negative DNA probe count # : {cprobes_s}")
print(f"Negative decoding count # : {cwords_s}")
print(f"Unassigned codeword count # : {uwords_s}")
print(f"Deprecated codeword count # : {dwords_s}")

In [None]:
## A metric that we've liked using to assess run quality

Slide1_XeniumICI480_Data.obs["control_probe_counts"].mean()

In [None]:
## 10x Genomics's recommended false positive metric
# Calculated by: Mean negative probe count per cell / number of control probes / number of cells

# Example from our XeniumICI480 dataset: (0.010 / 20 / 539650)

# INSERT HERE

In [None]:
## Print gene list
# Note: Gene list (.var['gene_ids']) got lost when concatenating the data, but it's still stored as .var_names so we can pull it from there

Slide1_XeniumICI480_Data.var_names

In [None]:
##### Keep when applicable (new panel)

# Convert to a table
XeniumICI480_GeneList = pd.DataFrame(Slide1_XeniumICI480_Data.var_names)

# Name gene column
XeniumICI480_GeneList.rename(columns={0: "unique_genes"}, inplace=True)

# Print
XeniumICI480_GeneList

# Ensure that this count matches your expected number of genes. Otherwise, genes may have been removed during cell or transcript filtering steps

In [None]:
##### Save if desired

#XeniumICI480_GeneList.to_csv('/path/XeniumICI_Custom480Panel_GeneList.csv')

# Transcript File

Read in Xenium transcript file and visualize it

In [None]:
## Read in raw transcript file and visualize

# Xenium (Note: Unzip if necessary. In this case, the file had already been unzipped and saved by using our parquet env)
Slide1_XeniumICI480_transcripts_raw = pd.read_csv('/path/SlideID_0063975_transcripts.csv')

Slide1_XeniumICI480_transcripts_raw

In [None]:
# Quantify number of fov's

n_fovs = Slide1_XeniumICI480_transcripts_raw['fov_name'].nunique()

print(f"Number of unique fov's : {n_fovs}")

### Merge Slide1_XeniumICI480_Data (from above) with Xenium_transcripts_raw. This way, we're working with Xenium_transcripts_initialfilter (transcripts within cells that passed 10x's QC filter and our filtering)

This is also important for Xenium runs with orientation cores. It ensures that we remove cells from the orientation cores (which were removed from the cell-focused object.)

In [None]:
# View metadata
Slide1_XeniumICI480_Data.obs['cell_id']

# Note: These are all of the cells that passed 10x QC and our QC

In [None]:
## Format for merge

# Need to remove the index name because there's also a column with the same name (cell_id)
Slide1_XeniumICI480_Data.obs.index.name = None

# Rename 'cell_id' to 'cell_id_wSlideName'
Slide1_XeniumICI480_Data.obs.rename(columns={"cell_id": "cell_id_wSlideName"}, inplace=True)

# Duplicate the cell_id_wSlideName column and strip the prefix (so that the cell ids match the transcript ids)

Slide1_XeniumICI480_Data.obs["cell_id"] = (
    Slide1_XeniumICI480_Data.obs["cell_id_wSlideName"].str.replace("^Slide1-", "", regex=True)
)

# View
display(Slide1_XeniumICI480_Data.obs)

In [None]:
# Merge the data frames based on "cell_id"
Slide1_XeniumICI480_transcripts_initialfilter = pd.merge(Slide1_XeniumICI480_transcripts_raw, Slide1_XeniumICI480_Data.obs[['cell_id']], on='cell_id', how='inner')

# View
Slide1_XeniumICI480_transcripts_initialfilter


# Result: End up with 134,672,733 rows (transcripts) instead of 170,805,937


## Consider saving Slide1_XeniumICI480_transcripts_initialfilter here

### Done - Continue with Xenium_transcripts_initialfilter

### Create new objects - Remove false positives / low quality / other undesirable values

In [None]:
## First, we actually want to create an object with just the negative probe transcripts (just to compare later on to the gene probe mean counts)

# Make a copy
Slide1_XeniumICI480_negprobe_transcripts = Slide1_XeniumICI480_transcripts_initialfilter.copy()

# Keep only the rows where 'feature_name' contains 'NegControlProbe'
Slide1_XeniumICI480_negprobe_transcripts = Slide1_XeniumICI480_negprobe_transcripts[Slide1_XeniumICI480_negprobe_transcripts['feature_name'].str.contains('NegControlProbe', case=False, na=False)]

# View
Slide1_XeniumICI480_negprobe_transcripts

In [None]:
# Make new df
Slide1_XeniumICI480_negprobe_transcripts2 = Slide1_XeniumICI480_negprobe_transcripts.copy()

# Remove rows where 'cell_id' matches 'UNASSIGNED'
Slide1_XeniumICI480_negprobe_transcripts2 = Slide1_XeniumICI480_negprobe_transcripts2[Slide1_XeniumICI480_negprobe_transcripts2['cell_id'] != 'UNASSIGNED']

# Remove transcripts that didn't pass QC
Slide1_XeniumICI480_negprobe_transcripts2 = Slide1_XeniumICI480_negprobe_transcripts2[Slide1_XeniumICI480_negprobe_transcripts2['qv'] >= 20.0]

# Group by cell
Slide1_XeniumICI480_negprobe_transcripts2 = Slide1_XeniumICI480_negprobe_transcripts2.groupby('cell_id')['transcript_id'].nunique().reset_index()

# Rename columns for clarity
Slide1_XeniumICI480_negprobe_transcripts2.columns = ['cell_id', 'negprobe_count']

# Display the resulting DataFrame
print(Slide1_XeniumICI480_negprobe_transcripts2)

In [None]:
## Calculate negprobe_count sum

Slide1_XeniumICI480_negprobe_transcripts2['negprobe_count'].sum()

# Note: This matches what we found in the metadata for the negative control probe counts (scroll up)

In [None]:
## Looking for other controls

Xenium_control_transcripts = Slide1_XeniumICI480_transcripts_initialfilter.copy()

# Filter based on feature_name value including an '_'
filtered_featurenames_controls = Xenium_control_transcripts['feature_name'][Xenium_control_transcripts['feature_name'].str.contains('_')]

# Get unique values
unique_featurenames_controls = filtered_featurenames_controls.unique()

# Alphabetize
sorted_unique_featurenames_controls = sorted(unique_featurenames_controls)

# Print
for value in sorted_unique_featurenames_controls:
    print(value)
    
    
# Note: There are NegControlCodeword and NegControlProbe -- No instances of DeprecatedCodeword, Intergenic_Region, or UnassignedCodeword (as seen in our Xenium datasets)
# The controls per Xenium dataset may vary based on the chemistry version and 10x updates

In [None]:
## Looking for other controls, continued

featurenames_blanks = Slide1_XeniumICI480_transcripts_initialfilter.copy()

filtered_featurenames_blanks = featurenames_blanks[(featurenames_blanks['feature_name'].str.contains('BLANK', case=False, na=False))]

# print
filtered_featurenames_blanks

Okay, moving forward.

We will re-name this initialfilter object to r, to denote filtering removal

In [None]:
### Remove false positives -- Modify this as needed

Slide1_XeniumICI480_transcripts_r = Slide1_XeniumICI480_transcripts_initialfilter.copy()

# Remove rows where 'cell_id' matches 'UNASSIGNED'
Slide1_XeniumICI480_transcripts_r = Slide1_XeniumICI480_transcripts_r[Slide1_XeniumICI480_transcripts_r['cell_id'] != 'UNASSIGNED']

# Remove rows where 'feature_name' contains '_' (All of the controls)
Slide1_XeniumICI480_transcripts_r = Slide1_XeniumICI480_transcripts_r[~(Slide1_XeniumICI480_transcripts_r['feature_name'].str.contains('_', case=False, na=False))]

# Remove rows where 'feature_name' contains 'BLANK'
Slide1_XeniumICI480_transcripts_r = Slide1_XeniumICI480_transcripts_r[~(Slide1_XeniumICI480_transcripts_r['feature_name'].str.contains('BLANK', case=False, na=False))]

# Remove transcripts that didn't pass QC
Slide1_XeniumICI480_transcripts_r = Slide1_XeniumICI480_transcripts_r[Slide1_XeniumICI480_transcripts_r['qv'] >= 20.0]

# View
Slide1_XeniumICI480_transcripts_r


## 117,335,771 transcripts

In [None]:
## Print the number of unique genes

Slide1_XeniumICI480_transcripts_r['feature_name'].nunique()

# Ensure that this value is the expected number of genes in the panel (480) -- YES
# Meaning that all of the controls have been removed. And that all of the genes survived the cell-specific filtering
# If a gene was filtered out during the cell-specific filtering, we need to figure out which gene that is
# Run the code below if that's the case

In [None]:
## Save Slide1_XeniumICI480_transcripts_r

#Slide1_XeniumICI480_transcripts_r.to_csv('/path/Slide1_XeniumICI480_transcripts_r.csv')