#### This particular notebook details the preprocessing pipeline for our CosMx data -- This is a re-analysis of already published data

#### Required input files:

* Raw CosMx data object (availabile via FigShare)

Environment: Please create and activate the conda environment provided in default_env.yaml before running this notebook

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scanpy as sc
import squidpy as sq

import gzip
import anndata

import os

### Load Data

In [None]:
# Read in raw object

adata = sc.read_h5ad('/path/22_11_22_CosMx_Raw.h5ad')

adata

In [None]:
adata.obsm['spatial']

In [None]:
# Use the global (x,y) coordinates for spatial info, for visualization of all fovs.
adata.obsm["X_spatial"] = adata.obsm["spatial_fov"]

adata.obsm["spatial_fov"]

In [None]:
# spatial coordinates
global_x = adata.obs["CenterX_global_px"]
global_y = adata.obs["CenterY_global_px"]
global_y

In [None]:
# extract the local and global coordinates and make a list of tuples
#coords_local = [(local_x[i], local_y[i]) for i in range(0, len(local_x))]
coords_global = [(global_x[i], global_y[i]) for i in range(0, len(global_x))]

#coords_local[0:5]

In [None]:
# Add the spatial coordinates to adata.obsm for both local and global
# convert the list of tuples to arrays of tuples (for formatting issue)
#adata_lymph_node_manually_annotated.obsm["X_spatial_local"] = np.asarray(coords_local)
adata.obsm["X_spatial_global"] = np.asarray(coords_global)

In [None]:
adata

### Calculate Quality Control Metrics

Calculate the quality control metrics on the anndata.AnnData using scanpy.pp.calculate_qc_metrics

In [None]:
sc.pp.calculate_qc_metrics(adata, percent_top=(10, 20, 50, 150), inplace=True)

The percentage of control probes and control codewords can be calculated from adata.obs

In [None]:
ncount_nprobes = (
    adata.obs["nCount_negprobes"].sum() / adata.obs["total_counts"].sum() * 100
)

print(f"Percentage of unassigned negative probe transcripts % : {ncount_nprobes}")

In [None]:
# Rename the cell area column in the CosMx data to more closely match with the Xenium data and include unit

adata.obs.rename(columns={'Area': 'cell_area_pixels'}, inplace=True)

# Add a metadata column that includes area in um2
# Conversion: 1 pixel = 0.12 um so multiply pixels by 0.0144 to get um2

adata.obs['cell_area_um2'] = adata.obs['cell_area_pixels'] * 0.0144

Next we plot the distribution of total transcripts per cell, unique transcripts per cell, area of segmented cells and the ratio of nuclei area to their cells

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(15, 4))

axs[0].set_title("Total transcripts per cell")
sns.histplot(
    adata.obs["total_counts"],
    kde=False,
    ax=axs[0],
)

axs[1].set_title("Unique transcripts per cell")
sns.histplot(
    adata.obs["n_genes_by_counts"],
    kde=False,
    ax=axs[1],
)


axs[2].set_title("Area of segmented cells (um2)")
sns.histplot(
    adata.obs["cell_area_um2"],
    kde=False,
    ax=axs[2],
)

Filter the cells based on the minimum number of counts required using scanpy.pp.filter_cells. Filter the genes based on the minimum number of cells required with scanpy.pp.filter_genes. The parameters for the both were specified based on the plots above. They were set to filter out the cells and genes with minimum counts and minimum cells respectively.

Other filter criteria might be cell area, DAPI signal or a minimum of unique transcripts.

Squidpy tutorial filtering examples: 
sc.pp.filter_cells(adata, min_counts=10);
sc.pp.filter_genes(adata, min_cells=5)

In [None]:
## Our filtering schema
# Note: These filters should have already been applied to this data, but I'm running them again just in case
adata_filtered = adata.copy()

# filter out cells with <50 counts and <10 genes
sc.pp.filter_cells(adata_filtered, min_counts=50)
sc.pp.filter_cells(adata_filtered, min_genes=10)

# filter out genes that have <1 count and are detected in <10 cells
sc.pp.filter_genes(adata_filtered, min_counts=1)
sc.pp.filter_genes(adata_filtered, min_cells=10)

In [None]:
### Also -- filter based on NanoString's QC flags
# Note: Some of these filters should have already been applied to this data, but I'm running all of them just in case

adata_filtered = adata_filtered[adata_filtered.obs['qcFlagsRNACounts'] == 'Pass']
adata_filtered = adata_filtered[adata_filtered.obs['qcFlagsCellCounts'] == 'Pass']
adata_filtered = adata_filtered[adata_filtered.obs['qcFlagsCellPropNeg'] == 'Pass']
adata_filtered = adata_filtered[adata_filtered.obs['qcFlagsCellComplex'] == 'Pass']
adata_filtered = adata_filtered[adata_filtered.obs['qcFlagsCellArea'] == 'Pass']
adata_filtered = adata_filtered[adata_filtered.obs['qcFlagsFOV'] == 'Pass']

Visualize genes with the highest expression levels

In [None]:
sc.pl.highest_expr_genes(adata_filtered, n_top=20, )

Make a copy of the original raw counts (post-filtering; pre-normalization)

In [None]:
adata_filtered.layers['raw_counts'] = adata_filtered.X.copy() # Make a copy
adata_filtered

In [None]:
adata_filtered.obs

# 126368 cells -- this didn't change by filtering, indicating that the object had sucessfully been filtered before (in the R notebook in previous analysis)

# Checked the R notebook from previous analysis -- Started with 162510 cells

# Previous analysis was published previously

### Save object

This is post-filtering and pre-normalization

In [None]:
adata_filtered.write_h5ad('/path/DataObjects_withoutUMAP/CosMxdata_filtered_pre-normalization.h5ad')

### Continue with analysis

Normalize counts per cell using scanpy.pp.normalize_total.

Logarithmize, do principal component analysis, compute a neighborhood graph of the observations using scanpy.pp.log1p, scanpy.pp.pca and scanpy.pp.neighbors respectively.

Use scanpy.tl.umap to embed the neighborhood graph of the data and cluster the cells into subgroups employing scanpy.tl.leiden.

In [None]:
# Note: inplace=True is the default value; updates adata with the normalized version of the original adata.X and adata.layers
sc.pp.normalize_total(adata_filtered, inplace=True)
sc.pp.log1p(adata_filtered)
# Save log_normalized_counts as a layer
adata_filtered.layers['log_normalized_counts']=adata_filtered.X

Calculate and plot the top highly variable genes

In [None]:
sc.pp.highly_variable_genes(adata_filtered, min_mean=0.0125, max_mean=3, min_disp=0.5)

In [None]:
sc.pl.highly_variable_genes(adata_filtered)

Regress out unwanted sources or variation and scale data:

Going to make 2 versions of the data object -- 1 following the Xenium pipeline and regressing out total_counts and n_genes_by_counts and the other following the original CosMx pipeline and regressing out nCount_RNA / total_counts, nFeature_RNA / n_genes_by_counts, nFeaturePerCell, and nCount

In [None]:
# regress the Xenium way copy
adata_filtered_RX = adata_filtered.copy()

sc.pp.regress_out(adata_filtered_RX, ["total_counts","n_genes_by_counts"])

In [None]:
### Xenium regress method

## Scale data
sc.pp.scale(adata_filtered_RX, max_value=10)

## Run PCA and plot PCA variance ratio

sc.pp.pca(adata_filtered_RX, svd_solver='arpack')

sc.pl.pca(adata_filtered_RX, color=['EPCAM','COL3A1','CD69'])

In [None]:
### Xenium regress method

sc.pl.pca_variance_ratio(adata_filtered_RX, log=True)

In [None]:
### Xenium regress method

sc.pl.pca_variance_ratio(adata_filtered_RX, n_pcs = 50, log=True)

In [None]:
### Xenium regress method

adata_filtered_RX

### Save objects

This is post-filtering and post-normalization. Immediately before computing neighbor graph and plotting UMAP.

In [None]:
adata_filtered_RX.write_h5ad('/path/DataObjects_withoutUMAP/CosMxdata_filtered_post-normalization_XeniumRegressMethod.h5ad')

### Compute neighbor graph and plot UMAP

In [None]:
sc.settings.figdir = '/path/UMAP_pngs/'

#### Parameter descriptions:
n_neighbors
* A value between 2 and 100, representing the number of neighboring data points used for manifld approximation. Larger values give a manifold with a more global view of the dataset, while smaller values preserve more of the local structures.
* Default value is 15

n_pcs
* Use this many PCs
* Default value is None

min_dist
* The minimum distance between two points in the UMAP embedding.
* Default value is 0.05

spread
* A scaling factor for distance between embedded points.
* Default value is 1.0

Helpful resource: https://smorabit.github.io/blog/2020/umap/

#### Testing vRX1

In [None]:
adata_filtered_RX_v1 = adata_filtered_RX.copy()

In [None]:
sc.pp.neighbors(adata_filtered_RX_v1, n_neighbors=10, n_pcs=30)
sc.tl.umap(adata_filtered_RX_v1, min_dist=0.02, spread=1.5)
sc.tl.leiden(adata_filtered_RX_v1)

In [None]:
sc.pl.umap(
    adata_filtered_RX_v1,
    color=[
        "total_counts",
        "n_genes_by_counts",
        "leiden",
    ],
    wspace=0.4,
    save = '_v1_CosMxdata_XeniumRegressMethod.png',
)

In [None]:
## Plot based on Fine_annotation_3 (original CosMx object annotations)

sc.pl.umap(
    adata_filtered_RX_v1,
    color=[
        "Fine_annotation_3"
    ],
    wspace=0.4,
    save = '_v1_CosMxdata_XeniumRegressMethod_ColoredByFineAnnotation3.png',
)

In [None]:
# Save object with UMAP
adata_filtered_RX_v1.write_h5ad('/path/DataObjects_withUMAP/CosMxdata_XeniumRegressMethod_umapv1.h5ad')