#### This particular notebook details the preprocessing pipeline for our Xenium Dataset 1 (rep 1) data

#### Required input files:

* Raw Xenium Dataset 1 290 IntReps1and2 data object and filter to only keep Rep1 (availabile via FigShare) OR raw data files for Xenium Dataset 1 290 Rep1 (available via GEO)

This notebook starts with the raw GEO files

Environment: Please create and activate the conda environment provided in default_env.yaml before running this notebook

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scanpy as sc
import squidpy as sq

import gzip
import anndata

### Load Data

In [None]:
adata = sc.read_10x_h5(
    filename="/path/cell_feature_matrix.h5"
)

In [None]:
# We can unzip cells.csz.gz to obtain cells.csv

In [None]:
df = pd.read_csv(
    "/path/cells.csv"
)

In [None]:
df.set_index(adata.obs_names, inplace=True)
adata.obs = df.copy()

In [None]:
adata.obsm["spatial"] = adata.obs[["x_centroid", "y_centroid"]].copy().to_numpy()

In [None]:
adata.obs

In [None]:
adata

### Calculate Quality Control Metrics

Calculate the quality control metrics on the anndata.AnnData using scanpy.pp.calculate_qc_metrics

In [None]:
sc.pp.calculate_qc_metrics(adata, percent_top=(10, 20, 50, 150), inplace=True)

The percentage of control probes and control codewords can be calculated from adata.obs

In [None]:
cprobes = (
    adata.obs["control_probe_counts"].sum() / adata.obs["total_counts"].sum() * 100
)
cwords = (
    adata.obs["control_codeword_counts"].sum() / adata.obs["total_counts"].sum() * 100
)
print(f"Negative DNA probe count % : {cprobes}")
print(f"Negative decoding count % : {cwords}")

Next we plot the distribution of total transcripts per cell, unique transcripts per cell, area of segmented cells and the ratio of nuclei area to their cells

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(15, 4))

axs[0].set_title("Total transcripts per cell")
sns.histplot(
    adata.obs["total_counts"],
    kde=False,
    ax=axs[0],
)

axs[1].set_title("Unique transcripts per cell")
sns.histplot(
    adata.obs["n_genes_by_counts"],
    kde=False,
    ax=axs[1],
)


axs[2].set_title("Area of segmented cells")
sns.histplot(
    adata.obs["cell_area"],
    kde=False,
    ax=axs[2],
)

axs[3].set_title("Nucleus ratio")
sns.histplot(
    adata.obs["nucleus_area"] / adata.obs["cell_area"],
    kde=False,
    ax=axs[3],
)

Filter the cells based on the minimum number of counts required using scanpy.pp.filter_cells. Filter the genes based on the minimum number of cells required with scanpy.pp.filter_genes. The parameters for the both were specified based on the plots above. They were set to filter out the cells and genes with minimum counts and minimum cells respectively.

Other filter criteria might be cell area, DAPI signal or a minimum of unique transcripts.

Squidpy tutorial filtering examples: 
sc.pp.filter_cells(adata, min_counts=10);
sc.pp.filter_genes(adata, min_cells=5)

In [None]:
## Our filtering schema
# Note: Transcripts that didn't pass 10x Genomics's QC have already been filtered out
adata_filtered = adata

# filter out cells with <50 counts and <10 genes
sc.pp.filter_cells(adata_filtered, min_counts=50)
sc.pp.filter_cells(adata_filtered, min_genes=10)

# filter out genes that have <1 count and are detected in <10 cells
sc.pp.filter_genes(adata_filtered, min_counts=1)
sc.pp.filter_genes(adata_filtered, min_cells=10)

Visualize genes with the highest expression levels

In [None]:
sc.pl.highest_expr_genes(adata_filtered, n_top=20, )

Make a copy of the original raw counts (post-filtering; pre-normalization)

In [None]:
adata_filtered.layers['raw_counts'] = adata_filtered.X.copy() # Make a copy
adata_filtered

In [None]:
adata_filtered.obs

### Save object

This is post-filtering and pre-normalization

In [None]:
adata_filtered.write_h5ad('/path/DataObjects_withoutUMAP/Xeniumdata_filtered_pre-normalization.h5ad')

### Continue with analysis

Normalize counts per cell using scanpy.pp.normalize_total.

Logarithmize, do principal component analysis, compute a neighborhood graph of the observations using scanpy.pp.log1p, scanpy.pp.pca and scanpy.pp.neighbors respectively.

Use scanpy.tl.umap to embed the neighborhood graph of the data and cluster the cells into subgroups employing scanpy.tl.leiden.

In [None]:
sc.pp.normalize_total(adata_filtered, inplace=True)
sc.pp.log1p(adata_filtered)
# Save log_normalized_counts as a layer
adata_filtered.layers['log_normalized_counts']=adata_filtered.X

Calculate and plot the top highly variable genes

In [None]:
# Note: These min and max values are the default values
sc.pp.highly_variable_genes(adata_filtered, min_mean=0.0125, max_mean=3, min_disp=0.5)

In [None]:
sc.pl.highly_variable_genes(adata_filtered)

Regress out unwanted sources or variation and scale data

In [None]:
# Here, we're going to regress out total_counts and n_genes_by_counts

sc.pp.regress_out(adata_filtered, ["total_counts","n_genes_by_counts"])

In [None]:
sc.pp.scale(adata_filtered, max_value=10)

Run PCA and plot PCA variance ratio

In [None]:
sc.pp.pca(adata_filtered, svd_solver='arpack')

In [None]:
sc.pl.pca(adata_filtered, color=['EPCAM','COL3A1','CD69'])

In [None]:
sc.pl.pca_variance_ratio(adata_filtered, log=True)

In [None]:
sc.pl.pca_variance_ratio(adata_filtered, n_pcs = 50, log=True)

In [None]:
adata_filtered

### Compute neighbor graph and plot UMAP

In [None]:
sc.settings.figdir = '/path/UMAP_pngs/'

#### Parameter descriptions:
n_neighbors
* A value between 2 and 100, representing the number of neighboring data points used for manifld approximation. Larger values give a manifold with a more global view of the dataset, while smaller values preserve more of the local structures.
* Default value is 15

n_pcs
* Use this many PCs
* Default value is None

min_dist
* The minimum distance between two points in the UMAP embedding.
* Default value is 0.05

spread
* A scaling factor for distance between embedded points.
* Default value is 1.0

Helpful resource: https://smorabit.github.io/blog/2020/umap/

#### Different versions -- Playing around with different n_pcs and spread values

#### v4: 
n_neighbors = 10; 
n_pcs = 30;
min_dist = 0.02;
spread = 1.75

#### Testing v4

In [None]:
adata_filtered_v4 = adata_filtered

In [None]:
sc.pp.neighbors(adata_filtered_v4, n_neighbors=10, n_pcs=30)
sc.tl.umap(adata_filtered_v4, min_dist=0.02, spread=1.75)
sc.tl.leiden(adata_filtered_v4)

In [None]:
sc.pl.umap(
    adata_filtered_v4,
    color=[
        "total_counts",
        "n_genes_by_counts",
        "leiden",
    ],
    wspace=0.4,
    save = '_v4_Xeniumdata.png',
)

In [None]:
# Save object with UMAP
adata_filtered_v4.write_h5ad('/path/DataObjects_withUMAP/Xeniumdata_umapv4.h5ad')