# ESFS example workflow - Parigi et al. 2022 mouse gut spatial transcriptomic
In this workflow we apply ESFS to a mouse gut spatial transcriptomics dataset from Parigi et al. 2022. This case study show that ESFS is able to find a subset of genes that reveal distinct cell types in the adult gut. We then show that our ES combinatorial marker gene identification software find biolgically relevent gene expression profiles that non-negative matrix factorisation (NMF) fail to identify.


In [None]:
### Data path
path = "/Users/radleya/The Francis Crick Dropbox/BriscoeJ/Radleya/New_ES_Packages/GSE169749_RAW/"

### Import ESFS package

In [None]:
import ESFS
from scipy.sparse import csc_matrix

### Set python default discrete class colour palette to one with more colours

In [None]:
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import anndata as ad
import scanpy as sc
Colours = px.colors.qualitative.Dark24
Colours.remove('#222A2A') # Remove black form the color palette (personal preference).
#Colours = np.concatenate((Colours,Colours))
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=Colours)

### Load data

In [None]:
## Load data
adata = sc.read_10x_h5(path+"GSM5213483_V19S23-097_A1_S1_raw_feature_bc_matrix.h5")

## Extract spatial transcriptomic grid coordinates
Cords = pd.read_csv(path+"GSM5213483_V19S23-097_A1_S1_tissue_positions_list.csv",index_col=0, header=None)
IJ_Cords = Cords[[2,3]]
adata = adata[IJ_Cords.index]

# adata = adata.copy()
# adata.X = adata.X.toarray()
# adata.var_names_make_unique()

## Provide dummy sample labels
adata.obs["Null_ID"] = np.repeat("Null_ID",adata.shape[0])

### Basic Quality Control

In [None]:
# Remove genes with less than 30 samples showing expression
Keep_Genes = adata.var_names[np.where(np.sum(adata.X>0,axis=0) > 30)[0]]
adata = adata[:,Keep_Genes]

### Create and save a scaled version of the scRNA-seq counts matrix
Create and save a scaled version of the scRNA-seq counts matrix where each genes has their observed expression values clipped if above a percentile threshold (default=97.5) and are then normalised by their maximum values so that all values are between 0-1, which is a requirment for the Entropy Sort Score (ESS) and Error Potential (EP) calcualtions.

In [None]:
adata = ESFS.Create_Scaled_Matrix(adata,log_scale=False)

### Calculate ESS and EP matricies
Here we use the theory of Entropy Sorting to generate a pairwise gene correlation matrix (ESS matrix) and a correlation significance matrix (EP matrix). To speed up calculations parallel processing is implemented. To control the number of cores used for processing, vary the "Use_Cores" parameter which deaults to "Use_Cores=-1" which indicates the software will use n-1 cores, where n is the number of cores available on your machine. 
Please note that "Use_Cores=-1" is as special flag and "Use_Cores=-2" will not use n-2 cores. You should instead change "Use_Cores" to positive integer values of the number of cores you would like to use.

In [None]:
adata = ESFS.Parallel_Calc_ES_Matricies(adata, Secondary_Features_Label="Self", save_matrices=np.array(["ESSs","EPs"]), Use_Cores=-1)

### Calculate gene importance rankings
We now calculate the gene importance weights and hence gene importance rankings for the remaining genes in the dataset. For further details see our manuscript.

In [None]:
### ESS_Threshold is one of two paramaters that users should prioratise when trying to optimise the ESFS
# workflow. ESS_Threshold designates the upper threshold for edges between genes in the ESS matrix will be
# retained. For this earlky human embryo data, an ESS_Threshold of 0 works well and is a good starting point
# for most datasets. For other datasets, increasing the ESS_Threshold can be beneficial, as demonstrated in
# our Delile et al. 2019 neural tube scRNA-seq data example workflow.
ESS_Threshold = 0.01

### An array of known important genes for early human embryo development so that we may use them as a
# reference point for how genes are beign ranked/grouped. Replace these genes with those you are
# interested in in your data.
Known_Important_Genes = np.array(["Slc51a","Tagln","Pcp4"])

### Run the ES_Rank_Genes function while using "Exclude_Genes=Exclude_Batch_Effect_Genes" to exclude possible
# batch effect genes and "Known_Important_Genes=Known_Important_Genes" as reference point genes.
adata = ESFS.ES_Rank_Genes(adata,ESS_Threshold,Known_Important_Genes=Known_Important_Genes)

### Visualise clustering of top ranked genes in a UMAP generated from their pairwise ESS scores

In [None]:
### Visualise top ranked genes graph
# Num_Top_Ranked_Genes is the second of the two paramaters that users should focus on when optimising
# the ESFS workflow for their data. Values between 3000-4000 are a typically good place to start.
Num_Top_Ranked_Genes = 8000

# The "Clustering" paramter can be set to an integer value if you wish to cluster genes in the UMAP with Kmeans clustering.
# Set "Clustering='hdbscan'" for automated density based clustering, and "Clustering=None" for no clustering.
Top_ESS_Genes, Gene_Clust_Labels, Gene_Embedding = ESFS.Plot_Top_Ranked_Genes_UMAP(adata,Num_Top_Ranked_Genes,Clustering="hdbscan",Known_Important_Genes=Known_Important_Genes)
plt.show()

### Visualise the cell clustering UMAPs for each of the groupings of genes identified in the previous step

This is stage of the worfklow is a critical step for trying out different input parameters to try and optimise the final results. The 2 main paramters to vary are the "Num_Top_Ranked_Genes" and the "Clustering" inputs for the Plot_Top_Ranked_Genes_UMAP function.

"Num_Top_Ranked_Genes" wil take the top ranked genes according to the cESFW algorithm. We recommend starting at around 3000-4000 genes and tweaking the "Clustering" paramater before trying higher or lower values of "Num_Top_Ranked_Genes".

The "Clustering" can either be an integer number, "hdbscan" or "None". An integer number will tell the algorithm to use Kmeans clustering with that number of clusters. "hdbscan" will use the hdbscan densitiy based clustering algorithm to automatically identify an optimal number of cluster according the gene UMAP embedding space.

In breif, you are seeking a gene cluster or combination of gene cluster that reveal biological structure of interest by excluding clusters of genes that contribute a large amount of noise to downstream analysis.

In [None]:
Gene_Cluster_Embeddings, Gene_Cluster_Selected_Genes = ESFS.Get_Gene_Cluster_Cell_UMAPs(adata,Gene_Clust_Labels,Top_ESS_Genes,n_neighbors=50,min_dist=0.1,log_transformed=False)

ESFS.Plot_Gene_Cluster_Cell_UMAPs(adata,Gene_Cluster_Embeddings,Gene_Cluster_Selected_Genes,Cell_Label="Null_ID")
plt.show()

### If you'd like to generate the cell UMAP for a specific cluster of a combination of clusters, use the specific_cluster parameter

For this workflow we find cluster 2 to be most informative of neural tube progenitor to neuron differentation and choose plot the resulting UMAP from these genes.

In [None]:
Gene_Cluster_Embeddings, Gene_Cluster_Selected_Genes = ESFS.Get_Gene_Cluster_Cell_UMAPs(adata,Gene_Clust_Labels,Top_ESS_Genes,specific_cluster=[6,3],n_neighbors=50,min_dist=0.1,log_transformed=False)

ESFS.Plot_Gene_Cluster_Cell_UMAPs(adata,Gene_Cluster_Embeddings,Gene_Cluster_Selected_Genes,Cell_Label="Null_ID")
plt.show()

We can also look at the gene expression of a particular gene.

In [None]:
ESFS.Plot_Gene_Cluster_Cell_UMAPs(adata,Gene_Cluster_Embeddings,Gene_Cluster_Selected_Genes,Cell_Label="Muc2")

### Finding marker genes
An important task in scRNA-seq analysis is to find marker genes for distinct populations. Commonly this is acheived by using statistical tests to perform differential expressed gene (DEG) analysis. However, as discussed in our manuscript, this process is limited by the requirement to identify a set of discrete non-overlapping clusters. Here we provide code to efficiently identify genes expression profiles enriched in the combinatorial cluster space of an intentionally over clustered dataset. This is possible because ES provides a mathermatically rigorous way to turn the intractable combinatorial cluster problem into a linearly complex problem, solvable in a minutes.

We start by intentionally overclustering the data. To this we will use the Leiden clustering algorithm provided by the Scanpy package.

In [None]:
# Sub-set the data to the genes identified by ESFS in the previous steps
sub_adata = adata[:, Gene_Cluster_Selected_Genes[0]]

# Perform Leiden clustering with a high "resolution" value
sc.pp.neighbors(sub_adata, n_neighbors=50, n_pcs=0, metric = "correlation")
sc.tl.umap(sub_adata,min_dist=0.1)
sc.tl.leiden(sub_adata,resolution = 10)

# Visualise the clustering
sc.pl.umap(sub_adata, color=['leiden'],legend_fontsize=6)

In [None]:
# Extract the cluster labels and add them to the adata object
Leiden_Clusts = np.asarray(sub_adata.obs['leiden'])
Unique_Leiden_Clusts = np.unique(Leiden_Clusts)
np.min(np.unique(Leiden_Clusts,return_counts=True)[1])
sample_labels = "Leiden_Clusts"
adata.obs[sample_labels] = Leiden_Clusts

Now we use the Leiden clusters as a set of secondary features to identify sets of samples that maximise the correlation of feature/gene in adata.

In [None]:
Secondary_Features_Label = "Leiden_Clusts_Secondary_Features"
Leiden_Clusts_Secondary_Features = csc_matrix(pd.get_dummies(Leiden_Clusts).astype("f"))
adata.obsm[Secondary_Features_Label] = Leiden_Clusts_Secondary_Features

We use Parallel_Calc_ES_Matricies to calculate the required metrics for ES combinatorial marker gene identification (save_matrices=np.array(["ESSs","SGs"])). 
Secondary_Features_Label designates a prefix for saving the ESS and SG outputs.

In [None]:
adata = ESFS.Parallel_Calc_ES_Matricies(adata, Secondary_Features_Label=Secondary_Features_Label, save_matrices=np.array(["ESSs","SGs"]), Use_Cores=-1)

With the ESS and SG matricies saved to adata, we can use the Find_Max_ESSs function to identify which combination of Leiden Clusters maximises the correlation (ESS) or each feature/gene in adata. The most important output of Find_Max_ESSs is a 2D array where each row is the samples of adata and each column is a feature representing the identified combination of Leiden Clusters that maximises the ESS score of the respective feature/gene column in adata.

In [None]:
adata = ESFS.Find_Max_ESSs(adata,Secondary_Features_Label)

The 2D array of combined Leiden Clusters that maximise adata feature/gene ESSs is then used by Parallel_Calc_ES_Matricies as secondary features to obtain the ESS scores of each feature/gene in adata against each identified structure maximising feature. This creates a 2D pairwise adata feature/gene Vs coars grain combined Leiden Cluster matrix.

In [None]:
Secondary_Features_Label = "Leiden_Clusts_Secondary_Features_Max_ESS_Features"
adata = ESFS.Parallel_Calc_ES_Matricies(adata, Secondary_Features_Label=Secondary_Features_Label, save_matrices=np.array(["ESSs"]), Use_Cores=-1)

Now that we have a coarse grain represenation of where gene structure is maximised in different regions of the data, we can use Find_Minimal_Combinatorial_Gene_Set to identify a minimal set of clusters/genes that captures that most unique/non-overlapping structure in the data. This minimal gene set may be thought of as an optimised set of unsuperised marker genes.

In [None]:
Secondary_Features_Label
Num_Genes = 50
Chosen_Clusters, Chosen_Genes, Chosen_Pairwise_ESSs = ESFS.Find_Minimal_Combinatorial_Gene_Set(adata,Num_Genes,Secondary_Features_Label,Resolution=0.75,Num_Reheats=3)

We can use knn_Smooth_Gene_Expression to get a knn mean smoothed representation of the data. We only use this for visualisation purposes.

In [None]:
adata = ESFS.knn_Smooth_Gene_Expression(adata, Gene_Cluster_Selected_Genes[0], knn=30, metric='correlation', log_scale=False)

### ES Marker Genes Vs. NMF
To demonstrate the ability of our ES marker gene software to identify distinct gene expression patterns in single cell RNA sequencing data, we compare it against non-negative matrix factorisation (NMF).

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.optimize import linear_sum_assignment
from sklearn.decomposition import NMF


First we provide the same data used to generate the above UMAP to NMF and ask it to find the same number of latent factors as we did with Find_Minimal_Combinatorial_Gene_Set (N=50).

In [None]:
model = NMF(n_components=50, init='nndsvd', max_iter=500)
U = model.fit_transform(adata[:,Gene_Cluster_Selected_Genes[0]].layers["Scaled_Counts"])
V = model.components_

To obtain the set of latent features from our ES marker gene algorithm we take the smoothed expression of the genes identified by Find_Minimal_Combinatorial_Gene_Set (Chosen_Genes)

In [None]:
A = adata[:,Chosen_Genes].layers["Scaled_Counts"]

No we match the most similar latent features between the ES features and NMF, while only allowing a feature to be paried with another feature once.

In [None]:
# Assume A and U are shape (4992, 30)
A_T = A.T  # shape (30, 4992)
U_T = U.T  # shape (30, 4992)
# Step 1: Compute similarity matrix
similarity_matrix = cosine_similarity(A_T, U_T)  # shape (30, 30)
# Step 2: Convert to cost matrix (negate similarity since we want to maximize)
cost_matrix = -similarity_matrix
# Step 3: Solve assignment problem (Hungarian algorithm)
row_ind, col_ind = linear_sum_assignment(cost_matrix)
# Now: A[:, row_ind[i]] matches uniquely to U[:, col_ind[i]]
matches = list(zip(row_ind, col_ind))

When we plot each of the pairs, we find that while in some cases both ES marker gene selection and NMF are able to identify the same cell types (Comprisons 1), there are clear cases where NMF fails to find important marker gene expression profiles that ES marker gene identification finds (Comprisons 2). Below we present some key examples. See our manuscript for further details.

In [None]:
from IPython.display import IFrame, display

In [None]:
display(IFrame(src="/Users/radleya/The Francis Crick Dropbox/BriscoeJ/Radleya/New_ES_Packages/ESFS_Git/ESFS/Example_Workflows/Gut_Spatial_Transcriptomics_Example/Comprisons 1.pdf", width=800, height=600))

In [None]:
display(IFrame(src="/Users/radleya/The Francis Crick Dropbox/BriscoeJ/Radleya/New_ES_Packages/ESFS_Git/ESFS/Example_Workflows/Gut_Spatial_Transcriptomics_Example/Comprisons 2.pdf", width=800, height=600))

In [None]:
# i = 0
# # for i in np.arange(len(matches)):
#     #
# Pair = matches[i]
# #
# plt.figure(figsize=(14,7))
# #
# Gene = Chosen_Genes[Pair[0]]
# Exp = np.asarray(adata[:,Gene].layers["Smoothed_Expressions"].A).T[0]
# plt.subplot(1,2,1)
# plt.title(Gene,fontsize=18)
# Grid = np.zeros((np.max(IJ_Cords[2]+1),np.max(IJ_Cords[3]+1)))
# Grid[Cords] = Exp
# plt.imshow(Grid,cmap="magma",aspect="auto")  
# plt.xticks([])
# plt.yticks([])  
# #
# plt.subplot(1,2,2)
# Factor = Pair[1]
# Values = U[:,Factor]
# plt.title(Factor,fontsize=18)
# Grid = np.zeros((np.max(IJ_Cords[2]+1),np.max(IJ_Cords[3]+1)))
# Grid[Cords] = Values
# plt.imshow(Grid,cmap="magma",aspect="auto")  
# plt.xticks([])
# plt.yticks([])  
# #
# plt.subplots_adjust(0.02,0.02,0.98,0.9)
# # plt.savefig("/Users/radleya/The Francis Crick Dropbox/BriscoeJ/Radleya/ESFS_Paper/ESFS Figures/Paragi2020_Plots/Matched_Plots/" + Gene + ".png",dpi=600)
# # plt.close()

# plt.show()