# Spatially-informed Bivariate Metrics

This tutorial provides an overview of the local and global scores implemented in LIANA+. These scores are used to identify spatially co-expressed ligand-receptor pairs. However, there also applicable to other types of spatially-informed bivariate analyses.
It provides brief explanations of the mathematical formulations of the scores; these include adaptation of bivariate Moran’s R, Pearson correlation, Spearman correlation, weighted Jaccard similarity, and Cosine similarity. The tutorial also showcases interaction categories (masks) and significance testing.

### Environement Setup

In [None]:
import pandas as pd
import scanpy as sc
import decoupler as dc
import liana as li
from matplotlib import pyplot as plt
# set dpi to 100, to make the notebook smaller
plt.rcParams['figure.dpi'] = 100
import os

datadir = '../../datasets/Hands_on_2_LIANA_MistY/'

### Load and Normalize Data

In [None]:
adata = sc.read(os.path.join(datadir, "kuppe_heart19.h5ad"))
adata.layers['counts'] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

### Available spatial correlation/similarity Functions

In [None]:
li.mt.bivariate.show_functions()

### How do the local functions work?

The local functions work are quite simple, as they are simply weighted versions of well-known similarity metrics. For example, the spatially-weighted version of Cosine similarity is defined as:

$$ \text{wCosine}_i = \frac{\sum_{j=1}^n w_{ij} x_j y_j}{\sqrt{\sum_{j=1}^n w_{ij} x_j^2} \sqrt{\sum_{j=1}^n w_{ij} y_j^2}}$$

where for each spot **i**, we perform summation over all spots **n**, where **w**​ represents the spatial connectivity weights from spot **i** to every other spot **j**;  for variables  **x** and **y**.

### How do the global functions work?

Global scores can be calculated as the average local score across all spots for each ligand-receptor pair.

In addition, we can also use Global bivariate Moran's R (or [Lee's statistic](https://onlinelibrary.wiley.com/doi/full/10.1111/gean.12106)) - an extension of univariate Moran's I, as proposed by [Anselin 2019](https://onlinelibrary.wiley.com/doi/full/10.1111/gean.12164) and [Lee and Li, 2019](https://onlinelibrary.wiley.com/doi/full/10.1111/gean.12106); implemented in [SEAGAL](https://academic.oup.com/bioinformatics/article/39/7/btad431/7223197) and [SpatialDM](https://github.com/StatBiomed/SpatialDM).



### Spatial Connectivity

The way that spatially-informed methods usually work is by making use of weights based on the proximity (or spatial connectivity) between spots/cells.
These spatial connectivities are then used to calculate the metric of interest, e.g. Cosine similarity, in a spatially-informed manner.

The spatial weights in LIANA+ are by default defined as a family of radial kernels that use the inverse Euclidean distance between cells/spots to bind the weights between 0 and 1, with spots that are closest having the highest spatial connectivity to one another (1), while those that are thought to be too far to be in contact are assigned 0.

Key parameters of spatial_neighbors include:
- `bandwidth` controls the radius of the spatial connectivities where higher values will result in a broader area being considered (controls the radius relative to the coordinates stored in `adata.obsm['spatial']`)
- `cutoff` controls the minimum value that will be considered to have a spatial relationship (anything lower than the `cutoff` is set to 0).
- `kernel` controls the distribution (shape) of the weights ('gaussian' by default)
- `set_diag` sets the diagonal (i.e. the weight for each spot to itself) to 1 if True. **NOTE**: Here we set it to True as we expect many cells to be neighbors of themselves within a visium spot

As choosing an optimal bandwidith can be tricky, we provide the ``query_bandwidth`` function which uses a set of coordinates to provide an estimate of how many cell or spot neighbors are being considered for each spot over a range of bandwidths. 

In [None]:
plot, _ = li.ut.query_bandwidth(coordinates=adata.obsm['spatial'], start=0, end=300, interval_n=50)
plot


Here, we can see that a bandwidth of 130-200 (pixels) roughly includes 6 neighbours i.e. the first ring of neighbours in the hexagonal grid of 10x Visium. So, we will build the spatial graph with a bandwidth of 150.

In [None]:
li.ut.spatial_neighbors(adata, bandwidth=150, cutoff=0.1, kernel='gaussian', set_diag=True)

Let's visualize the spatial weights for a single spot to all other spots in the dataset:

In [None]:
li.pl.connectivity(adata, idx=0, size=1.3, figure_size=(6, 5)) 

### Bivariate Ligand-Receptor Relationships

Now that we have covered the basics, let's see how these scores look for potential ligand-receptor interactions on our 10X Visium Slide.
Note that LIANA+ will take the presence of heteromeric complexes into account at the individual spot-level!

In [None]:
lrdata = li.mt.bivariate(adata,
                resource_name='consensus', # uses HUMAN gene symbols
                local_name='cosine', # Name of the local function
                global_name="morans", # Name global function (or 'lee')
                n_perms=None, # Number of permutations to calculate a p-value
                mask_negatives=False, # Whether to mask LowLow/NegativeNegative interactions
                add_categories=True, # Whether to add local categories to the results
                nz_prop=0.3, # Minimum expr. proportion for ligands/receptors and their subunits
                use_raw=False,
                verbose=True
                )

In [None]:
lrdata

In [None]:
# save lrdata for future use
lrdata.var = lrdata.var.drop(columns=['morans_pvals'])
lrdata.write("lrdata.h5ad")

#### Global scores

In [None]:
lrdata.var.sort_values("mean", ascending=False).head(5)

The columns 'mean' represent the average cosine local score across all spots and the p-value calculated from permutations, respectively.

The column 'morans' represents the global Moran's R score for each ligand-receptor pair.Bivariate Moran's R values near zero imply spatial independence, while positive or negative values reflect spatial co-clustering or spatial cross-dispersion, respectively.

In [None]:
lrdata.var.sort_values("morans", ascending=False).head()

From these Global summaries, we see that the average Cosine similarity largely represents **coverage** - e.g. *TIMP1 & CD63* is ubiquoutesly and uniformly distributed across the slide. 

On the other hand, among most variable interactions and with with the highest global morans R is e.g. **APOE^LRP1**. This interaction is thus more likely to represent biological relationships, with distinct spatial clustering patterns.

So, let's visualize both:

In [None]:
sc.set_figure_params(dpi=80, dpi_save=300, format='png', frameon=False, transparent=True, figsize=[5,5])

In [None]:
sc.pl.spatial(lrdata, color=['APOE^LRP1', 'TIMP1^CD63'], size=1.4, vmax=1, cmap='magma')

As expected, we see that the **TIMP1 & CD63** interaction is uniformly distributed across the slide, while **APOE^LRP1** shows a clear spatial pattern.

We can also see that this is the case when we look at the individual genes:

In [None]:
sc.pl.spatial(adata, color=['APOE', 'LRP1',  'TIMP1', 'CD63'],
              size=1.4, ncols=2)
#sq.pl.spatial_scatter(adata, color=['VTN', 'ITGAV', 'ITGB5', 'TIMP1', 'CD63'], size=1.2, ncols=2)  # you can also use `squidpy.pl.spatial_scatter` instead.

### Permutation-based p-values
In addition to the local scores, we also calculated permutation-based p-values based on a null distribution generated by shuffling the spot labels. 

## Beyond Ligand-Receptors (optional)

While protein-mediated ligand-receptor interactions are interesting, cell-cell communication is not limited to those alone. Rather it is a complex process that involves a variety of different mechanisms such as signalling pathways, metabolite-mediated signalling, and distinct cell types.

So, if such diverse mechanisms are involved in cell-cell communication, why should we limit ourselves to ligand-receptor interactions?
Let's see how we can use LIANA+ to explore other types of cell-cell communication.

One simple approach would be to check relationships e.g. between transcription factors and cell type proportions.

### Extract Cell type Composition
This slide comes with estimated cell type proportions using cell2location; See [Kuppe et al., 2022](https://www.nature.com/articles/s41586-022-05060-x). Let's extract from .obsm them to an independent AnnData object.

In [None]:
# let's extract those
comps = li.ut.obsm_to_adata(adata, 'compositions')
# check key cell types
sc.pl.spatial(comps, color=['vSMCs','CM', 'Endo', 'Fib'], size=1.3, ncols=2)

### Estimate Transcription Factor Activity

In [None]:
# Get transcription factor resource
net = dc.op.collectri(organism='human', remove_complexes=True, license='academic', verbose=False)

While multi-omics datasets might be even more of an interest, for the sake of simplicity (and because the general lack of spatial mutli-omics data at current times), let's instead use enrichment analysis to estimate the activity of transcription factors in each spot. We will use one of [decoupler-py's](https://decoupler-py.readthedocs.io/en/latest/index.html) enrichment methods with [CollectTRI](https://www.biorxiv.org/content/10.1101/2023.03.30.534849v1.abstract) to do so. Refer to this [tutorial](https://decoupler-py.readthedocs.io/en/latest/notebooks/dorothea.html) for more info.

In [None]:
# Estimate activities
dc.mt.ulm(
    data=adata,
    net=net,
    bsize=128,  
    tmin = 50,
    verbose=True,
    raw=False
)

#### Extract highly-variable TF activities
To reduce the number of TFs for the sake of computational speed, we will only focus on the top 50 most variable TFs.

We will use the simple coefficient of variation to identify the most variable TFs.


In [None]:
est = li.ut.obsm_to_adata(adata, 'score_ulm')
# est.write("acts_tfs.h5ad") # save for future use
est.var['cv'] =  est.X.std(axis=0) / est.X.mean(axis=0)
top_tfs = est.var.sort_values('cv', ascending=False, key=abs).head(50).index


Create MuData object with TF activities and cell type proportions, and transfer spatial connectivities and other information from the original AnnData object.

In [None]:
import mudata as mu
mdata = mu.MuData({"tf":est, "comps":comps})
mdata.obsp = adata.obsp
mdata.uns = adata.uns
mdata.obsm = adata.obsm

Define Interactions of interest:

In [None]:
from itertools import product

In [None]:
interactions = list(product(comps.var.index, top_tfs))

In [None]:
interactions[:5]

### Estimate Cosine Similarity

In [None]:
bdata = li.mt.bivariate(mdata,
                        x_mod="comps",
                        y_mod="tf",
                        x_transform=sc.pp.scale,
                        y_transform=sc.pp.scale,
                        local_name="cosine", 
                        interactions=interactions,
                        mask_negatives=True, 
                        add_categories=True,
                        x_use_raw=False,
                        y_use_raw=False,
                        xy_sep="<->",
                        x_name='celltype',
                        y_name='tf'
                        )

<div class="alert alert-info">

To make the distributions comparable, we simply z-scale the TF activities and cell type proportions via the `x_transform` & `y_transform` parameters.

The type of transformation will affect the interpretation of the results, and different types of transformation might be more appropriate for different types of data. We provide zero-inflated minmax `zi_minmax` & `neg_to_zero` transformation functions via `li.fun.transform`.


</div>

In [None]:
bdata.var.sort_values("mean", ascending=False).head(5)

#### Let's plot the results

In [None]:
sc.pl.spatial(bdata, color=['CM<->MEF2C', 'Fib<->CTNNB1'], size=1.4, cmap="coolwarm", vmax=1, vmin=-1)

In [None]:
sc.pl.spatial(mdata.mod['tf'], color=['MEF2C', 'CTNNB1'], cmap='coolwarm', size=1.4, vcenter=0)

In [None]:
sc.pl.spatial(mdata.mod['comps'], color=['CM', 'Fib'], cmap='viridis', size=1.4)