# Scoring Communication Tutorial

In [1]:
import scanpy as sc
import numpy as np

import warnings
warnings.filterwarnings('ignore')

from cell2cell.core.cell_composition import CellComposition, normalize_lr_scores_by_max, scale_by_composition

seed = 888
np.random.seed(seed)

for a single context
1) LR databases
2) aggregating expression
3) dealing with negative counts
4) scoring options (internal + LIANA)
5) cell composition 

# 4) Calculating a LR Communication Score

Here, we calculate a communication score from just the gene expression of the ligands and receptors. In the proceeding section, we will further discuss how to account for cell composition.

# 5) Accounting for Cell Composition

To account for cell composition, in each context, we first need a "composition score" associated with each cell type, representative of its relative composition/frequency. 

We have implemented a number of methods to calculate the composition score in the CellComposition class. If you have your own, simply format it as the CellComposition.aggregated_composition_score attribute. This is a dataframe with one column as the cell type, and the other as the "Composition Score". Composition scores should be between [0,1].

Here, we show how to get composition scores from the expression data:

In [4]:
balf_corrected = sc.read_h5ad('/data3/hratch/c2c_general/batch_corrected_balf_covid.h5ad')

In [5]:
# load the data
balf_corrected = sc.read_h5ad('/data3/hratch/c2c_general/batch_corrected_balf_covid.h5ad')

# subset for quicker running as an example
np.random.seed(seed)
sub = np.random.choice(range(balf_corrected.n_obs), size=3000, replace=False)
balf_corrected = balf_corrected[sub,]

# format the manifold information
manifold = pd.DataFrame(balf_corrected.obsm['X_pca'])
manifold.index = balf_corrected.obs_names

In [14]:
# preview the metadata
balf_corrected.obs[['Sample_ID', 'Context', 'cell_type']].head()

Unnamed: 0,Sample_ID,Context,cell_type
ACGAGCCAGATGGCGT-1,C141,Moderate_Covid,Macrophages
TTCATTGTCTCTTCAA-1,C100,Healthy_Control,Macrophages
TCATTTGAGCCGCCTA-1,C52,Healthy_Control,Macrophages
TGGCGCAAGGATGGAA-1,C52,Healthy_Control,Macrophages
GGGAATGTCGGTGTTA-1,C51,Healthy_Control,Macrophages


This tutorial was written to analyze multiple samples/contexts to incorporate into Tensor-cell2cell. If you want to score for just one sample/context, you can use the same code. Set a metadata context column with a single context label. 

In our case, the we use the 'Sample_ID' as the context column. If we didn't have unique contexts, we can create it as follows:

In [15]:
# balf_corrected.obs['Sample_ID'] = 'balf_covid'

Initialize the CellComposition object with the expression dataframe, metadata, and PCA manifold:

In [16]:
cc = CellComposition(expr = balf_corrected.to_df().T, metadata = balf_corrected.obs, 
                    manifold = manifold)

If you do not have a manifold and want to use a composition scoring method such as meld that requires one, you can calculate it as follows:

In [7]:
# cc.run_pca(n_pcs = 100)

Score the composition using one of the available methods: 'meld', 

Here, we proceed with meld (which requires atleast 2 contexts):

In [17]:
cc.score_composition(score_type='meld', context_label='Sample_ID')

Building graph on 3000 samples and 100 features.
Calculating graph and diffusion operator...
  Calculating KNN search...
  Calculated KNN search in 1.12 seconds.
  Calculating affinities...
  Calculated affinities in 0.31 seconds.
Calculated graph and diffusion operator in 1.54 seconds.


Since MELD scores composition at the single-cell resolution, this is stored in the attribute cc.barcode_composition_score. For such methods, we must aggregate to the cell type resolution, since this is the resolution communication scoring is conducted at. 

When aggregating, NaN values may result if a cell type was not present in a context. By default, we fill these with 0 since they are not present. 

In [18]:
cc.get_aggregate_composition_score(aggregation_method='median', fill_na = 0, 
                               context_label='Sample_ID', cellgroup_label='cell_type')

Each context has an associated aggregate composition score. This is the input you can provide as a user as well, if you have your own composition scores. As an example, let's see what sample C51 looks like:

In [19]:
cc.aggregate_composition_score['C51'].head()

Unnamed: 0_level_0,Composition_Score
cell_type,Unnamed: 1_level_1
B,0.101433
Epithelial,0.0
Macrophages,0.100636
NK,0.0
Neutrophil,0.0


Going forward, we have two options for incorporating composition into the communication score. 

Option 1: Directly scaling gene expression by composition prior to calculating communication


Option 2: Incorporate composition after calculating the communication score 

#### Mathematically, this is done as follows:

from gene expression matrix:
$$L\equiv\text{sender cell ligand expression}, R\equiv\text{target cell receptor expression}$$

calculated above with CellComposition.get_aggregate_composition_score: 
$$S\equiv\text{sender cell composition score}\in[0,1]$$
$$T\equiv\text{target cell composition score}\in[0,1]$$

For either option, the terms being combined must lie between [0,1] so that expression and composition are on equivalent scales. Once they are on equivalent scales, we can introduce a weighting parameter *lambda* to weight composition vs expression. Specifically, we will calculate a weighted average.

##### Option 1:
In this case, L and R must both lie between [0,1]:

$$L'\equiv\text{weighted sender cell ligand expression} = f(S,L) = \frac{\lambda S + (1-\lambda )L}{2}$$
$$R'\equiv\text{weighted sender cell ligand expression} = g(T,R) = \frac{\lambda T + (1-\lambda )R}{2}$$
$$S\equiv\text{weighted communication score} = h(L', R')$$ 

In this case, *S*, is the same scoring function used with the internal methods in Part 4, but with modified values for L and R.


##### Option 2:

calculated in Part 4:
$$CS\equiv\text{expression communication score} = {g_\text{S,T}}(L,R)$$

to ensure CS lies between [0,1], we can apply a normalization:
$${g'_\text{S,T}}(L,R) = \frac{CS}{CS_\text{max}}$$
where $$CS_\text{max}\equiv\text{maximum CS across all contexts}$$

We take the maximum across all contexts rather than within a context to ensure that the relative communication scores across contexts are accounted for when re-structuring into a tensor.

Since the communication score CS is calculated pairwise between cells already, we must calculate a pairwise composition score:

$$FS\equiv\text{pairwise composition score} = f(S,T) \in[0,1]$$

We use the mean, median, or mean as *f* in this instance. Since S and T are both [0,1], FS will be [0,1] <br>
Note that since *S* and *T* are the aggregate composition scores aggregated by cell type, this is a "mean of means"

Now, we have a pairwise LR communication score *CS* and a pairwise composition score *FS* both between [0,1], such that we can weight CS by FS:

$$S\equiv\text{weighted communication score} = h(L, R, S, T) = \frac{\lambda f(S,T) + (1-\lambda ){g'_\text{S,T}}(L,R)}{2}$$

A summation works, because we can assume that if *f* is 0, *g* will also be 0 (i.e., if the sender and receiver cell were not present in a context, there will be no LR expression communication score in that context).

## Option 1:

In [33]:
# TODO: when doing this, make sure to check that S, T, L, and R all lie [0,1]

#1) aggregate expression by cell type
#2) weight the expression by composition
#3) run into scoring function

## Option 2:

First, we must calculate a pairwise composition score that combines the individual composition scores. We recommend using the geometric mean because this will make any pair of cell composition scores 0 if one of the two is 0. 

In [20]:
pairwise_composition_scores = {}
for context, composition_score in cc.aggregate_composition_score.items():
    pairwise_composition_scores[context] = \
                            CellComposition.get_pairwise_composition_score(composition_score, method='gmean')

The result is a flattened diagonal matrix of pairwise composition scores:

In [22]:
pairwise_composition_scores['C51'].head()

Unnamed: 0,Composition_Score
B-B,0.101433
B-Epithelial,0.0
B-Macrophages,0.101034
B-NK,0.0
B-Neutrophil,0.0


Next we can scale the communication scores calculated in part 4 by the pairwise composition scores here:

In [24]:
# TODO: replace these random with actual communication scores
lr_communication_scores = {context: pd.DataFrame(np.random.rand(300, len(pairwise_score.index)), 
            columns = pairwise_score.index) for context, pairwise_score in pairwise_composition_scores.items()}

Next, we ensure that our LR scores are b/w [0,1] such that when we later weight the composition and LR communication scores, they are on equivalent scales (composition scores should be between [0,1] already).

In [25]:
lr_communication_scores = normalize_lr_scores_by_max(lr_communication_scores)

In [31]:
lr_communication_scores['C51'].head()

Unnamed: 0,B-B,B-Epithelial,B-Macrophages,B-NK,B-Neutrophil,B-Plasma,B-T,B-mDC,B-pDC,Epithelial-B,...,mDC-pDC,pDC-B,pDC-Epithelial,pDC-Macrophages,pDC-NK,pDC-Neutrophil,pDC-Plasma,pDC-T,pDC-mDC,pDC-pDC
0,0.251636,0.272324,0.549509,0.235827,0.36207,0.930779,0.972409,0.456784,0.9765,0.335636,...,0.638193,0.685172,0.737051,0.194902,0.087117,0.220078,0.945605,0.399299,0.641672,0.726755
1,0.929658,0.780617,0.491044,0.751686,0.736676,0.97156,0.372635,0.619899,0.979126,0.770571,...,0.631146,0.825354,0.405103,0.546353,0.945089,0.652021,0.834094,0.325539,0.741014,0.31897
2,0.588343,0.594012,0.585338,0.330183,0.313699,0.699193,0.562986,0.366172,0.283625,0.799744,...,0.265261,0.201016,0.979346,0.937663,0.850905,0.475232,0.137722,0.774116,0.447206,0.976485
3,0.090067,0.157863,0.058658,0.909436,0.668232,0.202363,0.605425,0.744067,0.804526,0.812308,...,0.347474,0.980513,0.075422,0.252192,0.650773,0.753796,0.437932,0.25078,0.266627,0.829893
4,0.458214,0.407332,0.798032,0.383857,0.827237,0.868393,0.619853,0.632085,0.006781,0.356945,...,0.984337,0.179892,0.340541,0.703671,0.667272,0.686995,0.100802,0.343237,0.057814,0.475087


Finally, we weight the LR communication scores by the pairwise communication scores (with the weightin parameter "composition_weight"):

In [29]:
final_communication_scores = {}
for context in balf_corrected.obs.Sample_ID.unique():
    pairwise_composition_score = pairwise_composition_scores[context]
    lr_communication_score = lr_communication_scores[context]
    
    final_communication_scores[context] = scale_by_composition(lr_communication_score, 
                                                              pairwise_composition_score, 
                                                              composition_weight=0.25)

In [30]:
final_communication_scores['C51'].head()

Unnamed: 0,B-B,B-Epithelial,B-Macrophages,B-NK,B-Neutrophil,B-Plasma,B-T,B-mDC,B-pDC,Epithelial-B,...,mDC-pDC,pDC-B,pDC-Epithelial,pDC-Macrophages,pDC-NK,pDC-Neutrophil,pDC-Plasma,pDC-T,pDC-mDC,pDC-pDC
0,0.107043,0.102122,0.218695,0.088435,0.135776,0.349042,0.374344,0.183634,0.366187,0.125864,...,0.239322,0.256939,0.276394,0.073088,0.032669,0.082529,0.354602,0.149737,0.240627,0.272533
1,0.361301,0.292732,0.196771,0.281882,0.276253,0.364335,0.149428,0.244802,0.367172,0.288964,...,0.23668,0.309508,0.151914,0.204882,0.354408,0.244508,0.312785,0.122077,0.27788,0.119614
2,0.233308,0.222754,0.232131,0.123819,0.117637,0.262197,0.22081,0.149654,0.106359,0.299904,...,0.099473,0.075381,0.367255,0.351624,0.31909,0.178212,0.051646,0.290293,0.167702,0.366182
3,0.046454,0.059199,0.034626,0.341038,0.250587,0.075886,0.236724,0.291365,0.301697,0.304615,...,0.130303,0.367692,0.028283,0.094572,0.24404,0.282674,0.164225,0.094042,0.099985,0.31121
4,0.18451,0.15275,0.311891,0.143946,0.310214,0.325647,0.242135,0.249371,0.002543,0.133854,...,0.369126,0.06746,0.127703,0.263877,0.250227,0.257623,0.037801,0.128714,0.02168,0.178157


In [None]:
# to do
# dealing with NaNs
# finalizing score
# new attribute: self.aggregate_composition_score