# Compositional Analysis

This tutorial will guide users in weighting their communication scores by cell type composition

In [8]:
import scanpy as sc
import numpy as np
import pandas as pd

import cell2cell as c2c

import warnings
warnings.filterwarnings('ignore')

#from cell2cell....import CellComposition, normalize_lr_scores_by_max, weight_tensor_by_composition, weight_communication_matrix_by_composition
seed = 888
np.random.seed(seed)

In [2]:
out_path = '/data3/hratch/c2c_general/'

# Accounting for Cell Composition

To account for cell composition, in each context, we first need a "composition score" associated with each cell type, representative of its relative composition/frequency. 

We have implemented a number of methods to calculate the composition score in the CellComposition class. If you have your own, simply format it as the CellComposition.aggregated_composition_score attribute. This is a dataframe with one column as the cell type, and the other as the "Composition Score". Composition scores should be between [0,1].

Here, we show how to get composition scores from the expression data:

In [3]:
# load the data
balf_corrected = sc.read_h5ad(out_path + 'batch_corrected_balf_covid.h5ad')
log1p = balf_corrected.raw.to_adata()

In [9]:
# # subset for quicker running as an example
# np.random.seed(seed)
# sub = np.random.choice(range(log1p.n_obs), size=3000, replace=False)
# log1p = log1p[sub,]

# format the manifold information
manifold = pd.DataFrame(log1p.obsm['X_pca'])
manifold.index = log1p.obs_names

In [10]:
# preview the metadata
log1p.obs[['Sample_ID', 'Context', 'cell_type']].head()

Unnamed: 0,Sample_ID,Context,cell_type
AAACCTGAGGAATCGC-1,C148,Severe_Covid,Macrophages
AAACCTGTCCAGAAGG-1,C148,Severe_Covid,Macrophages
AAACCTGTCCAGTAGT-1,C148,Severe_Covid,Macrophages
AAACCTGTCTGGGCCA-1,C148,Severe_Covid,Macrophages
AAACGGGCACGAGGTA-1,C148,Severe_Covid,T


This tutorial was written to analyze multiple samples/contexts to incorporate into Tensor-cell2cell. If you want to score for just one sample/context, you can use the same code. Set a metadata context column with a single context label. 

In our case, the we use the 'Sample_ID' as the context column. If we didn't have unique contexts, we can create it as follows:

In [15]:
# log1p.obs['Sample_ID'] = 'balf_covid'

Initialize the CellComposition object with the expression dataframe, metadata, and PCA manifold. We specify the context as the sample, and will use the higher-order disease context for downstream analyses and visualizations:

In [13]:
cc = CellComposition(expr = log1p.to_df().T, 
                     metadata = log1p.obs, context_label = 'Sample_ID', cellgroup_label= 'cell_type',
                    manifold = manifold)

If you do not have a manifold and want to use a composition scoring method such as meld that requires one, you can calculate it as follows:

In [163]:
# cc.run_pca(n_pcs = 100)

Score the composition using one of the available methods: 'frequency', 'meld', 



Using the frequency method is the simplest: the "composition score" is simply the cell type frequency in a given context, is already aggregated for cell type, and can be run even if you have just one context:

In [14]:
cc.score_composition(score_type='frequency')

We can view the composition scores for each context. Let's see what it looks like for sample C51:

In [15]:
cc.aggregate_composition_score['C51'].head()

Unnamed: 0_level_0,Composition_Score
cell_type,Unnamed: 1_level_1
B,0.001774
Epithelial,0.001538
Macrophages,0.936243
Mast,0.0
NK,0.000591


We can use more complex methods, including those that give composition scores at the single-cell resolution. Here, we show an example of single-cell resolution composition scoring using "meld" (https://doi.org/10.1038/s41587-020-00803-5), a method that requires atleast 2 contexts:

In [166]:
cc.score_composition(score_type='meld')

Building graph on 3000 samples and 100 features.
Calculating graph and diffusion operator...
  Calculating KNN search...
  Calculated KNN search in 1.14 seconds.
  Calculating affinities...
  Calculated affinities in 0.32 seconds.
Calculated graph and diffusion operator in 1.55 seconds.


Since MELD scores composition at the single-cell resolution, this is stored as a dictionary of (barcode, composition_score) pairs in the attribute "cc.barcode_composition_score". For such methods, we must aggregate to the cell type resolution, since this is the resolution communication scoring is conducted at. 

When aggregating, NaN values may result if a cell type was not present in a context. By default, we fill these with 0 since they are not present. 

In [167]:
cc.get_aggregate_composition_score(aggregation_method='median', fill_na = 0)

Again, let's see what sample C51 looks like after aggregation:

In [168]:
cc.aggregate_composition_score['C51'].head()

Unnamed: 0_level_0,Composition_Score
cell_type,Unnamed: 1_level_1
B,0.101434
Epithelial,0.0
Macrophages,0.100636
NK,0.0
Neutrophil,0.0


To proceed, we must ensure each context has an associated aggregate composition score for each cell type. This is the input you can provide as a user as well, if you have your own composition scores.

Going forward, we have two options for incorporating composition into the communication score. 

Option N: Directly scaling gene expression by composition prior to calculating communication


Option 2: Incorporate composition after calculating the communication score 

#### Mathematically, this is done as follows:

from gene expression matrix:
$$L\equiv\text{sender cell ligand expression}, R\equiv\text{target cell receptor expression}$$

calculated above and stored in CellComposition.aggregate_composition_score: 
$$S\equiv\text{sender cell composition score}\in[0,1]$$
$$T\equiv\text{target cell composition score}\in[0,1]$$

For either option, the terms being combined must lie on the same scale. Once they are on equivalent scales, we can introduce a weighting parameter *lambda* to weight composition vs expression. Specifically, we will calculate a weighted average.

##### Option N: Directly scaling gene expression by composition prior to calculating communication
In this case, we will modify S and T to lie on the same scale as L and R.  We recommend using scaled expression data. To achieve this, we multiply all composition scores by the maximum expression count across all contexts. Since composition scores are [0,1], multiplying by this value will put them on the same scales, with the largest composition score being at most the largest expression value. 

$$S' = expr_{\text{contexts,max}}*S$$
$$T' = expr_{\text{contexts,max}}*T$$

$$L'\equiv\text{weighted sender cell ligand expression} = f(S',L) = \lambda S' + (1-\lambda )L$$
$$R'\equiv\text{weighted sender cell ligand expression} = g(T',R) = \lambda T' + (1-\lambda )R$$
$$WCS\equiv\text{weighted communication score} = h(L', R')$$ 

In this case, *WCS*, is the same scoring function used with the internal methods in Part 4, but with modified values for L and R.


##### Option 1: Incorporate composition after calculating the communication score
calculated in Part 4:
$$CS\equiv\text{expression communication score} = {g_\text{S,T}}(L,R)$$

Here, we modify the CS to ensure that it lies between [0,1]. We can apply the following normalization:
$${g'_\text{S,T}}(L,R) = \frac{CS}{CS_\text{max}}$$
where $$CS_\text{max}\equiv\text{maximum CS across all contexts}$$

We take the maximum across all contexts rather than within a context to ensure that the relative communication scores across contexts are accounted for when re-structuring into a tensor.

Since the communication score CS is calculated pairwise between cells already, we must calculate a pairwise composition score:

$$FS\equiv\text{pairwise composition score} = f(S,T) \in[0,1]$$

We use the mean, median, or mean as *f* in this instance. Since S and T are both [0,1], FS will be [0,1] <br>
Note that since *S* and *T* are the aggregate composition scores aggregated by cell type, this is a "mean of means"

Now, we have a pairwise LR communication score *CS* and a pairwise composition score *FS* both between [0,1], such that we can weight CS by FS. 

A) We can do this using a weighted average:

$$WCS\equiv\text{weighted communication score} = h(L, R, S, T) = \lambda f(S,T) + (1-\lambda ){g'_\text{S,T}}(L,R)$$

A weighted average (summation) works, because we can assume that if *f* is 0, *g* will also be 0 (i.e., if the sender and receiver cell were not present in a context, there will be no LR expression communication score in that context).

B) Alternatively, we can weight CS by FS using a direct scaling (multiplicative):

$$WCS\equiv\text{weighted communication score} = h(L, R, S, T) = f(S,T) * g'_\text{S,T}(L,R)$$

## Option N:

not incorporated

In [193]:
# max_count = self.expr.max().max()
# self.scaled_aggregate_composition_score = {}
# for context, composition_scores in self.aggregate_composition_score:
#     scaled_composition_scores = composition_scores.copy()
#     scaled_composition_scores['Composition_Score'] = composition_scores.Composition_Score*max_count
#     self.scaled_aggregate_composition_score[context] = scaled_composition_scores
# self.scaled_barcde_composition_scores = dict(zip(self.barcode_composition_score, np.array(list(self.barcode_composition_score.values()))*max_count))

In [33]:
# TODO: when doing this, make sure to check that S, T, L, and R all lie [0,1]

#1) aggregate expression by cell type
#2) weight the expression by composition
#3) run into scoring function

## Option 1:

First, we must calculate a pairwise composition score that combines the individual composition scores. We recommend using the geometric mean because this will make any pair of cell composition scores 0 if one of the two is 0. 

In [16]:
pairwise_composition_scores = {}
for context, composition_score in cc.aggregate_composition_score.items():
    pairwise_composition_scores[context] = \
                            CellComposition.get_pairwise_composition_score(composition_score, method='gmean')

The result is a flattened diagonal matrix of pairwise composition scores:

In [17]:
pairwise_composition_scores['C51'].head()

Unnamed: 0,Composition_Score
B-B,0.001774
B-Epithelial,0.001652
B-Macrophages,0.040758
B-Mast,0.0
B-NK,0.001024


Next, we ensure that our LR scores are b/w [0,1] such that when we later weight the composition and LR communication scores, they are on equivalent scales (composition scores should be between [0,1] already).

This step is only necessary if your communication scores are not already between [0,1]. For example, if you followed tutorial 2A, communication scores are already between 0 and 1. 

In [180]:
tensor = c2c.io.load_variable_with_pickle(out_path + 'tensor_internal.pkl')
tensor = normalize_lr_scores_by_max(tensor)

Finally, we weight the LR communication scores by the pairwise communication scores by the weighted average (with the weightin parameter "composition_weight"):

In [181]:
tensor = weight_tensor_by_composition(tensor = tensor, 
                                     pairwise_composition_scores = pairwise_composition_scores, 
                                      method = 'weighted_average',
                                     composition_weight = 0.25)

Save the tensor:

In [182]:
c2c.io.export_variable_with_pickle(tensor, out_path + 'tensor_cellcompositionweighted_internal.pkl')

/data3/hratch/c2c_general/tensor_cellcompositionweighted_internal.pkl  was correctly saved.


Note, communication score scaling and weighting by composition can also be done if you do not yet have a tensor built. The format in this case for the LR communication scores would be a dictionary with keys as contexts and values as the communication matrix.

In [157]:
def tensor_to_communication_matrix(tensor: InteractionTensor, cell_delim: str = '-'):
    lr_communication_scores = dict()
    for context in tensor.order_names[0]:
        cm = pd.DataFrame(tensor.tensor[tensor.order_names[0].index(context),:,:,:].reshape((len(tensor.order_names[1]), len(tensor.order_names[2])*len(tensor.order_names[3]))))
        cm.columns = [cell_delim.join(cp) for cp in itertools.product(tensor.order_names[2], tensor.order_names[3])]
        cm.index = tensor.order_names[1]
        lr_communication_scores[context] = cm
    
    return lr_communication_scores

tensor = c2c.io.load_variable_with_pickle(out_path + 'tensor_internal.pkl')
lr_communication_scores = tensor_to_communication_matrix(tensor)

In [159]:
lr_communication_scores = normalize_lr_scores_by_max(lr_communication_scores)

In [160]:
lr_communication_scores['C51'].head()

Unnamed: 0,B-B,B-Epithelial,B-Macrophages,B-NK,B-T,B-mDC,Epithelial-B,Epithelial-Epithelial,Epithelial-Macrophages,Epithelial-NK,...,T-Macrophages,T-NK,T-T,T-mDC,mDC-B,mDC-Epithelial,mDC-Macrophages,mDC-NK,mDC-T,mDC-mDC
TGFB1^TGFBR1&TGFBR2,0.133333,0.210256,0.18785,0.133333,0.153554,0.165942,0.038462,0.115385,0.092978,0.038462,...,0.151943,0.097426,0.117647,0.130035,0.065217,0.14214,0.119734,0.065217,0.085438,0.097826
TGFB1^ACVR1B&TGFBR2,0.133333,0.248718,0.150895,0.133333,0.142525,0.139855,0.038462,0.153846,0.056023,0.038462,...,0.114988,0.097426,0.106618,0.103948,0.065217,0.180602,0.082779,0.065217,0.074409,0.071739
TGFB1^ACVR1&TGFBR1&TGFBR2,0.133333,0.171795,0.153169,0.133333,0.153554,0.144203,0.038462,0.076923,0.058297,0.038462,...,0.117262,0.097426,0.117647,0.108296,0.065217,0.103679,0.085053,0.065217,0.085438,0.076087
GDF11^TGFBR1&ACVR2A,0.0,0.076923,0.00638,0.0,0.005515,0.0,0.0,0.076923,0.00638,0.0,...,0.019248,0.012868,0.018382,0.012868,0.004348,0.081271,0.010728,0.004348,0.009863,0.004348
GDF11^TGFBR1&ACVR2B,0.0,0.038462,0.005496,0.0,0.005515,0.002174,0.0,0.038462,0.005496,0.0,...,0.018364,0.012868,0.018382,0.015042,0.004348,0.042809,0.009844,0.004348,0.009863,0.006522


Weight the LR communication scores by the pairwise communication scores:

In [178]:
final_communication_scores = {}
for context in balf_corrected.obs.Sample_ID.unique():
    pairwise_composition_score = pairwise_composition_scores[context]
    lr_communication_score = lr_communication_scores[context]
    
    final_communication_scores[context] = weight_communication_matrix_by_composition(lr_communication_score,
                                                                             pairwise_composition_score, 
                                                                             method = 'weighted_average',
                                                                             composition_weight=0.25)

In [179]:
final_communication_scores['C51'].head()

Unnamed: 0,B-B,B-Epithelial,B-Macrophages,B-NK,B-T,B-mDC,Epithelial-B,Epithelial-Epithelial,Epithelial-Macrophages,Epithelial-NK,...,T-Macrophages,T-NK,T-T,T-mDC,mDC-B,mDC-Epithelial,mDC-Macrophages,mDC-NK,mDC-T,mDC-mDC
TGFB1^TGFBR1&TGFBR2,0.100444,0.158105,0.151077,0.100256,0.117054,0.126193,0.029259,0.086923,0.07922,0.029085,...,0.157347,0.07416,0.096279,0.104923,0.05065,0.108222,0.1297,0.049916,0.071475,0.080171
TGFB1^ACVR1B&TGFBR2,0.100444,0.186951,0.123361,0.100256,0.108782,0.106628,0.029259,0.115769,0.051503,0.029085,...,0.129631,0.07416,0.088007,0.085358,0.05065,0.137069,0.101984,0.049916,0.063203,0.060606
TGFB1^ACVR1&TGFBR1&TGFBR2,0.100444,0.129259,0.125066,0.100256,0.117054,0.109889,0.029259,0.058077,0.053209,0.029085,...,0.131336,0.07416,0.096279,0.088619,0.05065,0.079376,0.103689,0.049916,0.071475,0.063867
GDF11^TGFBR1&ACVR2A,0.000444,0.058105,0.014975,0.000256,0.006025,0.001737,0.000413,0.058077,0.014271,0.000238,...,0.057826,0.010741,0.02183,0.017047,0.004998,0.06257,0.047946,0.004264,0.014793,0.010062
GDF11^TGFBR1&ACVR2B,0.000444,0.029259,0.014311,0.000256,0.006025,0.003367,0.000413,0.029231,0.013608,0.000238,...,0.057162,0.010741,0.02183,0.018678,0.004998,0.033724,0.047282,0.004264,0.014793,0.011693


Note, to emulate CellChat's composition incorporation do the following:

1) Calculate the composition score with the score_type = 'frequency' <br>
2) Calculate the pairwise composition score with the method = 'product' <br>
3) scale the communication by the composition with the method = 'product' <br>