This tutorial will demonstrate how to pre-process single-cell raw UMI counts to generate expression matrices that can be used as input to Tensor-cell2cell. 

At some point in the pipeline, we must account for batch. Batch-correction is important since Tensor-cell2cell considers multiple balf_samples to extract context-dependent patterns, and we want to make sure we are capturing true biological signals rather than sample-specific differences due to technical variability. 

Ideally, we can use single-cell RNAseq batch correction methods. There are a few potential problems with this approach:

1) Batch correction methods often return a matrix in a reduced space and thus does not have the original gene features included, which is needed for LR scoring.

2) Some cell-cell communication tools expect data in other formats, such as log(1+CPM)

3) Batch correction methods that do return gene counts often return negative counts which can result in negative LR scores. Negative values in the tensor can bias non-negative TCD, the main algorithm used in Tensor-cell2cell.  

In this tutorial, and its companion 01B for R users, we will show pre-processing from raw counts to batch corrected counts. Problem 1 can simply be dealt with by only using batch correction methods that return the original gene features. Problem 2-3 will be discussed further in Tutorials XXX. Essentially, Problems 2-3 can both be dealth with by instead directly introducing a technical covariate to account for batch directly to the decomposition. Problem 3 can also be dealt with either by masking negative values or using a TCD approach that does not have a non-negative constraint. 

There are a number of workflows to achieve this; here we demonstrate a typical workflow using the popular single-cell analysis software scanpy to generate an AnnData object which can be used downstream.

In [2]:
import scanpy as sc
import numpy as np

import warnings
warnings.filterwarnings('ignore')

seed = 888

Load the BALF COVID dataset, which is described here: https://doi.org/10.1038/s41591-020-0901-9.

This dataset contains 12 samples, each associated with "Healthy Control", "Moderate", or "Severe" COVID contexts. 

You can download the 12 scRNAseq .h5 files under the samples section here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926
You can also download the metadata file: TODO

In [14]:
expression_data_path = '/data2/eric/Tensor-Revisions/data/COVID-19/'
from cell2cell.datasets.load_data import load_balf

balf_samples = c2c.data.load_balf(expression_data_path)

# expression_data_path = '/data2/eric/Tensor-Revisions/data/COVID-19/'
# balf_samples = load_balf(expression_data_path)

balf_samples
is a dictionary with keys as the sample name and values as a scanpy object with relevant metadata and raw UMI counts.

We begin with a basic preprocessing of each sample based on QC metrics. See https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html#Preprocessing for details

In [16]:
for sample, adata_ in balf_samples.items():
    adata = adata_.copy()
    sc.pp.filter_cells(adata, min_genes=30)
    sc.pp.filter_genes(adata, min_cells=3)

    # filter cells based on QC metrics
    adata.var['mt'] = adata.var_names.str.startswith('MT-')
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    adata = adata[adata.obs.n_genes_by_counts < 6000, :]
    adata = adata[adata.obs.pct_counts_mt < 10, :]
    
    balf_samples[sample] = adata

Next, we normalize the raw UMI counts. We recommend log(1+CPM) normalization, as this maintains non-negative counts and is the input for many communication scoring functions

In [26]:
for sample, adata_ in balf_samples.items():
    adata = adata_.copy()
    # CPM normalize
    sc.pp.normalize_total(adata, target_sum=1e6)

    # logarithmize 
    sc.pp.log1p(adata)
    
    balf_samples[sample] = adata

Finally, we apply a batch correction. The goal here is to account for sample-to-sample technical variability. In this case, we show Combat since it is built in with scanpy. 

Note, the final input matrices to Tensor-cell2cell must be non-negative. We will demonstrate workarounds to negative counts in the tensor building tutorial. 

See 10.1186/s13619-020-00041-9 for a benchmarking of Scanpy's batch correction methods

In [32]:
batch_var = 'Sample_ID' # the batch variable in the metadata

Batch correction using combat:

In [33]:
# merge the balf_samples
balf_corrected = sc.concat(balf_samples.values())
balf_corrected.obs_names_make_unique()

# store log(1+CPM) values in "raw" attribute
balf_corrected.raw = balf_corrected 

# do the batch correction
sc.pp.combat(balf_corrected, key = batch_var) 

The next two cells, unused, show examples of other methods for batch correction . See https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_03_integration.html for more tutorials on batch correction

Batch correction with scanorama:

In [8]:
# import scanorama

# # merge all the balf_samples into a single object
# balf_log = sc.concat(balf_samples.values())
# balf_log.obs_names_make_unique()

# # correct with scanorama
# balf_corrected = scanorama.correct_scanpy(adatas=list(balf_samples.values()), return_dimred=False)

# # aggregate into one object
# balf_corrected = sc.concat(balf_corrected) 
# balf_corrected.obs_names_make_unique()

# # store log(1+CPM) values in "raw" attribute
# balf_corrected.raw = balf_log

Batch correction using a simple linear regression:

In [9]:
# # merge the balf_samples
# balf_corrected = sc.concat(balf_samples.values())
# balf_corrected.obs_names_make_unique()

# # store log(1+CPM) values in "raw" attribute
# balf_corrected.raw = balf_corrected

# # do the batch correction
# sc.pp.regress_out(balf_corrected, keys = batch_var)

Calculate a PCA manifold on the batch-corrected counts

In [17]:
# get the top 2000 highly variable genes
sc.pp.highly_variable_genes(balf_corrected, n_top_genes = 2000)

# get PCA to 100 PCs
sc.tl.pca(balf_corrected, use_highly_variable = True, svd_solver='arpack', random_state = seed, 
         n_comps = 100)

In [None]:
# TODO: make this corrected object, the raw data, and metadata available to download somewhere
out_path = '/data3/hratch/c2c_general/'
balf_corrected.write_h5ad(out_path + 'batch_corrected_balf_covid.h5ad') # 6.7Gb

The final "balf_corrected" AnnData object has the following attributes:
1) X: batch-correct counts matrix (preferably non-negative) <br>
2) obs: cell metadata that includes the cell group (cluster or type), Sample ID, and Context <br>
3) raw: log(1+CPM) normalized AnnData object <br>
4) obsm['X_pca']: the cell manifold 

Regardless of the preprocessing pipeline used, these four pieces of information will be necessary for some parts of the Tensor-cell2cell analyses. 

In [20]:
# corrected counts matrix
balf_corrected.to_df().T.head()

Unnamed: 0,AAACCTGAGGAATCGC-1,AAACCTGTCCAGAAGG-1,AAACCTGTCCAGTAGT-1,AAACCTGTCTGGGCCA-1,AAACGGGCACGAGGTA-1,AAACGGGGTACATCCA-1,AAACGGGGTCTCCCTA-1,AAACGGGTCTAGAGTC-1,AAAGATGTCGTGGGAA-1,AAAGCAAAGGGATACC-1,...,TTTGGTTAGCACGCCT-1,TTTGGTTAGTGGTAAT-1,TTTGGTTAGTTGTAGA-1-1,TTTGGTTCATACTACG-1,TTTGTCAAGATTACCC-1,TTTGTCAAGTGGTAAT-1,TTTGTCACAGAAGCAC-1,TTTGTCATCAACCAAC-1,TTTGTCATCCAAACAC-1,TTTGTCATCGCGTTTC-1
LINC00115,-0.026211,3.884047,-0.026211,-0.026211,4.615011,-0.026211,-0.026211,-0.026211,-0.026211,4.423844,...,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741
NOC2L,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,...,-0.434179,-0.434179,2.241069,-0.434179,-0.434179,-0.434179,-0.434179,-0.434179,-0.434179,-0.434179
KLHL17,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,...,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993
PLEKHN1,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,...,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228
HES4,-0.444072,4.559903,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,...,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,6.038211


In [25]:
# cell metadata
balf_corrected.obs.head()

Unnamed: 0,Sample_ID,Context,cell_type,n_genes,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt
AAACCTGAGGAATCGC-1,C148,Severe_Covid,Macrophages,606,606,1342.0,60.0,4.470939
AAACCTGTCCAGAAGG-1,C148,Severe_Covid,Macrophages,2035,2034,7297.0,334.0,4.577224
AAACCTGTCCAGTAGT-1,C148,Severe_Covid,Macrophages,1660,1658,4959.0,324.0,6.533575
AAACCTGTCTGGGCCA-1,C148,Severe_Covid,Macrophages,4965,4964,31956.0,1374.0,4.299662
AAACGGGCACGAGGTA-1,C148,Severe_Covid,T,1290,1288,2892.0,119.0,4.114799


In [28]:
# log(1+CPM) counts matrix
balf_corrected.raw.to_adata().to_df().T.head()

Unnamed: 0,AAACCTGAGGAATCGC-1,AAACCTGTCCAGAAGG-1,AAACCTGTCCAGTAGT-1,AAACCTGTCTGGGCCA-1,AAACGGGCACGAGGTA-1,AAACGGGGTACATCCA-1,AAACGGGGTCTCCCTA-1,AAACGGGTCTAGAGTC-1,AAAGATGTCGTGGGAA-1,AAAGCAAAGGGATACC-1,...,TTTGGTTAGCACGCCT-1,TTTGGTTAGTGGTAAT-1,TTTGGTTAGTTGTAGA-1-1,TTTGGTTCATACTACG-1,TTTGTCAAGATTACCC-1,TTTGTCAAGTGGTAAT-1,TTTGTCACAGAAGCAC-1,TTTGTCATCAACCAAC-1,TTTGTCATCCAAACAC-1,TTTGTCATCGCGTTTC-1
LINC00115,0.0,4.927562,0.0,0.0,5.848695,0.0,0.0,0.0,0.0,5.607794,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
NOC2L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.458022,0.0,0.0,0.0,0.0,0.0,0.0,0.0
KLHL17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PLEKHN1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
HES4,0.0,6.021334,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.693111


In [36]:
# cell manifold
pd.DataFrame(balf_corrected.obsm['X_pca'], 
            columns = ['PC' + str(i) for i in range(1, 101)], 
                      index = balf_corrected.obs.index.tolist()).head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC91,PC92,PC93,PC94,PC95,PC96,PC97,PC98,PC99,PC100
AAACCTGAGGAATCGC-1,-12.27951,-11.569539,2.333932,-0.792567,-2.818772,2.152842,1.414543,-2.81945,-0.843046,-0.359107,...,0.636127,-0.555673,-1.174195,0.382731,-1.154741,1.880821,-2.175864,0.964448,2.638466,2.88271
AAACCTGTCCAGAAGG-1,4.993004,-9.634439,-4.017168,6.867777,-3.659773,2.181219,2.381791,5.002475,4.327793,-2.199009,...,1.056841,-0.549429,-1.792231,-1.502628,3.608827,0.246818,2.4785,-0.412751,2.093467,-2.841379
AAACCTGTCCAGTAGT-1,-4.327682,-9.474691,-0.37374,-1.574611,1.026707,0.264881,-0.642445,2.645939,0.76808,1.401344,...,-1.219072,-1.50904,0.535264,-2.544001,-2.009928,0.51789,-0.087034,0.123314,0.31421,-3.137026
AAACCTGTCTGGGCCA-1,21.99651,2.566066,-3.94439,-13.916259,7.445151,3.117134,1.296663,2.814822,-1.560574,-0.914551,...,-1.616594,-0.872157,-2.061459,-0.744164,-1.183877,-4.804202,-1.866552,2.740477,1.292481,-1.16091
AAACGGGCACGAGGTA-1,-18.817038,4.994419,-9.384811,-6.844615,-3.056098,2.77496,-5.017649,0.400225,0.813033,0.663741,...,1.176409,0.047743,-0.064789,2.626716,0.151935,0.872773,-1.538529,-0.425076,1.31526,-0.842731


In [None]:
# from typing import Dict
# def split_adata(adata, sample_col = 'Sample_ID'):
#     """Split an AnnData object with corrected counts into its respective balf_samples.

#     Parameters
#     ----------
#     adata : AnnData
#         merged AnnData object across balf_samples (see sc.concat)
#     sample_col : str, optional
#         the metadata (adata.obs) column specifying the balf_samples, by default 'Sample_ID'

#     Returns
#     -------
#     balf_samples : Dict[str, AnnData]
#         the set of AnnData objects corresponding to each sample
#     """
    
#     balf_samples = {sample: adata[adata.obs[adata.obs[sample_col] == sample].index] for sample in adata.obs[sample_col].unique()}
#     return balf_samples


# balf_corrected_split = split_adata(adata=balf_corrected)