## Scaden is a deep-learning based algorithm for cell type deconvolution of bulk RNA-seq samples.

## PR

### Bulk RNA-seq

MAGNet

### scRNA-seq and snRNA-seq

Processed scRNA-seq data GEO accession codes GSE109816 and GSE121893 (data re-analysed).

Processed scRNA- and snRNA-seq data from https://www.heartcellatlas.org

Processed snRNA-seq data from https://singlecell.broadinstitute.org/single_cell/study/SCP498/transcriptional-and-cellular-diversity-of-the-human-heart#study-download


scRNA- and snRNA-seq pre-processed (deconvolution_prep_scdata.ipynb).


In [2]:
# import order is important to avoid ImportError
import sklearn

import scaden

# run scaden as module
from scaden import example as scx
from scaden import process as scp
from scaden import train as sct
from scaden import predict as scpr
from scaden import simulate as scs

In [3]:
import os
import glob
import numpy as np
import pandas as pd

import loompy as lp




/home/eboileau/.virtualenvs/pml/lib/python3.6/site-packages/numba/np/ufunc/parallel.py:365: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 7003. The TBB threading layer is disabled.
  warnings.warn(problem)


In [4]:
# using normalized data (no log transformation) - raw counts in "counts" layer
scDir = '/prj/MAGE/analysis/deconvolution/scdata/scaden'
# raw - we need to Scanpy normalize the same way
bulkDir = '/prj/MAGE/analysis/data/stringtie'

locDir = '/prj/MAGE/analysis/deconvolution/scaden'

In [108]:
# to leverage the heterogeneity of multisubject data, training data is generated separately for
# every sample in the dataset

# see wrap_scaden_data.py

In [51]:
# bulk data - genes x samples
lf = lp.connect(glob.glob(os.path.join(bulkDir, '*.loom'))[0], mode='r+', validate=False)
# Only one indexing vector or array is currently allowed for advanced selection
X = lf[lf.ra['GeneFlag']==1,:]
X = X[:,lf.ca['SampleFlag']==1]
# same as scanpy.pp.normalize_total with exclude_highly_expressed=False
def _normalize_data(X, counts, after=None, copy=True):
    X = X.copy() if copy else X
    if issubclass(X.dtype.type, (int, np.integer)):
        X = X.astype(np.float32)  # TODO: Check if float64 should be used
    counts = np.asarray(counts)  # dask doesn't do medians
    after = np.median(counts[counts>0], axis=0) if after is None else after
    counts += (counts == 0)
    counts = counts / after
    # no sparse data
    np.divide(X, counts[:, None], out=X)
    return X

target_sum = 1e4
counts_per_cell = X.sum(1)
cell_subset = counts_per_cell > 0
if not np.all(cell_subset):
    print('Some samples have total count of genes equal to zero!')
else:
    X = _normalize_data(X, counts_per_cell, target_sum)

    
bulkf = os.path.join(locDir, 'results', 'MAGNet_counts.txt')
pd.DataFrame(X, 
             index=lf.ra[lf.ra['GeneFlag']==1]['Gene'], 
             columns=lf.ca[lf.ca['SampleFlag']==1]['CellID']).to_csv(bulkf, 
                                                                     index=True, 
                                                                     header=True, 
                                                                     sep='\t',
                                                                     float_format='%.5f')
lf.close()

## Simulate data for training


In [11]:
# scs.simulation(simulate_dir=os.path.join(locDir, 'training_/'), # trailing / is important!
#                data_dir=os.path.join(locDir, 'input/'),
#                sample_size=1000, # number of cells per sample
#                num_samples=2000, # number of samples
#                pattern="HCASampleH3.h5ad",
#                unknown_celltypes=['unknown'], # must be a list, we don't have any
#                out_prefix='test',
#                fmt='h5ad')

In [12]:
scs.simulation(simulate_dir=os.path.join(locDir, 'training/'), # trailing / is important!
               data_dir=os.path.join(locDir, 'input/'),
               sample_size=1000, # number of cells per sample
               num_samples=2000, # number of samples
               pattern="*.h5ad",
               unknown_celltypes=['unknown'], # must be a list, we don't have any
               out_prefix='trained_3set_by_sample_1000cells_2000samples',
               fmt='h5ad')

Loading SCP498Sample1723 dataset ...
Loading SCP498Sample1681 dataset ...
Loading GSE109816SampleN2 dataset ...
Loading GSE109816SampleD4 dataset ...
Loading GSE109816SampleN9 dataset ...
Loading HCASampleD1 dataset ...
Loading HCASampleD2 dataset ...
Loading HCASampleH4 dataset ...
Loading GSE109816SampleN5 dataset ...
Loading SCP498Sample1702 dataset ...
Loading GSE109816SampleN10 dataset ...
Loading HCASampleH6 dataset ...
Loading GSE109816SampleN12 dataset ...
Loading GSE109816SampleN8 dataset ...
Loading GSE109816SampleN11 dataset ...
Loading HCASampleD6 dataset ...
Loading SCP498Sample1666 dataset ...
Loading HCASampleH3 dataset ...
Loading GSE109816SampleN13 dataset ...
Loading GSE109816SampleN14 dataset ...
Loading HCASampleD4 dataset ...
Loading GSE109816SampleN1 dataset ...
Loading HCASampleH7 dataset ...
Loading HCASampleH2 dataset ...
Loading GSE109816SampleC2 dataset ...
Loading GSE109816SampleC1 dataset ...
Loading HCASampleH5 dataset ...
Loading HCASampleD5 dataset ...
L

Normal samples: 100%|██████████| 1000/1000 [49:28<00:00,  2.97s/it]
Sparse samples: 100%|██████████| 1000/1000 [26:21<00:00,  1.58s/it]
Normal samples: 100%|██████████| 1000/1000 [36:36<00:00,  2.20s/it]
Sparse samples: 100%|██████████| 1000/1000 [19:45<00:00,  1.19s/it]
Normal samples: 100%|██████████| 1000/1000 [03:48<00:00,  4.37it/s]
Sparse samples: 100%|██████████| 1000/1000 [04:14<00:00,  3.92it/s]
Normal samples: 100%|██████████| 1000/1000 [03:56<00:00,  4.23it/s]
Sparse samples: 100%|██████████| 1000/1000 [04:09<00:00,  4.00it/s]
Normal samples: 100%|██████████| 1000/1000 [03:53<00:00,  4.29it/s]
Sparse samples: 100%|██████████| 1000/1000 [04:14<00:00,  3.92it/s]
Normal samples: 100%|██████████| 1000/1000 [14:28<00:00,  1.15it/s]
Sparse samples: 100%|██████████| 1000/1000 [09:21<00:00,  1.78it/s]
Normal samples: 100%|██████████| 1000/1000 [33:08<00:00,  1.99s/it]
Sparse samples: 100%|██████████| 1000/1000 [18:44<00:00,  1.12s/it]
Normal samples: 100%|██████████| 1000/1000 [21:2

Output

- `celltypes.txt` which contains all cell types in the training data
- `*_labels.txt` for each input sc dataset, of size num_samples × num_celltypes (cell type proportion for each sample)
- `*_samples.txt` bulk data simulated for each sc dataset of size num_samples × num_genes (intersect of all sc datasets)
- `*.h5ad`, bulk data combined of size n_obs × n_vars = num_scdataset * num_samples × num_genes (intersect of all sc datasets), var are genes, and obs contains for each sample cell type proportions

**Note:** `sample_size`

For each num_samples (artificial), generate a random fraction of each cell type given sc datasets. Using sample_size (number of cells per sample), fractions are multiplied to obtain the number of each cell
type that will be found in each mock bulk sample. 


## Process input

- intersection of genes between training and data default var_cutoff=0.1
- log2-transformed and scaled
- pre-process: default scaling_option cannot be changed uses sklearn.preprocessing.MinMaxScaler

In [13]:
# prediction data (bulk)
data_path = os.path.join(locDir, 'results', 'MAGNet_counts.txt')
# training data (h5ad file) scRNA-seq
training_data = os.path.join(locDir, 'training', 'trained_3set_by_sample_1000cells_2000samples.h5ad')
# name of processed file - output
processed_path = os.path.join(locDir, 'training', 'trained_3set_by_sample_1000cells_2000samples_processed.h5ad')

# default
var_cutoff = 0.1

scp.processing(data_path, training_data, processed_path, var_cutoff)


## Training

Options:

--train_datasets Comma-separated list of datasets used for training.

Here, we simulated sc data separately for training. Use multiple sc dataset to simulate one training dataset?
Simulate multiple training dataset?

- uses 3 deep NN, trained for 5,000 steps
- default: --batch_size 128  --learning_rate 0.0001  --steps 5000  --seed 0

In [5]:
data_path = os.path.join(locDir, 'training', 'trained_3set_by_sample_1000cells_2000samples_processed.h5ad')

train_datasets = '' # ds from processed, uses all by default when called

model_dir = os.path.join(locDir, 'model/')

batch_size = 128
learning_rate = 0.0001 
num_steps = 5000

sct.training(data_path,
             train_datasets,
             model_dir,
             batch_size,
             learning_rate,
             num_steps,
             seed=0)


Step: 4999, Loss: 0.0021: 100%|██████████| 5000/5000 [03:09<00:00, 26.44it/s]


INFO:tensorflow:Assets written to: /prj/MAGE/analysis/deconvolution/scaden/model//m256/assets


Step: 4999, Loss: 0.0014: 100%|██████████| 5000/5000 [03:55<00:00, 21.24it/s]


INFO:tensorflow:Assets written to: /prj/MAGE/analysis/deconvolution/scaden/model//m512/assets


Step: 4999, Loss: 0.0014: 100%|██████████| 5000/5000 [05:12<00:00, 15.98it/s]


INFO:tensorflow:Assets written to: /prj/MAGE/analysis/deconvolution/scaden/model//m1024/assets


## Prediction

In [6]:
model_dir = os.path.join(locDir, 'model/')

data_path = os.path.join(locDir, 'results', 'MAGNet_counts.txt')

out_name = os.path.join(locDir, 'results', 'scaden_deconvolution_MAGNet_counts_trained_3set_by_sample_1000cells_2000samples.txt')

scpr.prediction(model_dir=model_dir,
                data_path=data_path,
                out_name=out_name,
                seed=0)

In [None]:
###########################

In [9]:
from anndata import read_h5ad

In [11]:
test_data = read_h5ad(os.path.join(locDir, 'training', 'trained_3set_by_sample_1000cells_2000samples_processed.h5ad'))
test_data

AnnData object with n_obs × n_vars = 82000 × 18257
    obs: 'Endothelial', 'Fibroblast', 'Macrophage', 'Ventricular_cardiomyocyte', 'Mesothelial', 'Smooth_muscle', 'Atrial_cardiomyocyte', 'Adipocyte', 'Lymphocyte', 'Pericyte', 'Lymphoid', 'Neuronal', 'Myeloid', 'ds', 'batch'
    uns: 'cell_types', 'unknown'