# scBoolSeq API demonstration

This notebook demonstrates the basic features of scBoolSeq: scRNA-Seq data binarization and synthetic generation.

Note that scBoolSeq also comes with a command line interface, see https://github.com/bnediction/scBoolSeq.

In [1]:
import pandas as pd
from scboolseq import scBoolSeq

### Retrieve an example dataset

The demonstration will be performed on the scRNA-Seq datasets from XXX.

In [2]:
!test -f data_Nestorowa.tsv.gz || curl -fOL \
    https://github.com/pinellolab/STREAM/raw/master/stream/tests/datasets/Nestorowa_2016/data_Nestorowa.tsv.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 29.7M  100 29.7M    0     0  3352k      0  0:00:09  0:00:09 --:--:-- 3616k


**Important**: `scBoolSeq` expects the data to be formatted as follows: columns representing genes and rows representing cells/samples (sc/bulk RNA-Seq).

In [3]:
nestorowa = pd.read_csv("data_Nestorowa.tsv.gz", compression="gzip", sep="\t", index_col=0).T
nestorowa.head()

Unnamed: 0,Clec1b,Kdm3a,Coro2b,8430408G22Rik,Clec9a,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2,...,Zfp438,Rab18,Mzb1,B4galt6,Rnf125,Impact,Taf4b,Zfp521,Hrh4,Psma8
HSPC_025,0.0,4.891604,1.426148,0.0,0.0,2.599758,2.954035,6.357369,2.12914,1.426148,...,1.426148,9.660368,1.426148,1.426148,2.12914,8.177546,1.426148,1.426148,0.0,7.869409
HSPC_031,0.0,6.877725,0.0,0.0,0.0,2.423483,1.804914,0.0,0.0,0.0,...,0.0,0.699126,0.0,6.562672,0.0,5.439604,0.699126,0.0,0.0,0.0
HSPC_037,0.0,0.0,6.913384,0.0,0.0,2.051659,8.265465,0.0,1.363402,0.0,...,1.363402,8.885311,0.0,1.363402,0.0,8.068215,0.0,2.051659,0.0,1.363402
LT-HSC_001,0.0,0.0,8.178374,0.0,0.0,6.419817,3.453502,2.579528,2.579528,0.0,...,2.579528,6.501342,4.947883,0.0,0.0,0.0,2.579528,8.178374,0.0,2.579528
HSPC_001,0.0,0.0,9.475577,0.0,0.0,7.73337,1.4789,0.0,10.045601,0.532906,...,0.0,1.693409,7.975432,8.561045,0.0,6.53992,0.532906,0.0,0.0,0.532906


## Instantiation

In [4]:
scbool = scBoolSeq()
scbool

### Binarization

The binarization requires learning the distribution of RNA pseudocounts for each gene, which is performed by the `fit()` method:

In [5]:
%time scbool.fit(nestorowa)

CPU times: user 5min 50s, sys: 3.38 s, total: 5min 54s
Wall time: 1min 5s


Internally, the learned features are stored in a `criteria` table, which can be accessed like this:

In [6]:
scbool.criteria_[['Category', *scbool.criteria_]].head(10)

Unnamed: 0,Category,Mean,MeanNZ,Median,MedianNZ,GeometricMean,HarmonicMean,Variance,VarianceNZ,DropOutRate,Amplitude,Dip,Kurtosis,Skewness,DenPeak,BI,Category.1
Clec1b,ZeroInf,0.188285,1.520978,0.0,0.968776,1.07712,0.843836,0.57944,2.653752,0.876208,8.852181,0.358107,54.017736,6.716474,-0.000128,0.0,ZeroInf
Kdm3a,Bimodal,2.593177,3.84794,1.26804,2.737412,2.682239,1.747482,8.687337,8.062633,0.326087,10.126676,0.0,-0.784019,0.863438,0.303398,2.401649,Bimodal
Coro2b,ZeroInf,0.814759,2.383819,0.0,1.290666,1.586378,1.14978,3.110739,5.361032,0.658213,9.475577,0.0,7.061604,2.771571,0.003072,0.0,ZeroInf
8430408G22Rik,ZeroInf,0.34591,2.983472,0.0,1.449779,1.845045,1.214593,1.8529,8.112175,0.884058,9.067857,0.684454,21.729044,4.708367,0.003788,0.0,ZeroInf
Clec9a,ZeroInf,0.078488,2.280293,0.0,1.229896,1.525787,1.148,0.372653,5.805785,0.96558,9.614233,1.0,140.089285,11.195517,0.000308,0.0,ZeroInf
Phf6,Bimodal,4.846453,5.025501,4.778527,5.051362,4.104554,3.106348,8.034313,7.431326,0.035628,10.135226,0.0,-1.389024,-0.002268,2.033821,1.989131,Bimodal
Usp14,Bimodal,6.061999,6.109964,7.140887,7.170392,5.283258,4.241799,7.524603,7.291078,0.00785,11.08875,0.0,-1.224987,-0.450551,8.231397,2.208317,Bimodal
Tmem167b,Bimodal,2.090655,3.448331,0.924808,2.02715,2.356898,1.589141,7.572099,7.807721,0.39372,9.486826,0.0,0.093023,1.246701,0.115315,2.426544,Bimodal
Kbtbd7,ZeroInf,1.255786,2.928988,0.0,1.671472,1.961379,1.349735,5.089084,6.96896,0.571256,10.910051,0.0,3.577214,2.131193,0.004064,0.0,ZeroInf
Rag2,ZeroInf,1.041198,1.928663,0.548416,1.274551,1.390515,1.097732,3.036219,3.912518,0.460145,10.348297,0.0,9.080962,2.910103,0.00779,0.0,ZeroInf


The actual binarization is performed using the `.binarize` method, and takes as argument the dataset to binarize, which can be identifical to the reference dataset used for learning criteria:

In [7]:
%time nestorowa_binarized = scbool.binarize(nestorowa)

CPU times: user 2.81 s, sys: 20.8 ms, total: 2.83 s
Wall time: 2.83 s


The method returns a Pandas dataframe replacing the RNA log pseudocounts with `0`, `1`, or `NaN`:

In [8]:
nestorowa_binarized.head()

Unnamed: 0,Clec1b,Kdm3a,Coro2b,8430408G22Rik,Clec9a,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2,...,Zfp438,Rab18,Mzb1,B4galt6,Rnf125,Impact,Taf4b,Zfp521,Hrh4,Psma8
HSPC_025,,1.0,1.0,,,0.0,0.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.0,0.0,1.0,,1.0,,1.0
HSPC_031,,1.0,,,,0.0,0.0,0.0,,,...,,0.0,,1.0,0.0,,,,,
HSPC_037,,0.0,1.0,,,0.0,1.0,0.0,1.0,,...,1.0,1.0,,0.0,0.0,1.0,,1.0,,1.0
LT-HSC_001,,0.0,1.0,,,1.0,0.0,,1.0,,...,1.0,1.0,1.0,0.0,0.0,0.0,,1.0,,1.0
HSPC_001,,0.0,1.0,,,1.0,0.0,0.0,1.0,1.0,...,,0.0,1.0,1.0,0.0,1.0,,,,1.0


### Synthetic generation

The synthetic generation is performed from fully determined Boolean states, given as a Pandas dataframe.

For this example, we simply reuse the binarized data, where we replace the `NaN` entries with random Boolean values:

In [9]:
from scboolseq.simulation import random_nan_binariser
fully_bin = random_nan_binariser(nestorowa_binarized)
fully_bin.head()

Unnamed: 0,Clec1b,Kdm3a,Coro2b,8430408G22Rik,Clec9a,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2,...,Zfp438,Rab18,Mzb1,B4galt6,Rnf125,Impact,Taf4b,Zfp521,Hrh4,Psma8
HSPC_025,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
HSPC_031,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
HSPC_037,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
LT-HSC_001,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
HSPC_001,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,...,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0


For this example, we generate synthetic RNA-Seq only for a subset of genes and cells:

In [10]:
to_simulate = fully_bin.iloc[:100, :10]
to_simulate

Unnamed: 0,Clec1b,Kdm3a,Coro2b,8430408G22Rik,Clec9a,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2
HSPC_025,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0
HSPC_031,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
HSPC_037,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
LT-HSC_001,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
HSPC_001,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
LT-HSC_014,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0
HSPC_044,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0
HSPC_051,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
HSPC_057,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [11]:
synthetic_rna = scbool.sample_counts(to_simulate, n_samples_per_state=3, random_state=1234)
synthetic_rna.head()

Unnamed: 0,Clec1b,Kdm3a,Coro2b,8430408G22Rik,Clec9a,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2
HSPC_025,13.277362,12.865621,13.322328,6.916974,1.772676,3.024705,6.071398,8.983634,8.995036,8.790327
HSPC_031,2.259741,9.369407,3.673108,5.979309,7.267355,2.112683,0.0,4.471132,8.443967,5.093893
HSPC_037,15.005001,2.580485,9.592068,3.603809,8.360564,3.898696,8.219273,7.117078,7.854001,9.625242
LT-HSC_001,4.510286,2.156347,7.615455,10.921965,9.327481,6.790554,5.802628,2.487239,11.066909,3.450493
HSPC_001,4.284373,2.983465,10.381631,13.410947,8.163133,9.542608,5.384108,2.58313,9.29075,9.566609


In [13]:
synthetic_rna.shape

(300, 10)