# scBoolSeq API demonstration

This notebook demonstrates the basic features of scBoolSeq: scRNA-Seq data binarization and synthetic generation.

Note that scBoolSeq also comes with a command line interface, see https://github.com/bnediction/scBoolSeq.

In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import warnings
warnings.filterwarnings("ignore")

In [6]:
import pandas as pd
from scboolseq import scBoolSeq

### Retrieve an example dataset

The demonstration will be performed on the scRNA-Seq datasets from XXX.

In [3]:
!test -f data_Nestorowa.tsv.gz || curl -fOL \
    https://github.com/pinellolab/STREAM/raw/master/stream/tests/datasets/Nestorowa_2016/data_Nestorowa.tsv.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:14 --:--:--     0curl: (6) Could not resolve host: github.com


**Important**: `scBoolSeq` expects the data to be formatted as follows: columns representing genes and rows representing cells/samples (sc/bulk RNA-Seq).

In [3]:
nestorowa = pd.read_csv("data_Nestorowa.tsv.gz", compression="gzip", sep="\t", index_col=0).T
nestorowa.head()

Unnamed: 0,Clec1b,Kdm3a,Coro2b,8430408G22Rik,Clec9a,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2,...,Zfp438,Rab18,Mzb1,B4galt6,Rnf125,Impact,Taf4b,Zfp521,Hrh4,Psma8
HSPC_025,0.0,4.891604,1.426148,0.0,0.0,2.599758,2.954035,6.357369,2.12914,1.426148,...,1.426148,9.660368,1.426148,1.426148,2.12914,8.177546,1.426148,1.426148,0.0,7.869409
HSPC_031,0.0,6.877725,0.0,0.0,0.0,2.423483,1.804914,0.0,0.0,0.0,...,0.0,0.699126,0.0,6.562672,0.0,5.439604,0.699126,0.0,0.0,0.0
HSPC_037,0.0,0.0,6.913384,0.0,0.0,2.051659,8.265465,0.0,1.363402,0.0,...,1.363402,8.885311,0.0,1.363402,0.0,8.068215,0.0,2.051659,0.0,1.363402
LT-HSC_001,0.0,0.0,8.178374,0.0,0.0,6.419817,3.453502,2.579528,2.579528,0.0,...,2.579528,6.501342,4.947883,0.0,0.0,0.0,2.579528,8.178374,0.0,2.579528
HSPC_001,0.0,0.0,9.475577,0.0,0.0,7.73337,1.4789,0.0,10.045601,0.532906,...,0.0,1.693409,7.975432,8.561045,0.0,6.53992,0.532906,0.0,0.0,0.532906


## Instantiation

In [7]:
scbool = scBoolSeq()
scbool

### Binarization

The binarization requires learning the distribution of RNA pseudocounts for each gene, which is performed by the `fit()` method:

In [5]:
%time scbool.fit(nestorowa)

CPU times: user 11.6 s, sys: 132 ms, total: 11.8 s
Wall time: 42 s


scBoolSeq(has_data=True, can_binarize=True, can_simulate=False)

Internally, the learned features are stored in a `criteria` table, which can be accessed like this:

In [6]:
scbool.criteria[['Category', *scbool.criteria]].head(10)

Unnamed: 0,Category,Dip,BI,Kurtosis,DropOutRate,MeanNZ,DenPeak,Amplitude,gaussian_prob1,gaussian_prob2,...,variance,unimodal_margin_quantile,unimodal_low_quantile,unimodal_high_quantile,IQR,q50,bim_thresh_down,bim_thresh_up,Category.1,dor_threshold
Clec1b,ZeroInf,0.358107,1.635698,54.017736,0.876208,1.520978,-0.007249,8.852181,0.98614,0.01386,...,0.579791,0.25,0.0,0.0,0.0,0.0,2.78574,3.094168,ZeroInf,0.95
Kdm3a,Bimodal,0.0,2.407548,-0.784019,0.326087,3.84794,0.209239,10.126676,0.71452,0.28548,...,8.692586,0.25,0.0,5.258984,5.258984,1.26804,3.432251,4.748643,Bimodal,0.95
Coro2b,ZeroInf,0.0,2.32006,7.061604,0.658213,2.383819,0.004597,9.475577,0.919508,0.080492,...,3.112619,0.25,0.0,0.868463,0.868463,0.0,3.183596,3.879537,ZeroInf,0.95
8430408G22Rik,ZeroInf,0.684454,3.121069,21.729044,0.884058,2.983472,0.005663,9.067857,0.964962,0.035038,...,1.85402,0.25,0.0,0.0,0.0,0.0,3.612061,4.175572,ZeroInf,0.95
Clec9a,Discarded,1.0,2.081717,140.089285,0.96558,2.280293,-0.009361,9.614233,0.993961,0.006039,...,0.372878,0.25,0.0,0.0,0.0,0.0,3.11341,4.607253,Discarded,0.95
Phf6,Bimodal,0.0,1.988667,-1.389024,0.035628,5.025501,2.017547,10.135226,0.505609,0.494391,...,8.039168,0.25,2.197163,7.542022,5.344859,4.778527,3.932792,5.828662,Bimodal,0.95
Usp14,Bimodal,0.0,2.20808,-1.224987,0.00785,6.109964,8.24557,11.08875,0.374024,0.625976,...,7.52915,0.25,3.337786,8.354258,5.016472,7.140887,4.553463,6.022076,Bimodal,0.95
Tmem167b,Bimodal,0.0,2.430813,0.093023,0.39372,3.448331,0.072982,9.486826,0.78825,0.21175,...,7.576674,0.25,0.0,2.838078,2.838078,0.924808,3.561157,4.654446,Bimodal,0.95
Kbtbd7,ZeroInf,0.0,2.137107,3.577214,0.571256,2.928988,-0.000556,10.910051,0.883407,0.116593,...,5.092159,0.25,0.0,1.410089,1.410089,0.0,3.553714,4.491632,ZeroInf,0.95
Rag2,ZeroInf,0.0,1.772383,9.080962,0.460145,1.928663,0.002901,10.348297,0.943473,0.056527,...,3.038054,0.25,0.0,1.329104,1.329104,0.548416,3.895822,4.703259,ZeroInf,0.95


The actual binarization is performed using the `.binarize` method, and takes as argument the dataset to binarize, which can be identifical to the reference dataset used for learning criteria:

In [7]:
%time nestorowa_binarized = scbool.binarize(nestorowa)

CPU times: user 280 ms, sys: 199 ms, total: 479 ms
Wall time: 3.66 s


The method returns a Pandas dataframe replacing the RNA counts with `0`, `1`, or `NaN`:

In [8]:
nestorowa_binarized.head()

Unnamed: 0,Clec1b,Kdm3a,Coro2b,8430408G22Rik,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2,Hmgcs1,...,Fzd8,Zfp438,Rab18,Mzb1,B4galt6,Rnf125,Impact,Taf4b,Zfp521,Psma8
HSPC_025,,1.0,,,0.0,0.0,1.0,,,1.0,...,,1.0,1.0,1.0,0.0,0.0,1.0,,,1.0
HSPC_031,,1.0,,,0.0,0.0,0.0,,,0.0,...,,,0.0,,1.0,0.0,,,,
HSPC_037,,0.0,1.0,,0.0,1.0,0.0,,,1.0,...,,1.0,1.0,,0.0,0.0,1.0,,,
LT-HSC_001,,0.0,1.0,,1.0,0.0,0.0,,,1.0,...,,1.0,1.0,1.0,0.0,0.0,0.0,,1.0,
HSPC_001,,0.0,1.0,,1.0,0.0,0.0,1.0,,0.0,...,,,0.0,1.0,1.0,0.0,1.0,,,


### Synthetic generation

The synthetic generation is performed from fully determined Boolean states, given as a Pandas dataframe.

For this example, we simply reuse the binarized data, where we replace the `NaN` entries with random Boolean values:

In [4]:
from scboolseq.simulation import random_nan_binariser
fully_bin = random_nan_binariser(nestorowa_binarized)
fully_bin.head()

NameError: name 'nestorowa_binarized' is not defined

For this example, we generate synthetic RNA-Seq only for a subset of genes and cells:

In [10]:
to_simulate = fully_bin.iloc[1:15, 4:10]
to_simulate

Unnamed: 0,Phf6,Usp14,Tmem167b,Kbtbd7,Rag2,Hmgcs1
HSPC_031,0.0,0.0,0.0,0.0,0.0,0.0
HSPC_037,0.0,1.0,0.0,1.0,0.0,1.0
LT-HSC_001,1.0,0.0,0.0,0.0,0.0,1.0
HSPC_001,1.0,0.0,0.0,1.0,0.0,0.0
HSPC_008,1.0,0.0,0.0,1.0,1.0,1.0
HSPC_014,0.0,1.0,0.0,0.0,0.0,1.0
HSPC_020,1.0,1.0,0.0,0.0,0.0,1.0
HSPC_026,1.0,1.0,0.0,1.0,0.0,0.0
HSPC_038,0.0,0.0,0.0,1.0,0.0,0.0
LT-HSC_002,0.0,1.0,0.0,1.0,0.0,0.0


In [8]:
simulated1 = scbool.sample_counts(to_simulate, n_samples_per_state=3, seed=1234)
simulated1.head()

NameError: name 'to_simulate' is not defined