# tut 1

NSCLC PBMCs Single Cell RNA-Seq (Fig. 2a,b):
* This example builds a signature matrix from single cell RNA sequencing data from NSCLC PBMCs and enumerates the proportions of the different cell types in a RNA-seq dataset profiled from whole blood using S-mode batch correction.


# example 1: generate signature matrix

### NSCLC PBMCs Single Cell RNA-Seq (Fig. 2a,b):

This example builds a signature matrix from single cell RNA sequencing data from NSCLC PBMCs and enumerates the proportions of the different cell types in a RNA-seq dataset profiled from whole blood using S-mode batch correction.

```
docker run \
    -v absolute/path/to/input/dir:/src/data \
    -v absolute/path/to/output/dir:/src/outdir \
    cibersortx/fractions \
    --username email_address_registered_on_CIBERSORTx_website \
    --token token_obtained_from_CIBERSORTx_website \
    --single_cell TRUE \
    --refsample Fig2ab-NSCLC_PBMCs_scRNAseq_refsample.txt \
    --mixture Fig2b-WholeBlood_RNAseq.txt \
    --fraction 0 \
    --rmbatchSmode TRUE 
```

## set up some stuff

In [1]:
import logging

In [2]:
logging.basicConfig()

## download data

In [3]:
%%bash

pushd /mnt/liulab/csx_example_files/

export BASE_URL="https://cibersortx.stanford.edu/inc/inc.download.page.handler.php"
# curl -O -J -L {$BASE_URL}?file=NSCLC_PBMCs_Single_Cell_RNA-Seq_Fig2ab.zip
# unzip NSCLC_PBMCs_Single_Cell_RNA-Seq_Fig2ab.zip
# curl -O -J -L {$BASE_URL}?file=RNA-Seq_mixture_melanoma_Tirosh_Fig2b-d.txt

tree -h

popd

/mnt/liulab/csx_example_files ~/deconv-data-exploration
.
├── [   0]  Expression_datasets
│   ├── [ 52M]  Fig2a-NSCLC_PBMCs_scRNAseq_matrix.txt
│   ├── [4.1M]  Fig2b-WholeBlood_RNAseq.txt
│   ├── [ 835]  Fig2b_ground_truth_whole_blood.txt
│   ├── [1.0M]  Fig3b-f-FL-arrays-groundtruth.RMA.txt
│   ├── [ 67M]  Fig3b-f-FL-arrays-mixture.txt
│   ├── [ 36M]  Fig3g_NSCLC_RNASeq_bulksortedpopulation.txt
│   ├── [1.8M]  Fig3g_groundtruth_NSCLCsubsets_Fig3g.txt
│   ├── [8.2M]  Fig3g_mixture_NSCLCbulk.txt
│   └── [2.1K]  README.txt
├── [   0]  Fig2ab-NSCLC_PBMCs
│   ├── [ 52M]  Fig2ab-NSCLC_PBMCs_scRNAseq_refsample.txt
│   ├── [186K]  Fig2ab-NSCLC_PBMCs_scRNAseq_sigmatrix.txt
│   └── [4.1M]  Fig2b-WholeBlood_RNAseq.txt
├── [ 835]  Fig2b_ground_truth_whole_blood.txt
├── [143K]  LM22.txt
├── [ 12M]  NSCLC_PBMCs_Single_Cell_RNA-Seq_Fig2ab.zip
├── [6.0M]  RNA-Seq_mixture_melanoma_Tirosh_Fig2b-d.txt
├── [   0]  Single_Cell_RNA-Seq_Melanoma_SuppFig_3b-d
│   ├── [6.0M]  mixture_melanoma_Tirosh_SuppFig_3

### read data into dataframes

In [4]:
import pandas as pd

logging.getLogger('pandas').setLevel('DEBUG')

In [5]:
path = (
    "/mnt/liulab/csx_example_files/Fig2ab-NSCLC_PBMCs/"
    "Fig2ab-NSCLC_PBMCs_scRNAseq_refsample.txt"
)

nsclc_pbmc_sc = pd.read_csv(
    path,
    sep='\t',
    index_col=0
)

nsclc_pbmc_sc

Unnamed: 0_level_0,T cells CD8,T cells CD8.1,T cells CD8.2,Monocytes,Monocytes.1,T cells CD4,T cells CD8.3,Monocytes.2,Monocytes.3,Monocytes.4,...,T cells CD8.233,T cells CD8.234,NKT cells.80,Monocytes.454,Monocytes.455,Monocytes.456,Monocytes.457,NKT cells.81,T cells CD8.235,Monocytes.458
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
RP11.34P13.7,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AL627309.1,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AP006222.2,0.0,0.0,0.0,0.0,0.0,0.0,216.59086,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RP4.669L17.10,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RP5.857K21.3,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AC011841.1,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AL354822.1,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
KIR2DL2,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PNRC2.1,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
nsclc_pbmc_sc.sum(axis=0).sort_values()

T cells CD4.127    1000000.0
Monocytes.175      1000000.0
Monocytes.295      1000000.0
T cells CD4.82     1000000.0
T cells CD8.6      1000000.0
                     ...    
Monocytes.319      1000000.0
Monocytes.205      1000000.0
T cells CD8.81     1000000.0
Monocytes.81       1000000.0
Monocytes.230      1000000.0
Length: 1054, dtype: float64

In [8]:
path = (
    "/mnt/liulab/csx_example_files/Fig2ab-NSCLC_PBMCs/"
    "Fig2b-WholeBlood_RNAseq.txt"
)

nsclc_wholeblood_mixtures = pd.read_csv(
    path,
    sep='\t',
    index_col=0
)

nsclc_wholeblood_mixtures

Unnamed: 0_level_0,W070517001156,W070517001157,W070517001159,W070517001160,W070517001161,W070517001162,W070517102034,W070517102035,W070517102036,W070517102037,W070517102038,W070517102051
GeneSym,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5_8S_rRNA,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5S_rRNA,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.304710,0.000000,0.000000,0.752697,0.000000
7SK,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.181943,0.000000,0.000000
A1BG,1.524589,1.198209,2.281101,2.510963,1.752686,3.467098,2.523853,1.634724,2.687471,3.385051,2.195180,1.912779
A1BG-AS1,0.210020,0.263073,0.410865,0.571484,0.139725,0.142219,0.348219,0.294046,0.732450,0.595088,0.424970,0.272239
...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11B,18.753000,10.084024,8.159590,12.489620,5.222887,6.192270,7.825120,12.366960,7.205970,7.896432,9.496550,8.637130
ZYX,200.613353,140.107566,144.816461,134.412477,81.341464,107.785758,62.656594,265.309460,88.768774,94.147450,194.531694,127.203111
ZYXP1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
ZZEF1,76.731847,61.515861,51.289958,80.723373,48.363048,44.139845,51.296040,63.324805,40.328860,49.242034,73.869547,62.659012


## run csx with docker

```
docker run \
    -v absolute/path/to/input/dir:/src/data \
    -v absolute/path/to/output/dir:/src/outdir \
    cibersortx/fractions \
    --username email_address_registered_on_CIBERSORTx_website \
    --token token_obtained_from_CIBERSORTx_website \
    --single_cell TRUE \
    --refsample Fig2ab-NSCLC_PBMCs_scRNAseq_refsample.txt \
    --mixture Fig2b-WholeBlood_RNAseq.txt \
    --fraction 0 \
    --rmbatchSmode TRUE 

```

In [9]:
!ls -l /mnt/liulab/csx_example_files

total 18239
drwxr-xr-x 1 jupyter jupyter        0 Jul 13 14:47 Expression_datasets
drwxr-xr-x 1 jupyter jupyter        0 Jul 13 14:47 Fig2ab-NSCLC_PBMCs
-rw-r--r-- 1 jupyter jupyter      835 Jul  2 21:48 Fig2b_ground_truth_whole_blood.txt
-rw-r--r-- 1 jupyter jupyter   146759 Jul  3 04:39 LM22.txt
-rw-r--r-- 1 jupyter jupyter 12259563 Jul 13 08:06 NSCLC_PBMCs_Single_Cell_RNA-Seq_Fig2ab.zip
-rw-r--r-- 1 jupyter jupyter  6264562 Jul 13 08:39 RNA-Seq_mixture_melanoma_Tirosh_Fig2b-d.txt
drwxr-xr-x 1 jupyter jupyter        0 Jul 13 14:47 Single_Cell_RNA-Seq_Melanoma_SuppFig_3b-d
-rw-r--r-- 1 jupyter jupyter     1974 Jul  2 21:48 groundtruth_HNSCC_Puram_et_al_Fig2cd.txt
-rw-r--r-- 1 jupyter jupyter     1216 Jul  2 21:48 groundtruth_Melanoma_Tirosh_et_al_SuppFig3b-d.txt


In [24]:
!./run_csx_fractions.sh

created directory /home/jupyter/csx/input/mixture.txt
Fig2b-WholeBlood_RNAseq.txt

sent 4,351,886 bytes  received 93 bytes  8,703,958.00 bytes/sec
total size is 8,701,450  speedup is 2.00
Fig2ab-NSCLC_PBMCs_scRNAseq_refsample.txt

sent 54,724,713 bytes  received 35 bytes  109,449,496.00 bytes/sec
total size is 54,711,251  speedup is 1.00
total 53M
drwxr-xr-x 2 jupyter jupyter 4.0K Jul 13 16:58 mixture.txt
-rw-r--r-- 1 jupyter jupyter  53M Jul 13 16:58 refsample.txt
>Running CIBERSORTxFractions...
>[Options] username: lyronctk@stanford.edu
>[Options] token: dfeba2c8b9d61daebee5fa87026b8e56
>[Options] single_cell: TRUE
>[Options] refsample: refsample.txt
>[Options] mixture: mixture.txt
>[Options] rmbatchSmode: TRUE
>[Options] verbose: TRUE
>Making reference sample file.
>Making phenotype class file.
>single_cell is set to TRUE, so quantile normalization is set to FALSE, and the default parameters for building the signature matrix have been set to the following values:
	- G.min <- 300
	- 

In [20]:
!ls -hlt /home/jupyter/csx/output

total 16M
-rw-r--r-- 1 jupyter jupyter 2.5K Jul 13 16:39 CIBERSORTx_Adjusted.txt
-rw-r--r-- 1 jupyter jupyter 234K Jul 13 16:39 CIBERSORTx_sigmatrix_Adjusted.txt
-rw-r--r-- 1 jupyter jupyter 3.2M Jul 13 16:39 CIBERSORTx_Mixtures_Adjusted.txt
-rw-r--r-- 1 jupyter jupyter  84K Jul 13 16:38 CIBERSORTx_Fig2ab-NSCLC_PBMCs_scRNAseq_refsample_inferred_phenoclasses.CIBERSORTx_Fig2ab-NSCLC_PBMCs_scRNAseq_refsample_inferred_refsample.bm.K999.pdf
-rw-r--r-- 1 jupyter jupyter 229K Jul 13 16:38 CIBERSORTx_Fig2ab-NSCLC_PBMCs_scRNAseq_refsample_inferred_phenoclasses.CIBERSORTx_Fig2ab-NSCLC_PBMCs_scRNAseq_refsample_inferred_refsample.bm.K999.txt
-rw-r--r-- 1 jupyter jupyter 2.1M Jul 13 16:38 CIBERSORTx_cell_type_sourceGEP.txt
-rw-r--r-- 1 jupyter jupyter 9.7M Jul 13 16:38 CIBERSORTx_Fig2ab-NSCLC_PBMCs_scRNAseq_refsample_inferred_refsample.txt
-rw-r--r-- 1 jupyter jupyter  421 Jul 13 16:38 CIBERSORTx_Fig2ab-NSCLC_PBMCs_scRNAseq_refsample_inferred_phenoclasses.txt


In [None]:
path = "/home/jupyter/csx/output/CIBERSORTx_sigmatrix_Adjusted.txt"

learned_sigmatrix = pd.read_csv(
    path,
    sep='\t',
    index_col=0
)

In [None]:
learned_sigmatrix

In [None]:
tirosh_tumor_mixtures['53']

In [None]:
pd.merge(learned_sigmatrix, tirosh_tumor_mixtures['53'], left_index=True, right_index=True)

In [None]:
pd.merge(learned_sigmatrix, tirosh_tumor_mixtures['53'], left_index=True, right_index=True)

# attempt inferring fractions myself with sigmatrix, mixture

In [None]:
from sklearn.svm import NuSVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
_combined_data = pd.merge(learned_sigmatrix, tirosh_tumor_mixtures['53'], left_index=True, right_index=True)
y = _combined_data.values[:, -1]
X = _combined_data.values[:, :-1]
y.shape, X.shape

In [None]:
regr = make_pipeline(StandardScaler(), NuSVR(kernel='linear'))
regr.fit(X, y)

In [None]:
_ = regr.named_steps['nusvr'].coef_
import numpy as np
_ / np.sum(_)

# check fractions inferred by csx

In [None]:
!find /home/jupyter/csx/output -name '*txt'

In [None]:
path = "/home/jupyter/csx/output/CIBERSORTx_Adjusted.txt"

pd.read_csv(
    path,
    sep='\t',
    index_col=0
).loc[53]

# extra

In [None]:
pd.read_csv(
    "/mnt/liulab/csx_example_files/Fig2ab-NSCLC_PBMCs/Fig2ab-NSCLC_PBMCs_scRNAseq_sigmatrix.txt",
    sep='\t',
    index_col=0
)

In [None]:
pd.read_csv(
    "/mnt/liulab/csx_example_files/Fig2ab-NSCLC_PBMCs/Fig2b-WholeBlood_RNAseq.txt",
    sep='\t',
    index_col=0
)