# Run *SDePER* on sequencing-based simulated data: Scenario 1 + Spatial data as reference + NO CVAE

In this Notebook we run SDePER on simulated data. For generating **sequencing-based** simulated data via coarse-graining procedure please refer [generate_simulated_spatial_data.nb.html](https://rawcdn.githack.com/az7jh2/SDePER_Analysis/c963d08f74f4591c2ef6f132177795297793d878/Simulation_seq_based/Generate_simulation_data/generate_simulated_spatial_data.nb.html) in [Simulation_seq_based](https://github.com/az7jh2/SDePER_Analysis/tree/main/Simulation_seq_based) folder.

**Scenario 1** means the reference data for deconvolution includes all single cells with the **matched 12 cell types**.

**Spatial data as reference** means the reference data is actually the [GSE102827](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102827) scRNA-seq data which is used to generate the simulated data, therefore it's **free of platform effect**.

**NO CVAE** means we DO NOT use CVAE to remove platform effect since it's free of platform effect here.

==================================================================================================================

So here we use the **4 input files** as shown below:

1. raw nUMI counts of simulated spatial transcriptomic data (spots × genes): [sim_seq_based_spatial_spot_nUMI.csv](https://github.com/az7jh2/SDePER_Analysis/blob/main/Simulation_seq_based/Generate_simulation_data/sim_seq_based_spatial_spot_nUMI.csv)
2. raw nUMI counts of reference GSE102827 scRNA-seq data (cells × genes): `GSE102827_scRNA_cell_nUMI.csv`. Since the file size of csv file of raw nUMI matrix of all 65,539 cells and 25,187 genes is up to 3.1 GB, we do not provide this file in our repository. It's just a **matrix transpose** of [GSE102827_merged_all_raw.csv.gz](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE102827&format=file&file=GSE102827%5Fmerged%5Fall%5Fraw%2Ecsv%2Egz) in [GSE102827](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102827) to satisty the file format requirement that rows as cells and columns as genes.
3. cell type annotations for selected 2,002 cells used for simulated data generation in [GSE102827](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102827) scRNA-seq data (cells × 1): [GSE102827_scRNA_cell_celltype.csv](https://github.com/az7jh2/SDePER_Analysis/blob/main/Simulation_seq_based/Run_SDePER_on_simulation_data/Scenario_1/ref_spatial/GSE102827_scRNA_cell_celltype.csv)
4. adjacency matrix of spots in simulated spatial transcriptomic data (spots × spots): [sim_spatial_spot_adjacency_matrix.csv](https://github.com/az7jh2/SDePER_Analysis/blob/main/Simulation/Generate_simulation_data/sim_spatial_spot_adjacency_matrix.csv)

==================================================================================================================

SDePER settings are:

* number of selected TOP marker genes for each comparison in Differential `n_marker_per_cmp`: 20
* number of used CPU cores `n_core`: 64
* initial learning rate for training CVAE `cvae_init_lr`: 0.003
* number of hidden layers in encoder and decoder of CVAE `num_hidden_layer`: 1
* whether to use Batch Normalization `use_batch_norm`: false
* CVAE training epochs `cvae_train_epoch`: 1000
* for diagnostic purposes set `diagnosis`: true
* **whether to use CVAE to remove platform effect `use_cvae`: false**

ALL other options are left as default.

==================================================================================================================

the `bash` command to start cell type deconvolution is

`runDeconvolution -q sim_seq_based_spatial_spot_nUMI.csv -r GSE102827_scRNA_cell_nUMI.csv -c GSE102827_scRNA_cell_celltype.csv -a sim_spatial_spot_adjacency_matrix.csv --n_marker_per_cmp 20 -n 64 --cvae_init_lr 0.003 --num_hidden_layer 1 --use_batch_norm false --cvae_train_epoch 1000 --diagnosis true --use_cvae false`

Note this Notebook uses **SDePER v1.2.1**. Cell type deconvolution result is renamed as [S1_ref_spatial_SDePER_NO_CVAE_celltype_proportions.csv](https://github.com/az7jh2/SDePER_Analysis/blob/main/Simulation_seq_based/Run_SDePER_on_simulation_data/Scenario_1/ref_spatial/S1_ref_spatial_SDePER_NO_CVAE_celltype_proportions.csv).

In [1]:
import subprocess

cmd = '''runDeconvolution -q sim_seq_based_spatial_spot_nUMI.csv \
                          -r GSE102827_scRNA_cell_nUMI.csv \
                          -c GSE102827_scRNA_cell_celltype.csv \
                          -a sim_spatial_spot_adjacency_matrix.csv \
                          --n_marker_per_cmp 20 \
                          -n 64 \
                          --cvae_init_lr 0.003 \
                          --num_hidden_layer 1 \
                          --use_batch_norm false \
                          --cvae_train_epoch 1000 \
                          --diagnosis true \
                          --use_cvae false
'''

subprocess.run(cmd, check=True, text=True, shell=True)


SDePER (Spatial Deconvolution method with Platform Effect Removal) v1.2.1


running options:
spatial_file: /home/exouser/Spatial/sim_seq_based_spatial_spot_nUMI.csv
ref_file: /home/exouser/Spatial/GSE102827_scRNA_cell_nUMI.csv
ref_celltype_file: /home/exouser/Spatial/GSE102827_scRNA_cell_celltype.csv
marker_file: None
loc_file: None
A_file: /home/exouser/Spatial/sim_spatial_spot_adjacency_matrix.csv
n_cores: 64
threshold: 0
use_cvae: False
use_imputation: False
diagnosis: True
verbose: True
use_fdr: True
p_val_cutoff: 0.05
fc_cutoff: 1.2
pct1_cutoff: 0.3
pct2_cutoff: 0.1
sortby_fc: True
n_marker_per_cmp: 20
filter_cell: True
filter_gene: True
n_hv_gene: 200
n_pseudo_spot: 500000
pseudo_spot_min_cell: 2
pseudo_spot_max_cell: 8
seq_depth_scaler: 10000
cvae_input_scaler: 10
cvae_init_lr: 0.003
num_hidden_layer: 1
use_batch_norm: False
cvae_train_epoch: 1000
use_spatial_pseudo: False
redo_de: True
seed: 383
lambda_r: [0.1, 0.268, 0.72, 1.931, 5.179, 13.895, 37.276, 100.0]
lambda_g: [0.1, 

CompletedProcess(args='runDeconvolution -q sim_seq_based_spatial_spot_nUMI.csv                           -r GSE102827_scRNA_cell_nUMI.csv                           -c GSE102827_scRNA_cell_celltype.csv                           -a sim_spatial_spot_adjacency_matrix.csv                           --n_marker_per_cmp 20                           -n 64                           --cvae_init_lr 0.003                           --num_hidden_layer 1                           --use_batch_norm false                           --cvae_train_epoch 1000                           --diagnosis true                           --use_cvae false\n', returncode=0)

Epoch 981/1000
Epoch 982/1000
Epoch 983/1000
Epoch 984/1000
Epoch 985/1000
Epoch 986/1000
Epoch 987/1000
Epoch 988/1000
Epoch 989/1000
Epoch 990/1000
Epoch 991/1000
Epoch 992/1000
Epoch 993/1000
Epoch 994/1000
Epoch 995/1000
Epoch 996/1000
Epoch 997/1000
Epoch 998/1000
Epoch 999/1000
Epoch 1000/1000

training finished in 1000 epochs (reach max pre-specified epoches), transform data to adjust the platform effect...


re-run DE on CVAE transformed scRNA-seq data!
Differential analysis across cell-types on scRNA-seq data...
finally selected 390 cell-type marker genes


platform effect adjustment by CVAE finished. Elapsed time: 87.01 minutes.


use the marker genes derived from CVAE transformed scRNA-seq for downstream regression!

gene filtering before modeling...
all genes passed filtering

spot filtering before modeling...
all spots passed filtering


######### Start GLRM modeling... #########

GLRM settings:
use SciPy minimize method:  L-BFGS-B
global optimization turned off, local min

    20 |      0.534 |     30.631 |      0.149 |      0.624 |     128.00 |     128.00 |    6.371 |    0.000 |    0.003 |   0.007661 |   0.003831
    21 |      0.472 |     26.639 |      0.145 |      0.661 |     128.00 |     256.00 |    6.270 |    0.000 |    0.003 |   0.006763 |   0.003381
    22 |      0.406 |     33.354 |      0.151 |      0.729 |     256.00 |     256.00 |    6.450 |    0.000 |    0.003 |   0.005786 |   0.002893
    23 |      0.346 |     37.035 |      0.155 |      0.788 |     256.00 |     256.00 |    6.040 |    0.000 |    0.003 |   0.004807 |   0.002404
    24 |      0.307 |     33.043 |      0.151 |      0.836 |     256.00 |     512.00 |    6.161 |    0.000 |    0.003 |   0.004245 |   0.002123
    25 |      0.259 |     43.366 |      0.161 |      0.924 |     512.00 |     512.00 |    6.020 |    0.000 |    0.003 |   0.003569 |   0.001784
    26 |      0.215 |     50.401 |      0.168 |      0.995 |     512.00 |     512.00 |    5.566 |    0.000 |    0.003 |   0.002888 |   0

    23 |      0.014 |     79.796 |      0.198 |      0.321 |     128.00 |     128.00 |    2.524 |    0.000 |    0.004 |   0.000205 |   0.000105
    24 |      0.004 |     79.843 |      0.198 |      0.322 |     128.00 |          / |    1.859 |    0.000 |    0.004 |   0.000065 |   0.000035
early stop!
Terminated (optimal) in 25 iterations.
One optimization by ADMM finished. Elapsed time: 1.76 minutes.


stage 2 finished. Elapsed time: 66.06 minutes.

GLRM fitting finished. Elapsed time: 78.22 minutes.


Post-processing estimated cell-type proportion theta...
hard thresholding small theta values with threshold 0


cell type deconvolution finished. Estimate results saved in /home/exouser/Spatial/celltype_proportions.csv. Elapsed time: 2.75 hours.


######### No imputation #########


whole pipeline finished. Total elapsed time: 2.75 hours.


CompletedProcess(args='runDeconvolution -q sim_seq_based_spatial_spot_nUMI.csv                           -r GSE102827_scRNA_cell_nUMI.csv                           -c GSE102827_scRNA_cell_celltype.csv                           -a sim_spatial_spot_adjacency_matrix.csv                           --n_hv_gene 150                           --n_marker_per_cmp 15                           --seed 2                           -n 64\n', returncode=0)