# Workflow of simulation process

## Logistics of the workflow
![](cnv-figures/workflow_diagram.png)

## Overview of the workflow
Type command

`sos run workflow/20190717_workflow.ipynb -h`

```
usage: sos run workflow/20190717_workflow.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  genome_partition
  susie
  varbvs
  fisher
  mcmc
  mcmc_multichain
  sier
  hybrid
  get_hist
  varbvs_wg
  simulate

Global Workflow Options:
  --cwd output (as path)
  --genotype-file data/deletion.X.gz (as path)
  --phenotype-file data/deletion.y.gz (as path)
                        phenotype
  --name ''
  --blocks-file . (as path)
  --iteration 5000 (as int)
                        MCMC number of iterations
  --tune-prop 0.25 (as float)
                        MCMC ...
  --target-accept 0.98 (as float)
                        MCMC ...
  --n-chain 10 (as int)
                        MCMC ...
  --n-core 1 (as int)
                        MCMC ... For some reason on RCC cluster more than 1 cores will not work
                        (stuck)
  --[no-]reparameterize (default to False)
                        MCMC ...
  --prevalence 0.05 (as float)
                        alpha = log(p/1-p) is uniform lower bound: p = prevalence, when
                        prevalence = 0.05, alpha = -2.94 upper bound: p = case proportion, when
                        case = control, alpha = 0 MCMC ...
  --mcmc-seed 999 (as int)
                        MCMC ...
  --hyperparam-file . (as path)
                        Hyper-parameters for MCMC and for Single Effect Regression
  --L 10 (as int)
                        SuSiE number of effects
  --varbvs-wg-pip . (as path)
                        Whole genome PIPs obtained by varbvs using `varbvs_wg` pipeline, used for
                        hybrid pipeline
  --mcmc-walltime '2.5h'
                        cluster job configurations
  --sier-walltime 2h
  --job-size 80 (as int)

Sections
  genome_partition:
    Workflow Options:
      --input-file VAL (as str, required)
                        For simulation: get real deletion/duplication CNV data and its block
                        n_gene_in_block: get_hist: 1, simulate: 20~50, analyze: 1
      --output-files VAL VAL ... (as type, required)
                        output contain 3 files: 1) input data removing columns with all zeros, 2)
                        file containing block start and end matching index in 1), 3) block start
                        and end without reindex
      --n-gene-in-block VAL (as int, required)
                        minimum number of genes in a block for copy model set it to 1 to use
                        "natural blocks" from input data
      --col-index VAL (required)
                        col_index=None: no row names, col_index=0: use first column as row names
  susie_1, varbvs_1, fisher_1, mcmc_1, mcmc_multichain_1, sier_1, hybrid_1, get_hist_1:
  fisher_2:
  susie_2, varbvs_2, mcmc_2, mcmc_multichain_2, sier_2, hybrid_2:
  hybrid_3:
  mcmc_multichain_3:
  mcmc_3:
  sier_3:
    Workflow Options:
      --expected-effects -9 (as int)
  mcmc_4, mcmc_multichain_4, sier_4, hybrid_4:
  susie_3:
    Workflow Options:
      --estimate-prior-method simple
      --check-null-threshold 0.1 (as float)
  susie_4, varbvs_4:
  varbvs_3:
  varbvs_wg:
    Workflow Options:
      --maximum-prior-inclusion 0.0 (as float)
      --Rseed 999 (as int)
  get_hist_2:
  simulate_1:
    Workflow Options:
      --n-gene-in-block 30 (as int)
  simulate_2:
  simulate_3:
    Workflow Options:
      --shape 1.4 (as float)
      --scale 0.6 (as float)
      --beta-method normal
      --pi0 0.95 (as float)
      --seed 999 (as int)
  simulate_4:
    Workflow Options:
      --sample-size 100000 (as int)
      --n-batch 200 (as int)
  simulate_5:
  simulate_6:
```

## Command

### Obtain simulation results, run shell script below
```
DATE=$(date +%Y-%m-%d)
output="$DATE-OUTPUT"

#simulate data
sos run workflow/20190717_workflow.ipynb simulate \
    --name simulation \
    --genotype-file data/deletion.X.gz \
    --cwd $output \
    --sample-size 200000 \
    --n-batch 200
    --seed 999999
    
#whole genome analysis using varbvs    
sos run workflow/20190717_workflow.ipynb varbvs_wg \
    --name varbvs --cwd $output \
    --genotype-file $output/deletion.X_b30.simulation.X.gz \
    --phenotype-file $output/deletion.X_b30.simulation.y.gz
    
#Fisher's exact test per gene    
sos run workflow/20190717_workflow.ipynb fisher \
    --name fisher --cwd $output \
    --genotype-file $output/deletion.X_b30.simulation.X.gz \
    --phenotype-file $output/deletion.X_b30.simulation.y.gz
    
#MCMC analysis
sos run workflow/20190717_workflow.ipynb mcmc \
    --name mcmc --cwd $output \
    --genotype-file $output/deletion.X_b30.simulation.X.gz \
    --phenotype-file $output/deletion.X_b30.simulation.y.gz \
    --hyperparam-file $output/deletion.X_b30.simulation.varbvs.hyperparams                                                     ​

#SuSiE analysis
sos run workflow/20190717_workflow.ipynb susie \
    --name susie --cwd $output \
    --genotype-file $output/deletion.X_b30.simulation.X.gz \
    --phenotype-file $output/deletion.X_b30.simulation.y.gz \
    --hyperparam-file $output/deletion.X_b30.simulation.varbvs.hyperparams
    
#Single Effect Regression analysis
sos run workflow/20190717_workflow.ipynb sier \
    --name sier --cwd $output \
    --genotype-file $output/deletion.X_b30.simulation.X.gz \
    --phenotype-file $output/deletion.X_b30.simulation.y.gz \
    --hyperparam-file $output/deletion.X_b30.simulation.varbvs.hyperparams
    
#varbvs analysis
sos run workflow/20190717_workflow.ipynb varbvs \
    --name varbvs --cwd $output \
    --genotype-file $output/deletion.X_b30.simulation.X.gz \
    --phenotype-file $output/deletion.X_b30.simulation.y.gz
   
#Hybrid approach
sos run workflow/20190717_workflow.ipynb hybrid \
    --name hybrid --cwd $output \
    --genotype-file $output/deletion.X_b30.simulation.X.gz \
    --phenotype-file $output/deletion.X_b30.simulation.y.gz \
    --varbvs-wg-pip $output/deletion.X_b30.simulation.varbvs.pip \
    --hyperparam-file $output/deletion.X_b30.simulation.varbvs.hyperparams

#Get histogram
sos run workflow/20190717_workflow.ipynb get_hist \
    --name get_hist --cwd $output
```

## Results

### Histogram
![](cnv-figures/hist.png)

### ROC curve

In [None]:
path = "/Users/bohan/cnv/2022-08-19-OUTPUT"
import pandas as pd, numpy as np
import os
from collections import Counter
cwd = os.path.expanduser(path)
n = 30

In [None]:
m1 = "varbvs"
m2 = "susie"
m3 = "mcmc"
m4 = "sier"
m5 = "hybrid"
m6 = 'fisher'

In [None]:
def get_pip_table():
    prefix = "deletion.X_b30.simulation.y"
    varbvs = pd.read_csv(f"{cwd}/{prefix}.{m1}_pip.gz", sep = "\t", header = None, usecols = [1], names = [f"pip_{m1}"])
    susie = pd.read_csv(f"{cwd}/{prefix}.{m2}_pip.gz", sep = "\t", header = None, usecols = [1], names = [f"pip_{m2}"])
    mcmc = pd.read_csv(f"{cwd}/{prefix}.{m3}_pip.gz", sep = "\t", header = None, names = ["gene", f"pip_{m3}"])
    sier = pd.read_csv(f"{cwd}/{prefix}.{m4}_pip.gz", sep = "\t", header = None, usecols = [1], names = [f"pip_{m4}"])
    hybrid = pd.read_csv(f"{cwd}/{prefix}.{m5}_pip.gz", sep = "\t", header = None, usecols = [1], names = [f"pip_{m5}"])
    fisher = pd.read_csv(f"{cwd}/{prefix[0:-2]}.X.cleaned.fisher.gz", sep = "\t", header = 0)
    beta_all = np.loadtxt(f"{cwd}/{prefix[0:-2]}.beta")
    index = pd.read_csv(f"{cwd}/{prefix[0:-2]}.X.block_index_original.csv", sep = "\t", header = None, names = ["start", "end"])
    betas = list()
    for i, item in index.iterrows():
        betas.extend(list(beta_all[item[0]:(item[1]+1)]))
    pips = pd.concat([mcmc, varbvs, susie, sier, hybrid], axis = 1).merge(fisher[["gene", "p"]], on = "gene")
    pips["beta"] = betas
    pips["is_signal"] = pips["beta"].apply(lambda x: 1 if x != 0 else 0)
    return pips

In [None]:
pips = get_pip_table()
pips = pips.set_index([[i for i in range(pips.shape[0])]])
pips['pip_fisher'] = 1 - pips['p']

Then, switch to R kernel and define a function `roc_data` which takes PIP values from the PIP table of each method as inputs and add confidence level 5% as cutoff, then outputs ROC data.
```R
roc_data = function(dat, cols, cutoff = c(0.05, 0.999), connect_org = TRUE, grid = 100) {
    d1 = dat[,cols]
    if (connect_org) start = 0
    else start = 1
    ttv = (start:grid)/grid
    ttv = ttv[which(ttv>=cutoff[1] & ttv<=cutoff[2])]
    rst1 = t(sapply(ttv, function(x) c(sum(d1[,2][d1[,1]>=x]), length(d1[,2][d1[,1]>=x]), sum(d1[,2][d1[,1]>=x]==0))))
    rst1 = cbind(rst1, sum(d1[,2]), sum(1-d1[,2]))
    rst1 = as.data.frame(rst1)
    colnames(rst1) = c('true_positive', 'total_positive', 'false_positive', 'total_signal', 'total_null')
    rst2 = as.data.frame(cbind(rst1$true_positive / rst1$total_positive, rst1$true_positive / rst1$total_signal,  ttv))
    rst3 = as.data.frame(cbind(1 - rst1$false_positive / rst1$total_null, rst1$true_positive / rst1$total_signal,  ttv))
    if (connect_org) {
        # make a stair to origin
        rst2 = rbind(rst2, c(max(1 - cutoff[1], rst2[nrow(rst2),1]), max(rst2[nrow(rst2),2]-0.01, 0), rst2[nrow(rst2),3]))
        rst2 = rbind(rst2, c(1, 0, 1))
        rst3 = rbind(rst3, c(1, 0, 1))
    }
    colnames(rst2) = c('Precision', 'Recall', 'Threshold')
    colnames(rst3) = c('TN', 'TP', 'Threshold')
    return(list(counts = rst1, pr = rst2, roc = rst3))
}
```

In [None]:
%get pips m1 m2 m3 m4 m5 m6
res = list()
for (m in c(m1,m2,m3,m4,m5,m6)) res[[m]] = roc_data(pips, c(paste0('pip_',m), 'is_signal'), cutoff = c(0, 1), connect_org = T, grid=200)

Lastly, switch back to SoS kernel and `%get` ROC data from R kernel and plot them.

In [None]:
%get res --from R

import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
font_prop = font_manager.FontProperties(size=24)

fig, ax = plt.subplots(figsize = (10, 10))
for m, c in zip([m1,m2,m3,m4,m5,m6], ['C0','C8','C3','C4','C2','C7']):
    plt.plot(1 - res[m]['roc']['TN'], res[m]['roc']['TP'], c = c, label = m)
plt.legend(loc = 'lower right', fontsize = 15)
plt.ylabel("Power", fontproperties=font_prop)
plt.xlabel("False positive", fontproperties=font_prop)
plt.title('ROC curve', fontproperties=font_prop)
plt.savefig(f"{cwd}/ROC_curve.pdf")
plt.show()

$n=30,000$ |$n=80,000$ 
:---------:|:----------:
![](cnv-figures/ROC_4simu_0802.png) | ![](cnv-figures/ROC_large_sample_0802_12.png)

|![alt](cnv-figures/hybrid_11_simu123_0709.png) |![alt](cnv-figures/MCMC_11_simu123_0709.png)|
|-|-|
|![alt](cnv-figures/SuSiE_12_simu123_0709.png) | ![alt](cnv-figures/SIER_11_simu123_0709.png)