# Copy model simulation and analysis workflow

In [1]:
!sos run 20190717_workflow.ipynb -h

usage: sos run 20190717_workflow.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  get_hist
  simulate
  analyze
  default
  get_data_hist

Global Workflow Options:
  --cnv-type deletion
  --cwd /home/min/GIT/github/cnv-gene-mapping/data (as path)
  --genotype-file  path(f"{cwd:a}/{cnv_type}.X.gz")

  --phenotype-file  path(f"{cwd:a}/{cnv_type}.y") # real CNV data phenotype


Sections
  get_hist_1, simulate_1, analyze_1:
    Workflow Options:
      --n-gene-in-block 1 (as int)
                        For simulation: get real deletion/duplication CNV data
                        and its block n_gene_in_block: get_hist: 1, simulate:
                        20~50, anal

## Run this workflow
### Simulation:
```
sos run dsc/20190717_workflow.ipynb simulate:1-5 --n_gene_in_block 30 --shape 1 --scale 0.5 -s build
```
$\pi_0*\delta + (1-\pi_0)* \text{N} (0, 1)$, shape = 0, scale = 1, to test `varbvs`
```
sos run dsc/20190717_workflow.ipynb simulate:1-5 --n_gene_in_block 30 --shape 0 --scale 1 -s build
```
### Get histogram
- For simulation

```
sos run dsc/20190717_workflow.ipynb get_hist:1-2 --genotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--n_gene_in_block 1 -s build
```
- For real data

```
sos run dsc/20190717_workflow.ipynb get_hist:1-2 --n_gene_in_block 1 -s build
```

### Analyze
```
sos run dsc/20190717_workflow.ipynb susie:1-3 --genotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--phenotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --n_gene_in_block 1 -s build
```
```
sos run dsc/20190717_workflow.ipynb fisher --genotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--phenotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --n_gene_in_block 1 -s build
```
```
sos run dsc/20190717_workflow.ipynb logit --genotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--phenotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --n_gene_in_block 1 -s build
```
```
sos run dsc/20190717_workflow.ipynb pymc3 --genotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--phenotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --n_gene_in_block 1 -s build
```
### midway
```
cd /home/gaow/GIT/github/cnv-gene-mapping
JOB_OPT="-q midway2 -c midway2.yml"
sos run dsc/20190717_workflow.ipynb pymc3 --genotype_file /home/gaow/GIT/github/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--phenotype_file /home/gaow/GIT/github/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --n_gene_in_block 1 -s build \
--job_size 10 $JOB_OPT
```
`n_gene_in_block` = 20, `sample_size` = 200000
```
cd /home/gaow/GIT/github/cnv-gene-mapping
JOB_OPT="-q midway2 -c midway2.yml"
sos run dsc/20190717_workflow.ipynb simulate --n_gene_in_block 20 --shape 0 --scale 1 --sample_size 200000 -s build --job_size 10 $JOB_OPT
```

In [None]:
[global]
parameter: cnv_type = "deletion"
parameter: cwd = path("~/GIT/github/cnv-gene-mapping/data")
parameter: genotype_file = path(f"{cwd:a}/{cnv_type}.X.gz")
parameter: phenotype_file = path(f"{cwd:a}/{cnv_type}.y") # real CNV data phenotype
parameter: n_gene_in_block = 1
parameter: job_size = 80
parameter: mu0 = 0.777072111580423
parameter: s0 = 0.8436501699101251
parameter: pi0 = 0.0437754961218526
def fmtP(x):
    return str(x).replace(".", "p").replace(' ', '_').replace('"', "").replace("'", "").replace("-", '_')

In [1]:
[get_hist_1, simulate_1, susie_1, varbvs_1, fisher_1, pymc3_1, logit_1]
# For simulation: get real deletion/duplication CNV data and its block
# n_gene_in_block: get_hist: 1, simulate: 20~50, analyze: 1
input: genotype_file
output: f"{_input:nn}.genes.block{n_gene_in_block}.gz", f"{_input:nn}.block{n_gene_in_block}.forsimu.index.csv", f"{_input:nn}.block{n_gene_in_block}.index.csv"
python: expand = '${ }'
    import pandas as pd
    from operator import itemgetter
    from itertools import *
    data = pd.read_csv(${_input:r}, compression = "gzip", sep = "\t", header = None)
    data_clean = data.loc[:, (data != 0).any(axis = 0)]
    data_clean.to_csv(${_output[0]:r}, compression = "gzip", sep = "\t", header = False, index = False)
    indices = list(data_clean.columns)
    bound = list()
    i = 0; j = 1; n_0 = len(indices)
    while (j < n_0):
        if indices[j] - indices[i] >= ${n_gene_in_block} and indices[j] - indices[j-1] > 1:
            bound.append([indices[i], indices[j-1]])
            i = j
        j += 1
    bound.append([indices[i], indices[j-1]])
    bound = [item for item in bound if item[1] != 0]
    if bound[-1] == bound[-2]:
        bound = bound[:-1]
    pd.DataFrame(bound).to_csv(${_output[1]:r}, sep = "\t", header = False, index = False)
    span = [item[1] - item[0] for item in bound]
    bound2 = list()
    start = 0
    for i in span:
        end = start + i
        start = end + 1
        bound2.extend([end, start])
    bound2 = [0] + bound2[:-1]
    bound3 = [bound2[x:x+2] for x in range(0, len(bound2), 2)]
    ## bound3: index start from 0
    pd.DataFrame(bound3).to_csv(${_output[2]:r}, sep = "\t", header = False, index = False)

In [None]:
[fisher_2]
output: f"{_input[0]:n}.fisher.gz"
python: expand = '${ }'
    import pandas as pd
    ## use stats.fisher_exact instead of "from fisher import pvalue", because "pvalue" does not generate the constant pvalue, 
    ## for example, 'pvalue(56,6650,0,6706).two_tail' is not more significant than 'pvalue(24,6682,0,6706).two_tail'
    from scipy import stats
    data = pd.read_csv(${_input[0]:r}, compression = "gzip", sep = "\t", header = None)
    y = pd.read_csv("${phenotype_file}", header = None, names = ["y"])
    xy = pd.concat([y, data], axis = 1, join = 'inner')
    xy1 = xy[xy["y"] == 1]
    n1 = xy1.shape[0]
    xy0 = xy[xy["y"] == 0]
    n0 = xy0.shape[0]
    res = list()
    for i in list(data.columns):
        res.append([f"gene_{i+1}", sum(xy1.loc[:,i]), n1 - sum(xy1.loc[:,i]), sum(xy0.loc[:,i]), n0 - sum(xy0.loc[:,i]), 
                    stats.fisher_exact([[sum(xy1.loc[:,i]), sum(xy0.loc[:,i])], [n1 - sum(xy1.loc[:,i]), n0 - sum(xy0.loc[:,i])]])[1]])
    pd.DataFrame(res).sort_values(by = 5).to_csv(${_output:r}, compression = "gzip", sep = "\t", header = ["gene", "d_c", "d_nc", "nd_c", "nd_nc", "p"], index = False)

In [8]:
from fisher import pvalue

In [27]:
pvalue(603,6103,76,6630).two_tail

4.672523463727042e-06

In [13]:
from scipy import stats

In [30]:
stats.fisher_exact([[603,76], [6103,6630]])

(8.619337340565899, 1.9104901676695082e-107)

In [23]:
import pandas as pd
fisher = pd.read_csv("/home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.genes.block1.fisher.gz", compression = "gzip", sep = "\t", header = 0)

In [24]:
fisher.shape

(2285, 6)

In [25]:
fisher[fisher["p"] < 0.05].head(10)

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p
0,gene_25,603,6103,76,6630,1.91049e-107
1,gene_28,603,6103,76,6630,1.91049e-107
2,gene_27,603,6103,76,6630,1.91049e-107
3,gene_26,603,6103,76,6630,1.91049e-107
4,gene_29,603,6103,76,6630,1.91049e-107
5,gene_24,528,6178,61,6645,1.710063e-97
6,gene_1138,502,6204,79,6627,2.861096e-79
7,gene_1139,502,6204,79,6627,2.861096e-79
8,gene_1137,502,6204,79,6627,2.861096e-79
9,gene_567,132,6574,1,6705,1.296086e-38


In [182]:
[susie_2, varbvs_2, pymc3_2, logit_2]
## R: fread(${_input:[0]}, select = ${_blocks.replace('_', ':')})
## similar to fine mapping, create 527 folders and save results for each of them
blocks = ['_'.join(x.strip().split()) for x in open(f'{_input[2]:a}').readlines()]#[:5]
input: for_each = ['blocks']
output: f"{_input[0]:d}/block_{_blocks}/{_input[0]:bnn}.block_{_blocks}.gz"
task: trunk_workers = 1, trunk_size = job_size, walltime = '10m', mem = '8G', cores = 1, tags = f'pymc3_{_output:bn}'
python: expand = '${ }'
    import pandas as pd
    data = pd.read_csv(${_input[0]:r}, compression = "gzip", sep = "\t", header = None)
    data.loc[:, int('${_blocks}'.split("_")[0]):int('${_blocks}'.split("_")[1])].to_csv(${_output:r}, compression = "gzip", sep = "\t", header = False, index = False)

In [None]:
[pymc3_3]
parameter: iteration = 2000
parameter: pymc3_seed = 999
parameter: intercept = -2.9444389791664403
parameter: sigma_intercept = 0.05
input: group_by = 1
output: f"{_input[0]:n}.pymc3.gz"
task: trunk_workers = 1, trunk_size = job_size, walltime = '10m', mem = '8G', cores = 1, tags = f'pymc3_{_output:bn}'
python: expand = '${ }'
    import numpy as np, pandas as pd, pymc3 as pm, theano.tensor as tt
    from scipy.special import expit
    X = pd.read_csv(${_input:r}, compression = "gzip", sep = "\t", header = None, dtype = float)
    y = np.loadtxt("${phenotype_file}", dtype = int)
    invlogit = lambda x: 1/(1 + tt.exp(-x))
    model = pm.Model()
    with model:
        xi = pm.Bernoulli('xi', ${pi0}, shape = X.shape[1]) #inclusion probability for each variable
        alpha = pm.Normal('alpha', mu = ${intercept}, sd = ${sigma_intercept}) # Intercept
        beta = pm.Normal('beta', mu = ${mu0}, sd = ${s0}, shape = X.shape[1]) #Prior for the non-zero coefficients
        p = pm.math.dot(X, xi * beta) #Deterministic function to map the stochastics to the output
        y_obs = pm.Bernoulli('y_obs', invlogit(p + alpha), observed = y)  #Data likelihood
    with model:
        trace = pm.sample(${iteration}, random_seed = ${pymc3_seed}, cores = 1, progressbar = False, chains = 1)
    results = pd.DataFrame({'inclusion_probability': np.apply_along_axis(np.mean, 0, trace['xi']),
                            'beta': np.apply_along_axis(np.mean, 0, np.multiply(trace["beta"], trace["xi"])),
                            'beta_given_inclusion': np.apply_along_axis(np.sum, 0, trace['xi'] * trace['beta']) / np.apply_along_axis(np.sum, 0, trace['xi'])
                            })
    results.to_csv(${_output:r}, compression = "gzip", sep = "\t", header = True, index = False)

In [None]:
[pymc3_4]
input: group_by = "all"
output: f"{_input[0]:dd}/{_input[0]:bnnn}.pymc3.all.blocks.pip.gz"
python: expand = '${ }'
    import pandas as pd
    files = list([${_input:r,}])
    res = pd.DataFrame(columns = ["inclusion_probability"])
    for f in files:
        tmp = pd.read_csv(f, compression = "gzip", sep = "\t", header = 0, usecols = [0])
        res = pd.concat([res, tmp])
    res.to_csv(${_output:r}, compression = "gzip", sep = "\t", header = False, index = False)

In [None]:
[logit_3]
depends: R_library("data.table")
input: group_by = 1
output: f"{_input[0]:n}.logit.rds"
R: expand = '${ }', stderr = f'{_input[0]:n}.logit.stderr', stdout = f'{_input[0]:n}.logit.stdout'
    source("${cwd:dd}/logistic/code/misc.R")  
    source("${cwd:dd}/logistic/code/bayes.R")
    X <- as.matrix(data.table::fread(${_input:r}, header = F))
    X <- scale(X, center = TRUE, scale = FALSE)
    y <- as.matrix(data.table::fread("${phenotype_file}"))
    p <- dim(X)[2]
    p0 <- rep(${pi0}, 1, p)
    out <- bayes.logistic(X, y, p0, ${mu0}, ${s0})
    saveRDS(out, ${_output:r})

In [None]:
[logit_4]
input: group_by = "all"
output: f"{_input[0]:dd}/{_input[0]:bnnn}.logit.all.blocks.pip.csv"
R: expand = '${ }', stderr = f'{_input[0]:n}.logit.stderr', stdout = f'{_input[0]:n}.logit.stdout'
    files = c(${_input:r,})
    pips = c()
    for (i in 1:length(files)){pips = c(pips, readRDS(files[i])[["p1"]])}
    write(pips, file = ${_output[0]:r}, ncolumns = 1)

In [None]:
[susie_3]
depends: R_library("data.table"), R_library('susieR'), R_library("reticulate")
parameter: L = 1
parameter: pve = 0.005
parameter: method = "optim"
suffix = f"SuSiE.L_{L}.prior_{fmtP(pve)}"
input: group_by = 1
output: f"{_input[0]:n}.{suffix}.susie.rds"
R: expand = '${ }', stderr = f'{_input[0]:n}.susie.stderr', stdout = f'{_input[0]:n}.susie.stdout'
    library(susieR)
    library(data.table)
    library(reticulate)
    X <- as.matrix(data.table::fread(${_input:r}))
    y <- as.matrix(data.table::fread("${phenotype_file}"))
    storage.mode(X) = 'double'
    storage.mode(y) = 'double'
    res <- susie(X, y, L = ${L}, scaled_prior_variance = ${pve}, estimate_prior_method = '${method}', estimate_prior_variance = TRUE)
    saveRDS(res, ${_output:r})

In [None]:
[susie_4]
input: group_by = "all"
output: f"{_input[0]:dd}/{_input[0]:bnnn}.susie.all.blocks.pip.csv", f"{_input[0]:dd}/{_input[0]:bnnn}.susie.all.blocks.pip.sum.csv"
R: expand = '${ }', stderr = f'{_input[0]:n}.susie.stderr', stdout = f'{_input[0]:n}.susie.stdout'
    files = c(${_input:r,})
    pips = c()
    for (i in 1:length(files)){pips = c(pips, readRDS(files[i])$pip)}
    write(pips, file = ${_output[0]:r}, ncolumns = 1)
    count = sapply(1:length(files), function(i) sum(readRDS(files[i])$pip))
    write(count, file = ${_output[1]:r}, ncolumns = 1)

In [None]:
[varbvs_3]
depends: R_library("data.table"), R_library("reticulate"), R_library("varbvs")
output: f"{_input:n}.varbvs.rds"
R: expand = '${ }', stderr = f'{_input[0]:n}.varbvs.stderr', stdout = f'{_input[0]:n}.varbvs.stdout'
    library(varbvs)
    library(data.table)
    X <- as.matrix(data.table::fread(${_input:r}))
    y <- as.matrix(data.table::fread("${phenotype_file}"))
    storage.mode(X) = 'double'
    storage.mode(y) = 'double'
    # logodds <- seq(-log10(ncol(X)), 0, length.out = 20)
    fit <- varbvs::varbvs(X, NULL, y, family = "binomial", update.b0 = TRUE, verbose = FALSE)
    saveRDS(fit, ${_output:r})

In [None]:
[varbvs_4]
input: group_by = "all"
output: f"{_input[0]:dd}/{_input[0]:bnnn}.varbvs.all.blocks.pip.csv", f"{_input[0]:dd}/{_input[0]:bnnn}.varbvs.all.blocks.pip.sum.csv"
R: expand = '${ }', stderr = f'{_input[0]:n}.varbvs.stderr', stdout = f'{_input[0]:n}.varbvs.stdout'
    files = c(${_input:r,})
    pips = c()
    for (i in 1:length(files)){pips = c(pips, readRDS(files[i])$pip)}
    write(pips, file = ${_output[0]:r}, ncolumns = 1)
    count = sapply(1:length(files), function(i) sum(readRDS(files[i])$pip))
    write(count, file = ${_output[1]:r}, ncolumns = 1)

In [None]:
[get_hist_2]
output: f"{_input[0]:n}.histogram.pdf"
python: expand = '${ }'
    import pandas as pd, matplotlib.pyplot as plt
    blocks = pd.read_csv(${_input[1]:r}, sep = "\t", header = None, names = ["start", "end"])
    spans = [j-i+1 for i,j in zip(blocks["start"], blocks["end"])]
    counts = {i: spans.count(i) for i in set(spans) if i != 0}
    fig, ax = plt.subplots(figsize = (8,6))
    plt.bar(list(counts.keys()), list(counts.values()), width = 0.8)
    ax.set_title("Histogram of number of genes in blocks")
    plt.savefig(${_output:r})

In [None]:
[simulate_2]
output: f"{_input[0]:n}.for_simu.gz"
python: expand = '${ }'
    import pandas as pd
    data = pd.read_csv(${genotype_file:r}, compression = "gzip", header = None, sep = "\t")
    bound = pd.read_csv(${_input[1]:r}, header = None, sep = "\t")
    bound2 = [[item[0], item[1]] if item[0] == bound.values[-1][0] else [item[0], bound.values[j+1][0]-1] for j, item in enumerate(bound.values)]
    fill = list()
    for l in range(data.shape[0]):
        fill.append([data.loc[l, k[0]:k[1]].tolist() for k in bound2])
    res = pd.DataFrame(fill)
    res.to_csv(${_output:r}, compression = "gzip", sep = "\t", header = False, index = False)

The output of this step has two files:
- Boundary file

        block_start     block_end
        14      23
        93      97
        164     177
        229     236
- Genotype matrix with column names from boundary: column names are index from start to end in each boundary.

        14      15      16      17      18      19      20      21      22      23      93      94    95       96      97
         0       1       1       0       1       1       1       1       0       0       0       0     1        1       1

In [None]:
[simulate_3]
parameter: sample_size = 100000 # sample size: default 100000, test: 1000
parameter: n_batch = 200 # number of simulated sample for each job, default: 200, test: 20
assert sample_size % n_batch == 0
batches = [x+1 for x in range(n_batch)]
input: for_each = ['batches']
output: f"{cwd:a}/{cnv_type}_simu_{n_gene_in_block}/{_input[0]:bn}.sample.{_batches}.gz"
task: trunk_workers = 1, trunk_size = job_size, walltime = '10m', mem = '8G', cores = 1, tags = f'simulate3_{_output:bn}'
python: expand = '${ }'
    import pandas as pd, numpy as np
    import random, itertools, ast
    data = pd.read_csv(${_input:r}, compression = "gzip", header = None, sep = "\t")
    size = int(${sample_size} / ${n_batch})
    random.seed(${_batches})
    samples_genome = list()
    for i in range(size):
        order = random.sample(data.index.tolist(), data.shape[1])
        s = list(itertools.chain(*list(ast.literal_eval(n) for n in np.diag(data.loc[order, :]))))
        samples_genome.append(s)
    samples_genome_df = pd.DataFrame(samples_genome) # row: sample, column: gene
    samples_genome_df.to_csv(${_output:r}, compression = "gzip", sep = "\t", header = False, index = False)

In [None]:
[simulate_4]
input: group_by = 'all'
output: f'{_input[0]:nn}.combined.gz'
bash: expand = "${ }"
    zcat ${_input} | gzip > ${_output}

In [None]:
[simulate_5]
# shape = 3; scale = 1 for gamma
# shape = 2.191013; scale = 0.2682398 for normal
parameter: shape = 3.0 # mean for normal (1), shape for gamma (3)
parameter: scale = 1.0 # se for normal (0.5), scale for gamma (1)
# 'gamma' or 'normal'
parameter: beta_method = 'normal'
parameter: penetrance = 0.05
parameter: seed = 999999
parameter: ctrl_case_ratio = 1.0
parameter: pi0 = 0.95
output: f'{_input:nn}.shape{shape}.scale{scale}.X.gz', f'{_input:nn}.shape{shape}.scale{scale}.y', f'{_input:nn}.shape{shape}.scale{scale}.beta'
python: expand = "${ }"
    import pandas as pd, numpy as np
    np.random.seed(${seed})
    # For normal distribution the -3*sigma to 3*sigma on x-axis should correspond to
    # log(4) and log(20). The shape and scale parameters are thus:
    # mu = (log(20) + log(4))/2 = 2.191013; sigma = (log(20) - mu) / 3 = 0.2682398
    def logor_gamma(shape, scale, n):
        return np.log(np.random.gamma(shape, scale, n))

    def logor_normal(mean, se, n):
        return np.random.normal(mean, se, n)

    data = pd.read_csv(${_input:r}, compression = "gzip", sep = "\t", header = None)
    beta0 = np.log(${penetrance} / (1-${penetrance}))
    beta1s = [x for x in logor_${beta_method}(${shape}, ${scale}, data.shape[1])]
    beta1s = [np.random.binomial(1, 1-${pi0}) * i for i in beta1s]
    ## FIXME: store sparse beta: non-zero beta's with their indices
    with open(${_output[2]:r}, 'w') as f:
        f.write("\n".join([str(b) for b in beta1s]))
    logit_y = np.matmul(data.values, beta1s) + beta0
    ys_p = np.exp(logit_y) / (1+np.exp(logit_y))
    ys = np.random.binomial(1, ys_p)
    case_index = np.ravel(np.where(ys == 1))
    ctrl_index = sorted(np.random.choice(np.ravel(np.where(ys == 0)), int(len(case_index) * ${ctrl_case_ratio})))
    genotype = data.iloc[case_index.tolist() + ctrl_index, :]
    genotype.to_csv(${_output[0]:r}, compression = "gzip", sep = "\t", header = False, index = False)
    with open(${_output[1]:r}, 'w') as f:
        f.write('\n'.join(['1'] * len(case_index) + ['0'] * len(ctrl_index)))

## Note
```
cd ~/GIT/cnv-gene-mapping
sos run dsc/20190717_workflow.ipynb get_hist:1-2 -s build
sos run dsc/20190717_workflow.ipynb get_hist:1-2 --genotype_file /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz
sos run dsc/20190717_workflow.ipynb analyze:1-2 -s build
sos run dsc/20190717_workflow.ipynb analyze:1-2 --simu_pheno /home/min/GIT/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y -s build
sos run dsc/20190717_workflow.ipynb simulate:1-5 --n_gene_in_block 30 --shape 1 --scale 0.5 -s build
sos run dsc/20190717_workflow.ipynb -s build -j 6
```
```
sinteractive --time=01:00:00 --partition=bigmem2 --nodes=1 --ntasks-per-node=1 --mem-per-cpu=100G
sos run dsc/20190717_workflow.ipynb simulate:1-5 --n_gene_in_block 30 --shape 1 --scale 0.5 -s build

sos run dsc/20190717_workflow.ipynb get_hist:1-2 --genotype_file /home/gaow/GIT/github/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--phenotype_file /home/gaow/GIT/github/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --n_gene_in_block 1 -s build

sos run dsc/20190717_workflow.ipynb analyze:1-2 --genotype_file /home/gaow/GIT/github/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.X.gz \
--phenotype_file /home/gaow/GIT/github/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --n_gene_in_block 1 \
--simu_pheno /home/gaow/GIT/github/cnv-gene-mapping/data/deletion_simu/deletion.genes.block30.for_simu.sample.y --real "FALSE" -s build
```