# Logistic regression spike slab prior using PyMC3

Previously we have tried BoomSpikeSlab and LogitBvs R packages to fit spike slab model for logistic regression but didn't get anywhere the way we want (FIXME: add link). Now I'm looking at some customized options. According to [this post](https://www.kaggle.com/melondonkey/bayesian-spike-and-slab-in-pymc3), `stan` cannot handle spike slab model because it is not discrete. The post implements a `PyMC3` based sampler that looks neat enough so I'm trying to use it for our problem here.

Indeed discrete prior might also be not optimal for `PyMC3`, as pointed out [in this notebook](https://www.kaggle.com/derekpowll/bayesian-lr-w-cauchy-prior-in-pymc3). I think we can also try a spiky normal plus a slab normal mixture -- at least they will be continous there.

## Software required

```
pip install pymc3 -U
conda install -c anaconda mkl-service
```

In [1]:
import numpy as np
N = 10
W = np.array([0.35, 0.65])
MU = np.array([0., 2.])
SIGMA = np.array([0.5, 1])

In [2]:
component = np.random.choice(MU.size, size=N, p=W)

In [3]:
component

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 0])

In [4]:
x = np.random.normal(MU[component], SIGMA[component], size=N)

In [5]:
x

array([ 3.4039753 , -0.31527543,  1.46270094,  1.03092931,  2.81120231,
        3.43326324,  2.47595872,  2.66015286,  2.9430338 , -0.51450195])

In [6]:
np.ones_like(W)

array([1., 1.])

## Mixture normal distribution

Mixture normal distribution 1

In [7]:
import numpy as np
from scipy.stats import norm
N = 10
components = np.random.choice(2, N, p = [0.95, 0.05]).tolist()

In [8]:
components

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

In [9]:
mus = [0, 0.777]
sds = [1, 0.844]

In [10]:
[norm.rvs([0,0.777][i], [1,0.844][i], 1) for i in components]

[array([0.81503456]),
 array([1.59741385]),
 array([0.82577892]),
 array([1.34766705]),
 array([-0.39489492]),
 array([0.69170806]),
 array([-0.23767606]),
 array([1.46909949]),
 array([-0.7560455]),
 array([0.55327881])]

Mixture normal distribution 2

[link](https://stackoverflow.com/questions/47759577/creating-a-mixture-of-probability-distributions-for-sampling/47763145)

In [11]:
distributions = [{"type": np.random.normal, "kwargs": {"loc": 0, "scale": 1}}, {"type": np.random.normal, "kwargs": {"loc": 0.777, "scale": 0.844}}]

In [12]:
coefficients = np.array([0.95, 0.05])

In [13]:
coefficients

array([0.95, 0.05])

In [14]:
sample_size = 10

In [15]:
num_distr = len(distributions)

In [16]:
data = np.zeros((sample_size, num_distr))

In [17]:
for idx, distr in enumerate(distributions):
    data[:, idx] = distr["type"](size=(sample_size,), **distr["kwargs"])
random_idx = np.random.choice(np.arange(num_distr), size=sample_size, p=coefficients)
sample = data[np.arange(sample_size), random_idx]

In [18]:
data

array([[-0.27108568,  0.88964344],
       [ 0.16368067,  0.97006357],
       [ 1.7953499 , -1.53559879],
       [ 0.54886108,  1.4787358 ],
       [-0.33192432,  0.06252585],
       [-0.37984494,  0.65852646],
       [-2.41666066,  1.49447456],
       [-1.41418634,  1.83983833],
       [ 1.25568105, -0.36999306],
       [ 0.03977451, -0.0044227 ]])

In [19]:
random_idx

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0])

In [20]:
sample

array([-0.27108568,  0.16368067,  1.7953499 ,  0.54886108, -0.33192432,
       -0.37984494,  1.49447456, -1.41418634,  1.25568105,  0.03977451])

## Import X and y

In [1]:
import os
import numpy as np, pandas as pd
cwd = os.path.expanduser("/home/min/GIT/cnv-gene-mapping/data/deletion_simu_30_shape0.777_scale0.843")
start = 1815 #140 #1750
end = 1826 #144 #1772

In [2]:
X_file = f'{cwd}/block_{start}_{end}/deletion.genes.block30.for_simu.sample.combined.genes.block_{start}_{end}.gz'
y_file = f'{cwd}/deletion.genes.block30.for_simu.sample.combined.y.gz'
fisher_f = f'{cwd}/deletion.genes.block30.for_simu.sample.combined.genes.block1.fisher.gz'

In [3]:
fisher = pd.read_csv(fisher_f, compression = "gzip", header = 0, sep = "\t")

In [4]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
0,gene_1822,329,11599,31,11897,8.076649e-65,6
1,gene_1823,329,11599,31,11897,8.076649e-65,7
2,gene_1821,310,11618,35,11893,6.837438e-57,5
3,gene_1826,270,11658,25,11903,1.224925e-53,10
4,gene_1824,270,11658,25,11903,1.224925e-53,8
5,gene_1825,270,11658,25,11903,1.224925e-53,9
6,gene_1820,163,11765,11,11917,4.424518e-36,4
7,gene_1819,163,11765,11,11917,4.424518e-36,3
28,gene_1818,113,11815,6,11922,8.67353e-27,2
29,gene_1816,113,11815,6,11922,8.67353e-27,0


In [5]:
y = np.loadtxt(y_file, dtype=int)
y.shape

(23856,)

In [6]:
X = pd.read_csv(X_file, compression = "gzip", sep = "\t", header = None, dtype = float)
X.shape

(23856, 12)

In [7]:
np.sum(X, axis = 0)

0     119.0
1     119.0
2     119.0
3     174.0
4     174.0
5     345.0
6     360.0
7     360.0
8     295.0
9     295.0
10    295.0
11    120.0
dtype: float64

It is 13K samples 14 features for the CNV problem.

## Model specification

That is, to specify **how the data is generated**. Specifically it is about setting up the spike slab prior for logistic model, $$b\sim \pi_0 \delta_0 + (1-\pi_0)N(\mu, \sigma^2)$$ where from `varbvs` analysis, $\pi_0 = 0.043, \mu = 0.77, \sigma = 0.84$.

For intercept since for centered data it has interpretation of baseline odds ratio, I'm giving it a normal prior $N(0, 1.5)$ to roughly cover the span of baseline odds 0.05 (log odds about -3), for a not so rare disease.

**Question: how should we handle intercept? How did `varbvs` handle intercept?**

In [8]:
pi_varbvs = 0.0437754961218526
mu_varbvs = 0.777072111580423
si_varbvs = np.sqrt(0.711745609189383)

## Spike and Slab model

In [9]:
import pymc3 as pm
import theano.tensor as tt

uniform [link](https://docs.pymc.io/api/distributions/continuous.html#pymc3.distributions.continuous.Uniform)

In [11]:
prevalence = 0.05
case_prop = sum(y) / y.shape[0]
iteration = 2000
seed = 1
n_chain = 10

In [12]:
def get_model(y, X, pi0 = pi_varbvs, mu = mu_varbvs, sigma = si_varbvs):
    invlogit = lambda x: 1/(1 + tt.exp(-x))
    model = pm.Model()
    with model:
        xi = pm.Bernoulli('xi', pi0, shape = X.shape[1]) # inclusion probability for each variable
        # alpha = pm.Normal('alpha', mu = mu_intercept, sd = sigma_intercept) # Intercept
        alpha = pm.distributions.continuous.Uniform("alpha", lower = np.log(prevalence / (1-prevalence)), upper = np.log(case_prop / (1-case_prop)))
        beta = pm.Normal('beta', mu = mu, sd = sigma, shape = X.shape[1]) # Prior for the non-zero coefficients
        p = pm.math.dot(X, xi * beta) # Deterministic function to map the stochastics to the output
        y_obs = pm.Bernoulli('y_obs', invlogit(p + alpha), observed = y)  # Data likelihood
    return model

In [13]:
model = get_model(y,X)

In [15]:
model

<pymc3.model.Model at 0x7f60a52034d0>

## Sampling

Need to read additional `PyMC3` documentation to do proper sampling diagnostics, but generally it helps to use multiple chains.

In [16]:
# Here I use 20 cores on my 40 core machine, with 10 chains, to generate 2000 samples.
# Takes 50min on my desktop
with model:
    trace1 = pm.sample(iteration, random_seed = seed, cores = 8, progressbar = True, chains = 1, tune = 500)

Sequential sampling (1 chains in 1 job)
CompoundStep
>BinaryGibbsMetropolis: [xi]
>NUTS: [beta, alpha]
Sampling chain 0, 40 divergences: 100%|██████████| 2500/2500 [01:09<00:00, 35.91it/s]
There were 40 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.41453061196241514, but should be close to 0.8. Try to increase the number of tuning steps.
Only one chain was sampled, this makes it impossible to run some convergence checks


In [17]:
with model:
    trace2 = pm.sample(iteration, random_seed = 2, cores = 8, progressbar = True, chains = 1, tune = 500)

Sequential sampling (1 chains in 1 job)
CompoundStep
>BinaryGibbsMetropolis: [xi]
>NUTS: [beta, alpha]
Sampling chain 0, 3 divergences: 100%|██████████| 2500/2500 [01:54<00:00, 21.89it/s]
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
Only one chain was sampled, this makes it impossible to run some convergence checks


In [29]:
type(trace2["xi"])

numpy.ndarray

In [23]:
len(trace1), len(trace2)

(2000, 2000)

In [71]:
import numpy as np
a = np.array([[1, 2, 3], [7,8,9]])

In [72]:
b = np.array([[2, 3, 4], [5,6,7]])
a = np.concatenate((a,b))

In [73]:
a

array([[1, 2, 3],
       [7, 8, 9],
       [2, 3, 4],
       [5, 6, 7]])

In [27]:
pd.DataFrame({'inclusion_probability': np.apply_along_axis(np.mean, 0, trace['xi']),
                        'beta': np.apply_along_axis(np.mean, 0, np.multiply(trace["beta"], trace["xi"])),
                        'beta_given_inclusion': np.apply_along_axis(np.sum, 0, trace['xi'] * trace['beta']) / np.apply_along_axis(np.sum, 0, trace['xi'])
                        })

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
0,0.470313,0.625786,1.330574
1,0.258188,0.330091,1.278493
2,0.175187,0.217035,1.238875
3,0.15775,0.190972,1.210597
4,0.015437,0.002756,0.178513


In [36]:
(trace["alpha"])

array([-0.01582384, -0.01582384, -0.01482289, ..., -0.02639111,
       -0.01838102, -0.01130768])

    Sequential sampling (1 chains in 1 job)
    CompoundStep
    >BinaryGibbsMetropolis: [xi]
    >NUTS: [beta, alpha]
    Sampling chain 0, 0 divergences: 100%|██████████| 2500/2500 [02:33<00:00, 16.23it/s]
    The acceptance probability does not match the target. It is 0.8961356742718769, but should be close to 0.8. Try to increase the number of tuning steps.
    Only one chain was sampled, this makes it impossible to run some convergence checks

## Results

This will summarize samples generated to posterior quantities including PIP, $\tilde{b}$ and $\tilde{\mu}$ ($b$ given inclusion, $\xi=1$).

In [28]:
results = pd.DataFrame({'inclusion_probability': np.apply_along_axis(np.mean, 0, trace['xi']),
                        'beta': np.apply_along_axis(np.mean, 0, np.multiply(trace["beta"], trace["xi"])),
                        'beta_given_inclusion': np.apply_along_axis(np.sum, 0, trace['xi'] * trace['beta']) / np.apply_along_axis(np.sum, 0, trace['xi'])
                        })

https://stackoverflow.com/questions/49825216/what-is-a-chain-in-pymc3

https://discourse.pymc.io/t/warning-when-nuts-probability-is-greater-than-acceptance-level/594

https://stats.stackexchange.com/questions/388230/pymc3-acceptance-probabilities-and-divergencies-after-tuning

https://peerj.com/articles/cs-55/

In [29]:
## chain = 10, seed = 1, block_140_144, tune = 1500
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
0,0.8037,0.677023,0.842382
1,0.7265,0.567733,0.781463
3,0.71665,0.555855,0.77563
2,0.7104,0.550974,0.775583
4,0.1845,0.142737,0.77364


In [20]:
## chain = 1, seed = 1, block_140_144, tune = 1500
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
1,0.3645,0.474191,1.300935
0,0.283,0.368025,1.300441
3,0.255,0.329316,1.291437
2,0.1505,0.180362,1.198416
4,0.008,0.001532,0.191505


In [20]:
## chain = 1, seed = 1, block_140_144, normal, tune = 500
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
2,0.5675,1.263044,2.225628
1,0.4285,0.963057,2.247508
0,0.384,0.79049,2.058567
3,0.167,0.311862,1.867439
4,0.019,0.005971,0.314271


In [18]:
## chain = 1, seed = 1, block_140_144, uniform prior
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
3,0.373,0.496463,1.331001
2,0.3475,0.44549,1.281986
0,0.223,0.279563,1.253644
1,0.124,0.145075,1.169956
4,0.013,0.001742,0.134025


In [18]:
## chain = 10, seed = 999, block_1750_1772
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
3,0.6007,0.46619,0.776077
2,0.6006,0.466211,0.776242
6,0.5853,0.465344,0.795052
7,0.5448,0.422856,0.776168
8,0.52755,0.409165,0.775595
9,0.52145,0.404821,0.776336
5,0.51605,0.402167,0.779319
4,0.50115,0.3886,0.775416
15,0.15485,0.119549,0.772028
17,0.15465,0.119348,0.771728


In [26]:
## chain = 10, seed = 1, block_1750_1772
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
6,0.74985,0.588339,0.784609
3,0.60195,0.467173,0.776099
5,0.5499,0.432061,0.785709
8,0.51655,0.400206,0.774768
4,0.5012,0.38871,0.775558
7,0.4876,0.377942,0.775107
2,0.40195,0.31129,0.774449
9,0.3266,0.253161,0.775139
12,0.1621,0.125457,0.773948
15,0.16175,0.124851,0.771877


In [28]:
sum(results["inclusion_probability"])

5.764049999999998

In [15]:
pm.summary(trace)

Unnamed: 0,mean,sd,mc_error,hpd_2.5,hpd_97.5
xi__0,0.022,0.146683,0.003763,0.0,0.0
xi__1,0.0395,0.194781,0.004433,0.0,0.0
xi__2,0.47,0.499099,0.011489,0.0,1.0
xi__3,0.0195,0.138274,0.003154,0.0,0.0
xi__4,0.0235,0.151485,0.003497,0.0,0.0
alpha,-0.001069,0.017164,0.000361,-0.033539,0.031783
beta__0,0.769954,0.828415,0.018496,-0.886062,2.319919
beta__1,0.773492,0.849241,0.018786,-0.832353,2.454229
beta__2,1.002387,0.707638,0.017866,-0.421315,2.34806
beta__3,0.806046,0.824134,0.018227,-0.835216,2.351347


## Mixture normal model
[link](https://docs.pymc.io/notebooks/api_quickstart.html)

[link 2](https://www.ritchievink.com/blog/2018/06/05/clustering-data-with-dirichlet-mixtures-in-edward-and-pymc3/)

In [34]:
import pymc3 as pm
import theano.tensor as tt
mu2 = 0
sigma2 = 1e-8
w_mix = pm.floatX([1-pi_varbvs, pi_varbvs])
mu_mix = pm.floatX([mu2, mu_varbvs])
sigma_mix = pm.floatX([sigma2, si_varbvs])

def get_mix_model(y, X, mu_intercept = 0, sigma_intercept = 1.5):
    invlogit = lambda x: 1/(1 + tt.exp(-x))
    model = pm.Model()
    with model:
        alpha = pm.Normal('alpha', mu = mu_intercept, sd = sigma_intercept)
        beta = pm.NormalMixture("beta", w = w_mix, mu = mu_mix, sigma = sigma_mix, shape = X.shape[1])
        p = pm.math.dot(X, beta)
        y_obs = pm.Bernoulli('y_obs', invlogit(p + alpha), observed = y)
    return model

In [35]:
model1 = get_mix_model(y,X)

In [36]:
model1

<pymc3.model.Model at 0x7f789b50fa50>

## Sampling

In [37]:
with model1:
    trace1 = pm.sample(2000, random_seed = 999, cores = 1, progressbar = True, chains = 1)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [beta, alpha]
100%|██████████| 2500/2500 [00:47<00:00, 52.42it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks


## Results

In [38]:
pm.summary(trace1)

Unnamed: 0,mean,sd,mc_error,hpd_2.5,hpd_97.5
alpha,-0.001635,0.01774,0.000338,-0.035965,0.033394
beta__0,0.036425,0.756459,0.014511,-1.28289,1.605263
beta__1,0.05802,0.764307,0.015561,-1.371872,1.564398
beta__2,1.201155,0.464218,0.009858,0.368932,2.162282
beta__3,0.278193,0.639481,0.014458,-0.983561,1.507265
beta__4,0.064904,0.756572,0.016012,-1.443134,1.465203


In [39]:
results1 = pd.DataFrame({#'inclusion_probability': np.apply_along_axis(np.mean, 0, trace1['xi']),
                         'beta': np.apply_along_axis(np.mean, 0, trace1['beta']),})
                         #'beta_given_inclusion': np.apply_along_axis(np.sum, 0, trace1['xi']*trace1['beta']) / np.apply_along_axis(np.sum, 0, trace1['xi'])})

In [40]:
results1

Unnamed: 0,beta
0,0.036425
1,0.05802
2,1.201155
3,0.278193
4,0.064904


In [46]:
trace1["beta"][:20]

array([[ 0.57544786, -0.83577541,  1.00803471,  0.90092193,  0.5007824 ],
       [ 0.53493692, -0.73696289,  0.74450684,  1.08871973,  0.62139148],
       [-0.53655792, -1.81085826,  1.56537819,  0.7579375 ,  0.45450664],
       [ 0.28066971, -0.54830376,  1.47187974,  1.06650231,  0.04784993],
       [ 1.02154957, -0.34109855,  1.11824843,  0.67996458, -0.32189377],
       [ 0.07639977,  1.7270387 ,  1.21956114, -0.15765375,  0.93010689],
       [-0.09135544,  0.93185076,  1.32433051,  0.26501064,  0.613627  ],
       [-0.09202819, -0.78502477,  0.89406565, -0.18790993,  0.15688979],
       [ 0.77443531,  0.23094802,  0.91258328,  0.55355652,  0.49466225],
       [ 1.02393264, -0.116118  ,  1.41482005,  0.82841188, -0.34933298],
       [ 1.6052633 , -0.46625997,  0.60527141, -0.13718251, -0.6633943 ],
       [-1.20039427, -0.3512295 ,  1.37466886,  0.82097067, -0.36497798],
       [ 0.59981736, -0.11575713,  0.95179546, -0.87212803, -0.3211485 ],
       [-1.21500965, -0.29185475,  1.1

In [None]:
trace1["xi"][:10]

## Mixture normal model 2
Difference between multidimensional Gaussian (multivariate Gaussian) and Mixture normal [link](https://stats.stackexchange.com/questions/319954/whats-the-difference-between-multivariate-gaussian-and-mixture-of-gaussians)

In [53]:
import pymc3 as pm
import theano.tensor as tt
from scipy.stats import bernoulli
mu2 = 0
sigma2 = 1e-6

def get_mix_model2(y, X, mu_intercept = 0, sigma_intercept = 1.5):
    invlogit = lambda x: 1/(1 + tt.exp(-x))
    model = pm.Model()
    with model:
        alpha = pm.Normal('alpha', mu = mu_intercept, sd = sigma_intercept)
        xi = pm.Bernoulli('xi', pi_varbvs, shape = X.shape[1])
        beta1 = pm.Normal("beta1", mu2, sigma2, shape = X.shape[1])
        beta2 = pm.Normal("beta2", mu_varbvs, si_varbvs, shape = X.shape[1])
        p = pm.math.dot(X, xi*beta1 + (1-xi)*beta2)
        y_obs = pm.Bernoulli('y_obs', invlogit(p + alpha), observed = y)
    return model

In [54]:
model2 = get_mix_model2(y,X)

In [55]:
model2

<pymc3.model.Model at 0x7f78a8159fd0>

In [56]:
with model2:
    trace2 = pm.sample(2000, random_seed = 999, cores = 1, progressbar = True, chains = 1)

Sequential sampling (1 chains in 1 job)
CompoundStep
>NUTS: [beta2, beta1, alpha]
>BinaryGibbsMetropolis: [xi]
100%|██████████| 2500/2500 [01:43<00:00, 24.23it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks


In [57]:
pm.summary(trace2)

Unnamed: 0,mean,sd,mc_error,hpd_2.5,hpd_97.5
alpha,-0.001792694,0.01633489,0.0002424184,-0.036442,0.027095
xi__0,0.0605,0.2384109,0.006793195,0.0,1.0
xi__1,0.0685,0.252602,0.006190921,0.0,1.0
xi__2,0.008,0.08908423,0.002204541,0.0,0.0
xi__3,0.0785,0.2689568,0.007976685,0.0,1.0
xi__4,0.0755,0.2641964,0.00769724,0.0,1.0
beta1__0,-2.850403e-08,1.001414e-06,1.455378e-08,-2e-06,2e-06
beta1__1,-2.006287e-08,9.63056e-07,1.521084e-08,-2e-06,2e-06
beta1__2,-1.04663e-08,9.85171e-07,1.594965e-08,-2e-06,2e-06
beta1__3,1.344149e-08,1.025941e-06,1.511805e-08,-2e-06,2e-06


In [60]:
results2 = pd.DataFrame({'inclusion_probability': np.apply_along_axis(np.mean, 0, trace2['xi'])})
                         #'beta': np.apply_along_axis(np.mean, 0, trace2['beta']),})
                         #'beta_given_inclusion': np.apply_along_axis(np.sum, 0, trace2['xi']*trace2['beta']) / np.apply_along_axis(np.sum, 0, trace2['xi'])})

In [61]:
results2

Unnamed: 0,inclusion_probability
0,0.0605
1,0.0685
2,0.008
3,0.0785
4,0.0755


In [None]:
trace2["beta"]

In [49]:
np.arange(5)

array([0, 1, 2, 3, 4])

In [50]:
np.ones(5)

array([1., 1., 1., 1., 1.])

In [578]:
## 2177_2182
## 1,2,3
## 2.51, 1.78, 0.60
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
2,0.466,1.368539,2.058897
1,0.4525,1.36964,2.071297
0,0.413,1.294798,2.016943
3,0.3945,1.264112,1.949095
4,0.069,0.822499,1.056799
5,0.0665,0.817413,1.019666


In [579]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
38,gene_2180,56,6650,0,6706,2.4732960000000003e-17,2
39,gene_2178,56,6650,0,6706,2.4732960000000003e-17,0
40,gene_2179,56,6650,0,6706,2.4732960000000003e-17,1
41,gene_2181,56,6650,0,6706,2.4732960000000003e-17,3
71,gene_2182,24,6682,0,6706,1.16777e-07,4
72,gene_2183,24,6682,0,6706,1.16777e-07,5


In [556]:
## 2092_2099
## 0,2,6
## 0.18, 1.71, 0.47
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
2,0.839,1.290242,1.388613
4,0.0785,0.789153,0.933225
0,0.068,0.799668,1.035229
1,0.0635,0.784478,0.960614
5,0.0605,0.784214,0.768279
3,0.0565,0.780303,0.741498
7,0.026,0.747943,0.092032
6,0.0235,0.761483,0.111491


In [557]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
108,gene_2095,36,6670,8,6698,2.5e-05,2
121,gene_2096,37,6669,12,6694,0.000459,3
122,gene_2097,37,6669,12,6694,0.000459,4
123,gene_2098,37,6669,12,6694,0.000459,5
268,gene_2100,28,6678,11,6695,0.009374,7
269,gene_2099,28,6678,11,6695,0.009374,6
291,gene_2094,9,6697,1,6705,0.021439,1
293,gene_2093,9,6697,1,6705,0.021439,0


In [513]:
## block 1858_1875
## real 14
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
14,0.1635,0.885465,1.445334
13,0.1415,0.873469,1.438782
3,0.095,0.808904,1.114097
11,0.091,0.79945,1.168663
0,0.09,0.805384,1.061036
9,0.09,0.806278,1.133716
12,0.0895,0.811432,1.072543
10,0.087,0.809282,1.182747
15,0.083,0.776091,1.053188
8,0.081,0.827899,1.242092


In [514]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
283,gene_1873,7,6699,0,6706,0.015601,14
284,gene_1872,7,6699,0,6706,0.015601,13
362,gene_1859,12,6694,3,6703,0.035057,0
365,gene_1874,10,6696,2,6704,0.03849,15
370,gene_1869,8,6698,1,6705,0.039,10
371,gene_1868,8,6698,1,6705,0.039,9
372,gene_1867,8,6698,1,6705,0.039,8
373,gene_1866,8,6698,1,6705,0.039,7
374,gene_1865,8,6698,1,6705,0.039,6
375,gene_1864,8,6698,1,6705,0.039,5


In [408]:
## block 1561_1579
## real 13,16
## effect 0.23,0.86
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
12,0.2755,0.718949,0.56004
18,0.219,0.9068,1.37675
16,0.182,0.777204,0.807678
13,0.1285,0.751887,0.491896
14,0.127,0.725241,0.488526
11,0.1125,0.731549,0.467664
10,0.106,0.73943,0.466607
9,0.0915,0.737306,0.426145
17,0.078,0.779866,0.80895
15,0.0355,0.765793,0.475941


In [409]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
118,gene_1574,114,6592,65,6641,0.000281,12
119,gene_1575,120,6586,71,6635,0.000439,13
120,gene_1576,120,6586,71,6635,0.000439,14
137,gene_1572,114,6592,68,6638,0.00074,10
138,gene_1573,114,6592,68,6638,0.00074,11
142,gene_1578,43,6663,17,6689,0.001039,16
143,gene_1571,106,6600,63,6643,0.001088,9
191,gene_1577,76,6630,44,6662,0.004281,15
285,gene_1568,54,6652,31,6675,0.01617,6
286,gene_1570,54,6652,31,6675,0.01617,8


In [616]:
tmp = readRDS("/home/min/GIT/cnv-gene-mapping/data/deletion_simu/block_1430_1453/deletion.genes.block30.for_simu.sample.genes.block_1430_1453.SuSiE.L_1.prior_0p005.susie.rds")

In [618]:
head(tmp$pip)

In [380]:
## block 1430_1453
## real 2, 12, 21
## effect 1.23, 0.95, 1.15
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
8,0.139,0.859537,1.409517
13,0.1235,0.838317,1.292621
22,0.123,0.849467,1.318009
3,0.1205,0.864911,1.408754
15,0.1185,0.838471,1.309034
1,0.1185,0.848042,1.468499
19,0.1175,0.838416,1.324686
10,0.1145,0.82606,1.336911
14,0.112,0.821494,1.273331
5,0.1115,0.838728,1.289816


In [615]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
73,gene_1450,23,6683,0,6706,2.339559e-07,19
74,gene_1441,23,6683,0,6706,2.339559e-07,10
75,gene_1449,23,6683,0,6706,2.339559e-07,18
76,gene_1452,23,6683,0,6706,2.339559e-07,21
77,gene_1448,23,6683,0,6706,2.339559e-07,17
78,gene_1447,23,6683,0,6706,2.339559e-07,16
79,gene_1446,23,6683,0,6706,2.339559e-07,15
80,gene_1445,23,6683,0,6706,2.339559e-07,14
81,gene_1444,23,6683,0,6706,2.339559e-07,13
82,gene_1443,23,6683,0,6706,2.339559e-07,12


In [336]:
## block 1361_1377
## real 1,15
## effect 1.18, 0.74
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
15,0.1235,0.82304,1.098096
14,0.117,0.818803,1.128166
16,0.112,0.793767,1.062181
5,0.073,0.805364,0.864683
13,0.073,0.780449,0.877235
12,0.071,0.793114,1.007977
1,0.0705,0.784433,0.833734
6,0.0695,0.798761,0.965133
10,0.0665,0.802419,0.943085
4,0.065,0.784205,0.932707


In [337]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
277,gene_1378,16,6690,4,6702,0.011757,16
278,gene_1377,16,6690,4,6702,0.011757,15
279,gene_1376,16,6690,4,6702,0.011757,14
350,gene_1363,12,6694,3,6703,0.035057,1
351,gene_1364,12,6694,3,6703,0.035057,2
352,gene_1366,12,6694,3,6703,0.035057,4
353,gene_1367,12,6694,3,6703,0.035057,5
354,gene_1368,12,6694,3,6703,0.035057,6
355,gene_1365,12,6694,3,6703,0.035057,3
356,gene_1370,12,6694,3,6703,0.035057,8


In [321]:
## block 1320_1337
## real 5,7
## effect 1.71, 0.39
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
4,0.272,1.004289,1.613462
5,0.2065,0.937953,1.517386
3,0.1765,0.876835,1.484768
7,0.1695,0.912997,1.485643
8,0.168,0.914467,1.568524
6,0.1645,0.883832,1.499749
1,0.117,0.813945,1.137207
2,0.1005,0.823522,1.178037
0,0.1,0.824287,1.083487
14,0.081,0.804299,1.205732


In [322]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
43,gene_1326,58,6648,6,6700,8.171839e-12,5
44,gene_1325,58,6648,6,6700,8.171839e-12,4
45,gene_1324,58,6648,6,6700,8.171839e-12,3
46,gene_1329,58,6648,6,6700,8.171839e-12,8
47,gene_1327,58,6648,6,6700,8.171839e-12,6
48,gene_1328,58,6648,6,6700,8.171839e-12,7
49,gene_1330,47,6659,5,6701,1.204229e-09,9
51,gene_1321,37,6669,2,6704,2.71726e-09,0
52,gene_1322,37,6669,2,6704,2.71726e-09,1
53,gene_1323,37,6669,2,6704,2.71726e-09,2


In [620]:
tmp1 = readRDS("/home/min/GIT/cnv-gene-mapping/data/deletion_simu/block_1109_1141/deletion.genes.block30.for_simu.sample.genes.block_1109_1141.SuSiE.L_1.prior_0p005.susie.rds")

In [621]:
tmp1$pip

In [264]:
## block 1109_1141
## real 29
## effect 2.05
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
27,1.0,1.849512,1.849512
0,0.4515,0.794093,0.790931
11,0.145,0.883395,1.379335
12,0.1345,0.828012,1.324679
10,0.1285,0.845456,1.402063
13,0.108,0.815385,1.315768
8,0.1005,0.826173,1.203562
6,0.1,0.793537,1.124176
7,0.0905,0.794219,1.143379
3,0.09,0.745811,0.706926


In [265]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
6,gene_1138,502,6204,79,6627,2.861096e-79,28
7,gene_1139,502,6204,79,6627,2.861096e-79,29
8,gene_1137,502,6204,79,6627,2.861096e-79,27
50,gene_1140,91,6615,27,6679,2.307133e-09,30
106,gene_1141,44,6662,11,6695,8.356271e-06,31
107,gene_1142,44,6662,11,6695,8.356271e-06,32
109,gene_1122,15,6691,0,6706,6.055868e-05,12
110,gene_1121,15,6691,0,6706,6.055868e-05,11
111,gene_1123,15,6691,0,6706,6.055868e-05,13
112,gene_1120,15,6691,0,6706,6.055868e-05,10


In [None]:
### selective

In [234]:
## block 1018_1031
## real 2, 13
## effect 0.60, 1.07
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
2,0.304,0.926048,1.296915
5,0.065,0.76765,0.80155
8,0.064,0.777537,0.820872
9,0.0635,0.773069,0.86668
10,0.0585,0.7912,0.832734
3,0.058,0.775015,0.843443
12,0.058,0.796682,0.831028
13,0.058,0.780965,0.919135
4,0.0505,0.768463,0.766376
6,0.0495,0.774164,0.727036


In [233]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
280,gene_1021,14,6692,3,6703,0.012672,2
556,gene_1023,7,6699,2,6704,0.179541,4
557,gene_1024,7,6699,2,6704,0.179541,5
565,gene_1022,7,6699,2,6704,0.179541,3
566,gene_1025,7,6699,2,6704,0.179541,6
568,gene_1027,7,6699,2,6704,0.179541,8
569,gene_1030,7,6699,2,6704,0.179541,11
570,gene_1031,7,6699,2,6704,0.179541,12
571,gene_1032,7,6699,2,6704,0.179541,13
572,gene_1029,7,6699,2,6704,0.179541,10


In [206]:
## block 930_937
## real 2
## effect 0.96
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
2,0.1175,0.848943,1.234252
7,0.113,0.814002,1.313239
4,0.112,0.846596,1.367156
1,0.1105,0.833119,1.239662
6,0.1095,0.815812,1.231239
5,0.102,0.82971,1.340558
0,0.096,0.826704,1.339139
3,0.0925,0.814909,1.330195


In [207]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
516,gene_932,4,6702,0,6706,0.124944,1
524,gene_938,4,6702,0,6706,0.124944,7
527,gene_937,4,6702,0,6706,0.124944,6
532,gene_935,4,6702,0,6706,0.124944,4
533,gene_934,4,6702,0,6706,0.124944,3
534,gene_933,4,6702,0,6706,0.124944,2
536,gene_931,4,6702,0,6706,0.124944,0
537,gene_936,4,6702,0,6706,0.124944,5


In [192]:
## block 893_910
## real 11, 15
## effect 1.92, 1.48
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
7,0.1615,0.883286,1.481295
13,0.1575,0.90757,1.529887
3,0.154,0.894071,1.511624
14,0.154,0.864708,1.537633
15,0.1535,0.896044,1.491993
8,0.1465,0.896886,1.547793
0,0.1435,0.891985,1.436742
10,0.143,0.871303,1.446187
4,0.1385,0.859715,1.509538
9,0.133,0.862674,1.454693


In [193]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
55,gene_895,25,6681,0,6706,5.828382e-08,1
56,gene_894,25,6681,0,6706,5.828382e-08,0
57,gene_896,25,6681,0,6706,5.828382e-08,2
58,gene_898,25,6681,0,6706,5.828382e-08,4
59,gene_899,25,6681,0,6706,5.828382e-08,5
60,gene_900,25,6681,0,6706,5.828382e-08,6
61,gene_901,25,6681,0,6706,5.828382e-08,7
62,gene_897,25,6681,0,6706,5.828382e-08,3
63,gene_903,25,6681,0,6706,5.828382e-08,9
64,gene_904,25,6681,0,6706,5.828382e-08,10


In [178]:
## block 841_870
## real 4, 17
## effect 0.74, 1.45
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
22,0.093,0.818293,1.287559
10,0.0915,0.815534,1.254016
19,0.0905,0.812697,1.240249
1,0.0895,0.813472,1.153967
17,0.089,0.815766,1.194412
21,0.087,0.812563,1.171672
20,0.0865,0.787368,1.109925
2,0.086,0.827101,1.223039
4,0.0845,0.824216,1.109448
24,0.083,0.819679,1.168925


In [179]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
236,gene_849,8,6698,0,6706,0.007796,7
237,gene_870,8,6698,0,6706,0.007796,28
238,gene_848,8,6698,0,6706,0.007796,6
239,gene_847,8,6698,0,6706,0.007796,5
240,gene_845,8,6698,0,6706,0.007796,3
241,gene_844,8,6698,0,6706,0.007796,2
242,gene_843,8,6698,0,6706,0.007796,1
243,gene_842,8,6698,0,6706,0.007796,0
244,gene_851,8,6698,0,6706,0.007796,9
245,gene_852,8,6698,0,6706,0.007796,10


In [622]:
tmp2 = readRDS("/home/min/GIT/cnv-gene-mapping/data/deletion_simu/block_788_797/deletion.genes.block30.for_simu.sample.genes.block_788_797.SuSiE.L_1.prior_0p005.susie.rds")

In [627]:
tmp2$pip

In [164]:
## block 788_797
## real 7
## effect 1.808
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
7,0.132,0.82523,1.093129
2,0.1145,0.810917,1.147125
5,0.1115,0.823439,1.219508
1,0.1105,0.818534,1.139817
4,0.1095,0.83774,1.163126
8,0.1095,0.807095,1.194249
0,0.1065,0.817155,1.143092
9,0.105,0.798071,1.205505
3,0.104,0.825068,1.111863
6,0.0995,0.821848,1.072202


In [165]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
295,gene_794,11,6695,2,6704,0.022398,5
301,gene_793,11,6695,2,6704,0.022398,4
310,gene_792,11,6695,2,6704,0.022398,3
311,gene_791,11,6695,2,6704,0.022398,2
313,gene_795,11,6695,2,6704,0.022398,6
316,gene_789,11,6695,2,6704,0.022398,0
317,gene_790,11,6695,2,6704,0.022398,1
328,gene_796,11,6695,2,6704,0.022398,7
331,gene_798,11,6695,2,6704,0.022398,9
332,gene_797,11,6695,2,6704,0.022398,8


In [86]:
## block 739_743
## real 2
## effect 0.98
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
2,0.493,1.0007,1.232291
0,0.036,0.726913,0.263216
3,0.035,0.753536,0.516642
1,0.0295,0.787234,0.411099
4,0.026,0.768642,0.450941


In [121]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
274,gene_742,16,6690,4,6702,0.011757,2
1384,gene_743,5,6701,3,6703,0.726481,3
2184,gene_744,3,6703,2,6704,1.0,4
2185,gene_741,3,6703,2,6704,1.0,1
2186,gene_740,3,6703,2,6704,1.0,0


In [None]:
### selective

In [628]:
tmp3 = readRDS("/home/min/GIT/cnv-gene-mapping/data/deletion_simu/block_666_677/deletion.genes.block30.for_simu.sample.genes.block_666_677.SuSiE.L_1.prior_0p005.susie.rds")

In [629]:
tmp3$pip

In [74]:
## block 666_677
## real 4, 5, 11
## effect 1.28, 0.53, 0.98
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
4,0.546,1.477352,2.060144
5,0.5185,1.432267,2.04338
10,0.1325,0.84888,1.329551
9,0.127,0.854411,1.342899
11,0.108,0.823739,1.277766
8,0.03,0.750469,0.429703
0,0.024,0.744307,0.225855
7,0.023,0.77012,0.37547
1,0.0195,0.749606,0.191625
6,0.0185,0.788487,0.3676


In [120]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(666+1, 677+2)])]
fisher_block["index"] = [int(x.split("_")[1])-666-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
36,gene_671,99,6607,9,6697,1.956916e-20,4
37,gene_672,99,6607,9,6697,1.956916e-20,5
102,gene_676,20,6686,0,6706,1.88048e-06,9
103,gene_678,20,6686,0,6706,1.88048e-06,11
104,gene_677,20,6686,0,6706,1.88048e-06,10
139,gene_673,23,6683,5,6701,0.0009016016,6
140,gene_674,23,6683,5,6701,0.0009016016,7
141,gene_675,23,6683,5,6701,0.0009016016,8
282,gene_667,12,6694,2,6704,0.01289476,0
345,gene_668,17,6689,6,6700,0.03453722,1


In [None]:
### selective

In [663]:
tmp = readRDS("/home/min/GIT/cnv-gene-mapping/data/deletion_simu/block_556_583/deletion.genes.block30.for_simu.sample.genes.block_556_583.SuSiE.L_1.prior_0p005.susie.rds")

In [664]:
tmp$pip

In [62]:
## block 556_583
## real 8, 14, 20
## effect 2.03, 0.84, 0.71
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
13,0.1765,0.912673,1.54197
17,0.168,0.899262,1.533234
10,0.1565,0.894481,1.473024
22,0.145,0.899563,1.550085
15,0.14,0.881196,1.520979
18,0.139,0.87787,1.524457
11,0.1365,0.872064,1.475609
16,0.1355,0.876335,1.551596
8,0.1335,0.881722,1.562205
20,0.1325,0.879677,1.423774


In [119]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(556+1, 583+2)])]
fisher_block["index"] = [int(x.split("_")[1])-556-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
9,gene_567,132,6574,1,6705,1.296086e-38,10
10,gene_582,132,6574,1,6705,1.296086e-38,25
11,gene_581,132,6574,1,6705,1.296086e-38,24
12,gene_580,132,6574,1,6705,1.296086e-38,23
13,gene_577,132,6574,1,6705,1.296086e-38,20
14,gene_579,132,6574,1,6705,1.296086e-38,22
15,gene_576,132,6574,1,6705,1.296086e-38,19
16,gene_575,132,6574,1,6705,1.296086e-38,18
17,gene_574,132,6574,1,6705,1.296086e-38,17
18,gene_566,132,6574,1,6705,1.296086e-38,9


In [50]:
## block 380_417
## real 17, 37
## effect 0.66, 0.88
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
20,0.054,0.803795,0.668929
14,0.0525,0.792524,0.65988
23,0.05,0.788163,0.726686
35,0.047,0.777344,0.73914
7,0.047,0.770187,0.604312
15,0.0465,0.774596,0.735595
29,0.046,0.782624,0.709078
8,0.0455,0.768314,0.681973
33,0.045,0.760441,0.726983
6,0.0445,0.774748,0.715291


In [118]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(380+1, 417+2)])]
fisher_block["index"] = [int(x.split("_")[1])-380-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
740,gene_389,6,6700,2,6704,0.288916,8
743,gene_388,6,6700,2,6704,0.288916,7
744,gene_387,6,6700,2,6704,0.288916,6
745,gene_386,6,6700,2,6704,0.288916,5
746,gene_385,6,6700,2,6704,0.288916,4
748,gene_384,6,6700,2,6704,0.288916,3
750,gene_382,6,6700,2,6704,0.288916,1
752,gene_405,6,6700,2,6704,0.288916,24
753,gene_418,6,6700,2,6704,0.288916,37
754,gene_390,6,6700,2,6704,0.288916,9


In [38]:
## block 365_374
## real 4
## effect 1.59
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
4,0.0965,0.795459,1.12808
3,0.0925,0.810873,1.156519
1,0.09,0.806027,1.201729
5,0.0885,0.813363,1.141432
7,0.086,0.819544,1.043773
2,0.081,0.816045,1.104243
9,0.081,0.794617,1.111514
8,0.0765,0.820308,1.141249
6,0.076,0.811247,1.101379
0,0.0705,0.802559,1.156571


In [115]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(365+1, 374+2)])]
fisher_block["index"] = [int(x.split("_")[1])-365-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
505,gene_375,6,6700,1,6705,0.124902,9
506,gene_374,6,6700,1,6705,0.124902,8
507,gene_373,6,6700,1,6705,0.124902,7
508,gene_372,6,6700,1,6705,0.124902,6
509,gene_371,6,6700,1,6705,0.124902,5
510,gene_370,6,6700,1,6705,0.124902,4
511,gene_369,6,6700,1,6705,0.124902,3
513,gene_367,6,6700,1,6705,0.124902,1
514,gene_366,6,6700,1,6705,0.124902,0
515,gene_368,6,6700,1,6705,0.124902,2


In [26]:
## block 264_293
## real 8,9,29
## effect 0.92, 1.21, 1.66
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
29,0.2185,0.91184,1.368975
0,0.1415,0.879138,1.279202
10,0.127,0.830457,1.272394
7,0.126,0.83658,1.248508
8,0.1225,0.845818,1.323052
9,0.1175,0.842276,1.306886
14,0.0425,0.777203,0.572693
5,0.04,0.758333,0.537903
23,0.04,0.78279,0.602011
24,0.037,0.764466,0.552337


In [117]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(264+1, 293+2)])]
fisher_block["index"] = [int(x.split("_")[1])-264-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
381,gene_275,8,6698,1,6705,0.039,10
382,gene_274,8,6698,1,6705,0.039,9
383,gene_273,8,6698,1,6705,0.039,8
384,gene_272,8,6698,1,6705,0.039,7
386,gene_265,8,6698,1,6705,0.039,0
439,gene_294,7,6699,1,6705,0.070231,29
706,gene_283,9,6697,4,6702,0.266611,18
707,gene_270,9,6697,4,6702,0.266611,5
708,gene_285,9,6697,4,6702,0.266611,20
709,gene_284,9,6697,4,6702,0.266611,19


In [645]:
### selective

In [647]:
## block 23_36
## real 0,2,5,11
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
0,0.097,0.816991,1.289321
1,0.1175,0.838144,1.2557
2,0.099,0.825578,1.192458
3,0.1105,0.835401,1.263729
4,0.1225,0.795763,1.200789
5,0.11,0.801798,1.221597
6,0.1075,0.824616,1.175793
7,0.111,0.84388,1.219808
8,0.0975,0.8273,1.229726
9,0.123,0.838705,1.268431


In [644]:
fisher_block = fisher.copy()[fisher["gene"].isin([f"gene_{x}" for x in range(start+1, end+2)])]
fisher_block["index"] = [int(x.split("_")[1])-start-1 for x in fisher_block["gene"]]
fisher_block

Unnamed: 0,gene,d_c,d_nc,nd_c,nd_nc,p,index
0,gene_25,603,6103,76,6630,1.91049e-107,1
1,gene_28,603,6103,76,6630,1.91049e-107,4
2,gene_27,603,6103,76,6630,1.91049e-107,3
3,gene_26,603,6103,76,6630,1.91049e-107,2
4,gene_29,603,6103,76,6630,1.91049e-107,5
5,gene_24,528,6178,61,6645,1.710063e-97,0
29,gene_30,234,6472,41,6665,1.291823e-34,6
35,gene_31,162,6544,26,6680,1.6411380000000001e-25,7
124,gene_32,20,6686,3,6703,0.0004832617,8
125,gene_33,20,6686,3,6703,0.0004832617,9


But the true causal variables are 14 and 31 ... apparently this needs more work.

## Some sanity check

1. Does posterior predictive mean $\tilde{y}$ roughly equal to data mean?
2. What's the posterior number of non-zero variables?

In [13]:
estimate = trace['beta'] * trace['xi'] 
y_hat = np.apply_along_axis(np.mean, 1, expit(trace['alpha'] + np.dot(X, np.transpose(estimate) )) )
print(np.mean(y_hat), np.sum(results.inclusion_probability))

0.5000008683342422 1.13855


So the posterior mean suggests 1 variable involved. The prior is $0.043 \times 50 \approx 2$ expected. **Need to check this with simulated truth; also should run `varbvs` on this and compare**.