# Basic BVAS demo using S-gene data

In [1]:
import torch
import gzip
from bvas import BVASSelector, map_inference

### Load data

The analysis in this notebook is meant for demonstration purposes only.
Our data only includes alleles from the S gene. As such results are expected to be biased as compared to a genome wide analysis. <br> <br>
Powered by <br>
<img src="https://www.gisaid.org/fileadmin/gisaid/img/schild.png" alt="GISAID" width="80" align="left"/>

In [5]:
data = torch.load(gzip.GzipFile("../data/S_gene.pt.gz", "rb"), map_location='cpu')

In [6]:
# inspect data
for k, v in data.items():
    if hasattr(v, 'shape'):
        print(k, v.shape)
    elif isinstance(v, list):
        print(k, len(v))
    else:
        print(k, v)

Gamma torch.Size([415, 415])
Y torch.Size([415])
num_alleles 415
num_regions 74
mutations 415


### Instantiate BVASSelector object

In [8]:
selector = BVASSelector(data['Y'].double(),  # use 64-bit precision
                        data['Gamma'].double(), 
                        data['mutations'], 
                        nu_eff_multiplier=0.25,
                        S=5.0,
                        tau=100.0)

### Run BVAS MCMC-based inference

In [9]:
selector.run(T=5000, T_burnin=1000, seed=1)

  0%|          | 0/6000 [00:00<?, ?it/s]

### Inspect results

The results can be found in the `selector.summary` Pandas DataFrame.

In [10]:
print(selector.summary.iloc[:30][['PIP', 'Beta', 'BetaStd', 'Rank']])

              PIP      Beta   BetaStd  Rank
S:L452R  1.000000  0.500928  0.049617     1
S:T478K  1.000000  0.507602  0.074132     2
S:R346K  1.000000  0.481718  0.059620     3
S:T19R   1.000000  0.541961  0.075788     4
S:N440K  1.000000  0.482342  0.072035     5
S:E484K  1.000000  0.325571  0.033246     6
S:P681R  1.000000  0.421493  0.049610     7
S:N501Y  0.999877  0.288173  0.049724     8
S:T95I   0.999205  0.336165  0.078261     9
S:N969K  0.996847  0.432681  0.102006    10
S:Q954H  0.996489  0.432219  0.106664    11
S:G339D  0.996196  0.430900  0.087655    12
S:N679K  0.994718  0.399761  0.095743    13
S:N764K  0.985545  0.386436  0.102249    14
S:S375F  0.979299  0.374481  0.107152    15
S:S373P  0.978723  0.369183  0.113199    16
S:T859N  0.964351  0.255766  0.080352    17
S:S477N  0.934212  0.137568  0.050168    18
S:T716I  0.912222  0.289920  0.119026    19
S:Y145H  0.872734  0.186306  0.094216    20
S:H655Y  0.779720  0.224054  0.135953    21
S:D405N  0.698377  0.220314  0.1

## Let's compare to MAP inference

In [11]:
map_results = map_inference(data['Y'].double(), data['Gamma'].double(), data['mutations'], tau_reg=2048.0)

In [12]:
map_results.iloc[:30]

Unnamed: 0,Beta,BetaStd,Rank
S:T478K,0.479279,0.01672,1
S:L452R,0.403808,0.015633,2
S:T19R,0.401982,0.018214,3
S:P681R,0.385613,0.017929,4
S:N440K,0.255411,0.019505,5
S:R346K,0.232116,0.018944,6
S:T95I,0.226314,0.016852,7
S:N969K,0.223328,0.021557,8
S:Q954H,0.222891,0.021557,9
S:G339D,0.222524,0.021557,10


### Compare uncertainty estimates

We note that the MAP uncertainty estimates are much narrower than the 
corresponding BVAS uncertainty estimates. This is ultimatedly due to
the fact that BVAS considers multiple hypotheses about which alleles
are neutral and which are not.

In [13]:
# BVAS posterior standard deviation of selection coefficient for S:T478K 
selector.summary.loc['S:T478K'].BetaStd

0.07413167570495986

In [14]:
# MAP posterior standard deviation of selection coefficient for S:T478K 
map_results.loc['S:T478K'].BetaStd

0.016719752782775

In [15]:
# compute ratio
selector.summary.loc['S:T478K'].BetaStd / map_results.loc['S:T478K'].BetaStd

4.433778218379622