Precursors!

In [4]:
import os, subprocess

if not os.path.isfile('data/hg19.ml.fa'):
    subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
    subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)                

Analyzing noncoding variation associated with disease is a major application of Basenji. I now offer several tools to enable that analysis. If you have a small set of variants and know what datasets are most relevant, [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you perform a saturation mutagenesis of the variant and surrounding region to see the relevant nearby motifs.

If you want scores measuring the influence of those variants on all datasets,
 * [basenji_sad.py](https://github.com/calico/basenji/blob/master/bin/basenji_sad.py) computes my SNP activity difference (SAD) score--the predicted change in aligned fragments to the region.
 * [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py) computes my SNP expression difference (SED) score--the predicted change in aligned fragments to gene TSS's.

Here, I'll demonstrate those two programs. You'll need
 * Trained model
 * Input file (FASTA or HDF5 with test_in/test_out)

First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or use one that I pre-trained from the models subdirectory.

As an example, we'll study a prostate cancer susceptibility allele of rs339331 that increases RFX6 expression by modulating HOXB13 chromatin binding (http://www.nature.com/ng/journal/v46/n2/full/ng.2862.html).

First, we'll use [basenji_sad.py](https://github.com/calico/basenji/blob/master/bin/basenji_sad.py) to predict across the region for each allele and compute stats about the mean and max differences.

The most relevant options are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -f | data/hg19.ml.fa | Genome fasta. |
| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
| -l | 262144 | Saturation mutagenesis region in the center of the given sequence(s) |
| -o | rfx6_sad | Outplot plot directory. |
| --rc | | Predict forward and reverse complement versions and average the results. |
| -t | data/gm12878_wigs.txt | Target labels. |
| params_file | models/params_med.txt | Table of parameters to setup the model architecture and optimization parameters. |
| model_file | models/gm12878.tf | Trained saved model prefix. |
| vcf_file | data/rs339331.vcf | VCF file specifying variants to score. |

In [10]:
! basenji_sad.py -f data/hg19.ml.fa -g data/human.hg19.genome -l 262144 -o rfx6_sad --rc -t data/gm12878_wigs.txt models/params_small.txt models/gm12878_d10/model_best.tf data/rs339331.vcf

  return f(*args, **kwds)
  from ._conv import register_converters as _register_converters
{'batch_size': 2, 'batch_buffer': 16384, 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0], 'adam_beta2': 0.98, 'cnn_dropout': 0.05, 'learning_rate': 0.002, 'loss': 'poisson', 'adam_beta1': 0.97, 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0], 'num_targets': 39, 'link': 'softplus', 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'optimizer': 'adam', 'cnn_filter_sizes': [20, 6, 6, 6, 3, 3, 3, 3, 3, 3, 3, 1], 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'target_pool': 128}
Targets pooled by 128 to length 2048
Convolution w/ 39 384x1 filters to final targets
Model building time 9.600011
2018-05-12 09:08:14.611614: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA


rfx6_sad/sad_table.txt now contains a table describing the results.

The *u* in *upred* and *usad* refers to taking the mean across the sequence, whereas *x* in *xpred* and *xsad* refers to the maximum position. 
Then *sad* refers to subtracting the alt allele prediction from the ref allele, and *sar* refers to adding a pseudocount 1 and taking log2 of their ratio.

In [11]:
! head rfx6_sad/sad_table.txt

rsid ref alt ref_pred alt_pred sad sar geo_sad ref_lpred alt_lpred lsad lsar ref_xpred alt_xpred xsad xsar target_index target_id target_label
rs339331           T      C |  8601.44  8601.41   -0.021 -0.0000  -0.006 |  37.165  37.228   0.063  0.0024 |   4.873   4.928   0.055  0.0134 |    0 ENCSR000EJD_3_1 DNASE:GM12878
rs339331           T      C |  3635.89  3635.82   -0.075 -0.0000  -0.038 |  14.456  14.525   0.069  0.0065 |   1.782   1.819   0.037  0.0190 |    1 ENCSR000EMT_2_1 DNASE:GM12878
rs339331           T      C |  1438.24  1438.12   -0.112 -0.0001  -0.086 |   6.282   6.309   0.027  0.0053 |   0.929   0.952   0.023  0.0167 |    2 ENCSR000EMT_1_1 DNASE:GM12878
rs339331           T      C |  8199.03  8199.07    0.035  0.0000   0.009 |  34.614  34.699   0.085  0.0034 |   4.189   4.241   0.051  0.0142 |    3 ENCSR000EJD_1_1 DNASE:GM12878
rs339331           T      C |  5658.30  5658.24   -0.057 -0.0000  -0.019 |  24.260  24.286   0.026  0.0015 |   2.987   3.012   0.026  0.0093 |   

We can sort by *xsar* to get an idea of the datasets where Basenji sees the largest difference between the two alleles.

In [12]:
! sort -k13 -g rfx6_sad/sad_table.txt | head -n 5

rsid ref alt ref_pred alt_pred sad sar geo_sad ref_lpred alt_lpred lsad lsar ref_xpred alt_xpred xsad xsar target_index target_id target_label
rs339331           T      C |  2750.46  2750.37   -0.098 -0.0001  -0.054 |  13.524  13.504  -0.019 -0.0019 |   1.842   1.836  -0.006 -0.0033 |   21 ENCSR057BWO_1_1 HISTONE:H3K4me3 GM12878
rs339331           T      C |   981.37   980.96   -0.413 -0.0006  -0.334 |   5.194   5.185  -0.009 -0.0022 |   1.442   1.429  -0.013 -0.0077 |   13 ENCSR000AOW_1_1 HISTONE:H3K79me2 GM12878
rs339331           T      C |   245.40   245.35   -0.048 -0.0003  -0.059 |   0.634   0.631  -0.003 -0.0023 |   0.101   0.099  -0.002 -0.0032 |   36    CNhs12332 CAGE:B lymphoblastoid cell line: GM12878 ENCODE, biol_rep2
rs339331           T      C |   247.11   247.06   -0.049 -0.0003  -0.060 |   0.732   0.729  -0.003 -0.0024 |   0.106   0.104  -0.002 -0.0029 |   37    CNhs12333 CAGE:B lymphoblastoid cell line: GM12878 ENCODE, biol_rep3


In [13]:
! sort -k13 -gr rfx6_sad/sad_table.txt | head -n 5

rs339331           T      C |  8199.03  8199.07    0.035  0.0000   0.009 |  34.614  34.699   0.085  0.0034 |   4.189   4.241   0.051  0.0142 |    3 ENCSR000EJD_1_1 DNASE:GM12878
rs339331           T      C |  3635.89  3635.82   -0.075 -0.0000  -0.038 |  14.456  14.525   0.069  0.0065 |   1.782   1.819   0.037  0.0190 |    1 ENCSR000EMT_2_1 DNASE:GM12878
rs339331           T      C |  8601.44  8601.41   -0.021 -0.0000  -0.006 |  37.165  37.228   0.063  0.0024 |   4.873   4.928   0.055  0.0134 |    0 ENCSR000EJD_3_1 DNASE:GM12878
rs339331           T      C |  1102.09  1102.15    0.063  0.0001   0.057 |   4.227   4.277   0.050  0.0138 |   0.620   0.650   0.029  0.0257 |   24 ENCSR000DRX_2_1 HISTONE:H3K27me3 GM12878
rs339331           T      C |   966.69   966.57   -0.115 -0.0002  -0.104 |   3.976   4.020   0.044  0.0127 |   0.627   0.658   0.031  0.0273 |   12 ENCSR000DRW_1_1 HISTONE:H3K36me3 GM12878


These are inconclusive small effect sizes, not surprising given that we're only studying GM12878. The proper cell types would shed more light.

Alternatively, we can directly query the predictions at gene TSS's using [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py).

[basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py) takes as input the gene sequence HDF5 format described in [genes.ipynb](https://github.com/calico/basenji/blob/master/tutorials/genes.ipynb). There's no harm to providing an HDF5 that describes all genes, but it's too big to easily move around so I constructed one that focuses on RFX6.

The most relevant options are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
| -o | rfx6_sed | Outplot plot directory. |
| --rc | | Predict forward and reverse complement versions and average the results. |
| -w | 128 | Sequence bin width at which predictions are made. |
| params_file | models/params_med.txt | Table of parameters to setup the model architecture and optimization parameters. |
| model_file | models/gm12878.tf | Trained saved model prefix. |
| genes_hdf5_file | data/rfx6.h5 | HDF5 file specifying gene sequences to query. |
| vcf_file | data/rs339331.vcf | VCF file specifying variants to score. |

Before running [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py), we need to generate an input data file for RFX6. Using an included GTF file that contains only RFX6, one can use [basenji_hdf5_genes.py]((https://github.com/calico/basenji/blob/master/bin/basenji_hdf5_genes.py) to create the required format.

In [17]:
! basenji_hdf5_genes.py -g data/human.hg19.genome -l 262144 -c 0.333 -w 128 data/hg19.ml.fa data/rfx6.gtf data/rfx6.h5

  from ._conv import register_converters as _register_converters


In [29]:
! basenji_sed.py -g data/human.hg19.genome -o rfx6_sed --rc models/params_small.txt models/gm12878_d10/model_best.tf data/rfx6.h5 data/rs339331.vcf

  from ._conv import register_converters as _register_converters
  return f(*args, **kwds)
Intersecting gene sequences with SNPs...1 sequences w/ SNPs
{'num_targets': 39, 'cnn_filter_sizes': [20, 6, 6, 6, 3, 3, 3, 3, 3, 3, 3, 1], 'learning_rate': 0.002, 'loss': 'poisson', 'batch_buffer': 16384, 'batch_size': 2, 'link': 'softplus', 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'adam_beta1': 0.97, 'adam_beta2': 0.98, 'optimizer': 'adam', 'cnn_dropout': 0.05, 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0], 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'target_pool': 128, 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0]}
Targets pooled by 128 to length 2048
Convolution w/ 39 384x1 filters to final targets
2018-05-12 11:28:05.089572: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
chr6:117071103-117333247 2 TSSs


In [30]:
! sort -k9 -g rfx6_sed/sed_gene.txt | head -n 5

rsid ref alt gene tss_dist ref_pred alt_pred sed ser target_index target_id target_label
rs339331      T     C ENSG00000185002.9_1  4100  0.7612  0.7559 -0.0054 -0.0088   15          t15 
rs339331      T     C ENSG00000185002.9_1  4100  1.5000  1.4902 -0.0098 -0.0088   31          t31 
rs339331      T     C ENSG00000185002.9_1  4100  1.6543  1.6436 -0.0107 -0.0088   13          t13 
rs339331      T     C ENSG00000185002.9_1  4100  1.0771  1.0723 -0.0049 -0.0059    6           t6 


In [31]:
! sort -k9 -gr rfx6_sed/sed_gene.txt | head -n 5

rs339331      T     C ENSG00000185002.9_1  4100  1.1816  1.1836  0.0020  0.0022   14          t14 
rs339331      T     C ENSG00000185002.9_1  4100  7.9102  7.9141  0.0039  0.0020    3           t3 
rs339331      T     C ENSG00000185002.9_1  4100  5.9766  5.9805  0.0039  0.0000    4           t4 
rs339331      T     C ENSG00000185002.9_1  4100  1.9531  1.9521 -0.0010  0.0000   28          t28 
rs339331      T     C ENSG00000185002.9_1  4100  3.7969  3.7949 -0.0020 -0.0010   38          t38 
