Analyzing noncoding variation associated with disease is a major application of Basenji. I now offer several tools to enable that analysis. If you have a small set of variants and know what datasets are most relevant, [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you perform a saturation mutagenesis of the variant and surrounding region to see the relevant nearby motifs.

If you want scores measuring the influence of those variants on all datasets,
 * [basenji_sad.py](https://github.com/calico/basenji/blob/master/bin/basenji_sad.py) computes my SNP activity difference (SAD) score--the predicted change in aligned fragments to the region.
 * [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py) computes my SNP expression difference (SED) score--the predicted change in aligned fragments to gene TSS's.

Here, I'll demonstrate those two programs. You'll need
 * Trained model
 * Input file (FASTA or HDF5 with test_in/test_out)

First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or use one that I pre-trained from the models subdirectory.

As an example, we'll study a prostate cancer susceptibility allele of rs339331 that increases RFX6 expression by modulating HOXB13 chromatin binding (http://www.nature.com/ng/journal/v46/n2/full/ng.2862.html).

First, we'll use [basenji_sad.py](https://github.com/calico/basenji/blob/master/bin/basenji_sad.py) to predict across the region for each allele and compute stats about the mean and max differences.

The most relevant options are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -f | data/hg19.ml.fa | Genome fasta. |
| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
| -l | 262144 | Saturation mutagenesis region in the center of the given sequence(s) |
| -o | rfx6_sad | Outplot plot directory. |
| --rc | | Predict forward and reverse complement versions and average the results. |
| -t | data/gm12878_wigs.txt | Target labels. |
| params_file | models/params_med.txt | Table of parameters to setup the model architecture and optimization parameters. |
| model_file | models/gm12878.tf | Trained saved model prefix. |
| vcf_file | data/rs339331.vcf | VCF file specifying variants to score. |

In [8]:
! basenji_sad.py -f data/hg19.ml.fa -g data/human.hg19.genome -l 262144 -o rfx6_sad --rc -t data/gm12878_wigs.txt models/params_med.txt models/gm12878.tf data/rs339331.vcf

{'batch_buffer': 16384, 'loss': 'poisson', 'cnn_dense': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0], 'adam_beta1': 0.97, 'cnn_dropout': [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.1], 'cnn_dilation': [1, 1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 128, 1], 'adam_beta2': 0.98, 'link': 'softplus', 'target_pool': 128, 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3, 3, 3, 3, 3, 3, 3, 3, 3], 'batch_renorm': 1, 'cnn_pool': [1, 2, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'cnn_filters': [196, 196, 235, 282, 338, 384, 64, 64, 64, 64, 64, 64, 64, 512], 'num_targets': 39, 'batch_size': 1, 'learning_rate': 0.002}
Targets pooled by 128 to length 2048
Convolution w/ 196 4x22 filters strided 1, dilated 1
Batch normalization
ReLU
Dropout w/ probability 0.050
Convolution w/ 196 196x1 filters strided 1, dilated 1
Batch normalization
ReLU
Max pool 2
Dropout w/ probability 0.050
Convolution w/ 235 196x6 filters strided 1, dilated 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability

rfx6_sad/sad_table.txt now contains a table describing the results.

The *u* in *upred* and *usad* refers to taking the mean across the sequence, whereas *x* in *xpred* and *xsad* refers to the maximum position. 
Then *sad* refers to subtracting the alt allele prediction from the ref allele, and *sar* refers to adding a pseudocount 1 and taking log2 of their ratio.

In [9]:
! head rfx6_sad/sad_table.txt

rsid index score ref alt ref_upred alt_upred usad usar ref_xpred alt_xpred xsad xsar target_index target_id target_label
rs339331      .                 .      T      C  4.469  4.469  0.0000  0.0000  4.383  4.422  0.0391  0.0104    0 ENCSR000EJD_3_1 DNASE:GM12878
rs339331      .                 .      T      C  1.628  1.628  0.0000  0.0000  1.690  1.758  0.0674  0.0357    1 ENCSR000EMT_2_1 DNASE:GM12878
rs339331      .                 .      T      C  0.658  0.658  0.0000  0.0000  0.694  0.721  0.0273  0.0231    2 ENCSR000EMT_1_1 DNASE:GM12878
rs339331      .                 .      T      C  4.332  4.332  0.0000  0.0000  4.004  4.055  0.0508  0.0146    3 ENCSR000EJD_1_1 DNASE:GM12878
rs339331      .                 .      T      C  2.797  2.799  0.0020  0.0010  2.332  2.342  0.0098  0.0042    4 ENCSR000EJD_2_1 DNASE:GM12878
rs339331      .                 .      T      C  1.744  1.744  0.0000  0.0000  1.424  1.438  0.0137  0.0081    5 ENCSR057BWO_2_1 HISTONE:H3K4me3 GM12878
rs33

We can sort by *xsar* to get an idea of the datasets where Basenji sees the largest difference between the two alleles.

In [13]:
! sort -k13 -g rfx6_sad/sad_table.txt | head -n 5

rsid index score ref alt ref_upred alt_upred usad usar ref_xpred alt_xpred xsad xsar target_index target_id target_label
rs339331      .                 .      T      C  1.391  1.391  0.0000  0.0000  1.733  1.704 -0.0293 -0.0155   21 ENCSR057BWO_1_1 HISTONE:H3K4me3 GM12878
rs339331      .                 .      T      C  0.087  0.087  0.0000  0.0000  0.632  0.629 -0.0029 -0.0026   38    CNhs12331 CAGE:B lymphoblastoid cell line: GM12878 ENCODE, biol_rep1
rs339331      .                 .      T      C  0.073  0.073  0.0000  0.0000  0.544  0.542 -0.0024 -0.0023   36    CNhs12332 CAGE:B lymphoblastoid cell line: GM12878 ENCODE, biol_rep2
rs339331      .                 .      T      C  0.220  0.220  0.0000  0.0000  0.342  0.340 -0.0020 -0.0021   33 ENCSR000AKH_2_1 HISTONE:H3K9ac GM12878


In [14]:
! sort -k13 -gr rfx6_sad/sad_table.txt | head -n 5

rs339331      .                 .      T      C  1.628  1.628  0.0000  0.0000  1.690  1.758  0.0674  0.0357    1 ENCSR000EMT_2_1 DNASE:GM12878
rs339331      .                 .      T      C  0.664  0.664  0.0000  0.0000  0.527  0.562  0.0356  0.0333   24 ENCSR000DRX_2_1 HISTONE:H3K27me3 GM12878
rs339331      .                 .      T      C  0.469  0.469  0.0000  0.0000  0.508  0.541  0.0327  0.0310   29 ENCSR000DRX_1_1 HISTONE:H3K27me3 GM12878
rs339331      .                 .      T      C  0.404  0.404  0.0000  0.0000  0.369  0.398  0.0293  0.0306   16 ENCSR000DRW_2_1 HISTONE:H3K36me3 GM12878
rs339331      .                 .      T      C  0.405  0.405  0.0000  0.0000  0.364  0.392  0.0276  0.0289   12 ENCSR000DRW_1_1 HISTONE:H3K36me3 GM12878


These are inconclusive small effect sizes, not surprising given that we're only studying GM12878. The proper cell types would shed more light.

Alternatively, we can directly query the predictions at gene TSS's using [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py).

[basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py) takes as input the gene sequence HDF5 format described in [genes.ipynb](https://github.com/calico/basenji/blob/master/tutorials/genes.ipynb). There's no harm to providing an HDF5 that describes all genes, but it's too big to easily move around so I constructed one that focuses on RFX6.

The most relevant options are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
| -o | rfx6_sed | Outplot plot directory. |
| --rc | | Predict forward and reverse complement versions and average the results. |
| -w | 128 | Sequence bin width at which predictions are made. |
| params_file | models/params_med.txt | Table of parameters to setup the model architecture and optimization parameters. |
| model_file | models/gm12878.tf | Trained saved model prefix. |
| genes_hdf5_file | data/rfx6.h5 | HDF5 file specifying gene sequences to query. |
| vcf_file | data/rs339331.vcf | VCF file specifying variants to score. |

In [1]:
! basenji_sed.py -g data/human.hg19.genome -o rfx6_sed --rc -w 128 models/params_med.txt models/gm12878.tf data/rfx6.h5 data/rs339331.vcf

Intersecting gene sequences with SNPs...done
{'cnn_pool': [1, 2, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'adam_beta1': 0.97, 'cnn_dilation': [1, 1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 128, 1], 'target_pool': 128, 'link': 'softplus', 'cnn_filters': [196, 196, 235, 282, 338, 384, 64, 64, 64, 64, 64, 64, 64, 512], 'num_targets': 39, 'adam_beta2': 0.98, 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3, 3, 3, 3, 3, 3, 3, 3, 3], 'loss': 'poisson', 'batch_size': 1, 'cnn_dropout': [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.1], 'batch_renorm': 1, 'learning_rate': 0.002, 'cnn_dense': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0], 'batch_buffer': 16384}
Targets pooled by 128 to length 2048
Convolution w/ 196 4x22 filters strided 1, dilated 1
Batch normalization
ReLU
Dropout w/ probability 0.050
Convolution w/ 196 196x1 filters strided 1, dilated 1
Batch normalization
ReLU
Max pool 2
Dropout w/ probability 0.050
Convolution w/ 235 196x6 filters strided 1, dilated 1
Batch normali

In [2]:
! sort -k9 -g rfx6_sed/sed_gene.txt | head -n 5

rsid ref alt gene tss_dist ref_pred alt_pred sed ser target_index target_id target_label
rs339331      T     C ENSG00000185002.9_1 11565 11.9669 11.9147 -0.0522 -0.0058   38          t38 
rs339331      T     C ENSG00000185002.9_1 11565  9.9741  9.9313 -0.0428 -0.0056   36          t36 
rs339331      T     C ENSG00000185002.9_1 11565 10.3157 10.2733 -0.0424 -0.0054   37          t37 
rs339331      T     C ENSG00000185002.9_1 11565 32.7165 32.5909 -0.1256 -0.0054   22          t22 


In [3]:
! sort -k9 -gr rfx6_sed/sed_gene.txt | head -n 5

rs339331      T     C ENSG00000185002.9_1 11565 168.9414 169.1487  0.2073  0.0018   14          t14 
rs339331      T     C ENSG00000185002.9_1 11565 135.2241 135.3682  0.1440  0.0015   24          t24 
rs339331      T     C ENSG00000185002.9_1 11565 95.2887 95.3773  0.0886  0.0013   29          t29 
rs339331      T     C ENSG00000185002.9_1 11565 65.0739 65.1327  0.0588  0.0013   26          t26 
rs339331      T     C ENSG00000185002.9_1 11565 57.9192 57.9313  0.0121  0.0003   32          t32 
