Precursors!

In [8]:
import os, subprocess

if not os.path.isdir('models/heart'):
    os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
    subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)

Saturation mutagenesis is a powerful tool both for dissecting a specific sequence of interest and understanding what the model learned. [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py) enables this analysis from a test set of data. [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you provide a VCF file for variant-centered mutagenesis.

To do this, you'll need
 * Trained model
 * Input file (FASTA or HDF5 with test_in/test_out)

First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or use one that I pre-trained from the models subdirectory.

We'll bash the GATA4 promoter to see what motifs drive its expression. I placed a 131 kb FASTA file surrounding the GATA4 TSS in data/gata4.fa, so we'll use [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py).

The most relevant options are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -g | | Plot the nucleotides proportional to the gain score, too. |
| -f | 20 | Figure width, that I usually scale to 10x the saturation mutageneis region |
| -l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |
| -o | gata4_sat | Outplot plot directory. |
| --rc | | Predict forward and reverse complement versions and average the results. |
| -t | 0,1,2 | Target indexes to analyze. |
| params_file | models/params_small.txt | Table of parameters to setup the model architecture and optimization parameters. |
| model_file | models/heart/model_best.tf | Trained saved model prefix. |
| input_file | data/gata4.fa | Either FASTA or HDF5 with test_in/test_out keys. |

In [9]:
! basenji_sat.py -g -f 20 -l 200 -o output/gata4_sat --rc -t 0,1,2 models/params_small.txt models/heart/model_best.tf data/gata4.fa

  return f(*args, **kwds)
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

{'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'link': 'softplus', 'loss': 'poisson', 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0], 'cnn_dropout': 0.1, 'cnn_filter_sizes': [20, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 1], 'optimizer': 'adam', 'target_pool': 128, 'adam_beta1': 0.97, 'adam_beta2': 0.98, 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0], 'learning_rate': 0.002, 'num_targets': 3, 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'batch_buffer': 4096, 'batch_size': 4}
Targets pooled by 128 to length 1024
Convolution w/ 3 384x1 filters to final targets
Model building time 14.755834
2018-05-16 17:54:20.283767: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow bina

The saturated mutagenesis heatmaps go into output/gata4_sat

In [10]:
from IPython.display import IFrame
IFrame('output/pim1_sat/seq0_t0.pdf', width=1200, height=400)