Saturation mutagenesis is a powerful tool both for dissecting a specific sequence of interest and understanding what the model learned. [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py) enables this analysis from a test set of data. [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you provide a VCF file for variant-centered mutagenesis.

To do this, you'll need
 * Trained model
 * Input file (FASTA or HDF5 with test_in/test_out)

First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or download one that I pre-trained.

In [9]:
import os, subprocess

if not os.path.isfile('models/gm12878.tf.meta'):
    subprocess.call('curl -o models/gm12878.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_gm12878.tf.index', shell=True)
    subprocess.call('curl -o models/gm12878.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_gm12878.tf.meta', shell=True)
    subprocess.call('curl -o models/gm12878.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_gm12878.tf.data-00000-of-00001', shell=True)

We'll bash the PIM1 promoter to see what motifs drive its expression. I placed a 262 kb FASTA file surrounding the PIM1 TSS in data/pim1.fa, so we'll use [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py).

The most relevant options are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |
| -f | 20 | Figure width, that I usually scale to 10x the saturation mutageneis region |
| -l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |
| -o | pim1_sat | Outplot plot directory. |
| -t | 0,38 | Target indexes. 0 is a DNase and 38 is CAGE, as you can see in data/gm12878_wigs.txt. |
| params_file | models/params_small_sat.txt | Table of parameters to setup the model architecture and optimization parameters. |
| model_file | models/gm12878.tf | Trained saved model prefix. |
| input_file | data/pim1.fa | Either FASTA or HDF5 with test_in/test_out keys. |

In [10]:
! basenji_sat.py -f 20 -l 200 -o pim1_sat -t 0,38 models/params_small_sat.txt models/gm12878.tf data/pim1.fa

{'target_pool': 128, 'num_targets': 39, 'cnn_filters': [128, 128, 160, 200, 250, 256], 'dense': 1, 'learning_rate': 0.002, 'loss': 'poisson', 'dcnn_filter_sizes': [3, 3, 3, 3, 3, 3], 'full_dropout': 0.05, 'link': 'softplus', 'adam_beta2': 0.98, 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3], 'cnn_dropout': 0.05, 'dcnn_dropout': 0.1, 'batch_renorm': 1, 'adam_beta1': 0.97, 'full_units': 384, 'batch_buffer': 16384, 'dcnn_filters': [32, 32, 32, 32, 32, 32], 'batch_size': 1, 'cnn_pool': [1, 2, 4, 4, 4, 1]}
Targets pooled by 128 to length 2048
Convolution w/ 128 4x22 filters strided by 1
Batch normalization
ReLU
Dropout w/ probability 0.050
Convolution w/ 128 128x1 filters strided by 1
Batch normalization
ReLU
Max pool 2
Dropout w/ probability 0.050
Convolution w/ 160 128x6 filters strided by 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability 0.050
Convolution w/ 200 160x6 filters strided by 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability 0.050
Convolution w/ 250 200x6 filter

    feed_dict_string, options, run_metadata)
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: not an sstable (bad magic number)
	 [[Node: save/RestoreV2_206 = RestoreV2[dtypes=[DT_INT64], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_206/tensor_names, save/RestoreV2_206/shape_and_slices)]]

Caused by op 'save/RestoreV2_206', defined at:
  File "/Users/davidkelley/code/Basenji/bin/basenji_sat.py", line 575, in <module>
    main()
  File "/Users/davidkelley/code/Basenji/bin/basenji_sat.py", line 120, in main
    saver = tf.train.Saver()
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tens

The saturated mutagenesis heatmaps go into pim1_sat

In [11]:
from IPython.display import IFrame
IFrame('pim1_sat/seq0_t0.pdf', width=1200, height=400)

Describe the output...

In [12]:
IFrame('pim1_sat/seq0_t1.pdf', width=1200, height=400)