Saturation mutagenesis is a powerful tool both for dissecting a specific sequence of interest and understanding what the model learned. [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py) enables this analysis from a test set of data. [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you provide a VCF file for variant-centered mutagenesis.

To do this, you'll need
 * Trained model
 * Input file (FASTA or HDF5 with test_in/test_out)

First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or use one that I pre-trained from the models subdirectory.

We'll bash the PIM1 promoter to see what motifs drive its expression. I placed a 262 kb FASTA file surrounding the PIM1 TSS in data/pim1.fa, so we'll use [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py).

The most relevant options are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -g | | Plot the nucleotides proportional to the gain score, too. |
| -f | 20 | Figure width, that I usually scale to 10x the saturation mutageneis region |
| -l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |
| -o | pim1_sat | Outplot plot directory. |
| --rc | | Predict forward and reverse complement versions and average the results. |
| -t | 0,38 | Target indexes. 0 is a DNase and 38 is CAGE, as you can see in data/gm12878_wigs.txt. |
| params_file | models/params_small_sat.txt | Table of parameters to setup the model architecture and optimization parameters. |
| model_file | models/gm12878.tf | Trained saved model prefix. |
| input_file | data/pim1.fa | Either FASTA or HDF5 with test_in/test_out keys. |

In [25]:
! basenji_sat.py -g -f 20 -l 200 -o pim1_sat --rc -t 0,38 models/params_med.txt models/gm12878.tf data/pim1.fa

{'cnn_pool': [1, 2, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'batch_size': 1, 'adam_beta1': 0.97, 'cnn_dilation': [1, 1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 128, 1], 'adam_beta2': 0.98, 'loss': 'poisson', 'num_targets': 39, 'cnn_filters': [196, 196, 235, 282, 338, 384, 64, 64, 64, 64, 64, 64, 64, 512], 'cnn_dropout': [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.1], 'link': 'softplus', 'target_pool': 128, 'cnn_dense': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0], 'batch_buffer': 16384, 'batch_renorm': 1, 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3, 3, 3, 3, 3, 3, 3, 3, 3], 'learning_rate': 0.002}
Targets pooled by 128 to length 2048
Convolution w/ 196 4x22 filters strided 1, dilated 1
Batch normalization
ReLU
Dropout w/ probability 0.050
Convolution w/ 196 196x1 filters strided 1, dilated 1
Batch normalization
ReLU
Max pool 2
Dropout w/ probability 0.050
Convolution w/ 235 196x6 filters strided 1, dilated 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability

The saturated mutagenesis heatmaps go into pim1_sat

First the DNASE:

In [26]:
from IPython.display import IFrame
IFrame('pim1_sat/seq0_t0.pdf', width=1200, height=400)

Second the CAGE:

In [27]:
IFrame('pim1_sat/seq0_t1.pdf', width=1200, height=400)