To train a model, you first need to convert your sequences and targets into the input HDF5 format. Check out my tutorials for how to do that; they're linked from the [main page](../README.md).

For this tutorial, grab a small example HDF5 that I constructed here with 10% of the training sequences and only GM12878 targets for various DNase-seq, ChIP-seq, and CAGE experiments.

In [1]:
! curl -o data/gm12878_l262k_w128_d10.h5 https://storage.googleapis.com/262k_binned/gm12878_l262k_w128_d10.h5

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  929M  100  929M    0     0  29.5M      0  0:00:31  0:00:31 --:--:-- 32.2M


Next, you need to decide what sort of architecture to use. This grammar probably needs work; my goal was to enable hyperparameter searches to write the parameters to file so that I could run parallel training jobs to explore the hyperparameter space. I included an example set of parameters that will work well with this data in models/params_small.txt.

Then, run [basenji_train.py](https://github.com/calico/basenji/blob/master/bin/basenji_train.py) to train a model. The program will offer training feedback via stdout and write the model output files to the prefix given by the *-s* parameter.

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --rc | | Process even-numbered epochs as forward, odd-numbered as reverse complemented. Average the forward and reverse complement to assess validation accuracy. |
| -s | models/gm12878 | File path prefix to save the model. |
| params_file | models/params_small.txt | Table of parameters to setup the model architecture and optimization. |
| data_file | data/gm12878_l262k_w128_d10.h5 | HDF5 file containing the training and validation input and output datasets as generated by [basenji_hdf5_single.py](https://github.com/calico/basenji/blob/master/bin/basenji_hdf5_single.py) |

If you want to train, uncomment the following line and run it. Depending on your hardware, it may require many hours.

In [2]:
# ! basenji_train.py -s models/gm12878 models/params_small.txt data/gm12878_l262k_w128_d10.h5

{'adam_beta2': 0.98, 'link': 'softplus', 'batch_size': 1, 'dense': 1, 'cnn_dropout': 0.05, 'dcnn_filter_sizes': [3, 3, 3, 3, 3, 3], 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3], 'full_units': 384, 'adam_beta1': 0.97, 'full_dropout': 0.05, 'cnn_filters': [128, 128, 160, 200, 250, 256], 'learning_rate': 0.002, 'dcnn_dropout': 0.1, 'batch_buffer': 16384, 'dcnn_filters': [32, 32, 32, 32, 32, 32], 'cnn_pool': [1, 2, 4, 4, 4, 1], 'loss': 'poisson', 'batch_renorm': 1}
Targets pooled by 128 to length 2048
Convolution w/ 128 4x22 filters strided by 1
Batch normalization
ReLU
Dropout w/ probability 0.050
Convolution w/ 128 128x1 filters strided by 1
Batch normalization
ReLU
Max pool 2
Dropout w/ probability 0.050
Convolution w/ 160 128x6 filters strided by 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability 0.050
Convolution w/ 200 160x6 filters strided by 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability 0.050
Convolution w/ 250 200x6 filters strided by 1
Batch normalization
ReLU

Alternatively, you can just download a trained model.

In [None]:
if True:
    subprocess.call('curl -o models/gm12878_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.index', shell=True)
    subprocess.call('curl -o models/gm12878_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.meta', shell=True)
    subprocess.call('curl -o models/gm12878_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.data-00000-of-00001', shell=True)

models/gm12878_best.tf will now specify the name of your saved model to be provided to other programs.

To further benchmark the accuracy (e.g. computing significant "peak" accuracy), use [basenji_test.py](https://github.com/calico/basenji/blob/master/bin/basenji_test.py).

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --rc | | Average the forward and reverse complement to form prediction. |
| -o | data/gm12878_test | Output directory. |
| --ai | 0,1,2 | Make accuracy scatter plots for targets 0, 1, and 2. |
| --ti | 3,4,5 | Make BigWig tracks for targets 3, 4, and 5. |
| -t | data/gm12878_l262k_w128_d10.bed | BED file describing sequence regions for BigWig track output. |
| params_file | models/params_small.txt | Table of parameters to setup the model architecture and optimization. |
| model_file | models/gm12878_best.tf | Trained saved model prefix. |
| data_file | data/gm12878_l262k_w128_d10.h5 | HDF5 file containing the test input and output datasets as generated by [basenji_hdf5_single.py](https://github.com/calico/basenji/blob/master/bin/basenji_hdf5_single.py) |

In [3]:
! basenji_test.py --rc -o data/gm12878_test --ai 0,1,2 -t data/gm12878_l262k_w128_d10.bed --ti 3,4,5 models/params_small.txt models/gm12878_best.tf data/gm12878_l262k_w128_d10.h5

{'loss': 'poisson', 'adam_beta2': 0.98, 'cnn_dropout': 0.05, 'full_dropout': 0.05, 'full_units': 384, 'dcnn_filters': [32, 32, 32, 32, 32, 32], 'cnn_pool': [1, 2, 4, 4, 4, 1], 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3], 'adam_beta1': 0.97, 'learning_rate': 0.002, 'dcnn_dropout': 0.1, 'link': 'softplus', 'dense': 1, 'batch_buffer': 16384, 'batch_renorm': 1, 'cnn_filters': [128, 128, 160, 200, 250, 256], 'dcnn_filter_sizes': [3, 3, 3, 3, 3, 3], 'batch_size': 1}
Targets pooled by 128 to length 2048
Convolution w/ 128 4x22 filters strided by 1
Batch normalization
ReLU
Dropout w/ probability 0.050
Convolution w/ 128 128x1 filters strided by 1
Batch normalization
ReLU
Max pool 2
Dropout w/ probability 0.050
Convolution w/ 160 128x6 filters strided by 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability 0.050
Convolution w/ 200 160x6 filters strided by 1
Batch normalization
ReLU
Max pool 4
Dropout w/ probability 0.050
Convolution w/ 250 200x6 filters strided by 1
Batch normalization
ReLU

2017-08-24 18:13:35.851110: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.852078: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.853045: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.853607: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.854242: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.854851: W tensorflow

2017-08-24 18:13:35.903024: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.903943: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.904754: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.905541: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.906235: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for models/gm12878_best.tf
2017-08-24 18:13:35.907103: W tensorflow

    main()
  File "/Users/davidkelley/code/Basenji/bin/basenji_test.py", line 143, in main
    saver.restore(sess, model_file)
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1548, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/Users/davidkelley/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.N

*data/gm12878_test/acc.txt* is a table specifiying the loss function value, R2, R2 after log2, and Spearman correlation for each dataset. 

In [4]:
! cat data/gm12878_test/acc.txt

cat: data/gm12878_test/acc.txt: No such file or directory


*data/gm12878_test/peak.txt* is a table specifiying the number of peaks called, AUROC, and AUPRC for each dataset. 

In [5]:
! cat data/gm12878_test/peaks.txt

cat: data/gm12878_test/peaks.txt: No such file or directory


The directories *pr*, *roc*, *violin*, and *scatter* in *data/gm12878_test* contain plots for the targets indexed by 0, 1, and 2 as specified by the --ai option above.

E.g.

In [6]:
![precision-recall](data/gm12878_test/pr/t0.pdf)

/bin/sh: -c: line 0: syntax error near unexpected token `data/gm12878_test/pr/t0.pdf'
/bin/sh: -c: line 0: `[precision-recall](data/gm12878_test/pr/t0.pdf)'
