training tutorial

davek44 · Oct 14, 2015 · 5301a7c · 5301a7c
1 parent 93f8639
commit 5301a7c
Show file tree

Hide file tree

Showing 5 changed files with 48 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -11,5 +11,6 @@ tutorials/*.bed
 tutorials/*.fa
 tutorials/*.h5
 tutorials/*.txt
+tutorials/infl_out
 tutorials/motifs_out
 tutorials/.ipynb_checkpoints
diff --git a/README.md b/README.md
@@ -74,7 +74,7 @@ These are a work in progress, so forgive incompleteness for the moment. If there
   - [Prepare new dataset(s) by adding to a compendium.](tutorials/new_data_many.ipynb)
   - [Prepare new dataset(s) in isolation.](tutorials/new_data_iso.ipynb)
 - Train
-  - [Train a model.](tutorials/train.ipynb)
+  - [Train a model.](tutorials/train.md)
 - Test
   - [Test a trained model.](tutorials/test.ipynb)
 - Visualization

diff --git a/data/models/pretrained_params.txt b/data/models/pretrained_params.txt
@@ -0,0 +1,16 @@
+conv_filters	300
+conv_filters	200
+conv_filters	200
+conv_filter_sizes	19
+conv_filter_sizes	11
+conv_filter_sizes	7
+pool_width		3
+pool_width		4
+pool_width		4
+hidden_units		1000
+hidden_units		1000
+hidden_dropouts		0.3
+hidden_dropouts		0.3
+learning_rate		0.002
+weight_norm		7
+momentum		0.98
diff --git a/data/models/tmp b/data/models/tmp
diff --git a/tutorials/train.md b/tutorials/train.md
@@ -0,0 +1,30 @@
+### Basset
+###### Deep convolutional neural networks for DNA sequence analysis.
+--------------------------------------------------------------------------------
+
+To train a model, you first need to convert your sequences into HDF5 format for Torch. Check out my tutorials for how to do that; they're linked from the [main page](../README.md).
+
+We'll use the script *basset_train.lua* to train. I'll review the options to that script, and discuss the decisions you need to make for each one.
+
+###### -cuda
+First, you need to decide whether to compute on your CPU or GPU. GPU stands for graphics processing unit, which have proven tremendously useful for certain computations that intersect with display...you guessed it- graphics. Here's a [Udacity course](https://www.udacity.com/course/intro-to-parallel-programming--cs344) where you can learn more. Training on the GPU is MUCH faster, and is effectively required for training a complex model on anything beyond 100k sequences. Beyond a million sequences, you'll either need a high end GPU or patience. Move training to your GPU with the -cuda option.
+
+###### -job
+Next, you need to design the model architecture, choose optimization parameters, and write them to a table text file. At this stage of rapid artifical neural network research, this is equal parts art and science. I provide the best architecture and optimization parameters that I found for the DNaseI-seq compendium analyzed in the paper [here](../data/models/pretrained_params.txt), which can also serve as an example for the required format. There's no doubt that better parameters exist, so explore away! [Bayesian optimization](http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf) is a fantastic approach for doing so, and [Spearmint](https://github.com/HIPS/Spearmint) can help you do that. Jasper and I can offer more advice here for ambitious users if you send me an email.
+
+###### -stagnant_t
+Assuming you carefully set aside some validation data, this option specifies how many training epochs to carry out without improvement in the validation loss before halting. If you expect to overfit, a small value here will save you a lot of unnecessary computation.
+
+###### -save
+Basset will save checkpoints and the model corresponding to the best validation loss that it's seen. (These are large so once you know you don't want them, delete them.) Specify the filename prefix with this option.
+
+At this point, you know enough to train. Run as
+```
+basset_train.lua -cuda -job pretrained_params.txt -stagnant_t 10 all_data_ever.h5
+```
+
+###### -restart
+Sometimes you'll need to stop your training job, but want to restart it. No problem. Basset writes a checkpoint after every epoch, so just provide the latest with this option.
+
+###### -seed
+As I demonstrate in the paper, seeding a training run with the parameters from my pretrained model can be a powerful approach to learn on smaller, new datasets. Provide the model with this option.