In [48]:
%matplotlib inline
import seaborn as sns

# Training a network

## Logging

For most purposes, we are going to want to set our logging level to `INFO`, since some commands are going to run for a long time, and we would like periodic updates.

In [49]:
import logging
reload(logging)
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%I:%M:%S')
#logging.debug('This is a debug message')
#logging.basicConfig(level=logging.INFO)

## Setting up an `HDF5` database

The first step is to create a dataset. This process is mostly abstracted away for you using `a3.Dataset` objects. Your responsibility is to specify how your data is stored and structured. We'll start by specifying some structure. 

### Keys

For the example dataset, there is only a single subject. For this dataset, a combination of a study ID and frame ID is sufficient to pick out a unique datapoint.  We will encode this database key as a list of the form `['study','frame']`. While you might construct a key of additional fields (ex. `subject`), the last element will usually be frame, since that is usually the minimal unit of analysis. 

Note that these names are arbitrary, as long as you are consistant. Of course using informative names is best practice, since `autotres` has some sensible defaults that rely on certain keys; for example, the default code for extracting images from video relies on a `'frame'` key.

In [50]:
keys = ['study', 'frame']

### Types

Next, we need to be able to tell autotres about the types of files we have, the types of data they represent, and what levels of the key heirarchy each piece of data should be associated with. This will mostly be accomplished with regular expressions. We create a dict of data types. The keys of the dict represent data type names. Note that we will want one type for each type of data we want, not for each type of file we have. `autotres` is perfectly happy pulling more than one type of information from a file. 

These are again arbitrary labels, but autotres can provide some sensible default behaviors for certain labels. Each dict contains information about that type:

#### `'conflict'`: How to deal with multiple files of the same type appearing for same combination of identifiers. 

For example, If I have multiple tracers on my team, then I would expect to have multiple 'trace' files for each combination of 'subject' and 'frame'. In this case, I would like to keep all of the traces for the same image, so that I can do something sensible with it, like interpolate. In this case I will set the value for 'conflict' to 'list'. 

Similarly, if I have conducted multiple studies with my dataset, one looking at coronals and one looking at fricatives, I would expect to have multiple copies of the images for coronal fricatives. However, unlike the instance with multiple traces, `fricative_frame-00042.png` and `coronal_frame-00042.png` should be identical files. I can use the 'hash' option to specify that I should ignore duplicates as long as they are the same, but should raise an exception if they don't.

Finally, there are some situations that I simply expect not to happen. For example, if a single subject is associated with more that one 'audio' file, then perhaps it is most likely that somebdy mislabeled something. In this case, I would not set 'conflict' to anything, and if there is a conflict, autotres will raise an exception automatically

#### `'regex'`: How to associate each file with a combination of heirachical levels.

This is a regular expression that will match a filename in the dataset. We use the `(?P<label>...)` syntax to capture parts of the filename that are informative. Specifically, we need to be able to infer all the relevent heirarchical information from the file name. This should be a left-substring of the keys list. 

Note that the regex is matched to the entire pathname, relative to whatever path we give it (see below). Here, we have also used the `(?x)` flag to allow us to break the regex over multiple lines, and to include comments.

In [51]:
import os
import re

types = {
    'trace': {
        'regex': r"""(?x)
            (?P<study>\d+\w+)              # in the example dataset, a 'study' is encoded in the image name as the substring preceding an '_'
            _(?P<frame>\d+)\.(?:jpg|png)   # the frame number
            \.(?P<tracer>\w+)              # the tracer id
            \.traced\.txt$""",
        'conflict': 'list'
        },
    'image': {
        'regex': r"""(?x)
            (?P<study>\d+\w+)
            _(?P<frame>\d+)
            \.(?P<ext>jpg|png)$""",
        'conflict': 'hash'
        },
    'name': {
        'regex': r"""(?x)
            (?P<fname>(?P<study>\d+\w+)
                _(?P<frame>\d+)
                \.(?P<ext>jpg|png)
            )$""",
        }
    }

### Creating the dataset

We will now set up our dataset. The `roi`, `n_points`, and `scale` `kwargs` will be passed down to the default data extraction callbacks (see [`a3/dataset.py`](../a3/dataset.py) documentation). Custom callbacks can be provided by putting a callable in the ds.callbacks dict. These should return a numpy array of type `float32`. 

If you don't have CUDA properly installed, importing a3 will throw some errors about nvcc (nvidia cuda compiler) not being found. This is fine so long as you are fine with only using the CPU (instead of the GPU) to train.

In [53]:
import a3
ds = a3.Dataset('example.hdf5',roi=(140.,320.,250.,580.),n_points=32,scale=.1)

The directory containing our data is `example_data`. You can scan multiple directories if you need to, possibly with different type definitions, but watch out for file conflicts! Your heirarchy should be the same accross calls to `scan_directory`. For large datasets may take a while to complete, since it is doing a full walk of the file heirarchy.

In [54]:
d = 'example_data'

In [55]:
ds.scan_directory(d,types,keys)

At this point, you can inspect what data sources you have by looking at the `ds.sources` dict. This `dict` can get very large, so be cautious about printing the whole of it to `stdout`.

In [56]:
ds.sources.keys()
#ds.sources.items()[0]

['20110518JF', '20110826JF', '20110829PB']

Once you have your data sources figured out, you can extract that data with ds.read_sources(). The arg here is a set-like object with all of the data types you need. This will take a while, since it is opening and processing a lot of files.

In [57]:
ds.read_sources(['trace','image','name'])

## Training a network

The rest is easy. Construct an `Autotracer` from your new dataset. Specifying `None` for the validation set sets aside part of your training data as validation data (no guarantees about randomness). Make sure you use the same ROI as above, or at least the same size.

In [58]:
a = a3.Autotracer('example.hdf5', None, roi=(140.,320.,250.,580.))

04:31:40 INFO:initializing model
04:31:41 INFO:compiling theano functions


To train on your dataset, simply call the `train()` method. In reality, training will require thousands of epochs (runs through the entire dataset), but for time we will just train a couple times. Minibatch size can be controlled with the `minibatch` kwarg, which defaults to `512`. If your logging level is set to INFO you will see the training loss and validation loss at the end of each epoch.

In [59]:
a.train(10)

04:31:43 INFO:Training
04:31:43 INFO:Epoch: 1, train_loss=0.123271, valid_loss=0.148709
04:31:43 INFO:Epoch: 2, train_loss=0.121601, valid_loss=0.143208
04:31:44 INFO:Epoch: 3, train_loss=0.118128, valid_loss=0.135538
04:31:44 INFO:Epoch: 4, train_loss=0.114214, valid_loss=0.125975
04:31:44 INFO:Epoch: 5, train_loss=0.107150, valid_loss=0.114751
04:31:44 INFO:Epoch: 6, train_loss=0.098259, valid_loss=0.102859
04:31:44 INFO:Epoch: 7, train_loss=0.087813, valid_loss=0.090623
04:31:44 INFO:Epoch: 8, train_loss=0.080423, valid_loss=0.078934
04:31:44 INFO:Epoch: 9, train_loss=0.070577, valid_loss=0.068480
04:31:44 INFO:Epoch: 10, train_loss=0.062668, valid_loss=0.060005


Make sure you save your weights! Note that the resulting file doesn't contain any information about the layout of the NNet -- that's still in the works. To change layouts, change the code in `a3.Autotrace.__init_layers()`.

In [60]:
a.save('example.a3.npy')

## Testing a network

Get the traces for your dataset! This will create a file named `original_test.json` that can be used with the APIL web tracer. The remaining positional arguments are the filenames for the images, the tracer ID, and subject ID.

In [61]:
import h5py
with h5py.File('example.hdf5','r') as h:
    # trace all images used in training
    a.trace(h['image'], 'example_test.json', h['name'],'autotrace_test','001')

This output can be easily inspected using the `json` module:

In [62]:
import json
len(json.load(open('example_test.json', 'r'))['trace-data'])

100

If you want to know your loss, you can train and test with the same dataset. 

In [63]:
b = a3.Autotracer(train='example.hdf5', test='example.hdf5', roi=(140.,320.,250.,580.))
b.train(1)

04:31:45 INFO:initializing model
04:31:45 INFO:compiling theano functions
04:31:47 INFO:Training
04:31:47 INFO:Epoch: 1, train_loss=0.131458, valid_loss=0.136041
