In [1]:
import tensorflow as tf
import numpy as np
import os
slim = tf.contrib.slim
from tensorflow.python.util import nest
from __future__ import division, print_function

# Outline
The presented model performs a mapping from sequences of images to sequences of steering angle measurements. The mapping is causal, i.e. there is no "looking into future" -- only past frames are used to predict the future steering decisions.  

The model is based on three key components: 1) The input image sequences are processed with a 3D convolution stack, where the discrete time axis is interpreted as the first "depth" dimension. That allows the model to learn motion detectors and understand the dynamics of driving. 2) The model predicts not only the steering angle, but also the vehicle speed and the torque applied to the steering wheel. 3) The model is stateful: the two upper layers are a LSTM and a simple RNN, respectively. The predicted angle, torque and speed serve as the input to the next timestep.  

The model is optimized jointly for the autoregressive and ground truth modes: in the former, model's own outputs are fed into next timestep, in the latter, real targets are used as the context. Naturally, only autoregressive mode is used at the test time.   

I used a single GTX 1080 to train the model. In the training phase there was a constraint to fit into the memory of the card (8 GB). For the evaluation phase the model was performing nearly twice as fast as real-time in this setup.  

Data extraction from rosbags is performed using Ross Wightman's scripts, because these were also used for the test data in this challenge; for real-life scenarios (and not for the challenge) it would make sense to read data directly into the model from the rosbags. Another concern about real-life is that the steering angle sequence that is to be predicted should be probably delayed by the actuator's latency.  

No data augmentation (except for aggressive regularization via dropout) is used.

In [2]:
# define some constants

# RNNs are typically trained using (truncated) backprop through time. SEQ_LEN here is the length of BPTT. 
# Batch size specifies the number of sequence fragments used in a single optimization step.
# (Actually we can use variable SEQ_LEN and BATCH_SIZE, they are set to constants only for simplicity).
# LEFT_CONTEXT is the number of extra frames from the past that we append to the left of our input sequence.
# We need to do it because 3D convolution with "VALID" padding "eats" frames from the left, 
# decreasing the sequence length.
# One should be careful here to maintain the model's causality.
SEQ_LEN = 10 
BATCH_SIZE = 4 
LEFT_CONTEXT = 5

# These are the input image parameters.
HEIGHT = 480
WIDTH = 640
CHANNELS = 3 # RGB

# The parameters of the LSTM that keeps the model state.
RNN_SIZE = 32
RNN_PROJ = 32

# Our training data follows the "interpolated.csv" format from Ross Wightman's scripts.
CSV_HEADER = "index,timestamp,width,height,frame_id,filename,angle,torque,speed,lat,long,alt".split(",")
OUTPUTS = CSV_HEADER[-6:-3] # angle,torque,speed
OUTPUT_DIM = len(OUTPUTS) # predict all features: steering angle, torque and vehicle speed

# Input/output format
Our data is presented as a long sequence of observations (several concatenated rosbags). We need to chunk it into a number of batches: for this, we will create BATCH_SIZE cursors. Let their starting points be uniformly spaced in our long sequence. We will advance them by SEQ_LEN at each step, creating a BATCH_SIZE x SEQ_LEN matrix of training examples. Boundary effects when one rosbag ends and the next starts are simply ignored.  

(Actually, LEFT_CONTEXT frames are also added to the left of the input sequence; see code below for details).

In [3]:
class BatchGenerator(object):
    def __init__(self, sequence, seq_len, batch_size):
        self.sequence = sequence
        self.seq_len = seq_len
        self.batch_size = batch_size
        
        # (chunk_size - seq_len) is the number of batches in one epoch
        chunk_size = 1 + (len(sequence) - 1) // batch_size
        self.indices = [(i*chunk_size) % len(sequence) for i in range(batch_size)]
        
    def next(self):
        """A simple example:
        - len(sequence) = 100
        - batch_size = 5
        - chunk_size = 1 + (100 - 1) /5 = 20
        - indices = [0, 20, 40, 60, 80]
        - seq_len = 4
        each batch is [[0-4], [20-24], [40-44], [60-64], [80-84]] (ignore left_context)"""
        while True:
            output = []
            for i in range(self.batch_size):
                idx = self.indices[i]
                left_pad = self.sequence[idx - LEFT_CONTEXT:idx]
                if len(left_pad) < LEFT_CONTEXT:
                    # duplicate inputs for left context if at the beginning
                    left_pad = [self.sequence[0]] * (LEFT_CONTEXT - len(left_pad)) + left_pad
                assert len(left_pad) == LEFT_CONTEXT
                leftover = len(self.sequence) - idx
                if leftover >= self.seq_len:
                    result = self.sequence[idx:idx + self.seq_len]
                else:
                    # append inputs from beginning if we reach the end of the datasets
                    result = self.sequence[idx:] + self.sequence[:self.seq_len - leftover]
                assert len(result) == self.seq_len
                
                # modify indices so next time when we call next(), we will get another batch data starting from the 
                # end of last batch. This makes sure each batch has no overlap for the sequence
                self.indices[i] = (idx + self.seq_len) % len(self.sequence)
                
                images, targets = zip(*result)
                images_left_pad, _ = zip(*left_pad)
                output.append((np.stack(images_left_pad + images), np.stack(targets)))
                
            output = zip(*output)
            output[0] = np.stack(output[0]) # batch_size x (LEFT_CONTEXT + seq_len)
            output[1] = np.stack(output[1]) # batch_size x seq_len x OUTPUT_DIM
            return output
        
def read_csv(filename, prefix, nrows=None):
    """Helper function to read csv file
    - prefix: the prefix path to be inserted in front of the image file name
    - nrows: number of rows of file to read. Useful for read toy test data set
    """
    with open(filename, 'r') as f:
        f.readline() # skip header row in csv file
        lines = [ln.strip().split(",")[-7:-3] for ln in f.readlines()][:nrows]
        lines = map(lambda x: (os.path.join(prefix, x[0]), np.float32(x[1:])), lines) # imagefile, outputs
        return lines

def process_csv(filename, prefix, nrows=None, val=5):
    sum_f = np.float128([0.0] * OUTPUT_DIM)
    sum_sq_f = np.float128([0.0] * OUTPUT_DIM)
    lines = read_csv(filename, prefix, nrows=nrows)
    # leave val% for validation
    train_seq = []
    valid_seq = []
    cnt = 0
    for ln in lines:
        # only use images from the center camera
        if 'center' not in ln[0]:
            continue
        if cnt < SEQ_LEN * BATCH_SIZE * (100 - val): 
            train_seq.append(ln)
            sum_f += ln[1]
            sum_sq_f += ln[1] * ln[1]
        else:
            valid_seq.append(ln)
        cnt += 1
        cnt %= SEQ_LEN * BATCH_SIZE * 100
    mean = sum_f / len(train_seq)
    var = sum_sq_f / len(train_seq) - mean * mean
    std = np.sqrt(var)
    print(len(train_seq), len(valid_seq))
    print(mean, std) # we will need these statistics to normalize the outputs (and ground truth inputs)
    return (train_seq, valid_seq), (mean, std)

In [4]:
(train_seq, valid_seq), (mean, std) = process_csv(filename="SelfDrivingData/export_ch2_002/interpolated.csv", 
                                                  prefix="SelfDrivingData/export_ch2_002", nrows=12000,
                                                  val=5)

3800 200
[-0.0058325905 -0.082154441  23.879963] [ 0.068750301  0.52096972  2.5421321]


# Key tricks
Now we are ready to build the model. In the next cell we will define the vision module and the recurrent stateful cell.  

The vision module takes a tensor of shape [BATCH_SIZE, LEFT_CONTEXT + SEQ_LEN, HEIGHT, WIDTH, CHANNELS] and outputs a tensor of shape [BATCH_SIZE, SEQ_LEN, 128]. The entire LEFT_CONTEXT is eaten by the 3D convolutions. Well-known tricks like residual connections and layer normalization are used to improve the convergence of the vision module. Dropout between each pair of layers serves as a regularizer.  

We also need to define our own recurrent cell because we need to train our model jointly in two conditions: when it uses ground truth history and when it uses its own past predictions as the context for the future predictions.  

In addition, we define two helper functions: a layer normalizer with trainable gain/offset and a gradient-clipping optimizer.

In [5]:
layer_norm = lambda x: tf.contrib.layers.layer_norm(inputs=x, center=True, scale=True, 
                                                    activation_fn=None, trainable=True)

def get_optimizer(loss, lrate):
    """Adam optimizer with global norm gradients clipping."""
    
    optimizer = tf.train.AdamOptimizer(learning_rate=lrate)
    gradvars = optimizer.compute_gradients(loss)
    gradients, v = list(zip(*gradvars))
    print([x.name for x in v])
    
    # clip gradients
    gradients, _ = tf.clip_by_global_norm(gradients, 15.0)
    return optimizer.apply_gradients(list(zip(gradients, v)))

def apply_vision_simple(image, keep_prob, batch_size, seq_len, scope=None, reuse=None):
    
    # reshape sequence of images to have proper shape fed into 3D-conv
    video = tf.reshape(image, shape=[batch_size, LEFT_CONTEXT + seq_len, HEIGHT, WIDTH, CHANNELS])
    
    with tf.variable_scope(scope, 'Vision', [image], reuse=reuse):
        
        # 3-D conv layers and auxiliary output
        
        # conv-1
        # in sequence dimension, 3-D conv eats 3-1=2 images, 5-2=3 left context left
        net = slim.convolution(video, num_outputs=64, kernel_size=[3,12,12], stride=[1,6,6], padding="VALID")
        net = tf.nn.dropout(x=net, keep_prob=keep_prob)
        # net has 5 dimensions: batch size, sequence, height, width and channels
        aux1 = slim.fully_connected(tf.reshape(net[:, -seq_len:, :, :, :], [batch_size, seq_len, -1]), 
                                    128, activation_fn=None)
        
        # conv-2
        # in sequence dimension, 3-D conv eats 2-1=1 image, 3-1=2 left context left
        net = slim.convolution(net, num_outputs=64, kernel_size=[2,5,5], stride=[1,2,2], padding="VALID")
        net = tf.nn.dropout(x=net, keep_prob=keep_prob)
        aux2 = slim.fully_connected(tf.reshape(net[:, -seq_len:, :, :, :], [batch_size, seq_len, -1]), 
                                    128, activation_fn=None)
        
        # conv-3
        # in sequence dimension, 3-D conv eats 2-1=1 image, 2-1=1 left context left
        net = slim.convolution(net, num_outputs=64, kernel_size=[2,5,5], stride=[1,1,1], padding="VALID")
        net = tf.nn.dropout(x=net, keep_prob=keep_prob)
        aux3 = slim.fully_connected(tf.reshape(net[:, -seq_len:, :, :, :], [batch_size, seq_len, -1]), 
                                    128, activation_fn=None)
        
        # conv-4
        # in sequence dimension, 3-D conv eats 2-1=1 image, 1-1=0 left context left
        net = slim.convolution(net, num_outputs=64, kernel_size=[2,5,5], stride=[1,1,1], padding="VALID")
        net = tf.nn.dropout(x=net, keep_prob=keep_prob)
        
        # at this point the tensor 'net' is of shape batch_size * seq_len * ... (all left context has been eaten)
        aux4 = slim.fully_connected(tf.reshape(net, [batch_size, seq_len, -1]), 128, activation_fn=None)
        
        # fully connected layers
        net = slim.fully_connected(tf.reshape(net, [batch_size, seq_len, -1]), 1024, activation_fn=tf.nn.relu)
        net = tf.nn.dropout(x=net, keep_prob=keep_prob)
        
        net = slim.fully_connected(net, 512, activation_fn=tf.nn.relu)
        net = tf.nn.dropout(x=net, keep_prob=keep_prob)
        
        net = slim.fully_connected(net, 256, activation_fn=tf.nn.relu)
        net = tf.nn.dropout(x=net, keep_prob=keep_prob)
        
        # final output layer
        net = slim.fully_connected(net, 128, activation_fn=None)
        
        # aux[1-4] are residual connections (shortcuts) (structure like ResNet)
        return layer_norm(tf.nn.elu(net + aux1 + aux2 + aux3 + aux4)) 

# self-defined RNN Cell, only deal with one time step data [batch_size, features]
# used as inputs to tf.nn.dynamic_rnn later
# and tf.nn.dynamic_rnn will deal with the sequence 
class SamplingRNNCell(tf.contrib.rnn.BasicRNNCell):
    """Simple sampling RNN cell."""

    def __init__(self, num_outputs, use_ground_truth, internal_cell):
        """
        if use_ground_truth then don't sample
        """
        self._num_outputs = num_outputs
        self._use_ground_truth = use_ground_truth # boolean
        self._internal_cell = internal_cell # may be LSTM or GRU or anything

    @property
    def state_size(self):
        return self._num_outputs, self._internal_cell.state_size # previous output and bottleneck state

    @property
    def output_size(self):
        return self._num_outputs # steering angle, torque, vehicle speed

    def __call__(self, inputs, state, scope=None):
        (visual_feats, current_ground_truth) = inputs
        prev_output, prev_state_internal = state
        context = tf.concat([prev_output, visual_feats], 1)
        # here the internal cell (e.g. LSTM) is called
        new_output_internal, new_state_internal = internal_cell(context, prev_state_internal) 
        
        # autoregressive part?
        new_output = tf.contrib.layers.fully_connected(
            inputs=tf.concat([new_output_internal, prev_output, visual_feats], 1),
            num_outputs=self._num_outputs,
            activation_fn=None,
            scope="OutputProjection")
        # if self._use_ground_truth == True, 
        # we pass the ground truth as the state; otherwise, we use the model's predictions
        return new_output, (current_ground_truth if self._use_ground_truth else new_output, new_state_internal)

# Model
Let's build the main graph. Code is mostly self-explanatory.  

A few comments:  

1) PNG images were used as the input only because this was the format for round1 testset. In practice, raw images should be fed directly from the rosbags.  

2) We define get_initial_state and deep_copy_initial_state functions to be able to preserve the state of our recurrent net between batches. (The second batch is right after the first batch in time series, so the final state of the first batch should be the initial state for the second batch.) The backpropagation is still truncated by SEQ_LEN.  

3) The loss is composed of two components. The first is the MSE of the steering angle prediction in the autoregressive setting -- that is exactly what interests us in the test time. The second components, weighted by the term aux_cost_weight, is the sum of MSEs for all outputs both in autoregressive and ground truth settings.  

Note: if the saver definition doesn't work for you please make sure you are using tensorflow 0.12rc0 or newer.

In [6]:
graph = tf.Graph()

with graph.as_default():
    # inputs
    learning_rate = tf.placeholder_with_default(input=1e-4, shape=())
    keep_prob = tf.placeholder_with_default(input=1.0, shape=())
    aux_cost_weight = tf.placeholder_with_default(input=0.1, shape=())
    
    # pathes to jpeg files from the central camera
    inputs = tf.placeholder(shape=(BATCH_SIZE,LEFT_CONTEXT+SEQ_LEN), dtype=tf.string) 
    # batch_size * seq_len * OUTPUT_DIM
    targets = tf.placeholder(shape=(BATCH_SIZE,SEQ_LEN,OUTPUT_DIM), dtype=tf.float32) 
    
    # mean and std are calculated above when reading data
    targets_normalized = (targets - mean) / std # (batch_size, seq_len, output_dim)
    
    # read matrix of images based on matrix of image file paths
    # tf.image.decode_jpeg will return a 3-D array of shape [height, width, channels]
    input_images = tf.stack([tf.image.decode_jpeg(tf.read_file(x)) # a list of 3-D array get from decode_jpeg
                            for x in tf.unstack(tf.reshape(inputs, shape=[(LEFT_CONTEXT+SEQ_LEN) * BATCH_SIZE]))])
    
    # could add data augmentation or flipping here
    
    # normalize images (0 to 255 -> -1 to 1)
    input_images = -1.0 + 2.0 * tf.cast(input_images, tf.float32) / 255.0
    
    # Updates the shape of this tensor (not changing the shape, different from tf.reshape)
    input_images.set_shape([(LEFT_CONTEXT+SEQ_LEN) * BATCH_SIZE, HEIGHT, WIDTH, CHANNELS])
    
    # output has shape [batch_size, seq_len, 128]
    visual_conditions_reshaped = apply_vision_simple(image=input_images, keep_prob=keep_prob, 
                                                     batch_size=BATCH_SIZE, seq_len=SEQ_LEN)
    
    visual_conditions = tf.reshape(visual_conditions_reshaped, [BATCH_SIZE, SEQ_LEN, -1])
    visual_conditions = tf.nn.dropout(x=visual_conditions, keep_prob=keep_prob)
    
    # prepare inputs to RNN based on output from 3-D conv layers
    # - visual_conditions: (batch_size, seq_len, 128)
    # - targets_normalized: (BATCH_SIZE, SEQ_LEN, OUTPUT_DIM)
    rnn_inputs_with_ground_truth = (visual_conditions, targets_normalized)
    rnn_inputs_autoregressive = (visual_conditions, 
                                 tf.zeros(shape=(BATCH_SIZE, SEQ_LEN, OUTPUT_DIM), dtype=tf.float32))
    
    # num_units: size of H, by default, LSTM is simaply a way to get Ht given Ht-1 and Xt
    # num_proj: size of Yt, a linear projection after Ht
    internal_cell = tf.contrib.rnn.LSTMCell(num_units=RNN_SIZE, num_proj=RNN_PROJ)
    cell_with_ground_truth = SamplingRNNCell(num_outputs=OUTPUT_DIM, use_ground_truth=True, 
                                             internal_cell=internal_cell)
    cell_autoregressive = SamplingRNNCell(num_outputs=OUTPUT_DIM, use_ground_truth=False, 
                                          internal_cell=internal_cell)
    
    def get_initial_state(complex_state_tuple_sizes):
        
        # complex_state_tuple_sizes: (_num_outputs, (c_state_size, m_state_size))
        flat_sizes = nest.flatten(complex_state_tuple_sizes)
        init_state_flat = [tf.tile( # tf.tile: repeat inputs in each dimension by i times
            multiples=[BATCH_SIZE, 1], 
            input=tf.get_variable("controller_initial_state_%d" % i, initializer=tf.zeros_initializer, 
                                  shape=([1, s]), dtype=tf.float32))
         for i,s in enumerate(flat_sizes)]
        
        # each batch element can have a different initial state
        # pack flat state back to a nested tuple: (new_output, (c_state, m_state))
        # - new_output: (batch_size, _num_outputs)
        # - c_state: (batch_size, c_state_size)
        # - m_state: (batch_size, m_state_size)
        init_state = nest.pack_sequence_as(complex_state_tuple_sizes, init_state_flat)
        return init_state
    
    def deep_copy_initial_state(complex_state_tuple):
        flat_state = nest.flatten(complex_state_tuple)
        flat_copy = [tf.identity(s) for s in flat_state] # tf.identity copies shape and content of a tensor
        deep_copy = nest.pack_sequence_as(complex_state_tuple, flat_copy)
        return deep_copy
    
    controller_initial_state_variables = get_initial_state(cell_autoregressive.state_size)
    
    # initial_state for autoregressive case
    controller_initial_state_autoregressive = deep_copy_initial_state(controller_initial_state_variables)
    # initial_state for groud truth case
    controller_initial_state_gt = deep_copy_initial_state(controller_initial_state_variables)

    with tf.variable_scope("predictor"):
        # - out_gt: (batch_size, seq_len, num_outputs)
        out_gt, controller_final_state_gt = \
        tf.nn.dynamic_rnn(cell=cell_with_ground_truth, 
                          inputs=rnn_inputs_with_ground_truth, 
                          sequence_length=[SEQ_LEN]*BATCH_SIZE, initial_state=controller_initial_state_gt, 
                          dtype=tf.float32, swap_memory=True, time_major=False)
        
    with tf.variable_scope("predictor", reuse=True):
        # reuse the variable above
        out_autoregressive, controller_final_state_autoregressive = \
        tf.nn.dynamic_rnn(cell=cell_autoregressive, 
                          inputs=rnn_inputs_autoregressive, 
                          sequence_length=[SEQ_LEN]*BATCH_SIZE, 
                          initial_state=controller_initial_state_autoregressive, 
                          dtype=tf.float32, swap_memory=True, time_major=False)
    
    # mse for all outputs (angle, torque, speed) across 3 dimensions (batch_size, seq_len, num_outputs)
    mse_gt = tf.reduce_mean(tf.squared_difference(out_gt, targets_normalized))
    mse_autoregressive = tf.reduce_mean(tf.squared_difference(out_autoregressive, targets_normalized))
    
    # mse for angle, across 2 dimensions (batch_size, seq_len)
    mse_autoregressive_steering = tf.reduce_mean(tf.squared_difference(out_autoregressive[:, :, 0], 
                                                                       targets_normalized[:, :, 0]))
    
    steering_predictions = (out_autoregressive[:, :, 0] * std[0]) + mean[0]
    
    # final scalar loss to be minimized
    total_loss = mse_autoregressive_steering + aux_cost_weight * (mse_gt + mse_autoregressive)
    
    optimizer = get_optimizer(total_loss, learning_rate)
    
    # summary
    tf.summary.scalar("MAIN_TRAIN_METRIC_rmse_autoregressive_steering", tf.sqrt(mse_autoregressive_steering))
    tf.summary.scalar("rmse_gt", tf.sqrt(mse_gt))
    tf.summary.scalar("rmse_autoregressive", tf.sqrt(mse_autoregressive))
    
    summaries = tf.summary.merge_all()
    train_writer = tf.summary.FileWriter('v3/train_summary', graph=graph)
    valid_writer = tf.summary.FileWriter('v3/valid_summary', graph=graph)
    saver = tf.train.Saver(write_version=tf.train.SaverDef.V2)

[u'Vision/Conv/weights:0', u'Vision/Conv/biases:0', u'Vision/fully_connected/weights:0', u'Vision/fully_connected/biases:0', u'Vision/Conv_1/weights:0', u'Vision/Conv_1/biases:0', u'Vision/fully_connected_1/weights:0', u'Vision/fully_connected_1/biases:0', u'Vision/Conv_2/weights:0', u'Vision/Conv_2/biases:0', u'Vision/fully_connected_2/weights:0', u'Vision/fully_connected_2/biases:0', u'Vision/Conv_3/weights:0', u'Vision/Conv_3/biases:0', u'Vision/fully_connected_3/weights:0', u'Vision/fully_connected_3/biases:0', u'Vision/fully_connected_4/weights:0', u'Vision/fully_connected_4/biases:0', u'Vision/fully_connected_5/weights:0', u'Vision/fully_connected_5/biases:0', u'Vision/fully_connected_6/weights:0', u'Vision/fully_connected_6/biases:0', u'Vision/fully_connected_7/weights:0', u'Vision/fully_connected_7/biases:0', u'Vision/LayerNorm/beta:0', u'Vision/LayerNorm/gamma:0', u'controller_initial_state_0:0', u'controller_initial_state_1:0', u'controller_initial_state_2:0', u'predictor/rnn

# Training
At this point we can start the training procedure.  

We will perform optimization for 100 epochs, doing validation after each epoch. We will keep the model's version that obtains the best performance in terms of the primary loss (autoregressive steering MSE) on the validation set. An aggressive regularization is used (keep_prob=0.25 for dropout), and the validation loss is highly non-monotonical.  

For each version of the model that beats the previous best validation score we will overwrite the checkpoint file and obtain predictions for the challenge test set.

In [None]:
# gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=1.0)

checkpoint_dir = os.getcwd() + "/v3"

global_train_step = 0
global_valid_step = 0

KEEP_PROB_TRAIN = 0.25

def do_epoch(session, sequences, mode):
    global global_train_step, global_valid_step
    
    test_predictions = {}
    valid_predictions = {}
    batch_generator = BatchGenerator(sequence=sequences, seq_len=SEQ_LEN, batch_size=BATCH_SIZE)
    
    # total steps in one epoch, chunk_size / seq_len
    total_num_steps = 1 + (batch_generator.indices[1] - 1) // SEQ_LEN
    controller_final_state_gt_cur, controller_final_state_autoregressive_cur = None, None
    acc_loss = np.float128(0.0)
    
    for step in range(total_num_steps):
        feed_inputs, feed_targets = batch_generator.next()
        feed_dict = {inputs : feed_inputs, targets : feed_targets}
        
        # feed in final state in last batch to the next batch as initial state
        if controller_final_state_autoregressive_cur is not None:
            feed_dict.update({controller_initial_state_autoregressive : controller_final_state_autoregressive_cur})
        if controller_final_state_gt_cur is not None:
            # original code: feed_dict.update({controller_final_state_gt : controller_final_state_gt_cur})
            # which might be wrong?
            feed_dict.update({controller_initial_state_gt : controller_final_state_gt_cur})
        
        if mode == "train":
            feed_dict.update({keep_prob : KEEP_PROB_TRAIN})
            
            # only use groud truth in autoregressive during training
            summary, _, loss, controller_final_state_gt_cur, controller_final_state_autoregressive_cur = \
                session.run([summaries, optimizer, mse_autoregressive_steering, controller_final_state_gt, 
                             controller_final_state_autoregressive], feed_dict = feed_dict)
            
            train_writer.add_summary(summary, global_train_step)
            global_train_step += 1
        
        elif mode == "valid":
            model_predictions, summary, loss, controller_final_state_autoregressive_cur = \
                session.run([steering_predictions, summaries, mse_autoregressive_steering, 
                             controller_final_state_autoregressive],feed_dict = feed_dict)
            
            valid_writer.add_summary(summary, global_valid_step)
            global_valid_step += 1  
            
            # record validation predictions and error
            feed_inputs = feed_inputs[:, LEFT_CONTEXT:].flatten()
            steering_targets = feed_targets[:, :, 0].flatten() # (batch_size * seq_len,)
            model_predictions = model_predictions.flatten()
            
            # record true target, prediction, error
            # stats: (3, batch_size * seq_len)
            stats = np.stack([steering_targets, model_predictions, (steering_targets - model_predictions)**2])
            for i, img in enumerate(feed_inputs):
                valid_predictions[img] = stats[:, i]
                
        elif mode == "test":
            model_predictions, controller_final_state_autoregressive_cur = \
                session.run([steering_predictions, controller_final_state_autoregressive],
                           feed_dict = feed_dict) 
            
            # record test predictions and error
            feed_inputs = feed_inputs[:, LEFT_CONTEXT:].flatten()
            model_predictions = model_predictions.flatten()
            for i, img in enumerate(feed_inputs):
                test_predictions[img] = model_predictions[i]
        
        # print average accumulative loss (mse_autoregressive_steering)
        if mode != "test":
            acc_loss += loss
            print('\r', step + 1, "/", total_num_steps, np.sqrt(acc_loss / (step+1)),)
    print()
    return (np.sqrt(acc_loss / total_num_steps), valid_predictions) if mode != "test" else (None, test_predictions)
    

NUM_EPOCHS=100

best_validation_score = None
# add 'gpu_options=gpu_options' to tf.ConfigProto() if using GPU
with tf.Session(graph=graph, config=tf.ConfigProto()) as session:
    session.run(tf.global_variables_initializer())
    print('Initialized')
    ckpt = tf.train.latest_checkpoint(checkpoint_dir)
    if ckpt:
        print("Restoring from", ckpt)
        saver.restore(sess=session, save_path=ckpt)
    for epoch in range(NUM_EPOCHS):
        print("Starting epoch %d" % epoch)
        
        # validate before each epoch
        print("Validation:")
        
        # valid_score is the average accumulative loss (mse_autoregressive_steering) returned by do_epoch
        valid_score, valid_predictions = do_epoch(session=session, sequences=valid_seq, mode="valid")
        if best_validation_score is None: 
            best_validation_score = valid_score
        
        # update best valid score and save best model so far
        if valid_score < best_validation_score:
            saver.save(session, 'v3/checkpoint-sdc-ch2')
            best_validation_score = valid_score
            print('\r', "SAVED at epoch %d" % epoch,)
            
            # validation: write image file name and corresponding stats (prediction, error) to a file
            with open("v3/valid-predictions-epoch%d" % epoch, "w") as out:
                result = np.float128(0.0)
                for img_id, stats in valid_predictions.items():
                    print(img_id, stats, file=out)
                    # accumulate SE, different from mse_autoregressive_steering
                    result += stats[-1]
            print("Validation unnormalized RMSE:", np.sqrt(result / len(valid_predictions)))
            
#             # test: write image file name and corresponding stats (prediction, error) to a file
#             with open("v3/test-predictions-epoch%d" % epoch, "w") as out:
#                 _, test_predictions = do_epoch(session=session, sequences=test_seq, mode="test")
#                 # write header
#                 print("frame_id,steering_angle", file=out)
#                 for img_id, pred in test_predictions.items():
#                     img_id = img_id.replace("challenge_2/Test-final/center/", "")
#                     print("%s,%f" % (img, pred), file=out)
        
        # continue training if not reach the number of epochs
        if epoch != NUM_EPOCHS - 1:
            print("Training")
            do_epoch(session=session, sequences=train_seq, mode="train")

Initialized
Starting epoch 0
Validation:
 1 / 5 2.0044540638
 2 / 5 2.00480503799
 3 / 5 2.02282143816
 4 / 5 2.0305336496
 5 / 5 2.03089749139

Training


Basically that's it.  

The model can be further fine-tuned for the challenge purposes by subsetting the training set and setting the aux_cost_weight to zero. It improves the result slightly, but the improvement is marginal (doesn't affect the challenge ranking). For real-life usage it would be probably harmful because of the risk of overfitting to the dev- or even testset.  

Of course, speaking of realistic models, we don't need to constrain our input only to the central camera -- other cameras and sensors can dramatically improve the performance. Also it is useful to think of a realistic delay for the target sequence to make an actual non-zero-latency control possible.  

If something in this writeup is unclear, please write me a e-mail so that I can add the necessary comments/clarifications.