# vid2speed
## Ben Penchas, Marco Monteiro, and Toby Bell
Given car dashboard video footage, we aimed to estimate the speed of a car using a deep neural network. We saw this problem as a small but important part of building an autonomous vehicle. The problem is by its nature underdetermined (since we have no absolute reference for scale/distance), so we treated it as a classification task where we bucketed speeds into 4 mph intervals.

<img src="im0.jpeg">

Our dataset was a dashboard video of driving around the Bay Area. We took our data from the <a href="https://twitter.com/comma_ai/status/854488327797448704?lang=en">comma.ai challenge.</a> We separated this video into 20400 image frames (that look like the one above), scaled each image to 224x224, and coupled every pair of frames. We then randomly shuffled these pairs--80% into our train set and 20% into our test set.

See below for how we implemented our final network and trained it. Our resulting accuracy was 91% on the test set. We did some analysis to understand what the network learned, and gained several insights (see bottom).


In [None]:
import numpy as np
import tensorflow as tf

import os
import imageio
import cv2
from moviepy.editor import *
import scipy
from sklearn.model_selection import train_test_split
from PIL import Image

In [None]:
# Function we wrote to turn the video given to us
# into an object of resized frames
def extract_frames(movie, imgdir):
    clip = VideoFileClip(movie)
    frames = int(clip.fps * clip.duration)
    all_frames = np.empty((frames,224*224*3))
    for f in range(frames):
        # Extract each frame and resize to 224x224
        image = clip.to_ImageClip(f / clip.fps).img
        image = scipy.misc.imresize(image, (224, 224))
        image = image.astype('float32')
        
        all_frames[f,:] = image.flatten()
        
        if f % 1000 == 0:
            print(f)
    return all_frames

The resulting frames look like this one:
<img src="im0.jpeg">

# Network Architecture
In the beginning, we saw the problem of dense optical flow as very similar to our video-to-speed problem, but not exactly the same. Rather than use a pretrained Flownet model, a very deep and complicated network, we wanted to design and train our own lightweight optical flow model specific to this problem.

We initially used VGG-19 as a dual stream feature extractor, mimicking the Flownet architecture and hoping to extract features like scale. We fed each image through a pre-trained VGG-19 model, up to the third block, and then concatenated the results. We then fed the resulting volume through our trainable custom CNN.

However, this did not work for two main reasons:

1) VGG-19 is an image classifier, and so it has been trained to extract semantic understanding. However, since subsequent frames are semantically very similar, the feature extraction with VGG-19 actually got rid of crucial spatial information. Using VGG-19 as a feature extractor led to a model that could not distinguish between images.

2) Additionally, VGG-19 has rapid down sampling with several max-pool layers. That architecture caused a loss of resolution in the images. For optical flow, and this video-to-speed problem, high spatial resolution is very important.

Thus we decided not to use a pretrained feature extractor and to just let the network learn everything on its own. In the following cell we define our network architecture. Here is an illustration of our architecture:

<img src="architecture.png">


In [None]:
# Misc. config.

# Log messages.
tf.logging.set_verbosity(tf.logging.INFO)

# Batch size to use when training. Note that this is the number of image
# *pairs* that will be fed in at once (so it need not be even).
BATCH_SIZE = 12

# Number of vid2speed speed buckets.
NUM_LABELS = 15

# Build the vid2speed classifier network. Accepts a tensor of 3-channel (RGB)
# uint8 image data with shape [2n, 224, 224, 3] - i.e., n pairs of images.
# Returns the logits layer.
def vid2speed(images):

    reshaped = tf.reshape(images, [-1, 2, 224, 224, 3])
    inputs = tf.cast(reshaped, tf.float32)
    
    even_img = inputs[:, 0, :, :, :]
    odd_img = inputs[:, 1, :, :, :]
    tf.summary.image('even', even_img, max_outputs=12)
    tf.summary.image('odd', odd_img, max_outputs=12)

    concat_vol = tf.concat([even_img, odd_img], 3)

    # Three convolutional layers.
    conv_1 = tf.layers.conv2d(concat_vol,
                              filters=64,
                              kernel_size=3,
                              strides=2,
                              activation=tf.nn.relu,
                              name='conv_1')
    conv_2 = tf.layers.conv2d(conv_1,
                              filters=64,
                              kernel_size=3,
                              strides=2,
                              activation=tf.nn.relu,
                              name='conv_2')
    conv_3 = tf.layers.conv2d(conv_2,
                              filters=64,
                              kernel_size=3,
                              strides=2,
                              activation=tf.nn.relu,
                              name='conv_3')

    # Four fully-connected layers.
    flat = tf.reshape(conv_3, [-1, 46656])
    fc_4 = tf.layers.dense(flat, 4096, activation=tf.nn.relu, name='fc_4')
    fc_5 = tf.layers.dense(fc_4, 4096, activation=tf.nn.relu, name='fc_5')
    fc_6 = tf.layers.dense(fc_5, 4096, activation=tf.nn.relu, name='fc_6')
    fc_7 = tf.layers.dense(fc_6, NUM_LABELS, name='fc_7')

    return fc_7


# Set up training and evaluation using an Estimator

We decided to use an Estimator to make checkpointing easy (we wanted to be able to start on stop on Floydhub without losing training progress). We use the cross entropy loss between the true bucket (discretized by us) and our predicted bucket. We use the gradient descent optimizer with learning_rate = 0.001. We tried larger values of this hyperparameter but found the loss went up during training (overshot the minimum). We apply softmax to our last layer to get probabilities for every bucket.

In [None]:
# Train the vid2speed network given an `images` batch and `labels` vector. The
# `images` tensor should have shape [2n, 224, 224, 3] and type uint8 - i.e., a
# batch of n pairs of 3-channel (RGB) 224x224 images. This function is written
# to comply with the TensorFlow Estimator specification, and should be used to
# create an Estimator for the vid2speed network.
def vid2speed_estimator(features, labels, mode):
    
    # Build the network itself.
    images = features['images']
    logits = vid2speed(images)
   
    # Given the logits layer, we can form predictions using argmax to find the
    # most likely class, and softmax to find class probabilities.
    predictions = {
        'classes': tf.cast(tf.argmax(logits, axis=1), tf.int32),
        
        # Adding `softmax_tensor` to the graph is used for predictions. It also
        # allows for logging during training.
        'probabilities': tf.nn.softmax(logits, name='softmax_tensor')
    }
    
    # If we just want to predict new samples, we don't need anything else.
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode, predictions)
    
    # Construct the loss function based on a one_hot vector for each label and
    # the logits layer from the vid2speed network. The loss function is
    # necessary for training and evaluation.
    labels = labels / 29 * NUM_LABELS
    labels = tf.cast(labels, tf.int32)
    one_hot = tf.one_hot(labels, NUM_LABELS)
    loss = tf.losses.softmax_cross_entropy(one_hot, logits)
    tf.summary.scalar('cost', loss)
    tf.identity(loss, name='loss')
    tf.identity(labels, name='labels')
    tf.identity(predictions['classes'], 'preds')

    # Configure the training operation using the loss function created above.
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(loss, tf.train.get_global_step())
        
        correct_prediction = tf.equal(labels, predictions['classes'])
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        tf.summary.scalar('accuracy', accuracy)
        return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

    # Add evaluation metrics for EVAL mode.
    metrics = {'accuracy': tf.metrics.accuracy(labels, predictions['classes'])}
    return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)

# Run the training

We trained for over 9,000 minibatches on Floydhub (about 15 epochs). We used Tensorboard to monitor the progress and output loss/accuracy graphs (included below). We also checkpointed the model every 60 seconds. We attempted to train locally, but doing so took on the order of seconds per minibatch (unacceptably slow). So we used Floydhub to do all the training.

As expected with training on minibatches, our accuracy on each batch varies but tends upward. By the end of the 15 epochs, we were able to completely fit the train set. We then evaluated on our unseen test set. Our trained model achieved  91% accuracy on the test set. Given the varied nature of roads in the train/test sets, we feel confident the network would generalize further to unseen road environments. We did not get to try re-training with regularization (such as by using or dropout); in our next iteration we would like to try regularizing the network.

In [None]:
# Main training cell

# Data import.
# Load training and testing data.
# Note: data tensors should have shape [2n, 224, 224, 3] and type uint8 (i.e.,
# n pairs of 224x224 3-channel (RGB) images), and the label vectors should have
# shape [n], containing the speed bucket for each pair of images.
train_data = np.load('/data/X_train.npy').reshape((-1, 2, 224, 224, 3))
train_labels = np.load('/data/y_train.npy')


# Create the Estimator
config = tf.estimator.RunConfig(save_checkpoints_secs=60)
v2s_classifier = tf.estimator.Estimator(vid2speed_estimator,
                                        './model',
                                        config=config)

# Set up logging during training. Logs will be produced every 50 training
# iterations. Note: alternatively, logs can be produced at a fixed time
# rate. To enable this behavior, use the `every_n_secs` parameter instead
# of the `every_n_iter` parameter below.
log_targets = {
    'loss': 'loss',
    'preds': 'preds',
    'labels': 'labels'
}
logging = tf.train.LoggingTensorHook(log_targets, every_n_iter=50)

'''
# Create a training function using the training data set above.
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': train_data},
    y=train_labels,
    batch_size=BATCH_SIZE,
    num_epochs=None,
    shuffle=False
)

# Run training.
v2s_classifier.train(input_fn, steps=20000, hooks=[logging])
'''

# Create an evaluation function using the test data set above.
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': test_data},
    y=test_labels,
    batch_size=BATCH_SIZE,
    num_epochs=1,
    shuffle=False
)

# Run training.
v2s_classifier.evaluate(input_fn, hooks=[logging])

## Accuracy as a function of # iterations
<img src="accuracy.png">
## Loss as a function of # iterations
<img src="loss.png">


# Analysis

Once we fit the training data, we wanted to see what we had learned. We will now analyze the network output by creating pixel-level saliency maps. We take the gradient of the loss with respect to each pixel and visualize the output. To do so, we need to restore the graph from a checkpoint (since we are using an Estimator).

In [None]:
# Parameter `pair` should have size [2, 224, 224, 3]. Parameter `label` should be a
# single label value.
def get_gradient_map(pair, label):
    tf.reset_default_graph()
    pair = tf.Variable(pair, name='pair', dtype=tf.float32)
    label = np.array([label])
    label = label / 29 * NUM_LABELS
    logits = vid2speed(pair)
    label = tf.cast(label, tf.int32)
    one_hot = tf.one_hot(label, NUM_LABELS)
    loss = tf.losses.softmax_cross_entropy(one_hot, logits)
    with tf.Session() as sess:
        pair.initializer.run()
        variables = tf.trainable_variables()
        uninit = []
        for vari in variables:
            if vari.name != 'pair:0':
                uninit.append(vari)
        saver = tf.train.Saver(uninit)
        saver.restore(sess, './model/model.ckpt-9014')
        grad = tf.gradients(loss, pair)
        return np.array(sess.run(grad))

In [None]:
# Load test data
test_data = np.load('/data/X_test.npy').reshape((-1, 2, 224, 224, 3))
test_labels = np.load('/data/y_test.npy')

# Create pixel level saliency maps
normy_index = 17
pair = test_data[normy_index, :, :, :, :]
mappy = get_gradient_map(pair, test_labels[normy_index])
mappy = mappy[0, :, :, :, :]
mappy_norms = np.linalg.norm(mappy, ord=2, axis=3)
mappy_norms /= np.max(mappy_norms)
mappy_norms = (mappy_norms * 255).astype('uint8')
mappy_norms = np.expand_dims(mappy_norms, axis=3)
mappy_norms = np.repeat(mappy_norms, 3, axis=3)

im0 = Image.fromarray(pair[0, :, :, :])
im1 = Image.fromarray(pair[1, :, :, :])
map0 = Image.fromarray(mappy_norms[0, :, :, :])
map1 = Image.fromarray(mappy_norms[1, :, :, :])
im0.save('im0.png')
im1.save('im1.png')
map0.save('map0.png')
map1.save('map1.png')

# Peeking inside the "black box"
To better understand our results, we took the gradient of the loss with respect to each pixel of an input image. We then visualized the gradients by coloring the pixels by the magnitude of the norm of their gradients (brighter here means a larger gradient). In other words, we are left with a â€œheat mapâ€ of which pixels the network focuses on to make its decision. Note we enhanced these maps with color and overlayed them on the original image to make them more visual.

<img src="im0.jpeg"> <img src="map0.jpeg"> <img src="map0_overlay.jpeg">

As you can see, the network learned to ignore the road and the sky. It learned to focus on the white lines in the road (something we hypothesized it would). Surprisingly, it also focuses on the hood of the car / the top of the car. After investigating this phenomenon more, we believe the car itself shakes more at higher speeds and thus the network has learned to factor in the shaking of the car. How neat!


# Next Steps

Next, we would like to try regularizing the network and testing it on different driving conditions (like different cars and roads). We are somewhat confident the model will generalize because the training set had such variable road conditions in it (neighborhoods, traffic, highways, etc). We also got a pretrained optical flow network working and would have liked to try using it as a feature extractor before our network. Here is some sample output from that network:

<img src="movement.png">
<img src="flow.png">

# References

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.

Fischer, Philipp, Dosovitskiy, Alexey, Ilg, Eddy, Hausser, Philip, Hazrba, Caner, Golkov, Vladimir, van der Â¨ Smagt, Patrick, Cremers, Daniel, and Brox, Thomas. Flownet: Learning optical flow with convolutional neural networks. In ICCV, 2015.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

Websites:

https://github.com/pathak22/pyflow

https://comma.ai/
