# Action classification in video: UCF-101

We will design an algorithm for action classification on UCF-101, a dataset with >13k videos covering 101 action classes. The pipeline consists in extracting frame-level features using a pre-trained CNN and then modelling their temporal evolution using RNNs.

## Imports

In [None]:
import tensorflow as tf
import tensorflow.contrib.layers as layers

from util.video_loader import generate_batch

import time
import numpy as np

## Hyperparameters

In [None]:
train_data_pattern = 'data/ucf101/*'
num_classes = 101
batch_size = 256
cell_type = 'gru'  # 'lstm' or 'gru'
rnn_layers = 2
rnn_cells = 256
learning_rate = 0.001
grad_clip_norm = 1.
num_iterations = 500
print_freq = 10  # how often to print during training

## Loading the videos: TFRecord

Extracting CNN features for each frame in a video is computationally expensive. For this reason, we already extracted features for the first 10s of every video in UCF-101 (padding shorter videos with empty frames) we will focus on developing and training the RNN model. Since we do not need to allocate memory for the CNN, we can use larger batch sizes and iterate much faster over the data.

The features are stored using TFRecord, a data format that enables fast and asynchronous data loading. This means that while the GPU is busy processing the current batch, the CPU will load and preprocess the next one; such procedure allows to make the most of the available hardware and avoid starving the GPU.

In [None]:
# Reset graph in case we already created one and want to change hyperparameters
tf.reset_default_graph()

# Create data loading pipeline
model_input, labels, seq_length, _ = generate_batch(train_data_pattern, batch_size, train=True)

# We will use one label for the whole sequence
labels = labels[:, 0]

## Define model

In [None]:
# Create RNN cells
cell_fn = {'lstm': tf.contrib.rnn.LSTMCell, 'gru': tf.contrib.rnn.GRUCell}
cell_list = [cell_fn[cell_type](rnn_cells) for _ in range(rnn_layers)]
multi_cell = tf.contrib.rnn.MultiRNNCell(cell_list)

# Unroll RNN dynamically
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(multi_cell, model_input, dtype=tf.float32, sequence_length=seq_length)

# Fully-connected layer on top of the last RNN output
logits = layers.linear(inputs=rnn_outputs[:, -1, :], num_outputs=num_classes + 1)  # we have the background class

In [None]:
# Compute cross-entropy loss
cross_entropy_per_sample = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels)
loss = tf.reduce_mean(cross_entropy_per_sample)

# Compute accuracy
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), labels), tf.float32))

# Define the backwards pass (gradients and parameter update)
opt = tf.train.AdamOptimizer(tf.constant(learning_rate))
vars_to_optimize = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, vars_to_optimize), clip_norm=grad_clip_norm)
grads_and_vars = list(zip(grads, vars_to_optimize))
train_fn = opt.apply_gradients(grads_and_vars)

## TensorBoard

In [None]:
# Path where the logs will be stored
logdir = 'tensorboard_logs/ucf101'

# Define individual summaries
loss_summary = tf.summary.scalar("cross_entropy_loss", loss)
accuracy_summary = tf.summary.scalar("accuracy", accuracy)

# Merge all summaries into a single op
summary_op = tf.summary.merge([loss_summary, accuracy_summary])

# Create an empty directory for the logs
if tf.gfile.Exists(logdir):
    tf.gfile.DeleteRecursively(logdir)
if not tf.gfile.Exists(logdir):
    tf.gfile.MakeDirs(logdir)
    
# Create the writer
summary_writer = tf.summary.FileWriter(logdir)

## Training

In [None]:
# Create session and initialize variables
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

# Initialize queue runners in the data loading pipeline
# tf.train.start_queue_runners(sess=sess); TODO: check if this is needed
coord = tf.train.Coordinator()
threads = []
for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
    threads.extend(qr.create_threads(sess, coord=coord, daemon=True, start=True))

# Training loop
try:
    duration = 0.
    for step in range(num_iterations):
        start_time = time.time()
        iter_cross_entropy, iter_acc, _, summary_str = sess.run([loss, accuracy, train_fn, summary_op])
        duration += time.time() - start_time

        # Write TensorBoard summaries and print training evolution every once in a while
        if step > 0 and (step % print_freq == 0 or step == (num_iterations - 1)):
            summary_writer.add_summary(summary_str, step)
            print("Step %d, "
                  "%.1f examples/sec (%.2f sec/batch), "
                  "loss: %.7f, "
                  "accuracy: %.2f %%" % (step,
                                         print_freq * batch_size / duration,
                                         duration / print_freq,
                                         iter_cross_entropy,
                                         100. * iter_acc))
            duration = 0.
    print('Done!')
except KeyboardInterrupt:
    print("Training was interrupted")
finally:
    # Stop the writer
    summary_writer.flush()
    summary_writer.close()

    # Stop data loading threads
    coord.request_stop()
    coord.join(threads, stop_grace_period_secs=10)

## Possible extensions

- Consecutive frames in a video are highly redundant. Can we improve performance by decreasing the rate at which frames are sampled?
- Propagating the gradients from the last time step only may become difficult for long sequences. How would you introduce supervision at frame-level?
- The UCF-101 dataset is rather small. How would you reduce overfitting?
- How would you load raw videos in TensorFlow to extract CNN features? Loading video files is not currently supported in TensorFlow (see this [issue on github](https://github.com/tensorflow/tensorflow/issues/6265#issuecomment-268338817) for more information). We used [our own custom python op](https://github.com/victorcampos7/tensorflow-ffmpeg) that wraps ffmpeg.