# Video Classification with a CNN-RNN Architecture

This notebook is forked from [https://keras.io/examples/vision/video_classification/](Keras Tutorial)

In [1]:
%pip install -q git+https://github.com/tensorflow/docs

Note: you may need to restart the kernel to use updated packages.


## Data collection

In order to keep the runtime of this example relatively short, we will be using a
subsampled version of the original UCF101 dataset. You can refer to
[this notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb)
to know how the subsampling was done.

In [2]:
!wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

## Setup

In [1]:
!export TF_ENABLE_ONEDNN_OPTS=1
!export TF_GPU_ALLOCATOR=cuda_malloc_async

from tensorflow_docs.vis import embed
from tensorflow import keras
from keras import layers

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

from sklearn.model_selection import train_test_split

2023-01-30 14:34:12.304921: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-30 14:34:12.752136: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-01-30 14:34:18.328359: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /kuacc/apps/ffmpeg/4.4.0_x265/lib:/kuacc/apps/x265/lib:/scratch/kuacc/apps/intel/oneapi

In [2]:
!nvidia-smi

Mon Jan 30 14:34:30 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:1C:00.0 Off |                    0 |
| N/A   38C    P0    26W /  70W |      0MiB / 15109MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

## Define hyperparameters

In [38]:
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 50

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

In [39]:
lstm_checkpoint_path = "./tmp/training_lstm/cp-{epoch:04d}.ckpt"
lstm_checkpoint_dir = os.path.dirname(lstm_checkpoint_path)

transformer_checkpoint_path = "./tmp/training_former/cp-{epoch:04d}.ckpt"
transformer_checkpoint_dir = os.path.dirname(transformer_checkpoint_path)

## Data preparation

In [40]:
basedir = "../engagement-slices/%d"
df_vids = pd.read_csv(os.path.join(basedir % 4, "list.csv"))
df_vids.insert(0, 'path', basedir)

for i in range(5,9):
    df = pd.read_csv(os.path.join(basedir % i, "list.csv"))
    df.insert(0, 'path', basedir % i)
    df_vids = pd.concat([df_vids, df], axis=0)

print(f"Total videos: {len(df_vids)}")
df_vids.sample(10)

train_df, test_df = train_test_split(df_vids, test_size=0.2)

Total videos: 2465


One of the many challenges of training video classifiers is figuring out a way to feed
the videos to a network. [This blog post](https://blog.coast.ai/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5)
discusses five such methods. Since a video is an ordered sequence of frames, we could
just extract the frames and put them in a 3D tensor. But the number of frames may differ
from video to video which would prevent us from stacking them into batches
(unless we use padding). As an alternative, we can **save video frames at a fixed
interval until a maximum frame count is reached**. In this example we will do
the following:

1. Capture the frames of a video.
2. Extract frames from the videos until a maximum frame count is reached.
3. In the case, where a video's frame count is lesser than the maximum frame count we
will pad the video with zeros.

Note that this workflow is identical to [problems involving texts sequences](https://developers.google.com/machine-learning/guides/text-classification/). Videos of the UCF101 dataset is [known](https://www.crcv.ucf.edu/papers/UCF101_CRCV-TR-12-01.pdf)
to not contain extreme variations in objects and actions across frames. Because of this,
it may be okay to only consider a few frames for the learning task. But this approach may
not generalize well to other video classification problems. We will be using
[OpenCV's `VideoCapture()` method](https://docs.opencv.org/master/dd/d43/tutorial_py_video_display.html)
to read frames from videos.

In [41]:
# The following two methods are taken from this tutorial:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub

def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)


We can use a pre-trained network to extract meaningful features from the extracted
frames. The [`Keras Applications`](https://keras.io/api/applications/) module provides
a number of state-of-the-art models pre-trained on the [ImageNet-1k dataset](http://image-net.org/).
We will be using the [InceptionV3 model](https://arxiv.org/abs/1512.00567) for this purpose.

In [42]:
def build_feature_extractor():
    feature_extractor = keras.applications.MobileNet(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

The labels of the videos are strings. Neural networks do not understand string values,
so they must be converted to some numerical form before they are fed to the model. Here
we will use the [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup)
layer encode the class labels as integers.

In [43]:
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(df_vids["tag"])
)
print(label_processor.get_vocabulary())

['D', 'E', 'HD', 'HE', 'M']


Finally, we can put all the pieces together to create our data processing utility.

In [44]:
def prepare_all_videos(df):
    num_samples = len(df)
    video_paths = (df['path'].astype(str) + df["video_name"].astype(str)).values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(path)
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :]
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels


train_data, train_labels = prepare_all_videos(train_df)
test_data, test_labels = prepare_all_videos(test_df)

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

Frame features in train set: (1972, 20, 2048)
Frame masks in train set: (1972, 20)


The above code block will take ~20 minutes to execute depending on the machine it's being
executed.

In [45]:
train_data[1]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

## The sequence model

Now, we can feed this data to a sequence model consisting of recurrent layers like `GRU`.

In [48]:
# Utility for our sequence model.
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    # Refer to the following tutorial to understand the significance of using `mask`:
    # https://keras.io/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    return rnn_model


# Utility for running experiments.
def run_experiment():
    checkpoint = keras.callbacks.ModelCheckpoint(filepath=lstm_checkpoint_path,
                                                     save_weights_only=True,
                                                     save_best_only=True,
                                                     verbose=1)
    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.2,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history, seq_model

In [49]:
_, sequence_model = run_experiment()

Epoch 1/50
Epoch 1: val_loss improved from inf to 1.58958, saving model to ./tmp/training_lstm/cp-0001.ckpt
Epoch 2/50
Epoch 2: val_loss improved from 1.58958 to 1.57203, saving model to ./tmp/training_lstm/cp-0002.ckpt
Epoch 3/50
Epoch 3: val_loss improved from 1.57203 to 1.55597, saving model to ./tmp/training_lstm/cp-0003.ckpt
Epoch 4/50
Epoch 4: val_loss improved from 1.55597 to 1.54167, saving model to ./tmp/training_lstm/cp-0004.ckpt
Epoch 5/50
Epoch 5: val_loss improved from 1.54167 to 1.52925, saving model to ./tmp/training_lstm/cp-0005.ckpt
Epoch 6/50
Epoch 6: val_loss improved from 1.52925 to 1.51776, saving model to ./tmp/training_lstm/cp-0006.ckpt
Epoch 7/50
Epoch 7: val_loss improved from 1.51776 to 1.50773, saving model to ./tmp/training_lstm/cp-0007.ckpt
Epoch 8/50
Epoch 8: val_loss improved from 1.50773 to 1.49851, saving model to ./tmp/training_lstm/cp-0008.ckpt
Epoch 9/50
Epoch 9: val_loss improved from 1.49851 to 1.49079, saving model to ./tmp/training_lstm/cp-0009.c

Epoch 26: val_loss improved from 1.43498 to 1.43342, saving model to ./tmp/training_lstm/cp-0026.ckpt
Epoch 27/50
Epoch 27: val_loss improved from 1.43342 to 1.43216, saving model to ./tmp/training_lstm/cp-0027.ckpt
Epoch 28/50
Epoch 28: val_loss improved from 1.43216 to 1.43074, saving model to ./tmp/training_lstm/cp-0028.ckpt
Epoch 29/50
Epoch 29: val_loss improved from 1.43074 to 1.42960, saving model to ./tmp/training_lstm/cp-0029.ckpt
Epoch 30/50
Epoch 30: val_loss improved from 1.42960 to 1.42875, saving model to ./tmp/training_lstm/cp-0030.ckpt
Epoch 31/50
Epoch 31: val_loss improved from 1.42875 to 1.42759, saving model to ./tmp/training_lstm/cp-0031.ckpt
Epoch 32/50
Epoch 32: val_loss improved from 1.42759 to 1.42668, saving model to ./tmp/training_lstm/cp-0032.ckpt
Epoch 33/50
Epoch 33: val_loss improved from 1.42668 to 1.42583, saving model to ./tmp/training_lstm/cp-0033.ckpt
Epoch 34/50
Epoch 34: val_loss improved from 1.42583 to 1.42482, saving model to ./tmp/training_lstm

NameError: name 'filepath' is not defined

**Note**: To keep the runtime of this example relatively short, we just used a few
training examples. This number of training examples is low with respect to the sequence
model being used that has 99,909 trainable parameters. You are encouraged to sample more
data from the UCF101 dataset using [the notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) mentioned above and train the same model.

# The Transformer Model

In [50]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )
        self.sequence_length = sequence_length
        self.output_dim = output_dim

    def call(self, inputs):
        # The inputs are of shape: `(batch_size, frames, num_features)`
        length = tf.shape(inputs)[1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_positions = self.position_embeddings(positions)
        return inputs + embedded_positions

    def compute_mask(self, inputs, mask=None):
        mask = tf.reduce_any(tf.cast(inputs, "bool"), axis=-1)
        return mask

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim, dropout=0.3
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation=tf.nn.gelu), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]

        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

In [51]:
def get_compiled_model():
    sequence_length = MAX_SEQ_LENGTH
    embed_dim = NUM_FEATURES
    dense_dim = 4
    num_heads = 1
    classes = len(label_processor.get_vocabulary())

    inputs = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    x = PositionalEmbedding(
        sequence_length, embed_dim, name="frame_position_embedding"
    )(inputs)
    x = TransformerEncoder(embed_dim, dense_dim, num_heads, name="transformer_layer")(x)
    x = layers.GlobalMaxPooling1D()(x)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(classes, activation="softmax")(x)
    model = keras.Model(inputs, outputs)

    model.compile(
        optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
    )
    return model


def run_experiment():
    checkpoint = keras.callbacks.ModelCheckpoint(filepath=transformer_checkpoint_path,
                                                     save_weights_only=True,
                                                     save_best_only=True,
                                                     verbose=1)

    model = get_compiled_model()
    history = model.fit(
        train_data[0],
        train_labels,
        validation_split=0.15,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    model.load_weights(transformer_checkpoint_path)
    _, accuracy = model.evaluate(test_data[0], test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return model

In [52]:
sequence_model = run_experiment()

Epoch 1/50


2023-01-30 15:00:14.138499: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8101


Epoch 1: val_loss improved from inf to 1.83581, saving model to ./tmp/training_former/cp-0001.ckpt
Epoch 2/50
Epoch 2: val_loss improved from 1.83581 to 1.47224, saving model to ./tmp/training_former/cp-0002.ckpt
Epoch 3/50
Epoch 3: val_loss did not improve from 1.47224
Epoch 4/50
Epoch 4: val_loss improved from 1.47224 to 1.42628, saving model to ./tmp/training_former/cp-0004.ckpt
Epoch 5/50
Epoch 5: val_loss did not improve from 1.42628
Epoch 6/50
Epoch 6: val_loss improved from 1.42628 to 1.41985, saving model to ./tmp/training_former/cp-0006.ckpt
Epoch 7/50
Epoch 7: val_loss did not improve from 1.41985
Epoch 8/50
Epoch 8: val_loss did not improve from 1.41985
Epoch 9/50
Epoch 9: val_loss did not improve from 1.41985
Epoch 10/50
Epoch 10: val_loss did not improve from 1.41985
Epoch 11/50
Epoch 11: val_loss did not improve from 1.41985
Epoch 12/50
Epoch 12: val_loss did not improve from 1.41985
Epoch 13/50
Epoch 13: val_loss did not improve from 1.41985
Epoch 14/50
Epoch 14: val_los

Epoch 30: val_loss did not improve from 1.41234
Epoch 31/50
Epoch 31: val_loss did not improve from 1.41234
Epoch 32/50
Epoch 32: val_loss did not improve from 1.41234
Epoch 33/50
Epoch 33: val_loss did not improve from 1.41234
Epoch 34/50
Epoch 34: val_loss did not improve from 1.41234
Epoch 35/50
Epoch 35: val_loss did not improve from 1.41234
Epoch 36/50
Epoch 36: val_loss did not improve from 1.41234
Epoch 37/50
Epoch 37: val_loss did not improve from 1.41234
Epoch 38/50
Epoch 38: val_loss did not improve from 1.41234
Epoch 39/50
Epoch 39: val_loss did not improve from 1.41234
Epoch 40/50
Epoch 40: val_loss did not improve from 1.41234
Epoch 41/50
Epoch 41: val_loss did not improve from 1.41234
Epoch 42/50
Epoch 42: val_loss did not improve from 1.41234
Epoch 43/50
Epoch 43: val_loss did not improve from 1.41234
Epoch 44/50
Epoch 44: val_loss did not improve from 1.41234
Epoch 45/50
Epoch 45: val_loss did not improve from 1.41234
Epoch 46/50
Epoch 46: val_loss did not improve from 

NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./tmp/training_former/cp-{epoch:04d}.ckpt

## Load From Checkpoint

In [None]:
#sequence_model = get_sequence_model().load_weights(lstm_checkpoint_path)
sequence_model = model.load_weights(transformer_checkpoint_path)

## Inference

In [None]:

def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# This utility is for visualization.
# Referenced from:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = np.clip(images * 255, 0, 255).astype(np.uint8)
    imageio.mimsave('./animation.gif', converted_images, fps=25)
    return embed.embed_file('./animation.gif')


test_video = np.random.choice(test_df["video_name"].values.tolist())
print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

## Next steps

* In this example, we made use of transfer learning for extracting meaningful features
from video frames. You could also fine-tune the pre-trained network to notice how that
affects the end results.
* For speed-accuracy trade-offs, you can try out other models present inside
`tf.keras.applications`.
* Try different combinations of `MAX_SEQ_LENGTH` to observe how that affects the
performance.
* Train on a higher number of classes and see if you are able to get good performance.
* Following [this tutorial](https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub), try a
[pre-trained action recognition model](https://arxiv.org/abs/1705.07750) from DeepMind.
* Rolling-averaging can be useful technique for video classification and it can be
combined with a standard image classification model to infer on videos.
[This tutorial](https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/)
will help understand how to use rolling-averaging with an image classifier.
* When there are variations in between the frames of a video not all the frames might be
equally important to decide its category. In those situations, putting a
[self-attention layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention) in the
sequence model will likely yield better results.
* Following [this book chapter](https://livebook.manning.com/book/deep-learning-with-python-second-edition/chapter-11),
you can implement Transformers-based models for processing videos.