<a href="https://colab.research.google.com/github/akshitadixit/RAKSHA-3.0/blob/main/Copy_of_video_classification_we're_using_currently.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This example demonstrates video classification, an important use-case with
applications in recommendations, security, and so on.
We will be using the [UCF101 dataset](https://www.crcv.ucf.edu/data/UCF101.php)
to build our video classifier. The dataset consists of videos categorized into different
actions, like cricket shot, punching, biking, etc. This dataset is commonly used to
build action recognizers, which are an application of video classification.

A video consists of an ordered sequence of frames. Each frame contains *spatial*
information, and the sequence of those frames contains *temporal* information. To model
both of these aspects, we use a hybrid architecture that consists of convolutions
(for spatial processing) as well as recurrent layers (for temporal processing).
Specifically, we'll use a Convolutional Neural Network (CNN) and a Recurrent Neural
Network (RNN) consisting of [GRU layers](https://keras.io/api/layers/recurrent_layers/gru/).
This kind of hybrid architecture is popularly known as a **CNN-RNN**.

This example requires TensorFlow 2.5 or higher, as well as TensorFlow Docs, which can be
installed using the following command:

In [None]:
!pip install -q git+https://github.com/tensorflow/docs

  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone


In [None]:
import numpy as np
import pandas as pd
from keras import backend as K
import sys
import csv
import os

import cv2
import math
import random
import datetime as dt
import tensorflow as tf
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import plot_model

## Data collection

In order to keep the runtime of this example relatively short, we will be using a
subsampled version of the original UCF101 dataset. You can refer to
[this notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb)
to know how the subsampling was done.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
dataset_dir = '/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/'

Mounted at /content/gdrive


In [None]:
classes = ['Hammer Strike','Groin Kick','Heel Palm Strike','Elbow Strike','Escape Bear Hug Attack','Escape Hands Trapped','Escape Side Headlock','Eye Strike','Knee strike','Ready Stance','Two handed choked']

with open(dataset_dir+'dataset.csv', 'w', newline='') as file:
  writer = csv.writer(file)
  for c in classes:
    path = os.path.join(dataset_dir,c)
    for i in os.listdir(path):
      writer.writerow([classes.index(c), os.path.join(path, i)])

In [None]:
df = pd.read_csv(dataset_dir+'dataset.csv', header=None)
df.columns = ["class", "path"]
df = df.astype({"class": str})

# changing path from mp4 to avi
#df["path"] = df["path"].apply(lambda x: x.replace("mp4", "avi"))

#df = df.append(df, ignore_index=True)
#df = df.append(df, ignore_index=True)
print(len(df))
print(df)

# split the data
train, test = np.split(df.sample(frac=1, random_state=42), [int(.857*len(df))])
print(len(train))

305
    class                                               path
0       0  /content/gdrive/My Drive/Colab Notebooks/Pose/...
1       0  /content/gdrive/My Drive/Colab Notebooks/Pose/...
2       0  /content/gdrive/My Drive/Colab Notebooks/Pose/...
3       0  /content/gdrive/My Drive/Colab Notebooks/Pose/...
4       0  /content/gdrive/My Drive/Colab Notebooks/Pose/...
..    ...                                                ...
300    10  /content/gdrive/My Drive/Colab Notebooks/Pose/...
301    10  /content/gdrive/My Drive/Colab Notebooks/Pose/...
302    10  /content/gdrive/My Drive/Colab Notebooks/Pose/...
303    10  /content/gdrive/My Drive/Colab Notebooks/Pose/...
304    10  /content/gdrive/My Drive/Colab Notebooks/Pose/...

[305 rows x 2 columns]
261


## Setup

In [None]:
from tensorflow_docs.vis import embed
from tensorflow import keras
from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

## Define hyperparameters

In [None]:
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 20

MAX_SEQ_LENGTH = 240
NUM_FEATURES = 2048

## Data preparation

In [None]:
train_df = train
test_df = test

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

train_df.sample(10)

Total videos for training: 261
Total videos for testing: 44


Unnamed: 0,class,path
254,8,/content/gdrive/My Drive/Colab Notebooks/Pose/...
236,8,/content/gdrive/My Drive/Colab Notebooks/Pose/...
66,1,/content/gdrive/My Drive/Colab Notebooks/Pose/...
77,2,/content/gdrive/My Drive/Colab Notebooks/Pose/...
300,10,/content/gdrive/My Drive/Colab Notebooks/Pose/...
197,7,/content/gdrive/My Drive/Colab Notebooks/Pose/...
156,6,/content/gdrive/My Drive/Colab Notebooks/Pose/...
5,0,/content/gdrive/My Drive/Colab Notebooks/Pose/...
63,1,/content/gdrive/My Drive/Colab Notebooks/Pose/...
129,4,/content/gdrive/My Drive/Colab Notebooks/Pose/...


One of the many challenges of training video classifiers is figuring out a way to feed
the videos to a network. [This blog post](https://blog.coast.ai/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5)
discusses five such methods. Since a video is an ordered sequence of frames, we could
just extract the frames and put them in a 3D tensor. But the number of frames may differ
from video to video which would prevent us from stacking them into batches
(unless we use padding). As an alternative, we can **save video frames at a fixed
interval until a maximum frame count is reached**. In this example we will do
the following:

1. Capture the frames of a video.
2. Extract frames from the videos until a maximum frame count is reached.
3. In the case, where a video's frame count is lesser than the maximum frame count we
will pad the video with zeros.

Note that this workflow is identical to [problems involving texts sequences](https://developers.google.com/machine-learning/guides/text-classification/). Videos of the UCF101 dataset is [known](https://www.crcv.ucf.edu/papers/UCF101_CRCV-TR-12-01.pdf)
to not contain extreme variations in objects and actions across frames. Because of this,
it may be okay to only consider a few frames for the learning task. But this approach may
not generalize well to other video classification problems. We will be using
[OpenCV's `VideoCapture()` method](https://docs.opencv.org/master/dd/d43/tutorial_py_video_display.html)
to read frames from videos.

In [None]:
# The following two methods are taken from this tutorial:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

We can use a pre-trained network to extract meaningful features from the extracted
frames. The [`Keras Applications`](https://keras.io/api/applications/) module provides
a number of state-of-the-art models pre-trained on the [ImageNet-1k dataset](http://image-net.org/).
We will be using the [InceptionV3 model](https://arxiv.org/abs/1512.00567) for this purpose.

In [None]:

def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5


The labels of the videos are strings. Neural networks do not understand string values,
so they must be converted to some numerical form before they are fed to the model. Here
we will use the [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup)
layer encode the class labels as integers.

In [None]:
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["class"])
)
print(label_processor.get_vocabulary())

['0', '1', '10', '2', '3', '4', '5', '6', '7', '8', '9']


Finally, we can put all the pieces together to create our data processing utility.

In [None]:

def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["path"].values.tolist()
    labels = df["class"].values
    labels = label_processor(labels[..., None]).numpy()

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :]
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()
        print(idx)
    return (frame_features, frame_masks), labels


train_data, train_labels = prepare_all_videos(train_df, "train")
print("train done.... ")
test_data, test_labels = prepare_all_videos(test_df, "test")

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
train done.... 
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


The above code block will take ~20 minutes to execute depending on the machine it's being
executed.

## The sequence model

Now, we can feed this data to a sequence model consisting of recurrent layers like `GRU`.

In [None]:
# Utility for our sequence model.
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    # Refer to the following tutorial to understand the significance of using `mask`:
    # https://keras.io/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(256)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(2048, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="hinge", optimizer=keras.optimizers.Adam(learning_rate=10e-5), metrics=["accuracy"]
    )
    return rnn_model


# Utility for running experiments.
def run_experiment():
    filepath = "/tmp/video_classifier"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )

    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history, seq_model


_, sequence_model = run_experiment()

Epoch 1/20
Epoch 00001: val_loss improved from inf to 0.52345, saving model to /tmp/video_classifier
Epoch 2/20
Epoch 00002: val_loss did not improve from 0.52345
Epoch 3/20
Epoch 00003: val_loss did not improve from 0.52345
Epoch 4/20
Epoch 00004: val_loss did not improve from 0.52345
Epoch 5/20
Epoch 00005: val_loss did not improve from 0.52345
Epoch 6/20
Epoch 00006: val_loss did not improve from 0.52345
Epoch 7/20
Epoch 00007: val_loss did not improve from 0.52345
Epoch 8/20
Epoch 00008: val_loss did not improve from 0.52345
Epoch 9/20
Epoch 00009: val_loss did not improve from 0.52345
Epoch 10/20
Epoch 00010: val_loss did not improve from 0.52345
Epoch 11/20
Epoch 00011: val_loss did not improve from 0.52345
Epoch 12/20
Epoch 00012: val_loss did not improve from 0.52345
Epoch 13/20
Epoch 00013: val_loss did not improve from 0.52345
Epoch 14/20
Epoch 00014: val_loss did not improve from 0.52345
Epoch 15/20
Epoch 00015: val_loss did not improve from 0.52345
Epoch 16/20
Epoch 00016: 

**Note**: To keep the runtime of this example relatively short, we just used a few
training examples. This number of training examples is low with respect to the sequence
model being used that has 99,909 trainable parameters. You are encouraged to sample more
data from the UCF101 dataset using [the notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) mentioned above and train the same model.

## Inference

In [None]:
df["path"][63]

'/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Groin Kick/623.avi'

In [None]:

def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# This utility is for visualization.
# Referenced from:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, fps=5)
    return embed.embed_file("animation.gif")


test_video = np.random.choice(test_df["path"].values.tolist())
print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames)

Test video path: /content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Escape Bear Hug Attack/1062.avi
  8: 10.00%
  0:  9.80%
  4:  9.80%
  6:  9.31%
  1:  9.30%
  7:  9.19%
  10:  8.84%
  9:  8.65%
  2:  8.51%
  5:  8.32%
  3:  8.27%


RuntimeError: ignored

## Next steps

* In this example, we made use of transfer learning for extracting meaningful features
from video frames. You could also fine-tune the pre-trained network to notice how that
affects the end results.
* For speed-accuracy trade-offs, you can try out other models present inside
`tf.keras.applications`.
* Try different combinations of `MAX_SEQ_LENGTH` to observe how that affects the
performance.
* Train on a higher number of classes and see if you are able to get good performance.
* Following [this tutorial](https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub), try a
[pre-trained action recognition model](https://arxiv.org/abs/1705.07750) from DeepMind.
* Rolling-averaging can be useful technique for video classification and it can be
combined with a standard image classification model to infer on videos.
[This tutorial](https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/)
will help understand how to use rolling-averaging with an image classifier.
* When there are variations in between the frames of a video not all the frames might be
equally important to decide its category. In those situations, putting a
[self-attention layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention) in the
sequence model will likely yield better results.
* Following [this book chapter](https://livebook.manning.com/book/deep-learning-with-python-second-edition/chapter-11),
you can implement Transformers-based models for processing videos.

In [None]:
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
import numpy as np

from tensorflow.keras.applications import VGG16
conv_base = VGG16(weights='imagenet',
                  include_top=False,
                  input_shape=(IMG_SIZE, IMG_SIZE, 3))

num_class = 11

def create_base():
  conv_base = VGG16(weights='imagenet',
                  include_top=False,
                  input_shape=(IMG_SIZE, IMG_SIZE, 3))
  x = GlobalAveragePooling2D()(conv_base.output)
  base_model = Model(conv_base.input, x)
  return base_model

conv_base = create_base()

ip = Input(shape=(10,IMG_SIZE, IMG_SIZE,3))
t_conv = TimeDistributed(conv_base)(ip) # vgg16 feature extractor

t_lstm = LSTM(10, return_sequences=False)(t_conv)

f_softmax = Dense(num_class, activation='softmax')(t_lstm)

model = Model(ip, f_softmax)

model.summary()

Model: "model_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_27 (InputLayer)       [(None, 10, 224, 224, 3)  0         
                             ]                                   
                                                                 
 time_distributed_2 (TimeDis  (None, 10, 512)          14714688  
 tributed)                                                       
                                                                 
 lstm_1 (LSTM)               (None, 10)                20920     
                                                                 
 dense_13 (Dense)            (None, 11)                121       
                                                                 
Total params: 14,735,729
Trainable params: 14,735,729
Non-trainable params: 0
_________________________________________________________________


In [None]:
train_frames = '/content/gdrive/MyDrive/Colab Notebooks/Pose/train_frames'

In [None]:
!mkdir '/content/gdrive/My Drive/Colab Notebooks/Pose/train_frames'

In [None]:
from glob import glob
from tqdm import tqdm

In [None]:
for i in train['path']:
  print(i)

/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Groin Kick/686.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Hammer Strike/51.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Ready Stance/2641.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Two handed choked/2964.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Eye Strike/1972.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Escape Bear Hug Attack/1164.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Ready Stance/2743.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Groin Kick/451.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Escape Bear Hug Attack/1359.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Groin Kick/612.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Knee strike/2121.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_data/Hammer Strike/226.avi
/content/gdrive/My Drive/Colab Notebooks/Pose/pose_d

In [None]:
# storing the frames from training videos
for i in train['path']:
    count = 0
    cap = cv2.VideoCapture(i)   # capturing the video from the given path
    x=1
    while(cap.isOpened()):
        ret, frame = cap.read()
        if (ret != True):
            break
        filename = i.split('/')[8][:-4] +'_frame'+str(count)+'.jpg'
        count += 1
        filename.replace('/', '\\')
        x = cv2.imwrite(filename, frame)
        print(x)
    cap.release()