<a href="https://colab.research.google.com/github/Userfound404/Video-Transformers-keras/blob/main/video_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

In [3]:
!pip install -q git+https://github.com/tensorflow/docs

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone


In [4]:
from tensorflow_docs.vis import embed
from tensorflow.keras import layers
from tensorflow import keras

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

In [5]:
MAX_SEQ_LENGTH = 20
NUM_FEATURES = 1024
IMG_SIZE = 128

EPOCHS = 5

This script is loading and preparing a dataset for training a video classification model using the Keras library. The script starts by loading two datasets, one for training and one for testing using the pd.read_csv function from the pandas library. The script is using a helper function load_video to load videos from a specified file path, using the OpenCV library. The script is also using a helper function crop_center to crop the center of each frame of the video to a specified image size. The script then creates a feature extractor model using the DenseNet121 architecture and pre-trained on ImageNet dataset, that will be used to extract features from each frame of the video. Then the script applies label preprocessing on the training dataset, using the StringLookup class from Keras.

The script then uses the feature extractor and the load_video function to extract features from each frame of all videos in the training dataset and store these features in an array. The script also pads shorter videos to a fixed length (MAX_SEQ_LENGTH), so that all videos have the same length. And It also shows the unique label for the training dataset

It is likely that this script is meant to be used as a part of a larger program, and that this script is only preparing the dataset for use in training a video classification model.

In [6]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

center_crop_layer = layers.CenterCrop(IMG_SIZE, IMG_SIZE)


def crop_center(frame):
    cropped = center_crop_layer(frame[None, ...])
    cropped = cropped.numpy().squeeze()
    return cropped

Total videos for training: 594
Total videos for testing: 224


In [7]:
def load_video(path, max_frames=0):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center(frame)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

cap = cv2.VideoCapture(path): This line creates a VideoCapture object from OpenCV library, that reads the video from the specified path.

frames = []: This line initializes an empty list to store the frames of the video.

while True:: This line starts a while loop that will run until the video has been read completely.

ret, frame = cap.read(): This line reads a frame from the video, which is stored in the frame variable. The ret variable stores a Boolean value, indicating whether a frame was successfully read or not.

if not ret: break: This line checks the value of the ret variable. If ret is False, which means that no frame was read, the loop will break, and the script will proceed to the next step.

frame = crop_center(frame): This line applies the helper function crop_center() to the current frame, which crops the center of the frame to a specified image size.

frame = frame[:, :, [2, 1, 0]]: This line reorders the color channels of the frame from BGR to RGB.

frames.append(frame): This line adds the current frame to the list of frames.

if len(frames) == max_frames: break: This line checks the number of frames that have been read. If the number of frames equals to max_frames, the loop breaks.

cap.release(): This line releases the video capturing object.

return np.array(frames): This line converts the list of frames to a numpy array and returns it.

In [8]:
def build_feature_extractor():
    feature_extractor = keras.applications.DenseNet121(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.densenet.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()


# Label preprocessing with StringLookup.
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"]), mask_token=None
)
print(label_processor.get_vocabulary())


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5
['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']


feature_extractor = keras.applications.DenseNet121(weights="imagenet", include_top=False, pooling="avg", input_shape=(IMG_SIZE, IMG_SIZE, 3)) : This line creates a DenseNet121 model by loading weights pre-trained on the ImageNet dataset, and setting include_top to False and pooling to avg. Also, it sets the shape of the input image as (IMG_SIZE, IMG_SIZE, 3).

preprocess_input = keras.applications.densenet.preprocess_input : This line loads the preprocess_input function from DenseNet module which is used to preprocess the input image.

inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3)) : This line creates an input layer for the feature extractor model, with the shape of the input image is (IMG_SIZE, IMG_SIZE, 3)

preprocessed = preprocess_input(inputs) : This line apply the preprocess_input function on the inputs of feature extractor model.

outputs = feature_extractor(preprocessed) : This line applies the feature extractor model on preprocessed inputs.

return keras.Model(inputs, outputs, name="feature_extractor"): This line creates a new model, including the input and output layers, and returns it. The model is named "feature_extractor".

feature_extractor = build_feature_extractor() : This line creates an instance of the feature extractor model by calling the build_feature_extractor() function.

label_processor = keras.layers.StringLookup(num_oov_indices=0, vocabulary=np.unique(train_df["tag"]), mask_token=None) : This line creates an instance of the StringLookup layer of Keras. num_oov_indices is set to 0, vocabulary to the unique labels in the train_df dataset and mask_token to None.

print(label_processor.get_vocabulary()) : This line prints the vocabulary of label_processor which are the unique labels in the training dataset.


In [9]:
def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()

    # `frame_features` are what we will feed to our sequence model.
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))

        # Pad shorter videos.
        if len(frames) < MAX_SEQ_LENGTH:
            diff = MAX_SEQ_LENGTH - len(frames)
            padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3))
            frames = np.concatenate(frames, padding)

        frames = frames[None, ...]

        # Initialize placeholder to store the features of the current video.
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                if np.mean(batch[j, :]) > 0.0:
                    temp_frame_features[i, j, :] = feature_extractor.predict(
                        batch[None, j, :]
                    )

                else:
                    temp_frame_features[i, j, :] = 0.0

        frame_features[idx,] = temp_frame_features.squeeze()

    return frame_features, labels


num_samples = len(df): This line gets the number of samples (rows) in the dataframe df which is the number of videos

video_paths = df["video_name"].values.tolist(): This line extracts the "video_name" column from the dataframe df and converts the resulting numpy array to a list of video file names.

labels = df["tag"].values: This line extracts the "tag" column from the dataframe df and get labels of each video in the dataset as a numpy array.

labels = label_processor(labels[..., None]).numpy(): This line applies the label_processor to the labels array. The label_processor maps each string label to a unique integer index. The resulting processed labels are then converted back to a numpy array.

frame_features = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"): This line creates an array of zeroes that will store the features of each frame of each video. It has shape (num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), where num_samples is the number of videos, MAX_SEQ_LENGTH is the maximum number of frames per video and NUM_FEATURES is the number of features extracted from each frame.

for idx, path in enumerate(video_paths):: This line starts a for loop that will iterate over each video file name in the list video_paths. The idx variable keeps track of the current index of the loop (i.e the current video), while the path variable contains the current video file name.

frames = load_video(os.path.join(root_dir, path)): This line calls the load_video() function with the full path of the current video file by joining the root_dir and the current video file name using `os.path.join

if len(frames) < MAX_SEQ_LENGTH:: This line check if the number of frames of the current video is less than the MAX_SEQ_LENGTH. If this is true, it means the video has less frames than the maximum allowed.

diff = MAX_SEQ_LENGTH - len(frames): This line calculates the number of frames that need to be padded to reach the MAX_SEQ_LENGTH

padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3)): This line creates a numpy array of zero with shape of (diff, IMG_SIZE, IMG_SIZE, 3) which is equivalent to the missing frames

frames = np.concatenate(frames, padding): This line concatenates the missing frames to the current video frames.

frames = frames[None, ...]: This line add batch dimension to the video frame

temp_frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"): This line creates an array of zeroes that will store the features of each frame of the current video.

for i, batch in enumerate(frames):: This line starts a nested for loop that will iterate over each batch of frames in the current video. The i variable keeps track of the current index of the loop (i.e the current batch), while the batch variable contains the current batch of frames.

video_length = batch.shape[0]: This line gets the number of frames of the current batch

length = min(MAX_SEQ_LENGTH, video_length): This line get the min of the video frames and the max sequence length of frames

if np.mean(batch[j, :]) > 0.0:: This line checks if mean of frame is greater than 0

temp_frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :]): If the mean is greater than 0, it applies feature_extractor model to extract feature from the current frame and stored it in the temp_frame_features array

temp_frame_features[i, j, :] = 0.0: If the mean is not greater than 0, it sets the feature to 0.

frame_features[idx,] = temp_frame_features.squeeze(): This line stores the extracted features for the current video in the frame_features array.

return frame_features, labels : This function returns the extracted features and processed labels of the videos.

 The prepare_all_videos function takes two arguments:

df : a DataFrame containing the metadata of the videos such as their names, labels, and other information.
root_dir: a string indicating the path of the directory where the video files are stored.
The function is used to load, process and extract features of the video frames. It first extracts the video file names and labels from the DataFrame. Then, it applies the load_video function to load the videos by passing the path of the directory containing the videos and the video file name from the video_paths list.

It then checks if the number of frames of the current video is less than MAX_SEQ_LENGTH to pad the video to reach max sequence length of frames and add a batch dimension to the frames.

It initializes an array temp_frame_features to store the features of the current video frames. This array is then passed to the feature extractor model to extract the features for each frame. If the mean of the frame is greater than 0, it applies the feature extractor model to extract the features, otherwise sets the feature to 0.

Finally, the function returns the extracted features frame_features and processed labels labels of the videos.

In [None]:
train_data, train_labels = prepare_all_videos(train_df,"train")



In [None]:
test_data, test_labels = prepare_all_videos(test_df,"test")