📺 Video Classification

Different approaches to video classification on Youtube Videos Dataset using CLIP embeddings for frames.

Structure

datasets ‒ implementations of torch datasets (to get video embeddings based on multiple frames, to get a random video frame).
models ‒ models implementations.
scripts ‒ scripts for preparing data to training and evaluation.
video_classification_utils ‒ various useful utilities, e.g. for obtaining frames from video, for training models.
experiments.ipynb ‒ notebook with running experiments, plots and metrics

Requirements

Create virtual environment with venv or conda and install requirements:

pip install -r requirements.txt

Or build and run docker container:

./run_docker.sh

Data

The Youtube Videos Dataset was used for training and testing. The dataset contains information about the video and its subject matter. Our task involves classifying videos by subject using frame embeddings.

There are 4 video categories in total: food, art_music, travel, history.

Prepare embeddings

Videos were downloaded using the script prepare_embeddings.py. Since the dataset contains videos that have already been deleted, only those videos for which there is a valid link were downloaded.

A total of 1728 videos were downloaded.

travel: 602
food: 491
art_music: 317
history: 316

As the script name suggests, the script also extracts embeddings from videos.

The script extracts --frames-count (script parameter) frames from each video. Frames are extracted evenly, that is, the distance between all frames is the same. Further, for each frame, the embedding vector is extracted using the CLIP model (ViT-B/32). The embedding sequence is saved to a json file in the directory --save-embeddings-to.

For further experiments, 100 frames were extracted from each video.

Split into train test

Splitting into training and test sets is done using a script split_into_train_test.py. The share of the test sample from the total dataset is controlled by a parameter --test-size (default: 0.1).

The split data statistics are as follows:

Train (1553):
- travel: 540
- food: 442
- history: 286
- art_music: 285
Test (173):
- travel: 62
- food: 49
- art_music: 32
- history: 30

Models

Ideas for models are taken from the article CLIP4Clip. The classic way of training a model is compared to training with Mixup.

Linear classifier

The video vector is obtained by averaging the vectors for the frames. This vector is converted to logits by a linear layer.

Mixup

In this case, the mixup does not add images, but video embeddings.

$$\lambda \cdot embedding_1 + (1 - \lambda) \cdot embedding_2 = embedding$$

Ground truth one-hot vectors are added with the same coefficients:

$$\lambda \cdot target_1 + (1 - \lambda) \cdot target_2 = target$$

for batch in zip(train_data_loader1, train_data_loader2):
    optimizer.zero_grad()

    lam = np.random.beta(alpha, alpha)
    batch1, batch2 = batch
    batch = {
        "embeddings": lam * batch1["embeddings"]
        + (1 - lam) * batch2["embeddings"],
        "labels": lam * batch1["labels"] + (1 - lam) * batch2["labels"],
    }

Transformer Encoder

In this approach, positional embeddings are added to frame embeddings. The resulting vectors are converted by one transformer encoder layer. The outputs are averaged and converted to logits using a linear layer.

An attention mask is also fed into the transformer encoder, as the model is trained by mini-batch gradient descent.

Mixup

When using the mixup technique, the frame embedding matrices are added element by element.

for batch in zip(train_data_loader1, train_data_loader2):
    optimizer.zero_grad()

    lam = np.random.beta(alpha, alpha)
    batch1, batch2 = batch
    batch = {
        "embeddings": lam * batch1["embeddings"]
            + (1 - lam) * batch2["embeddings"],
        "labels": lam * batch1["labels"] + (1 - lam) * batch2["labels"],
        "attention_mask": torch.logical_or(
            batch1["attention_mask"].bool(),
            batch2["attention_mask"].bool()
        ).float()
    }

Classification based on one random frame

A random frame is extracted from the video, preprocessed and converted into an embedding vector using the СLIP model (ViT-B/32). Embedding is converted to logits with one linear layer.

A random frame for each video is fixed and determined by the hash function of the video id:

frame_index = hash(video_id) % total_frame_count

Mixup

This approach uses a classic mixup, in which the original images are mixed.

for batch in zip(train_data_loader1, train_data_loader2):
    optimizer.zero_grad()

    lam = np.random.beta(alpha, alpha)
    (images1, labels1), (images2, labels2) = batch
    images = lam * images1 + (1 - lam) * images2
    labels = lam * labels1 + (1 - lam) * labels2

Results

The value of the metric f1 micro is presented. Training was carried out on 10 epochs.

Linear with mixup
Linear w/o mixup
Transformer Encoder with mixup
Transformer Encoder w/o mixup
Random frame with mixup
Random frame w/o mixup

As you can see from the metrics, mixup improves quality in two out of three cases. Also, an interesting fact is that the classification of video topics by a random frame gives, although less compared to other approaches, but still good quality.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
datasets		datasets
models		models
resources		resources
scripts		scripts
video_classification_utils		video_classification_utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
experiments.ipynb		experiments.ipynb
requirements.txt		requirements.txt
run_docker.sh		run_docker.sh
setup.py		setup.py

borisshapa/video-classification

Folders and files

Latest commit

History

Repository files navigation

📺 Video Classification

Structure

Requirements

Data

Prepare embeddings

Split into train test

Models

Linear classifier

Mixup

Transformer Encoder

Mixup

Classification based on one random frame

Mixup

Results

About

Topics

Resources

Stars

Watchers

Forks

Languages