In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

import torch
import decord 
import numpy as np
from IPython.display import Video
from transformers import (
    VideoMAEImageProcessor, VideoMAEForVideoClassification
)


2025-07-23 18:37:09.418319: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-23 18:37:09.430928: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753285029.444432   74181 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753285029.448453   74181 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1753285029.460479   74181 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [2]:
torch_dtype = torch.bfloat16
device = 'cuda'

## Model Overview



The specific model used here is a [VideoMAE model](https://huggingface.co/docs/transformers/model_doc/videomae)(`large`) variant that has been finetuned for multi-label video classification (a video can belong to multiple classes simultaneously).

In [3]:
# Initialize model and processor
model_ckpt = "ai-forever/kandinsky-videomae-large-camera-motion"
image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt)

model = VideoMAEForVideoClassification.from_pretrained(
    model_ckpt,
    attn_implementation="sdpa",
    torch_dtype=torch_dtype,
).eval().to(device)

### Model Classes and Their Meanings


The model predicts `21` different camera motion and shot type classes:

In [4]:
model.config.id2label

{0: 'arc_left',
 1: 'arc_right',
 2: 'dolly_in',
 3: 'dolly_out',
 4: 'pan_left',
 5: 'pan_right',
 6: 'pedestal_down',
 7: 'pedestal_up',
 8: 'pov',
 9: 'roll_left',
 10: 'roll_right',
 11: 'shake',
 12: 'static',
 13: 'tilt_down',
 14: 'tilt_up',
 15: 'track',
 16: 'truck_left',
 17: 'truck_right',
 18: 'undefined',
 19: 'zoom_in',
 20: 'zoom_out'}

##### Horizontal Movements
1. `arc_left`/`arc_right` – Camera moves in a leftward/rightward arc around the subject.

2. `pan_left`/`pan_right` – Camera rotates horizontally to the left/to the right (fixed position).

3. `truck_left`/`truck_right` – Camera slides left/right (lateral movement, keeping axis perpendicular).

##### Vertical Movements
4. `pedestal_down`/`pedestal_up` – Camera moves vertically downward/upward (e.g., lowering/raising a tripod).

5. `tilt_down`/`tilt_up` – Camera tilts downward/upward (angle change, fixed position).

##### Forward/Backward Movements
6. `dolly_in`/`dolly_out` – Camera moves physically forward/away.

7. `zoom_in`/`zoom_out` – Optical zoom-in/out (lens adjustment, no physical movement).

#####  Rotational  Movements
8. `roll_left`/`roll_right` – Camera rolls left/right (rotates on its axis).

##### Special Shots

9. `shake` – Shaky/unstable movement (handheld or intentional effect).

10. `track` – Camera follows a moving subject (e.g., on rails or steadycam).

11. `pov` – Point-of-view shot (camera mimics a character’s perspective).

12. `static` – Fixed shot (no camera movement).

##### Other
13. `undefined` – Unclassifiable or ambiguous motion.



### Key Features

(a) Multi-label classification: A single video can belong to multiple classes (e.g., `dolly_in` + `pan_right` + `track` + `shake`).

(b) Model was trained to associate entire video with camera labels, not frame-level motions(!): [input video] -> label/labels (because multilabel) for all video.  So, if this camera motion exists during all video frames model should predict this motion, otherwise it should predict `undefined`.


(c) Predictions use `sigmoid` with a `0.5` cutoff for activation.

---

### Technical Notes

1) Input frames are resized to `224x224` pixels.


2) The model is configured to process `config.num_frames=16` frames per input clip. These frames are extracted uniformly from the input video, regardless of its original duration. 

Here the video is loaded using `decord.VideoReader`, which efficiently decodes frames without reading the entire file into memory.
For any input video, we sample exactly `config.num_frames=16` frames, spaced evenly across its duration.

However, for videos longer than **2 seconds**, processing the entire video as a single clip may miss temporal nuances (e.g., varying camera motions). So, recommended workflow for such videos will be follows:

 (a) split the video into non-overlapping 2-second segments (or sliding windows with optional overlap).

 (b) run inference independently on each segment. 

 (c) post-process results.

 ---

In [5]:
import math
num_frames = model.config.num_frames
height, width = image_processor.crop_size['height'], image_processor.crop_size['width']

def load_video(filepath, num_frames=16, clip_start=0, clip_end=-1):
    """Load a video and select num_frames frames.
    This function loads a video file and extracts a specified number of frames (num_frames),
    evenly distributed over the time interval defined by the clip_start and clip_end parameters.

    Parameters
    ----------
    filepath : str
        Path to the video file.
    num_frames : int, optional
        Number of frames to extract.
    clip_start : float, optional
        Start time of the clip in seconds. Default is 0.
    clip_end : float, optional
        End time of the clip in seconds. If set to -1, the clip continues until the end of the video. Default is -1.

    Returns
    -------
    torch.Tensor
        Tensor containing video frames in the format (C, T, H, W).
    """
    vr = decord.VideoReader(
        filepath,
        num_threads=1,
        # ctx=decord.cpu(0),
        width=width,
        height=height,
    )
    total_frames = len(vr)
    fps = float(vr.get_avg_fps())
    duration = float(total_frames) / fps

    start_idx = math.ceil(fps * clip_start)
    if clip_end != -1 and clip_end < duration:
        end_idx = math.ceil(fps * clip_end)
    else:
        end_idx = total_frames - 1

    frame_indices = np.linspace(start_idx, end_idx, num=num_frames, dtype=int)
    
    video = vr.get_batch(frame_indices).asnumpy()  # (T, H, W, C)
    video = torch.from_numpy(video).permute(3, 0, 1, 2)  # (C, T, H, W)
    
    return video


Need perform the normalization and resizing of the input video tensor, followed by normalization using mean and standard deviation values from an image processor. The final tensor is permuted to match the expected format for the model.

In [None]:

def preprocess_video(video, device='cuda'):
    """Apply transformations to the video.
    """
    video = video.to(device) / 255.0
    video = torch.nn.functional.interpolate(
        video, size=(height, width), mode="bilinear"
    )
    mean = torch.tensor(image_processor.image_mean).view(3, 1, 1, 1).to(device)
    std = torch.tensor(image_processor.image_std).view(3, 1, 1, 1).to(device)

    video = (video - mean) / std
    video = video.permute(1, 0, 2, 3)
    return video


In [7]:
def predict_labels(model, filepath, clip_start=0, clip_end=-1):
    video = load_video(filepath, clip_start=clip_start, clip_end=clip_end)

    inputs = preprocess_video(video).unsqueeze(0).to(torch_dtype)
    print(inputs.shape)

    with torch.no_grad():
        outputs = model(inputs)

    logits = outputs.logits.float()
    probs = torch.sigmoid(logits).cpu().numpy()[0]  # multi-label

    preds = (probs > 0.5).astype(int)
    predicted_labels = [model.config.id2label[i] for i, p in enumerate(preds) if p == 1]
    return predicted_labels

In [None]:
!wget "https://huggingface.co/datasets/syCen/CameraBench/resolve/main/videos/015a2bdd4aa4b5cfb60aacde6a3069ea3e4ef458c6f263148b2339b546ef8e86.1.mp4" -O "examples/camera_motion/015a2bdd4aa4b5cfb60aacde6a3069ea3e4ef458c6f263148b2339b546ef8e86.1.mp4"

--2025-07-23 18:37:36--  https://huggingface.co/datasets/syCen/CameraBench/resolve/main/videos/015a2bdd4aa4b5cfb60aacde6a3069ea3e4ef458c6f263148b2339b546ef8e86.1.mp4
Resolving huggingface.co (huggingface.co)... 3.164.240.65, 3.164.240.43, 3.164.240.18, ...
Connecting to huggingface.co (huggingface.co)|3.164.240.65|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/05/c5/05c58eb35dc06dce7c833d554b8d8ec7c755e9e81124e7ec1c6a3c468e929cd2/48224284d16cce043e2737abc6f2a2abc7e4952aa2e0ae2a5ecfdcbb4241b48a?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27015a2bdd4aa4b5cfb60aacde6a3069ea3e4ef458c6f263148b2339b546ef8e86.1.mp4%3B+filename%3D%22015a2bdd4aa4b5cfb60aacde6a3069ea3e4ef458c6f263148b2339b546ef8e86.1.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1753288657&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MzI4ODY1N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL

In [None]:
filepath = "examples/camera_motion/015a2bdd4aa4b5cfb60aacde6a3069ea3e4ef458c6f263148b2339b546ef8e86.1.mp4"

In [10]:
predicted_labels = predict_labels(model, filepath)

print(f"Predicted labels: {predicted_labels}")
Video(filepath, width=512)

torch.Size([1, 16, 3, 224, 224])
Predicted labels: ['arc_right']


In [None]:
!wget "https://huggingface.co/datasets/syCen/CameraBench/resolve/main/videos/0Um7WnY72Us.1.0.mp4"  -O "examples/camera_motion/0Um7WnY72Us.1.0.mp4"


--2025-07-23 18:37:42--  https://huggingface.co/datasets/syCen/CameraBench/resolve/main/videos/0Um7WnY72Us.1.0.mp4
Resolving huggingface.co (huggingface.co)... 3.164.240.65, 3.164.240.43, 3.164.240.18, ...
Connecting to huggingface.co (huggingface.co)|3.164.240.65|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/05/c5/05c58eb35dc06dce7c833d554b8d8ec7c755e9e81124e7ec1c6a3c468e929cd2/3db1df636d555126d686fcabc9bb84fef02d87ddc1c03d7df3325a9db648aafb?response-content-disposition=inline%3B+filename*%3DUTF-8%27%270Um7WnY72Us.1.0.mp4%3B+filename%3D%220Um7WnY72Us.1.0.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1753288662&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MzI4ODY2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzA1L2M1LzA1YzU4ZWIzNWRjMDZkY2U3YzgzM2Q1NTRiOGQ4ZWM3Yzc1NWU5ZTgxMTI0ZTdlYzFjNmEzYzQ2OGU5MjljZDIvM2RiMWRmNjM2ZDU1NTEyNmQ2ODZmY2FiYzliYjg0ZmVmMDJkOD

In [None]:
filepath = "examples/camera_motion/0Um7WnY72Us.1.0.mp4"

In [13]:
predicted_labels = predict_labels(model, filepath, clip_start=0, clip_end=-1)

print(f"Predicted labels: {predicted_labels}")
Video(filepath, width=512)

torch.Size([1, 16, 3, 224, 224])
Predicted labels: ['shake', 'undefined']


It's really undefined class by definition... Let's try to predict for segment from 1s...

In [14]:
predicted_labels = predict_labels(model, filepath, clip_start=1, clip_end=-1)

print(f"Predicted labels: {predicted_labels}")

torch.Size([1, 16, 3, 224, 224])
Predicted labels: ['tilt_up']
