# Video Classification

Run activity recognition on video data using a pre-trained model.

In [1]:
import fiftyone as fo
import fiftyone.zoo as foz

In [None]:
dataset = foz.load_zoo_dataset(
    "activitynet-200",
    split="validation",
    max_duration=30,
    max_samples=100,
)

In [31]:
session = fo.launch_app(dataset, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.


Generate frame objects for each frame in each of our videos:

In [32]:
dataset.ensure_frames()

Computing metadata...
 100% |█████████████████| 100/100 [2.3s elapsed, 0s remaining, 44.0 samples/s]  


Create temporary image files for frames in our video samples. They will be stored in the `/tmp` directory. We pass `size=(224, 224)` to resize the frames to 224x224 pixels for compatibility with the pre-trained model.

In [82]:
dataset.to_frames(sample_frames=True, max_fps=5, output_dir="/tmp/", force_sample=True, size=(224, 224))

 100% |███████████████| 9172/9172 [142.1ms elapsed, 0s remaining, 65.1K samples/s]  
Sampling video frames...
 100% |█████████████████| 100/100 [23.9s elapsed, 0s remaining, 5.0 samples/s]      


Dataset:     activitynet-200-validation-100
Media type:  image
Num samples: 9172
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    sample_id:    fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    frame_number: fiftyone.core.fields.FrameNumberField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
View stages:
    1. ToFrames(config={'force_sample': True, 'max_fps': 5, 'output_dir': '/tmp/', ...})

Helper function to create a list of frame arrays for a sample:

In [83]:
from PIL import Image
import numpy as np

def _construct_frames_array(frame_fps):
    frames = []
    num_frames = 0
    for frame_fp in frame_fps:
        if frame_fp is None:
            continue
        
        image = Image.open(frame_fp)
        image = np.transpose(np.array(image), (2, 0, 1))
        frames.append(image)
        num_frames += 1

        if num_frames == 16:
            return frames

    return frames

Load and run our pretrained model across all samples in the dataset, adding the predictions to a new field of type `fo.Classification`.

In [None]:
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import torch
from tqdm.notebook import tqdm

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

frame_fp_lists = dataset.values("frames.filepath")

predicted_labels = []

for frame_fps in tqdm(frame_fp_lists):
    video = _construct_frames_array(frame_fps)
    inputs = processor(video, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_class_idx = logits.argmax(-1).item()
    predicted_class = model.config.id2label[predicted_class_idx]
    label = fo.Classification(label=predicted_class)
    predicted_labels.append(label)


dataset.set_values("predicted_labels", predicted_labels)

![Video Classification](../assets/video_classification.jpg)