<a href="https://colab.research.google.com/github/eltonalenca90/video_action_detection/blob/main/pre_trained_PyTorchVideo_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

[PyTorchVideo](https://pytorchvideo.readthedocs.io/en/latest/index.html) provides several pretrained models through Torch Hub. In this tutorial we will show how to load a pre trained video classification model in PyTorchVideo and run it on a test video. The PyTorchVideo Torch Hub models were trained on the Kinetics 400 dataset and finetuned specifically for detection on AVA v2.2 dataset. Available models are described in model zoo documentation.

NOTE: Currently, this tutorial only works if ran on local clone from the directory pytorchvideo/tutorials/video_detection_example

[1] W. Kay, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.

https://pytorchvideo.org/docs/tutorial_torchhub_detection_inference


# Prepare data

In this walkthrough, we are using a subset of the [Kinetics 400 action recognition dataset](https://deepmind.com/research/open-source/kinetics) composed of 400 human activity classes over 600,000 10-second long video clips sources from YouTube.

We will first need to download the labels for Kinetics as well as [youtube-dl](http://ytdl-org.github.io/youtube-dl/download.html) which we will use to download the video data from YouTube.

In [1]:
!pip install youtube-dl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube-dl
  Downloading youtube_dl-2021.12.17-py2.py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 5.3 MB/s 
[?25hInstalling collected packages: youtube-dl
Successfully installed youtube-dl-2021.12.17


In [2]:
from datetime import timedelta
import json
import os
import subprocess

import youtube_dl
from youtube_dl.utils import (DownloadError, ExtractorError)

def download_video(url, start, dur, output):
    output_tmp = os.path.join("/tmp",os.path.basename(output))
    try:
    # From https://stackoverflow.com/questions/57131049/is-it-possible-to-download-a-specific-part-of-a-file
        with youtube_dl.YoutubeDL({'format': 'best'}) as ydl:
            result = ydl.extract_info(url, download=False)
            video = result['entries'][0] if 'entries' in result else result
        
        url = video['url']
        if start < 5:
            offset = start
        else:
            offset = 5
        start -= offset
        offset_dur = dur + offset
        start_str = str(timedelta(seconds=start)) 
        dur_str = str(timedelta(seconds=offset_dur)) 

        cmd = ['ffmpeg', '-i', url, '-ss', start_str, '-t', dur_str, '-c:v',
                'copy', '-c:a', 'copy', output_tmp]
        subprocess.call(cmd)

        start_str_2 = str(timedelta(seconds=offset)) 
        dur_str_2 = str(timedelta(seconds=dur)) 

        cmd = ['ffmpeg', '-i', output_tmp, '-ss', start_str_2, '-t', dur_str_2, output]
        subprocess.call(cmd)
        return True
        
    except (DownloadError, ExtractorError) as e:
        print("Failed to download %s" % output)
        return False
        
'''with open("./kinetics400/test.json", "r") as f:
    test_data = json.load(f)

target_classes = [
 'springboard diving',
 'surfing water',
 'swimming backstroke',
 'swimming breast stroke',
 'swimming butterfly stroke',
]
data_dir = "./videos"
max_samples = 5
    
classes_count = {c:0 for c in target_classes}

for fn, data in test_data.items():
    label = data["annotations"]["label"]
    segment = data["annotations"]["segment"]
    url = data["url"]
    dur = data["duration"]
    if label in classes_count and classes_count[label] < max_samples:
        c_dir = os.path.join(data_dir, label)
        if not os.path.exists(c_dir):
            os.makedirs(c_dir)
        

        start = segment[0]
        output = os.path.join(c_dir, "%s_%s.mp4" % (label.replace(" ","_"), fn))
        
        results = True
        if not os.path.exists(output):
            result = download_video(url, start, dur, output)
        if result:
            classes_count[label] += 1
'''

#label = data["annotations"]["label"]
#segment = data["annotations"]["segment"]
#url = data["url"]
#dur = data["duration"]
#result = download_video(url, start, dur, output)
#start = segment[0]
#output = os.path.join(c_dir, "%s_%s.mp4" % (label.replace(" ","_"), fn))
print("Finished downloading videos!")

Finished downloading videos!


# Setup

This walkthrough requires Python 3.7 or 3.8 for PyTorchVideo.

This tutorial assumes that you have installed Detectron2 and Opencv-python on your machine.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import sys
sys.path = [
            '/content/slowfast/slowfast',
            '/content/slowfast/slowfast/utils',
            '/content/slowfast/slowfast/visualization',
            '/content/slowfast',
            '/content/drive/MyDrive/2-NeSy-ViU-Elton/PFC/notebooks',
            '/content',
            '/env/python',
            '/usr/lib/python37.zip',
            '/usr/lib/python3.7',
            '/usr/lib/python3.7/lib-dynload',
            '/usr/local/lib/python3.7/dist-packages',
            '/content/slowfast',
            '/usr/local/lib/python3.7/dist-packages/fairscale-0.4.6-py3.7.egg',
            '/usr/local/lib/python3.7/dist-packages/simplejson-3.17.6-py3.7-linux-x86_64.egg',
            '/usr/lib/python3/dist-packages',
            '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
            '/root/.ipython',
            '.']

In [7]:
!pip install 'git+https://github.com/facebookresearch/detectron2.git'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/facebookresearch/detectron2.git
  Cloning https://github.com/facebookresearch/detectron2.git to /tmp/pip-req-build-ycqnocuq
  Running command git clone -q https://github.com/facebookresearch/detectron2.git /tmp/pip-req-build-ycqnocuq
Collecting omegaconf<=2.2.0,>=2.1
  Downloading omegaconf-2.1.2-py3-none-any.whl (74 kB)
[K     |████████████████████████████████| 74 kB 2.1 MB/s 
[?25hCollecting hydra-core>=1.1
  Downloading hydra_core-1.2.0-py3-none-any.whl (151 kB)
[K     |████████████████████████████████| 151 kB 11.0 MB/s 
[?25hCollecting black==22.3.0
  Downloading black-22.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 75.0 MB/s 
[?25hCollecting scipy>1.5.1
  Downloading scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
[K     |████████████████

In [1]:
!pip install torch torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


PyTorchVideo needs to be installed through GitHub. Though otherwise it can be installed with: pip install pytorchvideo

In [2]:
!pip install "git+https://github.com/facebookresearch/pytorchvideo.git"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/facebookresearch/pytorchvideo.git
  Cloning https://github.com/facebookresearch/pytorchvideo.git to /tmp/pip-req-build-q5pjqt9o
  Running command git clone -q https://github.com/facebookresearch/pytorchvideo.git /tmp/pip-req-build-q5pjqt9o


In [23]:
!export PYTHONPATH=/path/to/content/slowfast/slowfast:$PYTHONPATH

In [14]:
!git clone --recursive https://github.com/pytorch/pytorch


Cloning into 'pytorch'...
remote: Enumerating objects: 817692, done.[K
remote: Counting objects: 100% (1168/1168), done.[K
remote: Compressing objects: 100% (639/639), done.[K
remote: Total 817692 (delta 748), reused 868 (delta 522), pack-reused 816524[K
Receiving objects: 100% (817692/817692), 820.85 MiB | 33.05 MiB/s, done.
Resolving deltas: 100% (660564/660564), done.
Checking out files: 100% (10261/10261), done.
Submodule 'android/libs/fbjni' (https://github.com/facebookincubator/fbjni.git) registered for path 'android/libs/fbjni'
Submodule 'third_party/NNPACK_deps/FP16' (https://github.com/Maratyszcza/FP16.git) registered for path 'third_party/FP16'
Submodule 'third_party/NNPACK_deps/FXdiv' (https://github.com/Maratyszcza/FXdiv.git) registered for path 'third_party/FXdiv'
Submodule 'third_party/NNPACK' (https://github.com/Maratyszcza/NNPACK.git) registered for path 'third_party/NNPACK'
Submodule 'third_party/QNNPACK' (https://github.com/pytorch/QNNPACK) registered for path 'th

In [3]:
!git clone https://github.com/facebookresearch/slowfast

fatal: destination path 'slowfast' already exists and is not an empty directory.


In [25]:
%cd /content/slowfast
#modify setup.py then change PIL to pillow
%run setup.py build develop

/content/slowfast
running build
running build_py
running develop
running egg_info
writing slowfast.egg-info/PKG-INFO
writing dependency_links to slowfast.egg-info/dependency_links.txt
writing requirements to slowfast.egg-info/requires.txt
writing top-level names to slowfast.egg-info/top_level.txt
adding license file 'LICENSE'
writing manifest file 'slowfast.egg-info/SOURCES.txt'
running build_ext
Creating /usr/local/lib/python3.7/dist-packages/slowfast.egg-link (link to .)
slowfast 1.0 is already the active version in easy-install.pth

Installed /content/slowfast
Processing dependencies for slowfast==1.0
Searching for fairscale==0.4.6
Best match: fairscale 0.4.6
Processing fairscale-0.4.6-py3.7.egg
fairscale 0.4.6 is already the active version in easy-install.pth

Using /usr/local/lib/python3.7/dist-packages/fairscale-0.4.6-py3.7.egg
Searching for tensorboard==2.8.0
Best match: tensorboard 2.8.0
Adding tensorboard 2.8.0 to easy-install.pth file
Installing tensorboard script to /usr/loc

In [28]:
%run /content/slowfast/slowfast/__init__.py

In [None]:
#change the line 13 from /content/slowfast/slowfast/utils/logging.py to /content/slowfast/slowfast/utils/logging.py

In [33]:
%run /content/slowfast/slowfast/visualization/video_visualizer.py

  "The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in 0.14. "


# Running a PyTorchVideo model

In this section, we use PyTorchVideo download and run a video classification model on the data that we previously loaded and store the results in FiftyOne.

[Torch Hub](https://pytorch.org/hub/) is a repository for pretrained PyTorch models that allow you to easily download and run inference on your dataset. PyTorchVideo provides a number of video classification models through their [Torch Hub-backed model zoo](https://pytorchvideo.readthedocs.io/en/latest/model_zoo.html) including SlowFast, I3D, C2D, R(2+1)D, and X3D. The following downloads the slow branch of SlowFast with a ResNet50 backbone and loads it into Python:

##Imports

In [35]:
from functools import partial
import numpy as np

import cv2
import torch

import detectron2
from detectron2.config import get_cfg
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.utils.video_visualizer import VideoVisualizer
from detectron2.utils.visualizer import ColorMode, Visualizer
from detectron2.data import MetadataCatalog

import pytorchvideo
from pytorchvideo.transforms.functional import (
    uniform_temporal_subsample,
    short_side_scale_with_boxes,
    clip_boxes_to_image,
)
from torchvision.transforms._functional_video import normalize
from pytorchvideo.data.ava import AvaLabeledVideoFramePaths
from pytorchvideo.models.hub import slow_r50_detection # Another option is slowfast_r50_detection

from slowfast.visualization import video_visualizer

##Load Model
PyTorchVideo provides several pretrained models through Torch Hub. Available models are described in model zoo documentation.

Here we are selecting the slow_r50_detection model which was trained using a 4x16 setting on the Kinetics 400 dataset and fine tuned on AVA V2.2 actions dataset.

NOTE: to run on GPU in Google Colab, in the menu bar selet: Runtime -> Change runtime type -> Harware Accelerator -> GPU



In [36]:
device = 'cuda' # or 'cpu'
video_model = slow_r50_detection(True) # Another option is slowfast_r50_detection
video_model = video_model.eval().to(device)

Downloading: "https://dl.fbaipublicfiles.com/pytorchvideo/model_zoo/ava/SLOW_4x16_R50_DETECTION.pyth" to /root/.cache/torch/hub/checkpoints/SLOW_4x16_R50_DETECTION.pyth


  0%|          | 0.00/243M [00:00<?, ?B/s]

##Load an off-the-shelf Detectron2 object detector
We use the object detector to detect bounding boxes for the people. These bounding boxes later feed into our video action detection model. For more details, please refer to the Detectron2's object detection tutorials.

To install Detectron2, please follow the instructions mentioned here

In [37]:
%cd ..

/content


In [38]:
# Initialize predictor
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.55  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")
predictor = DefaultPredictor(cfg)

# This method takes in an image and generates the bounding boxes for people in the image.
def get_person_bboxes(inp_img, predictor):
    predictions = predictor(inp_img.cpu().detach().numpy())['instances'].to('cpu')
    boxes = predictions.pred_boxes if predictions.has("pred_boxes") else None
    scores = predictions.scores if predictions.has("scores") else None
    classes = np.array(predictions.pred_classes.tolist() if predictions.has("pred_classes") else None)
    predicted_boxes = boxes[np.logical_and(classes==0, scores>0.75 )].tensor.cpu() # only person
    return predicted_boxes

model_final_280758.pkl: 167MB [00:03, 46.8MB/s]                           


##Define the transformations for the input required by the model
Before passing the video and bounding boxes into the model we need to apply some input transforms and sample a clip of the correct frame rate in the clip.

Here, below we define a method that can pre-process the clip and bounding boxes. It generates inputs accordingly for both Slow (Resnet) and SlowFast models depending on the parameterization of the variable slow_fast_alpha.

In [39]:
def ava_inference_transform(
    clip,
    boxes,
    num_frames = 4, #if using slowfast_r50_detection, change this to 32
    crop_size = 256,
    data_mean = [0.45, 0.45, 0.45],
    data_std = [0.225, 0.225, 0.225],
    slow_fast_alpha = None, #if using slowfast_r50_detection, change this to 4
):

    boxes = np.array(boxes)
    ori_boxes = boxes.copy()

    # Image [0, 255] -> [0, 1].
    clip = uniform_temporal_subsample(clip, num_frames)
    clip = clip.float()
    clip = clip / 255.0

    height, width = clip.shape[2], clip.shape[3]
    # The format of boxes is [x1, y1, x2, y2]. The input boxes are in the
    # range of [0, width] for x and [0,height] for y
    boxes = clip_boxes_to_image(boxes, height, width)

    # Resize short side to crop_size. Non-local and STRG uses 256.
    clip, boxes = short_side_scale_with_boxes(
        clip,
        size=crop_size,
        boxes=boxes,
    )

    # Normalize images by mean and std.
    clip = normalize(
        clip,
        np.array(data_mean, dtype=np.float32),
        np.array(data_std, dtype=np.float32),
    )

    boxes = clip_boxes_to_image(
        boxes, clip.shape[2],  clip.shape[3]
    )

    # Incase of slowfast, generate both pathways
    if slow_fast_alpha is not None:
        fast_pathway = clip
        # Perform temporal sampling from the fast pathway.
        slow_pathway = torch.index_select(
            clip,
            1,
            torch.linspace(
                0, clip.shape[1] - 1, clip.shape[1] // slow_fast_alpha
            ).long(),
        )
        clip = [slow_pathway, fast_pathway]

    return clip, torch.from_numpy(boxes), ori_boxes

##Setup
Download the id to label mapping for the AVA V2.2 dataset on which the Torch Hub models were finetuned. This will be used to get the category label names from the predicted class ids.

Create a visualizer to visualize and plot the results(labels + bounding boxes).

In [67]:
import json

# Dowload the action text to id mapping
!wget https://dl.fbaipublicfiles.com/pytorchvideo/data/class_names/ava_action_list.pbtxt

# Create an id to label name mapping
label_map, allowed_class_ids = AvaLabeledVideoFramePaths.read_label_map('/content/ava_action_list.pbtxt')
label_map = {y: x for x, y in label_map.items()} #switch the key:values to values:key

# Create a json file with the classnames 
with open("classname.json", "w") as outfile:
    json.dump(label_map, outfile)

# Create a video visualizer that can plot bounding boxes and visualize actions on bboxes.
video_visualizer = video_visualizer.VideoVisualizer(num_classes=81, class_names_path="/content/classname.json", top_k=3, mode="thres",thres=0.5)

--2022-05-25 04:50:37--  https://dl.fbaipublicfiles.com/pytorchvideo/data/class_names/ava_action_list.pbtxt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2649 (2.6K) [text/plain]
Saving to: ‘ava_action_list.pbtxt’


2022-05-25 04:50:37 (50.7 MB/s) - ‘ava_action_list.pbtxt’ saved [2649/2649]



##Load an example video
We get an opensourced video off the web from WikiMedia.

In [85]:
videopath = "/content/obama-interview.mp4"

# Load the video
encoded_vid = pytorchvideo.data.encoded_video.EncodedVideo.from_path(videopath)
print('Completed loading encoded video.')

Completed loading encoded video.


##Get model predictions
Generate bounding boxes and action predictions for a 10 second clip in the video.

In [81]:
# Video predictions are generated at an internal of 1 sec from 0 seconds to 100 seconds in the video.
time_stamp_range = range(0,30) # time stamps in video for which clip is sampled.
clip_duration = 1.0 # Duration of clip used for each inference step.
gif_imgs = []

for time_stamp in time_stamp_range:
    print("Generating predictions for time stamp: {} sec".format(time_stamp))

    # Generate clip around the designated time stamps
    inp_imgs = encoded_vid.get_clip(
        time_stamp - clip_duration/2.0, # start second
        time_stamp + clip_duration/2.0  # end second
    )
    inp_imgs = inp_imgs['video']

    # Generate people bbox predictions using Detectron2's off the self pre-trained predictor
    # We use the the middle image in each clip to generate the bounding boxes.
    inp_img = inp_imgs[:,inp_imgs.shape[1]//2,:,:]
    inp_img = inp_img.permute(1,2,0)

    # Predicted boxes are of the form List[(x_1, y_1, x_2, y_2)]
    predicted_boxes = get_person_bboxes(inp_img, predictor)
    if len(predicted_boxes) == 0:
        print("Skipping clip no frames detected at time stamp: ", time_stamp)
        continue

    # Preprocess clip and bounding boxes for video action recognition.
    inputs, inp_boxes, _ = ava_inference_transform(inp_imgs, predicted_boxes.numpy())
    # Prepend data sample id for each bounding box.
    # For more details refere to the RoIAlign in Detectron2
    inp_boxes = torch.cat([torch.zeros(inp_boxes.shape[0],1), inp_boxes], dim=1)

    # Generate actions predictions for the bounding boxes in the clip.
    # The model here takes in the pre-processed video clip and the detected bounding boxes.
    preds = video_model(inputs.unsqueeze(0).to(device), inp_boxes.to(device))
    preds= preds.to('cpu')
    print("1", preds)
    # The model is trained on AVA and AVA labels are 1 indexed so, prepend 0 to convert to 0 index.
    preds = torch.cat([torch.zeros(preds.shape[0],1), preds], dim=1)
    print("2", preds) 

    # Plot predictions on the video and save for later visualization.
    inp_imgs = inp_imgs.permute(1,2,3,0)
    inp_imgs = inp_imgs/255.0
    out_img_pred = video_visualizer.draw_clip_range(inp_imgs, preds, predicted_boxes)
    gif_imgs += out_img_pred

print("Finished generating predictions.")

Generating predictions for time stamp: 0 sec


  b = self.tensor[item]
  num_text_top = dist_to_top // textbox_width


1 tensor([[1.9481e-04, 2.2972e-06, 4.1288e-03, 1.3692e-05, 3.4511e-04, 3.3256e-03,
         5.8826e-05, 1.4853e-02, 5.8890e-05, 5.4941e-05, 9.9760e-01, 1.0348e-04,
         1.3072e-04, 4.3818e-04, 7.6228e-04, 3.5411e-06, 2.3060e-01, 1.3947e-06,
         2.6195e-06, 3.8213e-05, 1.2981e-06, 2.7378e-05, 3.4456e-06, 7.1849e-05,
         6.2048e-06, 6.0244e-04, 1.4827e-03, 7.8224e-04, 9.7394e-04, 1.5934e-05,
         1.9439e-06, 1.2198e-05, 2.2340e-06, 3.8977e-05, 1.8045e-06, 2.2293e-04,
         6.9790e-04, 1.2652e-04, 3.5435e-06, 3.5120e-06, 3.6226e-04, 3.0927e-06,
         6.5718e-05, 2.3689e-06, 1.7077e-04, 6.3443e-05, 4.1397e-04, 4.0245e-03,
         2.9888e-04, 3.2058e-06, 3.5013e-04, 3.7504e-05, 2.1166e-06, 4.1214e-02,
         1.5426e-06, 3.0356e-05, 2.8747e-04, 1.0013e-04, 3.5778e-03, 5.9940e-05,
         2.8188e-02, 1.1444e-03, 4.7396e-04, 6.3666e-05, 1.2707e-04, 6.5236e-04,
         2.1637e-04, 2.3537e-04, 1.0566e-04, 7.4337e-05, 3.7237e-06, 7.0488e-05,
         2.8112e-04, 1.227

We now save the predicted video containing bounding boxes and action labels for the bounding boxes.



In [75]:
height, width = gif_imgs[0].shape[0], gif_imgs[0].shape[1]

vide_save_path = 'output.mp4'
video = cv2.VideoWriter(vide_save_path,cv2.VideoWriter_fourcc(*'DIVX'), 7, (width,height))

for image in gif_imgs:
    img = (255*image).astype(np.uint8)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    video.write(img)
video.release()

print('Predictions are saved to the video file: ', vide_save_path)

Predictions are saved to the video file:  output.mp4


code source: https://pytorchvideo.org/docs/tutorial_torchhub_detection_inference