# Video Classification

Video classification is one of the many tasks in the field of _video understanding_, technologies that automatically extract information from video. You can read more about the great, wide world of video understanding in our blog post [An Introduction to Video Understanding: Capabilities and Applications](https://blog.fastforwardlabs.com/2021/12/14/an-introduction-to-video-understanding-capabilities-and-applications.html).

The goal of this notebook is to provide an introduction to video classification, including datasets and models.
The notebook consists of three main parts: 
- Setting up: Installation of the necessary packages, including Tensorflow, and importing the relevant libraries.
- The Data: An exploration of the [Kinetics Human Action Video Dataset](https://deepmind.com/research/open-source/kinetics) for action recognition.
- The Model: Experimentation with a pretrained version of the [I3D video classification model](https://deepmind.com/research/open-source/i3d-model) for action recognition, hosted on the Tensorflow Model Hub.

## Setting up
Install the necessary packages.

In [None]:
%%capture 
#hides cell output. disable for logs.
!pip install -r requirements.txt

In [1]:
from collections import defaultdict
import os
import glob
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display

In order to make it easier to work with the dataset and model in this notebook, we created a small library of helper functions and classes. Here, we import the good bits. 

In [2]:
from vidbench.data.load import KineticsLoader
from vidbench.models import I3DLoader
from vidbench.predict import predict, store_results_in_dataframe, compute_accuracy, evaluate
from vidbench.visualize import make_video_table
from vidbench.data.process import resample_video, load_and_resize_video, video_acceptable

## The Data

In this notebook we make use of the DeepMind [Kinetics dataset](https://arxiv.org/abs/1705.06950), which consists of thousands of YouTube videos focused on human actions and interactions. While there are several versions of this dataset, we'll focus primarily on the Kinetics 400 dataset which contains 400 human action classes, ranging from human-object interactions like playing instruments, as well as human-human interactions like shaking hands. Each video clip is approximately ten seconds and has been sourced from a unique YouTube video.   


As of November 2021, the video files that make up the Kinetics datasets are stored at https://s3.amazonaws.com/kinetics/ as described in this [repository](https://github.com/cvdfoundation/kinetics-dataset). The Kinetics 400 contains a total of 306,245 video clips with at least 400 clips in each class.  These are distributed among three splits: training, test and validation. Each split consists of thousands of videos in `.mp4` format. 

| Dataset split | Clips per class | Total Videos |
|---------------|-----------------|--------------|
| train         |  250-1000       |    246245    |
| test          |   100           |   40000      |
| val           |     50          |   20000      |


For each dataset split, videos are grouped into a series of directories and each directory is packaged as a `tar.gz` file that needs to be unpacked. Each of these `tar.gz` files contains about 1000 video clips.  In this notebook we'll explore a handful of videos from the validation set and we created a `KineticsLoader` class to handle downloading and unpacking these video files.  While this class is designed to handle the full validation (or test, or train) set, we can also use it to explore a small portion of the videos, which we'll demonstrate in the following cells. 

In [3]:
# this class handles the infrastructure to support downloading, unpacking, and pre-processing videos
loader = KineticsLoader(version="400", split="val")

No need to fetch, path already exists /home/cdsw/data/raw/kinetics/400/val/k400_val_path.txt
No need to fetch, path already exists /home/cdsw/data/raw/kinetics/400/val/val.csv


If you're exploring this notebook after performing automatic setup through the CML AMP interface, then we've already downloaded a chunk of videos to explore. If not, running the cell below will initiate a download and unpacking of (at least) 500 vidoes (recall that they are grouped in chunks of approximately 1000). 

In [None]:
loader.download_n_videos(500)

With the code above we've downloaded just 1000 videos from the validation set (out of 20,000 available videos). We could download the entire validation set but that would require at least 30-50 GBs of storage! Since our goal in this notebook is simply to explore the video classification capability, 1000 videos is more than enough to explore with. This AMP also includes a benchmarking script that will allow the user to evaluate a model on _all_ videos in any of the data splits discussed above (more on this towards the end of the notebook).   

We've also downloaded the ground truth labels for _all_ 20K videos in the validation set. These are stored in a Pandas DataFrame which you can see below.

In [4]:
ground_truth_labels_df = pd.read_csv(f"{loader.data_dir}/val.csv")
ground_truth_labels_df

Unnamed: 0,label,youtube_id,time_start,time_end,split,is_cc
0,abseiling,0wR5jVB-WPk,417,427,val,0
1,abseiling,3caPS4FHFF8,36,46,val,0
2,abseiling,3yaoNwz99xM,62,72,val,1
3,abseiling,6IbvOJxXnOo,47,57,val,0
4,abseiling,6_4kjPiQr7w,191,201,val,0
...,...,...,...,...,...,...
19901,zumba,w5hbJLVhZDI,93,103,val,0
19902,zumba,xDd6uIBeMEA,1,11,val,0
19903,zumba,XWvGn7eI04A,12,22,val,0
19904,zumba,yGdQwxP5koA,83,93,val,0


As you can see, this DataFrame has nearly 20K rows - one for each video clip in the validation set. We need to filter this DataFrame to just those videos that we've actually downloaded. Our `KineticsLoader` keeps track of which videos we've downloaded. We can see all available pathnames with the following line. We're only showing the top 20 but there are 1000 video pathnames available. 

In [5]:
loader.video_filenames[:20]

['/home/cdsw/data/raw/kinetics/400/val/part_0/0tlGOxUQ0Kw_000074_000084.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-05qSkAhM6Y_000205_000215.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-LISB_b8rIw_000049_000059.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/--ILYNHl3e4_000541_000551.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-yv8c2CDbR8_000004_000014.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-aeOuOI3eN0_000219_000229.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-WZgMWx8Elk_000013_000023.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-Y-fUYGcb7o_000049_000059.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-beyXnxwTao_000040_000050.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-whdHn9Mbcc_000000_000010.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-CAPalSW0QI_000546_000556.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/0HydDezkSU0_000018_000028.mp4',
 '/home/cdsw/data/raw/kinetics/400/val/part_0/-4hx9N2OhZo_000029

Next, we'll filter the ground truth DataFrame to include only those vidoes that have already been downloaded and are available locally. 

In [5]:
available_youtube_ids = []
for filename in loader.video_filenames:
    pathname, vidname = os.path.split(filename)
    available_youtube_ids.append(vidname[:11]) #youtube IDs are 11 characters long

available_videos_metadata_df = ground_truth_labels_df[ground_truth_labels_df['youtube_id'].isin(available_youtube_ids)]

In [6]:
available_videos_metadata_df

Unnamed: 0,label,youtube_id,time_start,time_end,split,is_cc
0,abseiling,0wR5jVB-WPk,417,427,val,0
50,air drumming,--nQbRBEz2s,104,114,val,0
99,answering questions,-egPJubR-CE,1,11,val,0
100,answering questions,-ejLPB4J4SM,106,116,val,0
101,answering questions,-emx5qjikEc,1,11,val,0
...,...,...,...,...,...,...
19811,yoga,-Gb1pbOE32g,32,42,val,0
19812,yoga,-IeJ0CF3huY,16,26,val,0
19813,yoga,-kSK2kqnTHA,22,32,val,0
19814,yoga,0wHOYxjRmlw,41,51,val,1


Much better! Now we have a DataFrame with only 1000 rows -- one for each locally available video clip. What did we end up downloading? Let's take a look at what classes we have. 

In [7]:
available_videos_metadata_df.label.value_counts()

massaging legs             10
scrambling eggs             9
eating chips                8
massaging person's head     8
washing hair                7
                           ..
long jump                   1
slacklining                 1
marching                    1
cleaning gutters            1
opening present             1
Name: label, Length: 360, dtype: int64

Out of 400 unique classes, we have videos from 360 of those. Most classes only have one representative video clip, though we do have a handful of video clips from classes like "massaging legs," and "scrambling eggs." 

Video classification is computationally expensive so in the following cell we take a small, random sample of vidoes to explore for the remainder of this notebook. Because this is a random sample, each time you run this cell you'll get a new set of 8 videos to play with!

In [8]:
NUM_VIDEOS = 8

video_sample = available_videos_metadata_df.sample(NUM_VIDEOS)
video_sample

Unnamed: 0,label,youtube_id,time_start,time_end,split,is_cc
15778,smoking hookah,-0BveUV52cM,43,53,val,0
3390,climbing tree,-ELsDPpCYkA,2,12,val,0
7859,hugging,-FWfQhqFoYc,61,71,val,0
19262,waxing chest,-h3LsUJLK4Y,3,13,val,0
18164,trimming or shaving beard,-1FlCGo8M4E,574,584,val,0
4834,doing nails,-5hZhtMVn9A,83,93,val,0
8555,jumping into pool,-2csq_1UhMQ,2,12,val,0
10899,playing bagpipes,-Wx7UjNi3uU,10,20,val,0


There we go! We've selected a manageable batch of just 8 videos to examine. And we can see at a glance which classes our model will be attempting to predict. 

###  Visualize some videos

So what kind of videos are we dealing with?  In the cell below we display our video clip sample along with their ground truth class labels. As you play each video, notice that each is only approximately ten seconds long. 

In [9]:
video_html = make_video_table(loader.data_dir, video_sample['youtube_id'].values, video_sample['label'].values)
display.HTML(video_html)

0,1,2,3
0smoking hookah,1climbing tree,2hugging,3waxing chest
4trimming or shaving beard,5doing nails,6jumping into pool,7playing bagpipes
,,,


### Pre-processing

Now that we have a sense of what kind of videos we're working with, let's start classifying them! But before we do that, we have another step to perform -- preprocessing.  

The YouTube videos in the Kinetics dataset are all in `.mp4` format but TensorFlow models do not recognize this! We must convert the videos into a format that our TF model can work with. This requires two steps: 
1. convert the `.mp4` format to a more appropriate data structure, like NumPy arrays
2. Resize the video dimensions to work within model specifications

####  Video resizing 
While the first step is likely self-explanatory, the second step deserves some attention. Those with experience working with pre-trained image classification models are likely already familiar with the idea that these models require image inputs of a specific height and width. These requirements are determined during model training and set limits on how large (or small) an image must be in pixels in order for the model to process that image. Video classification is no different in this respect, but comes with a third dimension of complexity: time. 

Let's take a look at the dimensions of the videos in our sample batch. The following cells will read an `.mp4` video clip into a numpy array and print the shape of that array to the screen. The shape tuple has the following format: 

(number of frames, height in pixels, width in pixels, number of color channels)

In [10]:
def load_video(path):
    """Convert video to Numpy array."""
    import cv2
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()  # frame is in BGR format
            if not ret:
                break
            frames.append(frame)
    finally:
        cap.release()
    return np.array(frames).astype("float32") / 255.0

In [11]:
video_paths = [glob.glob(f"{pathname}/{yt_id}*")[0] for yt_id in video_sample['youtube_id'].values]

for video_path in video_paths: 
    video_np = load_video(video_path)
    print(video_np.shape)

(76, 720, 406, 3)
(300, 320, 568, 3)
(250, 720, 1280, 3)
(300, 360, 204, 3)
(300, 720, 1280, 3)
(300, 720, 1280, 3)
(300, 360, 480, 3)
(300, 240, 320, 3)


As we can see, there's quite a bit of variation among our video batch that isn't really detectable when we viewed the raw video clips earlier. While each video is exactly 10 seconds long, some have 300 frames and others have only 127 frames. Some have small spatial dimensions (240 x 320) while others are quite large (720 x 1280). The only thing common to all videos is that they each have three color channels, the familiar RGB system. 

This has implications for how we sample and process these video clips for model consumption. The model we'll use in this notebook requires spatial dimensions of (224 x 224) so we'll need to resize and crop each frame of each video to these dimensions. 

But how do we deal with the temporal dimension, i.e., the number of frames?  That depends on how we want to use our video classification model. The model has no specified limit on the number of frames it can accept (note, however, that more frames translates to longer processing time which can become very computationally expensive!) If we only send to the model one video clip at a time, we can feed it the full number of frames for inference. However, it's usually faster to send a batch to the model rather than sending each video separately. In that case, we need to deal with the variation in frame rates for these videos because we cannot create a numpy array in which one of the dimesions varies!  

#### Video Resampling
The crux of the issue lies in the fact that different cameras have different frame capture rates. Higher quality videos have more frames-per-second (FPS) than lower quality videos leading to a situation in which a collection of 10 second videos can nevertheles have different numbers of frames. Below we show a toy example of two videos that are each 10 seconds long but the top fewer frames than the bottom one over that time span due to having a lower FPS.   

![Upsampling](images/video_fps_example.jpg)

In order to create a batch of video clips, we must resample the clips so that each has the same number of frames.  This can be accomplished either by _upsampling_ videos with low FPS, or _downsampling_ videos with high FPS.  A simple way to perform upsampling is to duplicate certain frames throughout the length of the video. In the figure below, we see Video 1 has frames added to match the number of frames contained in Video 2.  These frames are duplicates of existing frames in Video 1 (which is why some of the frames are repeated colors). 

<img src="images/upsample.gif">

In contrast, downsampling involves removing frames periodically throughout the video. This time, we remove a sample of unique frames from Video 2 so that it has the same number of frames as Video 1. 
<img src = "images/downsample.gif">

In either case, resampling should be done in such a way that the frames we duplicate or remove are equally dispersed throughout the duration of the 10 second clip in order to capture as much of the original motion as possible. 

In a pinch we could simply grab a fixed-size chunk of consecutive frames (from the beginning, middle, or end) from each video clip. The problem here is that this chunk may only represent a portion of the full 10 second interval. The model would then attempt to infer on a batch of videos in which one of them has frames representing close to the full 10 seconds, while another has frames representing only a fraction of the time. This could increase the difficulty of the model to properly classify the latter video.  

#### Load videos into NumPy arrays

Luckily, our `KineticsLoader` class has a `load_videos` method, similar to the one above, that also handles these pre-processing steps. Here's what we do under the hood:

1. First, a central square proportional to the size of the video is cropped
2. The cropped portion is resized to 224x224 pixels
3. Videos are resampled 
   - If the user provides `num_frames`, all videos are resampled to have this many frames (upsampling those that have lower FPS, downsampling those with higher FPS)
   - If `num_frames` is not provided, the algorithm determines which video in the batch has the lowest FPS (fewest frames) and all vidoes are downsampled to match this frame rate
   

Again we note that video clips with more total frames will take longer to process through the model. For the current dataset, `num_frames = 128` is reasonable choice.  For reference, the maximum number of frames that we've seen in the dataset is 300 (30 FPS). 

In [12]:
def load_videos(youtube_ids, num_frames):
    """Create numpy array with batch of videos from list of youtube ids""",
    # we created this list of video pathnames in an earlier cell
    global video_paths
    
    video_batch = []
    for video_path in video_paths:
        video_np = load_and_resize_video(video_path, resize_type="crop")
        video_np = np.expand_dims(video_np, axis=0)
        video_batch.append(video_np)
            
    video_batch_resampled = []
    for video in video_batch:
        resampled_video = resample_video(video, num_frames)
        video_batch_resampled.append(resampled_video)
            
    return np.concatenate(video_batch_resampled, axis=0).astype("float32")

In [13]:
num_frames = 128   
videos_tensor = load_videos(video_sample['youtube_id'].values, num_frames=num_frames)

In [14]:
videos_tensor.shape

(8, 128, 224, 224, 3)

We now have a batch of 8 video clips in a format that our model will understand. 

It's time to classify!

## The Model

In this notebook we make use of the Inflated 3D ConvNet (I3D) video classification model. This model architecture was introduced in 2017 and provided state-of-the-art results for video action classification for multiple datasets. You can read more about this model in the original paper, [Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset](https://arxiv.org/abs/1705.07750). 

Since it's inception, there are now multiple pre-trained versions of the I3D model that are publicly available on the [TensorFlow Model Hub](https://www.tensorflow.org/hub). The original version was pre-trained on the Kinetics 400 dataset, which we explored above. Another version was trained on the Kinetics 600 dataset. We created an `I3DLoader` class to handle model loading from the TF Model Hub. You can choose either the `kinetics-400` or `kinetics-600` version. 

The biggest difference between these two models is in the number of classes they predict. I3D trained on Kinetics 400 predicts, you guessed it, 400 different classes, while it's counterpart (I3D trained on Kinetics 600) predicts on 600 classes. While we didn't discuss the Kinetics 600 dataset in detail, it is essentially a superset of the Kinetics 400 dataset -- it includes all 400 labels from that dataset, plus an additional 200 unique classes. Either can be used to make inference on our sample of videos but keep in mind that with more classes comes more challenges in predicting the correct class -- there are simply more options for the model to choose from. 

Below we load up the I3D model trained on the Kinetics 400 dataset. 

In [15]:
i3d400 = I3DLoader(trained_on='kinetics-400')

Let's look at what kinds of predictions this model can make. 

In [13]:
i3d400.labels[:20]

['abseiling',
 'air drumming',
 'answering questions',
 'applauding',
 'applying cream',
 'archery',
 'arm wrestling',
 'arranging flowers',
 'assembling computer',
 'auctioning',
 'baby waking up',
 'baking cookies',
 'balloon blowing',
 'bandaging',
 'barbequing',
 'bartending',
 'beatboxing',
 'bee keeping',
 'belly dancing',
 'bench pressing']

### Get model predictions

Now that we have data in the proper format, ground truth labels, and a model loaded and ready to go -- it's time to make predictions! This notebook was developed on CPUs so the following cell can take some time to run. One way to speed up inference is to resample the vidoes to each have fewer frames, as we discussed above. The downside is that performance will likely degrade since the model will have fewer frames on which to make a prediction.

In [16]:
scores, predictions, _, _ = predict(videos_tensor, i3d400, verbose=False)

The output of our `predict` function includes the probabilities associated with each of the 400 labels and the 400 labels themselves for each video clip in our batch, sorted in descending order so that the most likely classes (highest probabilities) are at the top of the list. 

In [17]:
predictions

array([['playing harmonica', 'eating hotdog', 'eating burger', ...,
        'skiing crosscountry', 'skiing slalom', 'riding mule'],
       ['rock climbing', 'climbing a rope', 'climbing tree', ...,
        'playing controller', 'building cabinet', 'tapping pen'],
       ['headbutting', 'slapping', 'pumping fist', ...,
        'breading or breadcrumbing', 'paragliding', 'golf chipping'],
       ...,
       ['washing hands', 'doing nails', 'making a cake', ...,
        'trapezing', 'shearing sheep', 'riding mule'],
       ['jumping into pool', 'somersaulting', 'springboard diving', ...,
        'making sushi', 'sharpening knives', 'getting a tattoo'],
       ['playing bagpipes', 'singing', 'paragliding', ...,
        'dribbling basketball', 'water sliding',
        'catching or throwing softball']], dtype='<U39')

In [18]:
scores

array([[1.86612219e-01, 1.76524267e-01, 1.00085415e-01, ...,
        1.80657391e-08, 1.39200260e-08, 1.12994956e-08],
       [6.39056742e-01, 1.96823344e-01, 1.34719953e-01, ...,
        6.79193767e-12, 5.81971060e-12, 3.76654307e-12],
       [3.28819394e-01, 1.41055599e-01, 6.81093484e-02, ...,
        1.54618149e-07, 1.00298585e-07, 1.00217790e-07],
       ...,
       [2.26052657e-01, 2.09268406e-01, 1.23839580e-01, ...,
        7.19659166e-10, 6.01304562e-10, 4.23864638e-10],
       [8.40871334e-01, 7.77837187e-02, 4.24317531e-02, ...,
        4.25292288e-13, 3.77804206e-13, 3.25946111e-13],
       [9.98535514e-01, 1.07667851e-03, 1.44116653e-04, ...,
        1.79997996e-11, 1.42494471e-11, 3.49846389e-12]], dtype=float32)

Let's store our results as a Pandas DataFrame to make it easier to work with. This dataframe keeps only the top five model predictions for each video. The `Video_Id` index refers to the numbering of the vidoes in our visualize section above.

In [19]:
# collect metadata and model results
results = defaultdict(list)

results['YouTube_Id'] = video_sample['youtube_id'].values
results['Ground_Truth'] = video_sample['label'].values

for s, p in zip(scores, predictions):
    results['scores'].append(list(s[:5]))
    results['preds'].append(list(p[:5]))

results_df = store_results_in_dataframe(results)

In [20]:
results_df

Unnamed: 0_level_0,YouTube_Id,Ground_Truth,pred_1,pred_2,pred_3,pred_4,pred_5,score_1,score_2,score_3,score_4,score_5
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,-0BveUV52cM,smoking hookah,playing harmonica,eating hotdog,eating burger,playing poker,smoking,0.186612,0.176524,0.100085,0.0887486,0.04868247
1,-ELsDPpCYkA,climbing tree,rock climbing,climbing a rope,climbing tree,abseiling,pull ups,0.639057,0.196823,0.13472,0.01998034,0.00260625
2,-FWfQhqFoYc,hugging,headbutting,slapping,pumping fist,hugging,kissing,0.328819,0.141056,0.068109,0.06260131,0.04924914
3,-h3LsUJLK4Y,waxing chest,waxing chest,waxing legs,waxing back,shaving legs,tickling,0.999979,1.6e-05,4e-06,2.191966e-07,1.845836e-07
4,-1FlCGo8M4E,trimming or shaving beard,trimming or shaving beard,brush painting,brushing teeth,shaving head,doing nails,0.686045,0.241118,0.017078,0.0163433,0.007074527
5,-5hZhtMVn9A,doing nails,washing hands,doing nails,making a cake,applying cream,cleaning shoes,0.226053,0.209268,0.12384,0.1200006,0.08733294
6,-2csq_1UhMQ,jumping into pool,jumping into pool,somersaulting,springboard diving,swimming butterfly stroke,cartwheeling,0.840871,0.077784,0.042432,0.01513089,0.01438927
7,-Wx7UjNi3uU,playing bagpipes,playing bagpipes,singing,paragliding,playing clarinet,playing saxophone,0.998536,0.001077,0.000144,7.023879e-05,4.132017e-05


## Evaluating the model
So how well did our model do?  It may seem natural to consider the model's accuracy using only it's top prediction for each video and we'll look at that first. However, when working with hundreds of classes, subtlties arise. For example, some classes could be easily confused -- "catching or throwing a softball" and "catching or throwing a baseball" are both classes but it may be difficult for the model to discern the type of ball in a low quality video or if the ball only takes up a small handful of pixels in a wide-shot video.  Additionaly, videos can contain more than one action -- "texting" while "driving a car" (don't do that!) or "hula hooping" while "playing ukulele". The Kinetics 400 dataset only provides a single ground-truth label for each video, rather than an exhaustive list of annotations. For this reason, the authors recommend evaluating model performance on the top-5 accuracy, rather than top-1. 

### Visualize videos again

Let's first examine the model's top-1 accuracy by considering our video visualization. Our visualization helper function can also accept and display the model's top prediction beneath each video. If the model's prediction does not match the ground truth label, the text will display red. 

In [21]:
video_html = make_video_table(loader.data_dir, results_df['YouTube_Id'], results_df['Ground_Truth'], results_df['pred_1'])
display.HTML(video_html)

0,1,2,3
0smoking hookah playing harmonica,1climbing tree rock climbing,2hugging headbutting,3waxing chest waxing chest
4trimming or shaving beard trimming or shaving beard,5doing nails washing hands,6jumping into pool jumping into pool,7playing bagpipes playing bagpipes
,,,


While it may seem alarming that many of these texts are red, consider them in light of the discussion above -- how many of these top-1 labels might be a reasonable description of the video clip, even if it doesn't match the ground truth label? Are any of these labels possible points of confusion or are their multiple actions in the scene? 

### Model accuracy on our very small sample

Finally, let's consider the top-5 accuracy. In this case, we score the model as being "correct" if the ground truth label is within the model's top-5 predictions for a given video. 

In [22]:
accuracy_top_1 = compute_accuracy(results_df, num_top_classes = 1)
accuracy_top_5 = compute_accuracy(results_df, num_top_classes = 5)

top-1 accuracy: 50.00%
top-5 accuracy: 87.50%


As we can see, the model's accuracy improves when we consider the top-5 predictions. While this approach doesn't always make sense for every circumstance, due to the nature of the Kinetics dataset and it's annotations, this method is certainly valid in this case. 

## Evaluating over many videos

So far, everything we've done has been to explore the capability of video classification with a small, concrete example. However, in practice, there are several models and many datasets that one might consider when building a real-world video classification application. In that case, one will need to evaluate dfferent models over various datasets in order to gauge which is most appropriate for the application in question. This is a model benchmarking job. To that end, we've created some additional utilities to facilitate model evaluation over a much larger portion of the Kinetics datasets. Included in this AMP is a benchmarking script that can be automated via the CML Jobs abstraction (or by a simple bash script, if that's your style). Here we provide a quick example of the core utilities contained within that script -- namely, the ability to load, pre-process, and batch over a larger portion of videos. 

We break this task into two parts: load and cache videos, and evaluation. The first step is performed by our `KineticsLoader` class, and does essentially the same steps as our `load_videos` function above. Specifically, this will load the raw `mp4` format into numpy arrays and perform essential pre-processing, such as cropping and resizing to I3D specifications. Once loaded, these video examples are cached (saved to disk) in their numpy format so that they can be reused in any downstream evaluation process. 

In [5]:
num_videos = 64
loader.load_and_cache_video_examples(num_videos)

Processing /home/cdsw/data/raw/kinetics/400/val/part_0/0tlGOxUQ0Kw_000074_000084.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-05qSkAhM6Y_000205_000215.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-LISB_b8rIw_000049_000059.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/--ILYNHl3e4_000541_000551.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-yv8c2CDbR8_000004_000014.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-aeOuOI3eN0_000219_000229.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-WZgMWx8Elk_000013_000023.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-Y-fUYGcb7o_000049_000059.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-beyXnxwTao_000040_000050.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-whdHn9Mbcc_000000_000010.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/-CAPalSW0QI_000546_000556.mp4
Processing /home/cdsw/data/raw/kinetics/400/val/part_0/0HydDezkSU

The second step is an evaluation function that encapsulates all of the prediction steps we performed above. This function accepts a model and data loader class and infers on the requested number of videos, grouping them into the given `batch_size` after resampling to `num_frames` number of frames for each video. The results are stored in a Pandas DataFrame so that we can consider the overall model performance. 

In [6]:
results_df = evaluate(
    i3d400,
    loader, 
    num_videos=num_videos, 
    batch_size=8, 
    top_n_results=5, 
    num_frames=100, 
    savefile="small_sample_results_i3d.csv"
)

In [7]:
results_df

Unnamed: 0_level_0,YouTube_Id,Ground_Truth,pred_1,pred_2,pred_3,pred_4,pred_5,score_1,score_2,score_3,score_4,score_5
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,-1Hub6Ps_cc,washing hands,washing hands,washing hair,washing dishes,taking a shower,shaving legs,0.999894,0.000098,0.000007,7.551486e-07,5.958571e-07
1,-LISB_b8rIw,making sushi,arranging flowers,opening present,setting table,wrapping present,eating spaghetti,0.169844,0.110486,0.062749,4.887363e-02,3.537429e-02
2,-0r6NmrdKCU,making tea,making tea,setting table,cleaning toilet,making a cake,shredding paper,0.527350,0.123318,0.079832,6.867204e-02,3.120951e-02
3,-CAPalSW0QI,arranging flowers,arranging flowers,setting table,decorating the christmas tree,folding napkins,cleaning windows,0.995011,0.004133,0.000237,1.811463e-04,1.031628e-04
4,--ILYNHl3e4,digging,reading newspaper,skiing (not slalom or crosscountry),skiing slalom,writing,skiing crosscountry,0.266331,0.127142,0.117140,1.019598e-01,7.865583e-02
...,...,...,...,...,...,...,...,...,...,...,...,...
59,0xAB67W5GS4,scuba diving,cleaning windows,pumping gas,blasting sand,taking a shower,scuba diving,0.146501,0.103552,0.076591,7.604447e-02,4.983969e-02
60,-aeOuOI3eN0,cutting nails,bandaging,waxing legs,somersaulting,washing feet,cleaning shoes,0.039472,0.035430,0.028659,2.356781e-02,2.087780e-02
61,018EClOtVTM,situp,situp,exercising with an exercise ball,exercising arm,throwing ball,stretching arm,0.999537,0.000294,0.000083,5.420654e-05,1.453274e-05
62,-_hu_Ld-ddk,peeling potatoes,clay pottery making,washing hands,making tea,rock scissors paper,peeling potatoes,0.625126,0.129556,0.057470,5.311957e-02,2.710894e-02


In [8]:
accuracy_top_1 = compute_accuracy(results_df, num_top_classes = 1)
accuracy_top_5 = compute_accuracy(results_df, num_top_classes = 5)

top-1 accuracy: 56.25%
top-5 accuracy: 75.00%


To perform evaluation on larger datasets, please check [scripts/evaluate.py](scripts/evaluate.py). Instructions can be found in the [scripts/README.md](README.md) file.

**If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required notices. A copy of the Apache License Version 2.0 can be found here.**