In [1]:
%matplotlib inline

Getting Started with Pre-trained I3D Models on Kinetics400
=============================================================

`Kinetics400`  is an action recognition dataset
of realistic action videos, collected from YouTube. With about 240K training and 2K validation short trimmed videos
from 400 action categories, it is one of the most widely used dataset in the research
community for benchmarking state-of-the-art video action recognition models.

In this tutorial, we will demonstrate how to load a pre-trained model from `gluoncv-model-zoo`
and classify videos from the Internet or your local disk into one of the 400 action classes.

Step by Step
------------

We will show two exmaples here. For simplicity, we first try out a pre-trained Kinetics400 model
on a single video clip. 

First, please follow the `installation guide <../../index.html#installation>`__
to install ``MXNet`` and ``GluonCV`` if you haven't done so yet.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import mxnet as mx
from mxnet import gluon, nd, image
from mxnet.gluon.data.vision import transforms
from gluoncv.data.transforms import video
from gluoncv import utils
from gluoncv.model_zoo import get_model

Then, we download and show the example image:



In [None]:
url = 'https://github.com/bryanyzhu/tiny-kinetics400/raw/master/abseiling.mp4'
im_fname = utils.download(url)

img = image.imread(im_fname)

plt.imshow(img.asnumpy())
plt.show()

In case you don't recognize it, the image is a man abseiling :)

Now we define transformations for the image.



In [None]:
transform_fn = transforms.Compose([
    video.VideoCenterCrop(size=224),
    video.VideoToTensor(),
    video.VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

This transformation function does three things:
center crop the image to 224x224 in size,
transpose it to ``num_channels*height*width``,
and normalize with mean and standard deviation calculated across all ImageNet images.

What does the transformed image look like?



In [None]:
img_list = transform_fn([img.asnumpy()])
plt.imshow(np.transpose(img_list[0], (1,2,0)))
plt.show()

Can't recognize anything? *Don't panic!* Neither do I.
The transformation makes it more "model-friendly", instead of "human-friendly".

Next, we load a pre-trained model.



In [None]:
net = get_model('i3d_resnet50_v1_kinetics400', nclass=400, pretrained=True)

Note that if you want to use InceptionV3 series model, please resize the image to have
both dimensions larger than 299 (e.g., 340x450) and change input size from 224 to 299
in the transform function. Finally, we prepare the image and feed it to the model.



In [None]:
pred = net(nd.array(img_list[0]).expand_dims(axis=0))

classes = net.classes
topK = 5
ind = nd.topk(pred, k=topK)[0].astype('int')
print('The input video frame is classified to be')
for i in range(topK):
    print('\t[%s], with probability %.3f.'%
          (classes[ind[i].asscalar()], nd.softmax(pred)[0][ind[i]].asscalar()))

We can see that our pre-trained model predicts this video frame
to be ``abseiling`` action with high confidence.

