# Action Recognition with an Inflated 3D CNN (using tf hub)


## Instrucciones Generales

El siguiente práctico es **individual**. El formato de entregar es el **archivo .ipynb con todas las celdas ejecutadas**. Todas las preguntas deben ser respondida en celdas de texto. No se aceptará el _output_ de una celda de código como respuesta.

**Nombre:** FRANCISCO MENA

**Fecha de entrega: Abril 28 de 2021.**

El siguiente práctico cuanta con varias secciones y al final 1 o más actividades a realizar. Algunas actividades correspondrán a escribir código y otras a responder preguntas. 

**Importante.** Para facilitar su ejecución, cada sección puede ser ejecutada independientemente.

Se recomienda **fuertemente** revisar las secciones donde se entrega código porque algunas actividades de código pueden reutilizar el mismo código pero con cambios en algunas líneas.

El practico debe entregarse de forma **Individual** en caso contrario obtendrán la mínima calificación (1). Asimismo, debe indicar su nombre donde se indica sino la práctica no será calificada.

## 1.0 Introduction

The goal of human activity recognition is to examine activities from video sequences or still images. One of the more important models of this area is described in the paper "[Quo Vadis, Action Recognition? A New
Model and the Kinetics Dataset](https://arxiv.org/abs/1705.07750)" by Joao
Carreira and Andrew Zisserman. The paper was published as a CVPR 2017 conference paper.

"Quo Vadis" paper introduced a new architecture for video classification, the Inflated 3D Convnet or I3D. This architecture achieved state-of-the-art results on the UCF101 and HMDB51 datasets from fine-tuning these models. I3D models pre-trained on Kinetics
also placed first in the CVPR 2017 [Charades challenge](http://vuchallenge.org/charades.html).

The original module was trained on the [kinetics-400 dateset](https://deepmind.com/research/open-source/open-source-datasets/kinetics/)
and knows about 400 different actions.

<figure>
<center>
<img src='https://neilyongyangnie.files.wordpress.com/2018/08/model-2.png?w=1088' width="900" />
</center>
</figure>




In this tutorial we will use the I3D model to recognize activites in videos from a UCF101 dataset and own data. We will use TensorFlow Hub because it allows using the pre-trained model in an easy way.

Based on: https://tfhub.dev/deepmind/i3d-kinetics-400/1

## 2.0 Setup

In [None]:
!pip install -q imageio
!pip install -q opencv-python
!pip install -q git+https://github.com/tensorflow/docs

In [None]:
# TensorFlow and TF-Hub modules.
from absl import logging

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow_docs.vis import embed

logging.set_verbosity(logging.ERROR)

# Some modules to help with reading the UCF101 dataset.
import random
import re
import os
import tempfile
import ssl
import cv2
import numpy as np

# Some modules to display an animation using imageio.
import imageio
from IPython import display

from urllib import request  # requires python3

### 2.1 Defining utilities to retrieve the UCF101 dataset

The following functions allow to fecth videos of the UCF101 dataset from the original repository. 

`list_ucf_videos` returns the list of the videos.
`fetch_ucf_video` returns one selected video, you need to give the name of the video.

In [None]:
# Utilities to fetch videos from UCF101 dataset
UCF_ROOT = "https://www.crcv.ucf.edu/THUMOS14/UCF101/UCF101/"
_VIDEO_LIST = None
_CACHE_DIR = tempfile.mkdtemp()
# As of July 2020, crcv.ucf.edu doesn't use a certificate accepted by the
# default Colab environment anymore.
unverified_context = ssl._create_unverified_context()

def list_ucf_videos():
  """Lists videos available in UCF101 dataset."""
  global _VIDEO_LIST
  if not _VIDEO_LIST:
    index = request.urlopen(UCF_ROOT, context=unverified_context).read().decode("utf-8")
    videos = re.findall("(v_[\w_]+\.avi)", index)
    _VIDEO_LIST = sorted(set(videos))
  return list(_VIDEO_LIST)

def fetch_ucf_video(video):
  """Fetchs a video and cache into local filesystem."""
  cache_path = os.path.join(_CACHE_DIR, video)
  if not os.path.exists(cache_path):
    urlpath = request.urljoin(UCF_ROOT, video)
    print("Fetching %s => %s" % (urlpath, cache_path))
    data = request.urlopen(urlpath, context=unverified_context).read()
    open(cache_path, "wb").write(data)
  return cache_path

The following functions allow you to load a video to feed the model.

In [None]:
# Utilities to open video files using CV2
def crop_center_square(frame):
  y, x = frame.shape[0:2]
  min_dim = min(y, x)
  start_x = (x // 2) - (min_dim // 2)
  start_y = (y // 2) - (min_dim // 2)
  return frame[start_y:start_y+min_dim,start_x:start_x+min_dim]

def load_video(path, max_frames=0, resize=(224, 224)):
  cap = cv2.VideoCapture(path)
  frames = []
  try:
    while True:
      ret, frame = cap.read()
      if not ret:
        break
      frame = crop_center_square(frame)
      frame = cv2.resize(frame, resize)
      frame = frame[:, :, [2, 1, 0]]
      frames.append(frame)
      
      if len(frames) == max_frames:
        break
  finally:
    cap.release()
  return np.array(frames) / 255.0

The `to_gif` function converts a video to a gif.

In [None]:
def to_gif(images):
  converted_images = np.clip(images * 255, 0, 255).astype(np.uint8)
  imageio.mimsave('./animation.gif', converted_images, fps=25)
  return embed.embed_file('./animation.gif')

## 3.0 Processing UCF101 Dataset

We get the video list from the dataset and then we print it out for convenience.

In [None]:
# Get the list of videos in the dataset.
ucf_videos = list_ucf_videos()
  
categories = {}
for video in ucf_videos:
  category = video[2:-12]
  if category not in categories:
    categories[category] = []
  categories[category].append(video)
print("Found %d videos in %d categories." % (len(ucf_videos), len(categories)))

for category, sequences in categories.items():
  summary = ", ".join(sequences[:2])
  print("%-20s %4d videos (%s, ...)" % (category, len(sequences), summary))

The benefit of using the `fetch_ucf_video` function is that we don't need to download the entire dataset because it downloads a specific sample. 

Then using `load_video` function, we obtain a sample to use with I3D model. Note that we can use `to_gif` function to visualize the video.

In [None]:
# Get a sample video.
video_path = fetch_ucf_video("v_PlayingViolin_g01_c01.avi")
sample_video = load_video(video_path)

In [None]:
sample_video.shape

In [None]:
to_gif(sample_video)

## 4.0 Loading and using a pre-trained model

First, we need to obtain the labels for our model. I3D model was trained Kinectics-400, for that reason we have 400 labels.

In [None]:
# Get the kinetics-400 action labels from the GitHub repository.
KINETICS_URL = "https://raw.githubusercontent.com/deepmind/kinetics-i3d/master/data/label_map.txt"
with request.urlopen(KINETICS_URL) as obj:
  labels = [line.decode("utf-8").strip() for line in obj.readlines()]
print("Found %d labels." % len(labels))

TensorFlow Hub is a repository of trained machine learning models ready for fine-tuning and deployable anywhere. In this tutorial, we work with a Inflated 3D (I3D) Convnet model trained for action recognition on Kinetics-400.

The following line loads the model ready to used for predictions.


In [None]:
i3d = hub.load("https://tfhub.dev/deepmind/i3d-kinetics-400/1").signatures['default']

We define a function that allows to obtain the probabilities and show the top-5 predictions.

In [None]:
def predict(sample_video):
  # Add a batch axis to the to the sample video.
  model_input = tf.constant(sample_video, dtype=tf.float32)[tf.newaxis, ...]

  logits = i3d(model_input)['default'][0]
  probabilities = tf.nn.softmax(logits)

  print("Top 5 actions:")
  k=5
  for i in np.argsort(probabilities)[::-1][:k]:
    print(f"  {labels[i]:22}: {probabilities[i] * 100:5.2f}%")

Run the I3D model and print the top-5 action predictions.

In [None]:
predict(sample_video)

## 5.0 Predict on own data

We can use any video downloaded from the Internet to predict the action with the I3D model.

Let's try with [this video](https://commons.wikimedia.org/wiki/File:End_of_a_jam.ogv) of Patrick Gillett: 

In [None]:
!wget https://upload.wikimedia.org/wikipedia/commons/8/86/End_of_a_jam.ogv

In [None]:
video_path = "End_of_a_jam.ogv"

In [None]:
sample_video = load_video(video_path)[:100]

In [None]:
to_gif(sample_video)

In [None]:
predict(sample_video)

## 6.0 Activity

Now it is your turn. Reuse the code of the section 5.0 to predict the action of the following video: https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4

In [None]:
!wget https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4

In [None]:
video_path = 'abseiling_k400.mp4'
sample_video = load_video(video_path)[:100]
to_gif(sample_video)

In [None]:
predict(sample_video)


Based on this tutorial and the class, answer the questions.


1. Is the I3D model a 3d or 2d model?



**RESPUESTA** I3D es un modelo 3D porque usa kernels de 3 dimensiones, dos espaciales y uno temporal.

2. Is the I3D model a trimmed model approach?



**RESPUESTA** Es un modelo que resulta de inflar un modelo convolucional de 2D. Se infla de partir de la idea de repetir el kernel de una misma imagen varias veces, para reutilizar el conocimiento ya aprendido de modelos de imagenes.

3. Mention at least one advantages of the I3D over the previous approaches:



**RESPUESTA** Una gran ventaja es que logra aprovechar la información de modelos de imagenes 2D como por ej. ImageNet, en vez de C3D que tiene que aprender desde cero. Por lo tanto, tiene menos números de parámetros y toma menos entrenarlo que C3D. Además logra capturar mejor detalle temporal y espacial que el modelo two-streams.