# Action Recognition with a Two-Stream Architecture



## Instrucciones Generales

El siguiente práctico es **individual**. El formato de entregar es el **archivo .ipynb con todas las celdas ejecutadas**. Todas las preguntas deben ser respondida en celdas de texto. No se aceptará el _output_ de una celda de código como respuesta.

**Nombre:** FRANCISCO MENA

**Fecha de entrega: Abril 28 de 2021**

El siguiente práctico cuanta con varias secciones y al final 1 o más actividades a realizar. Algunas actividades correspondrán a escribir código y otras a responder preguntas. 

**Importante.** Para facilitar su ejecución, cada sección puede ser ejecutada independientemente.

Se recomienda **fuertemente** revisar las secciones donde se entrega código porque algunas actividades de código pueden reutilizar el mismo código pero con cambios en algunas líneas.

El practico debe entregarse de forma **Individual** en caso contrario obtendrán la mínima calificación (1). Asimismo, debe indicar su nombre donde se indica sino la práctica no será calificada.

## 1.0  Introduction

The [two-stream convolutional net architecture developed by Simonyan](https://arxiv.org/pdf/1406.2199.pdf) adds the temporal component of videos which provides an additional (and important) clue for recognition, as a number of actions can be reliably recognised based on the motion information.

The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. This architecture was trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51 obtaining competitive results.


<figure>
<center>
<img src='https://wushidonguc.github.io/assets/two_stream.png' width="700" />
</center>
</figure>

In this tutorial, we will use the implementation and the pre-trained model of [mohammed-elkomy](https://github.com/mohammed-elkomy/two-stream-action-recognition). The code is based on TensorFlow and the Keras library, and we will focus on how to load and use the model for inference.



## 2.0 Setup


This tutorial uses TF 1.1. The following line chance the environment to the required version.

In [1]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


This line will clone the repository and set the `two-stream-action-recognition` folder as root path.

In [2]:
!git clone https://github.com/mohammed-elkomy/two-stream-action-recognition.git

import os
os.chdir("/content/two-stream-action-recognition")

Cloning into 'two-stream-action-recognition'...
remote: Enumerating objects: 370, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 370 (delta 0), reused 0 (delta 0), pack-reused 367[K
Receiving objects: 100% (370/370), 50.78 MiB | 32.91 MiB/s, done.
Resolving deltas: 100% (145/145), done.


We now need to download and unzip the pre-trained models.



In [3]:
!gdown --id 1djGzpxAYFvNX-UaQ7ONqDHGgnzc8clBK -O "spatial.zip" 
!gdown --id 1kvslNL8zmZYaHRmhgAM6-l_pNDDA0EKZ -O "motion.zip"
!unzip spatial.zip
!unzip motion.zip

Downloading...
From: https://drive.google.com/uc?id=1djGzpxAYFvNX-UaQ7ONqDHGgnzc8clBK
To: /content/two-stream-action-recognition/spatial.zip
234MB [00:01, 171MB/s]
Downloading...
From: https://drive.google.com/uc?id=1kvslNL8zmZYaHRmhgAM6-l_pNDDA0EKZ
To: /content/two-stream-action-recognition/motion.zip
235MB [00:01, 124MB/s] 
Archive:  spatial.zip
  inflating: spatial.log             
  inflating: spatial.preds           
  inflating: spatial.h5              
Archive:  motion.zip
  inflating: motion.log              
  inflating: motion.preds            
  inflating: motion.h5               


Install and initialize the dependencies for this tutorial.

In [4]:
!pip install -U -q PyDrive 2> s.txt >> s.txt
!pip install opencv-python 2> s.txt >> s.txt
!pip install imgaug 2> s.txt >> s.txt
!pip install scikit-video 2> s.txt >> s.txt

In [5]:
import cv2
from imgaug import augmenters as iaa

from evaluation import legacy_load_model
from evaluation.evaluation import *

import random
from frame_dataloader import DataUtil

import matplotlib.pyplot as plt
import numpy as np

import skvideo.io
import io
import base64
from IPython.display import HTML
from matplotlib import gridspec

import warnings
warnings.filterwarnings('ignore')

## 3.0 Loading models

First, we will load the pre-trained model. It consists of two sub networks, one for spatial information and the second for temporal information.

In [6]:
# load spatial network
print("Loading Spatial stream")
spatial_model_restored = legacy_load_model(filepath="spatial.h5", custom_objects={'sparse_categorical_cross_entropy_loss': sparse_categorical_cross_entropy_loss, "acc_top_1": acc_top_1, "acc_top_5": acc_top_5})
spatial_model_restored.summary()

# load temporal network
print("Loading Motion stream")
motion_model_restored = legacy_load_model(filepath="motion.h5", custom_objects={'sparse_categorical_cross_entropy_loss': sparse_categorical_cross_entropy_loss, "acc_top_1": acc_top_1, "acc_top_5": acc_top_5})
motion_model_restored.summary()

Loading Spatial stream
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Tensor("predictions_target:0", shape=(?, ?), dtype=float32) Tensor("predictions/Softmax:0", shape=(?, 101), dtype=float32)
Tensor("predictions_target:0", shape=(?, ?), dtype=float32) Tensor("predictions/Softmax:0", shape=(?, 101), dtype=float32)
Model: "spatial_xception"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_spatial (InputLayer)   [(None, 299, 299, 3)]     0         
_________________________________________________________________
data_n

## 4.0 Dataset

### 4.1 Setup

The repo provides an extract of the UCF101 dataset (100 videos) that we will use in this tutorial.

To start, we need to define some important variables and functions that allow to preprocess an instance to feed the model.

In [7]:
high_resolution_video = True
stacked_frames = 10

# dictionary of class names
data_util = DataUtil(path= './UCF_list/', split="01")
action_names =  {v:k for k,v in data_util.action_to_label.items()} # class name dictionary

# image resize augmenter to be fed into the network
augmenter = iaa.Sequential([
    iaa.Scale({"height": 299, "width": 299})
])

In [8]:
def convert_to_image(flow_image):
    """
    this is the conversion function of each flow frame
    """
    l, h = -20, 20
    return (255 * (flow_image - l) / (h - l)).astype(np.uint8)


def stack_opticalflow(start_frame_index, stacked_frames):  # returns numpy (h,w,stacked*2) = one sample
    """
    Stacks "stacked_frames" u/v frames on a single numpy array : (h,w,stacked*2)
    """
    first_optical_frame_u = original_u_frames[start_frame_index]  # horizontal
    first_optical_frame_v = original_v_frames[start_frame_index]  # vertical

    stacked_optical_flow_sample = np.zeros(first_optical_frame_u.shape + (2 * stacked_frames,), dtype=np.uint8)  # with channel dimension of  stacked_frames(u)+ stacked_frames(v)

    stacked_optical_flow_sample[:, :, 0] = first_optical_frame_u
    stacked_optical_flow_sample[:, :, 0 + stacked_frames] = first_optical_frame_v

    for index, optical_frame_id in enumerate(range(start_frame_index + 1, start_frame_index + stacked_frames), 1):  # index starts at 1 placed after the first one
        stacked_optical_flow_sample[:, :, index] = original_u_frames[optical_frame_id]
        stacked_optical_flow_sample[:, :, index + stacked_frames] = original_v_frames[optical_frame_id]

    return stacked_optical_flow_sample


def get_image_from_fig(fig):
    """
    converts matplotlib figure into a numpy array for demo video generation
    """
    fig.canvas.draw()

    data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
    data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))

    return data

We randomly select a video from our dataset to work in this tutorial.

In [9]:
# select a random video
video_name = random.choice(os.listdir("testing video samples"))
selected_video=os.path.join("testing video samples",video_name)
print("selected_video:",selected_video)

vidcap = cv2.VideoCapture(selected_video)
print("frame rate for demo:",vidcap.get(cv2.CAP_PROP_FPS))

selected_video: testing video samples/v_SoccerPenalty_g17_c04.avi
frame rate for demo: 25.0


### 4.2 Process a random video

Now, we need to compute the RGB frames and optical flow frames to feed the model.


Optical flow frames are computed using TVL1 of OpenCV library. It might take few minutes for long videos (we process them on CPU since the GPU version requires building OpenCV from the source).

In [10]:
def obtain_frames(vidcap):
  # make the rgb frames
  original_rgb_frames = []

  success, image = vidcap.read()
  while success:
      original_rgb_frames.append(image)
      success, image = vidcap.read()

  print("frames count in video", len(original_rgb_frames))

  # make the optical flow frames
  original_v_frames = []
  original_u_frames = []

  frames = list(map(lambda frame: cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY).astype(np.float32) / 255.0, original_rgb_frames))
  optical_flow = cv2.optflow.DualTVL1OpticalFlow_create()

  for frame_index in range(len(frames) - 1):
      if frame_index % 10 == 0:
          print("processing tvl flow of frame ",frame_index)

      flow = optical_flow.calc(frames[frame_index], frames[frame_index + 1], None)
      u_frame = convert_to_image(flow[..., 0])
      v_frame = convert_to_image(flow[..., 1])

      original_v_frames.append(v_frame)
      original_u_frames.append(u_frame)

  print("original_rgb_frames:", len(original_rgb_frames), "\noriginal_u_frames:", len(original_u_frames), "\noriginal_v_frames:", len(original_v_frames))

  return original_rgb_frames, original_u_frames, original_v_frames

In [13]:
original_rgb_frames, original_u_frames, original_v_frames = obtain_frames(vidcap)

frames count in video 84
processing tvl flow of frame  0
processing tvl flow of frame  10
processing tvl flow of frame  20
processing tvl flow of frame  30
processing tvl flow of frame  40
processing tvl flow of frame  50
processing tvl flow of frame  60
processing tvl flow of frame  70
processing tvl flow of frame  80
original_rgb_frames: 84 
original_u_frames: 83 
original_v_frames: 83


We generate a dataloader for spatial and motion frames. You can use `original_u_frames` or `original_v_frames`.

In [14]:
def get_batch(original_rgb_frames, original_flow_frames):
  # generate spatial batch as done in the dataloader
  spatial_batch = []
  for image in original_rgb_frames:
      spatial_batch.append(
          cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
      )

  spatial_batch = np.array(augmenter.augment_images(spatial_batch), dtype=np.float32) / 255.0

  # generate motion batch as done in the dataloader
  motion_batch = []

  for first_optical_frame_id in range(len(original_flow_frames) - stacked_frames):
      motion_batch.append(  # append one sample which is (h,w,stacked*2)
          stack_opticalflow(start_frame_index=first_optical_frame_id, stacked_frames=stacked_frames)
      )
  motion_batch = np.array(augmenter.augment_images(motion_batch), dtype=np.float32) / 255.0

  return spatial_batch, motion_batch

In [15]:
spatial_batch, motion_batch = get_batch(original_rgb_frames, original_u_frames)

## 5.0 Make predictions

### 5.1 Predict the output

The two-stream model predicts at each frame, so for that reason we compute predictions of each frame organized in the batch.

In [16]:
"""
predict spatial stream output
"""
spatial_pred = spatial_model_restored.predict(spatial_batch)
spatial_classes = np.argsort(spatial_pred,axis=1)[:,:-6:-1]
spatial_scores = np.sort(spatial_pred,axis=1)[:,:-6:-1]
"""
predict motion stream output
"""
motion_pred = motion_model_restored.predict(motion_batch)
motion_classes = np.argsort(motion_pred,axis=1)[:,:-6:-1]
motion_scores = np.sort(motion_pred,axis=1)[:,:-6:-1]
"""
get the average output prediction
"""
average_pred = motion_pred + spatial_pred[:motion_pred.shape[0],]
average_classes = np.argsort(average_pred,axis=1)[:,:-6:-1]
average_scores = np.sort(average_pred,axis=1)[:,:-6:-1]

### 5.2 Visualize predictions

In order to visualize the predictions, we will create a video that contains the rgb and the optical flow frames with the corresponding prediction at each frame. This video is saved with the name `demo.mp4` and the with display it.

In [17]:
def make_bar_chart(classes,scores):
    height = scores.tolist()
    bars = [action_names[class_index] for class_index in classes]
    y_pos = np.arange(len(bars))
    
    bar = plt.bar(y_pos, height, color=['yellow', 'red', 'green', 'blue', 'cyan'])
    # plt.xticks(y_pos, bars, rotation=90) this will draw them below
    # plt.tick_params(axis="x",labelsize=10,direction="in", pad=-15)
    plt.ylim(top=1)  
    plt.ylim(bottom=0) 
    
    for bar_id,rect in enumerate(bar):
        plt.text(rect.get_x() + rect.get_width()/2.0, .5, bars[bar_id], ha='center', va='center', rotation=75,fontdict={'fontsize': 13 if high_resolution_video else 10})    

In [18]:
# Define the codec and create VideoWriter object.The output is stored in 'demo.mp4' file.
writer = skvideo.io.FFmpegWriter("demo.mp4", inputdict={
      '-r': '16',
    })

gs = gridspec.GridSpec(2, 3,
                       width_ratios=[1, 1,1],
                       height_ratios=[1.5, 1]
                       )

gs.update(wspace=0.2,hspace=0)

# generating output video
for frame_index in range(motion_classes.shape[0]): 
    if high_resolution_video :
        fig = plt.figure(figsize=(16, 12))
        fig.suptitle("Prediction for {}".format(video_name), fontsize=24)

        fig.text(.125,0.91,"Average Prediction from spatial stream: {}".format(action_names[np.mean(spatial_pred,axis = 0).argmax()]),color='r', fontsize=18)
        fig.text(.125,.87,"Average Prediction from motion stream: {}".format(action_names[np.mean(motion_pred,axis = 0).argmax()]),color='g',fontsize=18)
        fig.text(.125,.83,"Average Prediction from both streams: {}".format(action_names[np.mean(average_pred,axis = 0).argmax()]),color='b', fontsize=18)
    else :
        fig = plt.figure(figsize=(9, 6))
        fig.suptitle("Demo for {}".format(video_name), fontsize=16)

        fig.text(.125,0.91,"Average Prediction from spatial stream: {}".format(action_names[np.mean(spatial_pred,axis = 0).argmax()]),color='r', fontsize=13)
        fig.text(.125,.87,"Average Prediction from motion stream: {}".format(action_names[np.mean(motion_pred,axis = 0).argmax()]),color='g',fontsize=13)
        fig.text(.125,.83,"Average Prediction from both streams: {}".format(action_names[np.mean(average_pred,axis = 0).argmax()]),color='b', fontsize=13)
    

    if frame_index % 10 == 0:
        print("processing frame ",frame_index)
    ##########################################################
    # rgb frame
    ax = plt.subplot(gs[0])
    ax.set_title("RGB frame", fontsize=16 if high_resolution_video else 13)
    ax.get_yaxis().set_visible(False)
    ax.get_xaxis().set_visible(False)
    ax.imshow(cv2.cvtColor(original_rgb_frames[frame_index],cv2.COLOR_RGB2BGR))
    ##########################################################
    # optical flow frame
    ax = plt.subplot(gs[1])
    ax.set_title("TVL1 Optical flow u-frame", fontsize=16 if high_resolution_video else 13)
    ax.get_yaxis().set_visible(False)
    ax.get_xaxis().set_visible(False)
    ax.imshow(original_u_frames[frame_index],cmap="inferno") # viridis,inferno,plasma,magma
    ##########################################################
    # optical flow frame
    ax = plt.subplot(gs[2])
    ax.set_title("TVL1 Optical flow v-frame", fontsize= 16 if high_resolution_video else 13)
    ax.get_yaxis().set_visible(False)
    ax.get_xaxis().set_visible(False)
    ax.imshow(original_v_frames[frame_index],cmap="inferno") # viridis,inferno,plasma,magma
    ##########################################################
    # prediction scores
    ax = plt.subplot(gs[3])
    ax.set_title("Spatial Stream Output scores",fontsize= 16 if high_resolution_video else 13)

    make_bar_chart(spatial_classes[frame_index],spatial_scores[frame_index])
    ##########################################################
    # prediction scores
    ax = plt.subplot(gs[4])
    ax.set_title("Motion Stream Output scores",fontsize= 16 if high_resolution_video else 13)

    make_bar_chart(motion_classes[frame_index],motion_scores[frame_index])
    ##########################################################
    # prediction scores
    ax = plt.subplot(gs[5])
    ax.set_title("Average Output scores",fontsize= 16 if high_resolution_video else 13)

    make_bar_chart(average_classes[frame_index],average_scores[frame_index])
    ##########################################################
    fig.tight_layout( pad=0, h_pad=0, w_pad=0)
    writer.writeFrame(get_image_from_fig(fig))
    
    plt.close(fig)
    
writer.close()

processing frame  0
processing frame  10
processing frame  20
processing frame  30
processing frame  40
processing frame  50
processing frame  60
processing frame  70


Finally, we display the demo video of the prediction for the given random video.

In [19]:
video = io.open("demo.mp4" , 'r+b').read()
encoded = base64.b64encode(video)

HTML(data='''<video controls autoplay loop>
			<source type="video/mp4" src="data:video/mp4;base64,{}"
      		</video>'''.format(encoded.decode('ascii')))

## 6.0 Activity

Now is your turn. Execute the following line (which is for downloading a external video) and rerun the lines from section 4.2 onwards to report the prediction for the following video: https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4

Note that the final `demo.mp4` visualization should say **Prediction for abseiling_k400.mp4**

In [20]:
# this line download the video
!wget https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4

video_name="abseiling_k400.mp4"
selected_video=os.path.join(video_name)
print("selected_video:", selected_video)

vidcap = cv2.VideoCapture(selected_video)
print("frame rate for demo:",vidcap.get(cv2.CAP_PROP_FPS))

--2021-04-17 21:08:14--  https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/bryanyzhu/tiny-ucf101/master/abseiling_k400.mp4 [following]
--2021-04-17 21:08:14--  https://raw.githubusercontent.com/bryanyzhu/tiny-ucf101/master/abseiling_k400.mp4
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 800470 (782K) [application/octet-stream]
Saving to: ‘abseiling_k400.mp4’


2021-04-17 21:08:15 (20.6 MB/s) - ‘abseiling_k400.mp4’ saved [800470/800470]

selected_video: abseiling_k400.mp4
frame rate for demo: 25.0


In [21]:
original_rgb_frames, original_u_frames, original_v_frames = obtain_frames(vidcap)

frames count in video 250
processing tvl flow of frame  0
processing tvl flow of frame  10
processing tvl flow of frame  20
processing tvl flow of frame  30
processing tvl flow of frame  40
processing tvl flow of frame  50
processing tvl flow of frame  60
processing tvl flow of frame  70
processing tvl flow of frame  80
processing tvl flow of frame  90
processing tvl flow of frame  100
processing tvl flow of frame  110
processing tvl flow of frame  120
processing tvl flow of frame  130
processing tvl flow of frame  140
processing tvl flow of frame  150
processing tvl flow of frame  160
processing tvl flow of frame  170
processing tvl flow of frame  180
processing tvl flow of frame  190
processing tvl flow of frame  200
processing tvl flow of frame  210
processing tvl flow of frame  220
processing tvl flow of frame  230
processing tvl flow of frame  240
original_rgb_frames: 250 
original_u_frames: 249 
original_v_frames: 249


In [22]:
spatial_batch, motion_batch = get_batch(original_rgb_frames, original_u_frames)

In [23]:
"""
predict spatial stream output
"""
spatial_pred = spatial_model_restored.predict(spatial_batch)
spatial_classes = np.argsort(spatial_pred,axis=1)[:,:-6:-1]
spatial_scores = np.sort(spatial_pred,axis=1)[:,:-6:-1]
"""
predict motion stream output
"""
motion_pred = motion_model_restored.predict(motion_batch)
motion_classes = np.argsort(motion_pred,axis=1)[:,:-6:-1]
motion_scores = np.sort(motion_pred,axis=1)[:,:-6:-1]
"""
get the average output prediction
"""
average_pred = motion_pred + spatial_pred[:motion_pred.shape[0],]
average_classes = np.argsort(average_pred,axis=1)[:,:-6:-1]
average_scores = np.sort(average_pred,axis=1)[:,:-6:-1]

In [24]:
# Define the codec and create VideoWriter object.The output is stored in 'demo.mp4' file.
writer = skvideo.io.FFmpegWriter("demo.mp4", inputdict={
      '-r': '16',
    })

gs = gridspec.GridSpec(2, 3,
                       width_ratios=[1, 1,1],
                       height_ratios=[1.5, 1]
                       )

gs.update(wspace=0.2,hspace=0)

# generating output video
for frame_index in range(motion_classes.shape[0]): 
    if high_resolution_video :
        fig = plt.figure(figsize=(16, 12))
        fig.suptitle("Prediction for {}".format(video_name), fontsize=24)

        fig.text(.125,0.91,"Average Prediction from spatial stream: {}".format(action_names[np.mean(spatial_pred,axis = 0).argmax()]),color='r', fontsize=18)
        fig.text(.125,.87,"Average Prediction from motion stream: {}".format(action_names[np.mean(motion_pred,axis = 0).argmax()]),color='g',fontsize=18)
        fig.text(.125,.83,"Average Prediction from both streams: {}".format(action_names[np.mean(average_pred,axis = 0).argmax()]),color='b', fontsize=18)
    else :
        fig = plt.figure(figsize=(9, 6))
        fig.suptitle("Demo for {}".format(video_name), fontsize=16)

        fig.text(.125,0.91,"Average Prediction from spatial stream: {}".format(action_names[np.mean(spatial_pred,axis = 0).argmax()]),color='r', fontsize=13)
        fig.text(.125,.87,"Average Prediction from motion stream: {}".format(action_names[np.mean(motion_pred,axis = 0).argmax()]),color='g',fontsize=13)
        fig.text(.125,.83,"Average Prediction from both streams: {}".format(action_names[np.mean(average_pred,axis = 0).argmax()]),color='b', fontsize=13)
    

    if frame_index % 10 == 0:
        print("processing frame ",frame_index)
    ##########################################################
    # rgb frame
    ax = plt.subplot(gs[0])
    ax.set_title("RGB frame", fontsize=16 if high_resolution_video else 13)
    ax.get_yaxis().set_visible(False)
    ax.get_xaxis().set_visible(False)
    ax.imshow(cv2.cvtColor(original_rgb_frames[frame_index],cv2.COLOR_RGB2BGR))
    ##########################################################
    # optical flow frame
    ax = plt.subplot(gs[1])
    ax.set_title("TVL1 Optical flow u-frame", fontsize=16 if high_resolution_video else 13)
    ax.get_yaxis().set_visible(False)
    ax.get_xaxis().set_visible(False)
    ax.imshow(original_u_frames[frame_index],cmap="inferno") # viridis,inferno,plasma,magma
    ##########################################################
    # optical flow frame
    ax = plt.subplot(gs[2])
    ax.set_title("TVL1 Optical flow v-frame", fontsize= 16 if high_resolution_video else 13)
    ax.get_yaxis().set_visible(False)
    ax.get_xaxis().set_visible(False)
    ax.imshow(original_v_frames[frame_index],cmap="inferno") # viridis,inferno,plasma,magma
    ##########################################################
    # prediction scores
    ax = plt.subplot(gs[3])
    ax.set_title("Spatial Stream Output scores",fontsize= 16 if high_resolution_video else 13)

    make_bar_chart(spatial_classes[frame_index],spatial_scores[frame_index])
    ##########################################################
    # prediction scores
    ax = plt.subplot(gs[4])
    ax.set_title("Motion Stream Output scores",fontsize= 16 if high_resolution_video else 13)

    make_bar_chart(motion_classes[frame_index],motion_scores[frame_index])
    ##########################################################
    # prediction scores
    ax = plt.subplot(gs[5])
    ax.set_title("Average Output scores",fontsize= 16 if high_resolution_video else 13)

    make_bar_chart(average_classes[frame_index],average_scores[frame_index])
    ##########################################################
    fig.tight_layout( pad=0, h_pad=0, w_pad=0)
    writer.writeFrame(get_image_from_fig(fig))
    
    plt.close(fig)
    
writer.close()

processing frame  0
processing frame  10
processing frame  20
processing frame  30
processing frame  40
processing frame  50
processing frame  60
processing frame  70
processing frame  80
processing frame  90
processing frame  100
processing frame  110
processing frame  120
processing frame  130
processing frame  140
processing frame  150
processing frame  160
processing frame  170
processing frame  180
processing frame  190
processing frame  200
processing frame  210
processing frame  220
processing frame  230


In [25]:
video = io.open("demo.mp4" , 'r+b').read()
encoded = base64.b64encode(video)

HTML(data='''<video controls autoplay loop>
			<source type="video/mp4" src="data:video/mp4;base64,{}"
      		</video>'''.format(encoded.decode('ascii')))


Based on this tutorial and the class, answer the questions.


1. is the two-stream model a 3d or 2d model?



**RESPUESTA** El modelo two-stream model usa convoluciones 2D para analizar imágenes: una imágen que representa el espacio, y otras de optical flow que representan la sequencia temporal. Por eso, two-stream es un modelo 2D

2. Mention at least one disvantage of the two-stream model:



**RESPUESTA** Desventajas del two-stream model son: 
- que hay que calcular el optical flow del video antes de pasarlo por la red neuronal, lo cual puede tomar tiempo
- obtiene la dependencia espacial solamente de un frame del video, lo cual puede no ser una buena representacion
- No logra capturar sequencias temporales muy largas

3. Mention one advantage of the two-stream model over the cnn2d + rnn:



**RESPUESTA** Una ventaja del two-stream model sobre el cnn2d + rnn es que two-stream logra detectar más detalles temporales gracias al temporal flow, y además no es tan costoso en términos de recursos, ya que el RNN requiere hacer el backpropagation de multiples frames a lo largo de la red.