# **Diplomado IA: Audio y Video - Parte 1**. <br> Práctico 5: Aplicaciones 2
---
---

**Profesores:**
- Alain Raymond
- Gabriel Sepúlveda
- Álvaro Soto

**Ayudante:**
- Gabriel Molina
---
---

# **Instrucciones Generales**

El siguiente práctico se debe realizar de forma individual. El formato de entregar es el **archivo .ipynb con todas las celdas ejecutadas**. Las secciones donde se planteen preguntas de forma explícita, deben ser respondida en celdas de texto, y no se aceptará solo el _output_ de una celda de código como respuesta.

**Nombre alumno:** FRANCISCO MENA

El siguiente práctico cuenta con secciones que contienen los experimentos presentados durante la sesión de laboratorio, y actividades que deberán ser desarrolladas y luego entregadas como tarea. Algunas actividades correspondrán a escribir código y otras a responder preguntas. 

Antes de responder, se recomienda **fuertemente** revisar las secciones previas donde se desarrollan los ejemplos, dado que algunas de las actividades pueden ser completadas reutilizando el mismo código.

**Fecha de entrega:** viernes 28 de mayo de 2021, 23:59 hrs.

---
**IMPORTANTE:** habrá un bonus de 1 décima para todos aquellos alumnos/as que muestren buen orden en sus respuestas (esto aplica a legibilidad de código, buena redacción, formalidad, organización del jupyter notebook, seguimiento de instrucciones, etc). El criterio lo pondrá cada ayudante corrector. La nota máxima obtenible en el laboratorio es 7.0

#Sources

**End-to-End Audiovisual Speech Recognition**

dataset: https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html

paper: https://arxiv.org/pdf/1802.06424.pdf

github: https://github.com/mpc001/end-to-end-lipreading

#Preámbulo

In [1]:
import sys
import os
import os.path
import glob
import math
import random
import numpy as np
import cv2
import librosa

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

In [2]:
!if [ ! -f Audiovisual.zip ]; then wget -q --show-progress https://www.dropbox.com/s/j6het8rq3j2ewng/Audiovisual.zip; fi
!if [ ! -f label_sorted.txt ]; then wget -q --show-progress https://www.dropbox.com/s/r44j8lhhsgjvjzb/label_sorted.txt; fi
!if [ ! -f lipread_testset_mini.tar.gz ]; then wget -q --show-progress https://www.dropbox.com/s/cbr5q72b8cef22i/lipread_testset_mini.tar.gz; fi
#!if [ ! -f lipread_testset.tar.gz ]; then wget -q --show-progress https://www.dropbox.com/s/4e3hkzaoizd491y/lipread_testset.tar.gz; fi
!unzip -q Audiovisual.zip
!tar xzf lipread_testset_mini.tar.gz
#!tar xzf lipread_testset.tar.gz



#Preprocesamiento de datos

In [3]:
def extract_opencv(filename):
  video = []
  cap = cv2.VideoCapture(filename)
  while cap.isOpened():
    ret, frame = cap.read() # BGR
    if ret:
      video.append(frame)
    else:
      break 
  cap.release()
  video = np.array(video)
  return video[...,::-1]

def video_converter(basedir, basedir_to_save):
  if not os.path.isdir( basedir_to_save ):
    os.makedirs( basedir_to_save, exist_ok = True )
  filenames = glob.glob(os.path.join(basedir, '*', '*', '*.mp4')) # <basedir>/<word>/<train, val, test>/<filename.mp4>
  for filename in filenames:
    data = extract_opencv(filename)[:, 115:211, 79:175]
    path_to_save = os.path.join(basedir_to_save,
                  filename.split('/')[-3],
                  filename.split('/')[-2],
                  filename.split('/')[-1][:-4]+'.npz')
    if not os.path.exists(os.path.dirname(path_to_save)):
      try:
        os.makedirs(os.path.dirname(path_to_save))
      except OSError as exc:
        if exc.errno != errno.EEXIST:
          raise   
    np.savez(path_to_save, data=data)

def audio_converter(basedir, basedir_to_save):
  if not os.path.isdir( basedir_to_save ):
    os.makedirs( basedir_to_save, exist_ok = True )
  filenames = glob.glob(os.path.join(basedir, '*', '*', '*.mp4')) # <basedir>/<word>/<train, val, test>/<filename.mp4>
  for filename in filenames:
    data = librosa.load(filename, sr=16000)[0][-19456:]
    path_to_save = os.path.join(basedir_to_save,
                  filename.split('/')[-3],
                  filename.split('/')[-2],
                  filename.split('/')[-1][:-4]+'.npz')
    if not os.path.exists(os.path.dirname(path_to_save)):
      try:
        os.makedirs(os.path.dirname(path_to_save))
      except OSError as exc:
        if exc.errno != errno.EEXIST:
          raise
    np.savez( path_to_save, data=data)

In [None]:
video_converter('lipread_testset_mini', 'preprocessed/video')
audio_converter('lipread_testset_mini', 'preprocessed/audio')

#Dataloader

In [5]:
def load_audio_file(filename):
  return np.load(filename)['data']

def load_video_file(filename):
  cap = np.load(filename)['data']
  arrays = np.stack([cv2.cvtColor(cap[_], cv2.COLOR_RGB2GRAY) for _ in range(29)], axis=0)
  arrays = arrays / 255.
  return arrays

class MyDataset():
  def __init__(self, folds, audio_path, video_path):
    self.folds = folds
    self.audio_path = audio_path
    self.video_path = video_path
    self.clean = 1 / 7.
    with open('label_sorted.txt') as myfile:
      self.data_dir = myfile.read().splitlines()
    self.filenames = glob.glob(os.path.join(self.audio_path, '*', self.folds, '*.npz'))
    self.list = {}
    for i, x in enumerate(self.filenames):
      target = x.split('/')[-3]
      for j, elem in enumerate(self.data_dir):
        if elem == target:
          self.list[i] = [x]
          self.list[i].append(j)

  def normalisation(self, inputs):
    inputs_std = np.std(inputs)
    if inputs_std == 0.:
      inputs_std = 1.
    return (inputs - np.mean(inputs))/inputs_std

  def __getitem__(self, idx):
    video_inputs = load_video_file(os.path.join(self.video_path,
                          self.list[idx][0].split('/')[-3],
                          self.list[idx][0].split('/')[-2],
                          self.list[idx][0].split('/')[-1][:-4]+'.npz'))
    self.list[idx][0] = self.list[idx][0]
    audio_inputs = load_audio_file(self.list[idx][0])
    audio_inputs = self.normalisation(audio_inputs)
    labels = self.list[idx][1]
    return audio_inputs, video_inputs, labels

  def __len__(self):
    return len(self.filenames)

In [6]:
def data_loader(datatype, audio_dataset, video_dataset, batch_size):
  dsets = MyDataset(datatype, audio_dataset, video_dataset)
  dset_loader = torch.utils.data.DataLoader(dsets, batch_size = batch_size, shuffle=True, num_workers=4)
  dset_size = len(dsets)
  print('\nStatistics: {}: {}'.format(datatype, dset_size))
  return dset_loader, dset_size

#Modelo

##Bloques genéricos

In [7]:
class GRU(nn.Module):

  def __init__(self, input_size, hidden_size, num_layers, num_classes, output_layer=False, every_frame=False):
    super(GRU, self).__init__()
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.output_layer = output_layer
    self.every_frame = every_frame
    self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
    self.fc = nn.Linear(hidden_size*2, num_classes)

  def to( self, device ):
    self.device = device
    return super( GRU, self ).to( device )

  def forward(self, x):
    h0 = Variable(torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(self.device))
    # Forward propagate RNN
    out, _ = self.gru(x, h0)
    if self.output_layer:
      if self.every_frame:
        out = self.fc(out)  # predictions based on every time step
      else:
        out = self.fc(out[:, -1, :])  # predictions based on last time-step
    return out 

##Modelo de audio


In [8]:
class BasicBlock1D(nn.Module):
  expansion = 1

  def __init__(self, inplanes, planes, stride=1, downsample=None):
    super(BasicBlock1D, self).__init__()
    self.downsample = downsample
    self.stride = stride

    self.conv1 = nn.Conv1d(inplanes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
    self.bn1 = nn.BatchNorm1d(planes)
    self.relu = nn.ReLU(inplace=True)

    self.conv2 = nn.Conv1d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn2 = nn.BatchNorm1d(planes)

  def forward(self, x):
    residual = x
    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    if self.downsample is not None:
      residual = self.downsample(x)
    out += residual
    out = self.relu(out)
    return out


class ResNet(nn.Module):

  def __init__(self, block, layers, num_classes=1000):
    self.inplanes = 64
    super(ResNet, self).__init__()
    self.layer1 = self._make_layer(block, 64, layers[0])
    self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
    self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
    self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
    self.avgpool = nn.AvgPool1d(kernel_size=21, padding=1)
    self.fc = nn.Linear(512 * block.expansion, num_classes)
    for m in self.modules():
      if isinstance(m, nn.Conv1d):
        n = m.kernel_size[0] * m.out_channels
        m.weight.data.normal_(0, math.sqrt(2. / n))
      elif isinstance(m, nn.BatchNorm1d):
        m.weight.data.fill_(1)
        m.bias.data.zero_()

  def _make_layer(self, block, planes, blocks, stride=1):
    downsample = None
    if stride != 1 or self.inplanes != planes * block.expansion:
      downsample = nn.Sequential(
        nn.Conv1d(self.inplanes, planes * block.expansion,
              kernel_size=1, stride=stride, bias=False),
        nn.BatchNorm1d(planes * block.expansion),
      )

    layers = []
    layers.append(block(self.inplanes, planes, stride, downsample))
    self.inplanes = planes * block.expansion
    for i in range(1, blocks):
      layers.append(block(self.inplanes, planes))

    return nn.Sequential(*layers)

  def forward(self, x):
    x = self.layer1(x)
    x = self.layer2(x)
    x = self.layer3(x)
    x = self.layer4(x)
    x = self.avgpool(x)
    x = x.transpose(1, 2)
    x = x.contiguous()
    x = x.view(-1, x.size(2))
    x = self.fc(x)
    return x


class AudioLipreading(nn.Module):
  def __init__(self, inputDim=256, hiddenDim=512, nClasses=500, frameLen=29):
    super(AudioLipreading, self).__init__()
    self.inputDim = inputDim
    self.hiddenDim = hiddenDim
    self.nClasses = nClasses
    self.frameLen = frameLen
    self.nLayers = 2
    # frontend1D
    self.fronted1D = nn.Sequential(
        nn.Conv1d(1, 64, kernel_size=80, stride=4, padding=38, bias=False),
        nn.BatchNorm1d(64),
        nn.ReLU(True)
        )
    # resnet
    self.resnet18 = ResNet(BasicBlock1D, [2, 2, 2, 2], num_classes=self.inputDim)
    # backend_gru
    self.gru = GRU(self.inputDim, self.hiddenDim, self.nLayers, self.nClasses)

  def to( self, device ):
    self.gru.to(device)
    return super( AudioLipreading, self ).to( device )

  def forward(self, x):
    x = x.view(-1, 1, x.size(1))
    x = self.fronted1D(x)
    x = x.contiguous()
    x = self.resnet18(x)
    x = x.view(-1, self.frameLen, self.inputDim)
    x = self.gru(x)
    return x

##Modelo de video

In [9]:
class BasicBlock2D(nn.Module):
  expansion = 1

  def __init__(self, inplanes, planes, stride=1, downsample=None):
    super(BasicBlock2D, self).__init__()
    self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(planes)
    self.relu = nn.ReLU(inplace=True)
    self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn2 = nn.BatchNorm2d(planes)
    self.downsample = downsample
    self.stride = stride

  def forward(self, x):
    residual = x
    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    if self.downsample is not None:
      residual = self.downsample(x)
    out += residual
    out = self.relu(out)
    return out


class ResNet2D(nn.Module):

  def __init__(self, block, layers, num_classes=1000):
    self.inplanes = 64
    super(ResNet2D, self).__init__()
    self.layer1 = self._make_layer(block, 64, layers[0])
    self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
    self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
    self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
    self.avgpool = nn.AvgPool2d(2)
    self.fc = nn.Linear(512 * block.expansion, num_classes)
    self.bnfc = nn.BatchNorm1d(num_classes)
    for m in self.modules():
      if isinstance(m, nn.Conv2d):
        n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
        m.weight.data.normal_(0, math.sqrt(2. / n))
      elif isinstance(m, nn.BatchNorm2d):
        m.weight.data.fill_(1)
        m.bias.data.zero_()
      elif isinstance(m, nn.BatchNorm1d):
        m.weight.data.fill_(1)
        m.bias.data.zero_()

  def _make_layer(self, block, planes, blocks, stride=1):
    downsample = None
    if stride != 1 or self.inplanes != planes * block.expansion:
      downsample = nn.Sequential(
        nn.Conv2d(self.inplanes, planes * block.expansion,
              kernel_size=1, stride=stride, bias=False),
        nn.BatchNorm2d(planes * block.expansion),
      )

    layers = []
    layers.append(block(self.inplanes, planes, stride, downsample))
    self.inplanes = planes * block.expansion
    for i in range(1, blocks):
      layers.append(block(self.inplanes, planes))

    return nn.Sequential(*layers)

  def forward(self, x):
    x = self.layer1(x)
    x = self.layer2(x)
    x = self.layer3(x)
    x = self.layer4(x)
    x = self.avgpool(x)
    x = x.view(x.size(0), -1)
    x = self.fc(x)
    x = self.bnfc(x)
    return x


class VideoLipreading(nn.Module):

  def __init__(self, inputDim=256, hiddenDim=512, nClasses=500, frameLen=29):
    super(VideoLipreading, self).__init__()
    self.inputDim = inputDim
    self.hiddenDim = hiddenDim
    self.nClasses = nClasses
    self.frameLen = frameLen
    self.nLayers = 2
    # frontend3D
    self.frontend3D = nn.Sequential(
        nn.Conv3d(1, 64, kernel_size=(5, 7, 7), stride=(1, 2, 2), padding=(2, 3, 3), bias=False),
        nn.BatchNorm3d(64),
        nn.ReLU(True),
        nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1))
        )
    # resnet
    self.resnet34 = ResNet2D(BasicBlock2D, [3, 4, 6, 3], num_classes=self.inputDim)
    # backend_gru
    self.gru = GRU(self.inputDim, self.hiddenDim, self.nLayers, self.nClasses)

  def to( self, device ):
    self.gru.to(device)
    return super( VideoLipreading, self ).to( device )

  def forward(self, x):
    x = self.frontend3D(x)
    x = x.transpose(1, 2)
    x = x.contiguous()
    x = x.view(-1, 64, x.size(3), x.size(4))
    x = self.resnet34(x)
    x = x.view(-1, self.frameLen, self.inputDim)
    x = self.gru(x)
    return x

#Evaluación

In [10]:
def CenterCrop(batch_img, size):
  w, h = batch_img[0][0].shape[1], batch_img[0][0].shape[0]
  th, tw = size
  img = np.zeros((len(batch_img), len(batch_img[0]), th, tw))
  for i in range(len(batch_img)):
    x1 = int(round((w - tw))/2.)
    y1 = int(round((h - th))/2.)
    img[i] = batch_img[i, :, y1:y1+th, x1:x1+tw]
  return img

def ColorNormalize(batch_img):
  mean = 0.413621
  std = 0.1700239
  batch_img = (batch_img - mean) / std
  return batch_img

In [11]:
def reload_model(model, path=""):
  model_dict = model.state_dict()
  pretrained_dict = torch.load(path)
  pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
  model_dict.update(pretrained_dict)
  model.load_state_dict(model_dict)
  print('*** model has been successfully loaded! ***')
  return model

In [12]:
device = torch.device( 'cuda' if torch.cuda.is_available() else 'cpu' )
print( 'running on: %s' % (device) )

every_frame = True
audio_model = AudioLipreading(inputDim=512, hiddenDim=512, nClasses=500, frameLen=29)
video_model = VideoLipreading(inputDim=256, hiddenDim=512, nClasses=500, frameLen=29)
concat_model = GRU(2048, 512, 2, 500, output_layer=True, every_frame=every_frame)

# reload model
print('reload audio model')
audio_model = reload_model(audio_model, 'Audiovisual/Audiovisual_a_part.pt')
print("reload video model")
video_model = reload_model(video_model, 'Audiovisual/Audiovisual_v_part.pt')
print("reload LSTM model")
concat_model = reload_model(concat_model, 'Audiovisual/Audiovisual_c_part.pt')

audio_model = audio_model.to( device )
video_model = video_model.to( device )
concat_model = concat_model.to( device )

running on: cuda
reload audio model
*** model has been successfully loaded! ***
reload video model
*** model has been successfully loaded! ***
reload LSTM model
*** model has been successfully loaded! ***


In [13]:
dset_loader, dset_size = data_loader('test', 'preprocessed/audio', 'preprocessed/video', batch_size = 1)


Statistics: test: 2500


  cpuset_checked))


In [14]:
audio_model.eval()
video_model.eval()
concat_model.eval()

running_loss = 0.0
running_corrects = 0.0
running_all = 0.0
with torch.no_grad():
  for batch_idx, (audio_inputs, video_inputs, targets) in enumerate(dset_loader):
    batch_img = CenterCrop(video_inputs.numpy(), (88, 88))
    batch_img = ColorNormalize(batch_img)

    batch_img = np.reshape(batch_img, (batch_img.shape[0], batch_img.shape[1], batch_img.shape[2], batch_img.shape[3], 1))
    video_inputs = torch.from_numpy(batch_img)
    video_inputs = video_inputs.float().permute(0, 4, 1, 2, 3)

    audio_inputs = audio_inputs.float()

    audio_inputs = audio_inputs.to( device )
    video_inputs = video_inputs.to( device )
    targets = targets.to( device )

    audio_outputs = audio_model(audio_inputs)
    video_outputs = video_model(video_inputs)
    inputs = torch.cat((audio_outputs, video_outputs), dim=2)
    outputs = concat_model(inputs)

    if every_frame:
      outputs = torch.mean(outputs, 1) # average probability among frames
    _, preds = torch.max(F.softmax(outputs, dim=1).data, 1)

    #running_loss += loss.data[0] * inputs.size(0)
    running_corrects += torch.sum(preds == targets.data)
    running_all += len(inputs)
print('Accuracy: {:.4f}'.format(running_corrects / len(dset_loader.dataset))+'\n')

  cpuset_checked))


Accuracy: 0.9884



##Evaluación cualitativa

In [15]:
def normalisation(inputs):
  inputs_std = np.std(inputs)
  if inputs_std == 0.:
    inputs_std = 1.
  return (inputs - np.mean(inputs))/inputs_std

def predict(filename):
  # data loading
  npy_basename = 'preprocessed'
  with open('label_sorted.txt') as myfile:
    id2label = myfile.read().splitlines()
    label2id = { klass:i for i, klass in enumerate(id2label) }
  audio_input = load_audio_file( os.path.join( npy_basename,
                                               'audio',
                                                filename.split('/')[-3],
                                                filename.split('/')[-2],
                                                filename.split('/')[-1][:-4]+'.npz' ) )
  audio_input = normalisation(audio_input)
  video_input = load_video_file( os.path.join( npy_basename,
                                               'video',
                                                filename.split('/')[-3],
                                                filename.split('/')[-2],
                                                filename.split('/')[-1][:-4]+'.npz' ) )
  label = filename.split('/')[-3]
  label_id = label2id[label]

  audio_input = np.expand_dims( audio_input, 0 ) # add batch dimension
  video_input = np.expand_dims( video_input, 0 ) # add batch dimension

  # prediction
  batch_img = CenterCrop(video_input, (88, 88))
  batch_img = ColorNormalize(batch_img)

  batch_img = np.reshape(batch_img, (batch_img.shape[0], batch_img.shape[1], batch_img.shape[2], batch_img.shape[3], 1))
  video_input = torch.from_numpy(batch_img)
  video_input = video_input.float().permute(0, 4, 1, 2, 3)

  
  audio_input = torch.from_numpy(audio_input).float()

  audio_input = audio_input.to( device )
  video_input = video_input.to( device )

  audio_output = audio_model(audio_input)
  video_output = video_model(video_input)
  input = torch.cat((audio_output, video_output), dim=2)
  output = concat_model(input)

  if every_frame:
    output = torch.mean(output, 1) # average probability among frames
  _, pred = torch.max(F.softmax(output, dim=1).data, 1)
  
  pred_str = id2label[int(pred)]

  return preds, pred_str

In [16]:
video_filename = 'lipread_testset_mini/AMERICAN/test/AMERICAN_00001.mp4'
#video_filename = 'lipread_testset_mini/CHILDREN/test/CHILDREN_00001.mp4'
#video_filename = 'lipread_testset_mini/EXAMPLE/test/EXAMPLE_00001.mp4'
#video_filename = 'lipread_testset_mini/MAKING/test/MAKING_00001.mp4'
#video_filename = 'lipread_testset_mini/YESTERDAY/test/YESTERDAY_00001.mp4'

pred, pred_str = predict(video_filename)
print( 'Model prediction: %s [%d]' % (pred_str, int(pred)) )

Model prediction: AMERICAN [157]


In [17]:
from IPython.display import HTML
from base64 import b64encode
# Convert mp4 to format supported by Colab
os.system(f"ffmpeg -i {video_filename} -vcodec libx264 {os.path.basename(video_filename)}")
mp4 = open(os.path.basename(video_filename),'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<h1>Predicted word: %s</h1>
<br>
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % (pred_str, data_url))

#Actividades

##Actividad 1

¿ Por qué puede ser necesario utilizar los frames de video para la tarea de speech recognition ?

In [None]:
Respuesta = 'Para dar robustez al modelo cuando el audio viene con ruido ambiente' #@param ["seleccione una opcion", "Los frames de video son utilizados para leer los labios y generar el audio del habla","Para dar robustez al modelo cuando el audio viene con ruido ambiente", "Para localizar la persona que está hablando dentro de la imagen", "Para determinar el intervalo de tiempo donde se produce el habla", "Los frames de video son imprescindibles para reconocer la palabra pronunciada"]

##Actividad 2

¿ Por qué el entrenamiento del modelo se hace por etapas ? ( primero ResNets, luego BiGRUs, luego todo junto ).

In [None]:
Respuesta = 'Para aumentar la estabilidad del entrenamiento y obtener un mayor rendimiento' #@param ["seleccione una opcion", "Porque el modelo es muy grande y no es posible alamcenar el gradiente de todos sus pesos en una GPU","Porque no es posible combinar modelos feedforward con modelos recurrentes en la propagación de gradientes", "Para separar el entrenamiento del stream de video del de audio", "Para aumentar la estabilidad del entrenamiento y obtener un mayor rendimiento", "Para separar cada uno de los 29 instantes de tiempo que componen los videos de entrada"]

##Actividad 3

Como vimos en clases, para la *rama* que se encarga de procesar el video, se comienza por agregar una capa convolucional 3D. ¿ Cuál es el principal objetivo de esta capa ?

**Hint:** Si no se acuerda, puede descargar el paper y leer la sección 3.1, página 2.

In [None]:
Respuesta = 'Capturar las din\xE1micas producidas en peque\xF1os intervalos de tiempo' #@param ["seleccione una opcion", "Reducir la dimensión temporal desde 29 frames a 1 que resuma todo el movimiento","Reducir los 3 canales RGB de entrada a una matriz bidimensional", "Realizar un downsampling de algunos frames que permitan recuperar patrones temporales", "Capturar las dinámicas producidas en pequeños intervalos de tiempo"]