<a href="https://colab.research.google.com/github/gabilodeau/INF6804/blob/master/two_stream.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

INF6804 Vision par ordinateur

Polytechnique Montréal

Author: Soufiane Lamghari




Description : This notebook implements two-stream architecture for action recognition in inference, using ResNet-152 as the backbone network (pre-trained on the 101 classes of UCF-101 dataset). In this example, we predict actions for some sample videos.

Import libraries

In [1]:
import torch
import torch.nn as nn
import torch.utils.model_zoo as model_zoo
import torchvision.transforms as transforms
import numpy as np
import cv2
import math
import os
import sys
import collections
import time
import urllib.request as request
import subprocess
from PIL import Image, ImageDraw, ImageFont
from IPython.display import HTML
from base64 import b64encode
!apt-get > /dev/null install subversion 

Create a working directory named twostream

In [2]:
if not os.path.exists('twostream'):
  os.system('mkdir twostream')
%cd twostream

/content/twostream


Get videos (here from the github of INF6804 course) and save them in a destination folder. For simplicity zip files contains OF and RGB frames.

In [3]:
source = 'https://github.com/gabilodeau/INF6804/trunk/videos'

examples = ['Violin.zip', 'Teeth.zip']
destination = 'videos'

if not os.path.exists(destination):
  os.system('mkdir {}'.format(destination))

for example in examples:
  os.system('cd {} && svn export {}'.format(destination, os.path.join(source,example)))
  os.system('cd {} && unzip {}'.format(destination, example))

Define image transformations to transform images to tensors, with the mean and standard deviation of ImageNet particularly for RGB frames.

In [4]:
#@title

class Compose(object):

    def __init__(self, video_transforms):
        self.video_transforms = video_transforms

    def __call__(self, clips):
        for t in self.video_transforms:
            clips = t(clips)
        return clips


class ToTensor(object):

    def __call__(self, clips):
        if isinstance(clips, np.ndarray):
            # handle numpy array
            clips = torch.from_numpy(clips.transpose((2, 0, 1)))
            # backward compatibility
            return clips.float().div(255.0)

class Normalize(object):

    def __init__(self, mean, std):
        self.mean = mean
        self.std = std

    def __call__(self, tensor):
        for t, m, s in zip(tensor, self.mean, self.std):
            t.sub_(m).div_(s)
        return tensor

Load labels (here from the github of INF6804 course)

In [5]:
UCF101_NAMES = []
label_names = request.urlopen('https://raw.githubusercontent.com/sfnlm/DeepExamples/main/utils/UCF101_labels.txt')
for label_name in label_names.readlines():
  UCF101_NAMES.append(label_name.strip().decode('UTF-8').split(' ')[-1])

Let's define the ResNet 152 model.

In [6]:
#@title

def conv3x3(in_planes, out_planes, stride=1):
    "3x3 convolution with padding"
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=1, bias=False)


class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
                               padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes * 4)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out


class ResNet(nn.Module):

    def __init__(self, block, stream, layers, num_classes=1000):
        self.inplanes = 64
        super(ResNet, self).__init__()
        if stream == 'OF':
          self.conv1 = nn.Conv2d(20, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        elif stream == 'RGB':
          self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        self.avgpool = nn.AvgPool2d(7)
        self.dp = nn.Dropout(p=0.8)
        self.fc_action = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes * block.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.dp(x)
        x = self.fc_action(x)
        return x

def rgb_resnet152(pretrained=False, **kwargs):

    model = ResNet(Bottleneck,'RGB', [3, 8, 36, 3], **kwargs)
    if pretrained:
        pretrained_dict = model_zoo.load_url(model_urls['resnet152'])
        model_dict = model.state_dict()

        # 1. filter out unnecessary keys
        pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
        # 2. overwrite entries in the existing state dict
        model_dict.update(pretrained_dict) 
        # 3. load the new state dict
        model.load_state_dict(model_dict)

    return model

def change_key_names(old_params, in_channels):
    new_params = collections.OrderedDict()
    layer_count = 0
    allKeyList = old_params.keys()
    for layer_key in allKeyList:
        if layer_count >= len(allKeyList)-2:
            # exclude fc layers
            continue
        else:
            if layer_count == 0:
                rgb_weight = old_params[layer_key]               
                rgb_weight_mean = torch.mean(rgb_weight, dim=1)
                flow_weight = rgb_weight_mean.unsqueeze(1).repeat(1,in_channels,1,1)
                new_params[layer_key] = flow_weight
                layer_count += 1
            else:
                new_params[layer_key] = old_params[layer_key]
                layer_count += 1
    
    return new_params



def flow_resnet152(pretrained=False, **kwargs):
    """Constructs a ResNet-152 model.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(Bottleneck, 'OF',[3, 8, 36, 3], **kwargs)
    if pretrained:
        in_channels = 20
        pretrained_dict = model_zoo.load_url(model_urls['resnet152'])
        model_dict = model.state_dict()

        new_pretrained_dict = change_key_names(pretrained_dict, in_channels)
        # 1. filter out unnecessary keys
        new_pretrained_dict = {k: v for k, v in new_pretrained_dict.items() if k in model_dict}
        # 2. overwrite entries in the existing state dict
        model_dict.update(new_pretrained_dict) 
        # 3. load the new state dict
        model.load_state_dict(model_dict)

    return model

Download the pretrained models (RGB and OF streams), here we use ResNet with 152 layers

In [7]:
if not os.path.exists('checkpoints'):
  os.system('mkdir checkpoints')

RGB_checkpoint='ucf101_s1_rgb_resnet152.pth.tar'
OF_checkpoint='ucf101_s1_flow_resnet152.pth.tar'
!cd checkpoints && gdown --id 1BU8TyW7u-skmkQFAVlQhA_5ZZvugZXAt
!cd checkpoints && gdown --id 1KPoPYAslsdOMXbtqfi2y8TTn7zDEz898

Downloading...
From: https://drive.google.com/uc?id=1BU8TyW7u-skmkQFAVlQhA_5ZZvugZXAt
To: /content/twostream/checkpoints/ucf101_s1_rgb_resnet152.pth.tar
468MB [00:03, 117MB/s]
Downloading...
From: https://drive.google.com/uc?id=1KPoPYAslsdOMXbtqfi2y8TTn7zDEz898
To: /content/twostream/checkpoints/ucf101_s1_flow_resnet152.pth.tar
468MB [00:03, 149MB/s]


**Spatial stream inference**

This function is used to predict video actions from the spatial stream. Taking paths of RGB frames, *VideoSpatialPrediction()* passes subsamples (*num_samples*) of them in the network to return the averaged spatial prediction, which is the voting result of all (*num_samples*) frame level predictions.



In [8]:
#@title

def VideoSpatialPrediction(
        vid_name,
        net,
        num_categories,
        num_samples,
        start_frame=0,
        num_frames=0
        ):
  
    vid_path_RGB=os.path.join(destination, vid_name.split('.')[0], 'RGB')

    if num_frames == 0:
        imglist = os.listdir(vid_path_RGB)
        duration = len(imglist)
    else:
        duration = num_frames


    clip_mean = [0.485, 0.456, 0.406]
    clip_std = [0.229, 0.224, 0.225]
    normalize = Normalize(mean=clip_mean,
                                     std=clip_std)
    val_transform = Compose([
            ToTensor(),
            normalize,
        ])

    # selection
    step = int(math.floor((duration-1)/(num_samples-1)))
    print('Video length: ', duration)
    print('Samples frames: ', num_samples)
    print('Subsampling step: ', step)
    dims = (256,340,3,num_samples)
    rgb = np.zeros(shape=dims, dtype=np.float64)
    rgb_flip = np.zeros(shape=dims, dtype=np.float64)
    for i in range(num_samples):

        framenumber='{0:06d}'.format(i*step+1)
        img_file = os.path.join(vid_path_RGB, 'frame{}.jpg'.format(framenumber))
        img = cv2.imread(img_file, cv2.IMREAD_UNCHANGED)
        img = cv2.resize(img, dims[1::-1])
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        rgb[:,:,:,i] = img
        rgb_flip[:,:,:,i] = img[:,::-1,:]

    # crop
    rgb_1 = rgb[:224, :224, :,:]
    rgb_2 = rgb[:224, -224:, :,:]
    rgb_3 = rgb[16:240, 60:284, :,:]
    rgb_4 = rgb[-224:, :224, :,:]
    rgb_5 = rgb[-224:, -224:, :,:]
    rgb_f_1 = rgb_flip[:224, :224, :,:]
    rgb_f_2 = rgb_flip[:224, -224:, :,:]
    rgb_f_3 = rgb_flip[16:240, 60:284, :,:]
    rgb_f_4 = rgb_flip[-224:, :224, :,:]
    rgb_f_5 = rgb_flip[-224:, -224:, :,:]

    rgb = np.concatenate((rgb_1,rgb_2,rgb_3,rgb_4,rgb_5,rgb_f_1,rgb_f_2,rgb_f_3,rgb_f_4,rgb_f_5), axis=3)

    _, _, _, c = rgb.shape
    rgb_list = []
    for c_index in range(c):
        cur_img = rgb[:,:,:,c_index].squeeze()
        cur_img_tensor = val_transform(cur_img)
        rgb_list.append(np.expand_dims(cur_img_tensor.numpy(), 0))
        
    rgb_np = np.concatenate(rgb_list,axis=0)
    batch_size = 25
    prediction = np.zeros((num_categories,rgb.shape[3]))
    num_batches = int(math.ceil(float(rgb.shape[3])/batch_size))

    for bb in range(num_batches):
        span = range(batch_size*bb, min(rgb.shape[3],batch_size*(bb+1)))
        input_data = rgb_np[span,:,:,:]
        imgDataTensor = torch.from_numpy(input_data).type(torch.FloatTensor).cuda()
        imgDataVar = torch.autograd.Variable(imgDataTensor)
        output = net(imgDataVar)
        result = output.data.cpu().numpy()
        prediction[:, span] = np.transpose(result)

    return prediction


Load the spatial model and get the predictions for each example. The predicted categories belong to the 101 classes of UCF-101 dataset. From each video, we sample uniformly *2* frames to predict the action for each video.

In [9]:


RGB_num_samples = 2

RGB_model_path = os.path.join('checkpoints',RGB_checkpoint)
RGB_start_frame = 0
num_categories = 101
model_start_time = time.time()
params = torch.load(RGB_model_path)
spatial_net = rgb_resnet152(pretrained=False, num_classes=101)
spatial_net.load_state_dict(params['state_dict'])
spatial_net.cuda()
spatial_net.eval()
model_end_time = time.time()
model_time = model_end_time - model_start_time
print("Action recognition model is loaded in %4.4f seconds." % (model_time))
print("%d test videos" % len(examples))
line_id = 1
avg_fc8_spatial_result = []
predicted_RGB=[]
for example in examples:
    print("Video %d/%d" % (line_id, len(examples)))

    spatial_prediction = VideoSpatialPrediction(
            example,
            spatial_net,
            num_categories,
            RGB_num_samples,
            RGB_start_frame)
 
    avg_spatial_pred_fc8 = np.mean(spatial_prediction, axis=1)
    avg_fc8_spatial_result.append(avg_spatial_pred_fc8)
    spatial_pred_index = np.argmax(avg_spatial_pred_fc8)
    predicted_RGB.append(UCF101_NAMES[spatial_pred_index])
    line_id += 1

print('==> End of inference')

Action recognition model is loaded in 18.9255 seconds.
2 test videos
Video 1/2
Video length:  186
Samples frames:  2
Subsampling step:  185
Video 2/2
Video length:  103
Samples frames:  2
Subsampling step:  102
==> End of inference


**Temporal stream inference**

This function is used to predict video actions from the temporal stream. by taking the optical flow, *VideoTemporalPrediction()* passes (*num_samples*) stacks of 10 consecutive optical flow (x and y channels) images in the network to return the averaged temporal prediction, which is the voting result of all (*num_samples*) stacks level predictions.

In [10]:
#@title

def VideoTemporalPrediction(
        vid_name,
        net,
        num_categories,
        num_samples,
        optical_flow_frames,
        start_frame=0,
        num_frames=0
      
        ):
  
    vid_path_OF=os.path.join(destination, vid_name.split('.')[0], 'OF')

    if num_frames == 0:
        imglist = os.listdir(os.path.join(vid_path_OF,'u'))
        duration = len(imglist)
    else:
        duration = num_frames

    
    clip_mean = [0.5] * 20
    clip_std = [0.226] * 20
    normalize = Normalize(mean=clip_mean,
                                     std=clip_std)
    val_transform = Compose([
            ToTensor(),
            normalize,
        ])
    

    # selection
    step = int(math.floor((duration-optical_flow_frames+1)/num_samples))
    print('Video length: ', duration)
    print('Samples frames: ', num_samples)
    print('Subsampling step: ', step)
    
    dims = (256,340,optical_flow_frames*2,num_samples)
    flow = np.zeros(shape=dims, dtype=np.float64)
    flow_flip = np.zeros(shape=dims, dtype=np.float64)
    

    for i in range(num_samples):
        for j in range(optical_flow_frames):
                   
            framenumber='{0:06d}'.format(i*step+j+1 + start_frame)
            flow_x_file = os.path.join(vid_path_OF,'u','frame{}.jpg'.format(framenumber))
            flow_y_file = os.path.join(vid_path_OF,'v','frame{}.jpg'.format(framenumber))
            img_x = cv2.imread(flow_x_file, cv2.IMREAD_GRAYSCALE)
            img_y = cv2.imread(flow_y_file, cv2.IMREAD_GRAYSCALE)
            img_x = cv2.resize(img_x, dims[1::-1])
            img_y = cv2.resize(img_y, dims[1::-1])

            flow[:,:,j*2  ,i] = img_x
            flow[:,:,j*2+1,i] = img_y

            flow_flip[:,:,j*2  ,i] = 255 - img_x[:, ::-1]
            flow_flip[:,:,j*2+1,i] = img_y[:, ::-1]

    # crop
    flow_1 = flow[:224, :224, :,:]
    flow_2 = flow[:224, -224:, :,:]
    flow_3 = flow[16:240, 60:284, :,:]
    flow_4 = flow[-224:, :224, :,:]
    flow_5 = flow[-224:, -224:, :,:]
    flow_f_1 = flow_flip[:224, :224, :,:]
    flow_f_2 = flow_flip[:224, -224:, :,:]
    flow_f_3 = flow_flip[16:240, 60:284, :,:]
    flow_f_4 = flow_flip[-224:, :224, :,:]
    flow_f_5 = flow_flip[-224:, -224:, :,:]

    flow = np.concatenate((flow_1,flow_2,flow_3,flow_4,flow_5,flow_f_1,flow_f_2,flow_f_3,flow_f_4,flow_f_5), axis=3)
    
    _, _, _, c = flow.shape
    flow_list = []
    for c_index in range(c):
        cur_img = flow[:,:,:,c_index].squeeze()
        cur_img_tensor = val_transform(cur_img)
        flow_list.append(np.expand_dims(cur_img_tensor.numpy(), 0))
        
    flow_np = np.concatenate(flow_list,axis=0)

    batch_size = 25
    prediction = np.zeros((num_categories,flow.shape[3]))
    num_batches = int(math.ceil(float(flow.shape[3])/batch_size))

    for bb in range(num_batches):
        span = range(batch_size*bb, min(flow.shape[3],batch_size*(bb+1)))

        input_data = flow_np[span,:,:,:]
        imgDataTensor = torch.from_numpy(input_data).type(torch.FloatTensor).cuda()
        imgDataVar = torch.autograd.Variable(imgDataTensor)
        output = net(imgDataVar)
        result = output.data.cpu().numpy()
        prediction[:, span] = np.transpose(result)

    return prediction


Load the temporal model and get the predictions for each example. The predicted categories belong to the 101 classes of UCF-101 dataset. From each video, we sample uniformly *2* stacks of *10* consecutive frames to predict the corresponding action. 

In [11]:
OF_num_samples=2

optical_flow_frames=10
OF_model_path = os.path.join('checkpoints', OF_checkpoint)
OF_start_frame = 0
num_categories = 101
model_start_time = time.time()
params = torch.load(OF_model_path)
temporal_net = flow_resnet152(pretrained=False, num_classes=101)
temporal_net.load_state_dict(params['state_dict'])
temporal_net.cuda()
temporal_net.eval()
model_end_time = time.time()
model_time = model_end_time - model_start_time
print("Action recognition temporal model is loaded in %4.4f seconds." % (model_time))

print("%d test videos" % len(examples))

line_id = 1
avg_fc8_temporal_result = []
predicted_OF=[]
for example in examples:
    print("Video %d/%d" % (line_id, len(examples)))
    temporal_prediction = VideoTemporalPrediction(
            example,
            temporal_net,
            num_categories,
            OF_num_samples,
            optical_flow_frames,
            OF_start_frame)

    avg_temporal_pred_fc8 = np.mean(temporal_prediction, axis=1)
    avg_fc8_temporal_result.append(avg_temporal_pred_fc8)
    temporal_pred_index = np.argmax(avg_temporal_pred_fc8)
    predicted_OF.append(UCF101_NAMES[temporal_pred_index])
    line_id += 1

print('==> End of inference')

Action recognition temporal model is loaded in 1.6643 seconds.
2 test videos
Video 1/2
Video length:  186
Samples frames:  2
Subsampling step:  88
Video 2/2
Video length:  102
Samples frames:  2
Subsampling step:  46
==> End of inference


**Two stream fusion**

Function used to predict the action of each example by fusing the two streams. 

In [12]:
def two_stream_prediction(avg_spatial_pred_fc8,avg_temporal_pred_fc8):
  avg_twostream_pred_fc8=np.array(avg_spatial_pred_fc8)+np.array(avg_temporal_pred_fc8)
  predicted_two_stream=[]
  for pred in avg_twostream_pred_fc8:
    two_stream_pred_index=np.argmax(pred)
    predicted_two_stream.append(UCF101_NAMES[two_stream_pred_index])
  return predicted_two_stream

**Action recognition visualization**

Functions used to visualize the predicted action. *label_video()* draws the category label on the video. *show_labeled_videos()* shows the labeled video for each stream prediction and for the fused prediction.

In [13]:
#@title

def label_video(labels, examples, dst_directory_path):

    if not os.path.exists(dst_directory_path):
        subprocess.call('mkdir -p {}'.format(dst_directory_path), shell=True)

    for index in range(len(labels)):

        unit_classes = labels[index]
        vid_name=examples[index].split('.')[0]

        if os.path.exists('tmp'):
            subprocess.call('rm -rf tmp', shell=True)
        subprocess.call('mkdir tmp', shell=True)
 

        for j in range(len(os.listdir('videos/{}/RGB'.format(vid_name)))):
            image = Image.open('videos/{}/RGB/frame{:06}.jpg'.format(vid_name,j+1)).convert('RGB')
            min_length = min(image.size)
            font_size = int(min_length * 0.05)
            font =   ImageFont.load_default()
            d = ImageDraw.Draw(image)
            textsize = d.textsize(labels[index], font=font)
            x = int(font_size * 0.5)
            y = int(font_size * 0.25)
            x_offset = x
            y_offset = y
            rect_position = (x, y, x + textsize[0] + x_offset * 2,
                              y + textsize[1] + y_offset * 2)
            d.rectangle(rect_position, fill=(30, 30, 30))
            d.text((x + x_offset, y + y_offset), labels[index],
                    font=font, fill=(235, 235, 235))
            image.save('tmp/image_{:06}_pred.jpg'.format(j+1))


        dst_file_path = os.path.join(dst_directory_path, '{}.mp4'.format(vid_name))
        fps = 30
        subprocess.call('ffmpeg -y -r {} -i tmp/image_%06d_pred.jpg -b:v 1000k {}'.format(fps, dst_file_path),
                        shell=True)

        if os.path.exists('tmp'):
            subprocess.call('rm -rf tmp', shell=True)
    print('labeled videos saved in:',dst_directory_path)


def show_labeled_videos(stream,examples): 
  for example in examples:
    mp4 = open(os.path.join('pred_vids_{}'.format(stream),'{}.mp4'.format(example.split('.')[0])),'rb').read()
    decoded_vid = "data:video/mp4;base64," + b64encode(mp4).decode()
    display(HTML(f'<video width=400 controls><source src={decoded_vid} type="video/mp4"></video>'))



Let's see the labeled videos (predicition from spatial stream)

In [14]:
label_video(predicted_RGB, examples, './pred_vids_RGB')
show_labeled_videos('RGB', examples)

labeled videos saved in: ./pred_vids_RGB


Let's see the labeled videos (predicition from temporal stream)

In [15]:
label_video(predicted_OF, examples, './pred_vids_OF')
show_labeled_videos('OF', examples)

labeled videos saved in: ./pred_vids_OF


Let's see the labeled videos (predicition from two stream)

In [16]:
predicted_two_stream=two_stream_prediction(avg_fc8_spatial_result,avg_fc8_temporal_result)
label_video(predicted_two_stream, examples, './pred_vids_two_stream')
show_labeled_videos('two_stream', examples)

labeled videos saved in: ./pred_vids_two_stream


**References:**

 - https://github.com/bryanyzhu/two-stream-pytorch

 - [Two-stream convolutional networks for action recognition in videos](https://arxiv.org/abs/1406.2199) 

 - [Towards Good Practices for Very Deep Two-Stream ConvNets](https://arxiv.org/abs/1507.02159) 

 