<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Video" data-toc-modified-id="Video-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Video</a></span><ul class="toc-item"><li><span><a href="#Classifying-videos-with-pretrained-nets-in-six-different-ways" data-toc-modified-id="Classifying-videos-with-pretrained-nets-in-six-different-ways-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Classifying videos with pretrained nets in six different ways</a></span></li></ul></li></ul></div>

# Video
## Classifying videos with pretrained nets in six different ways

The **first approach** consists of classifying one video frame at a time by considering
each one of them as a separate image processed with a 2D CNN. This approach
simply reduces the video classification problem to an image classification problem.
Each video frame "emits" a classification output, and the video is classified by taking
into account the more frequently chosen category for each frame.

The **second approach** consists of creating one single network where a 2D CNN is
combined with an RNN (see Chapter 9, Autoencoders). The idea is that the CNN will
take into account the image components and the RNN will take into account the
sequence information for each video. This type of network can be very difficult to
train because of the very high number of parameters to optimize.

The **third approach** is to use a 3D ConvNet, where 3D ConvNets are an extension
of 2D ConvNets operating on a 3D tensor (time, image_width, image_height). This
approach is another natural extension of image classification. Again, 3D ConvNets
can be hard to train.

The **fourth approach** is based on a clever idea: instead of using CNNs directly for
classification, they can be used for storing offline features for each frame in the video.
The idea is that feature extraction can be made very efficient with transfer learning
as shown in a previous chapter. After all features are extracted, they can be passed
as a set of inputs into an RNN, which will learn sequences across multiple frames
and emit the final classification.

The **fifth approach** is a simple variant of the fourth, where the final layer is an MLP
instead of an RNN. In certain situations, this approach can be simpler and less
expensive in terms of computational requirements.

The **sixth approach** is a variant of the fourth, where the phase of feature extraction is
realized with a 3D CNN that extracts spatial and visual features. These features are
then passed into either an RNN or an MLP.

In [3]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, preprocessing
import tensorflow_datasets as tfds
max_len = 200
n_words = 10000
dim_embedding = 256
EPOCHS = 20
BATCH_SIZE =500
def load_data():
    #load data
    (X_train, y_train), (X_test, y_test) = datasets.imdb.load_data(num_words=n_words)
    X_train = preprocessing.sequence.pad_sequences(X_train,maxlen=max_len)
    X_test = preprocessing.sequence.pad_sequences(X_test, maxlen=max_len)
    return (X_train, y_train), (X_test, y_test)

In [6]:
def build_model():
    model = models.Sequential()
    # Input - Embedding Layer
    # the model will take as input an integer matrix of size
    # (batch, input_length)
    # the model will output dimension (input_length, dim_embedding)
    # the largest integer in the input should be no larger
    # than n_words (vocabulary size).
    model.add(layers.Embedding(n_words,
    dim_embedding, input_length=max_len))
    model.add(layers.Dropout(0.3))
    model.add(layers.Conv1D(256, 3, padding='valid',
    activation='relu'))
    # takes the maximum value of either feature vector from each of
    # the n_words features
    model.add(layers.GlobalMaxPooling1D())
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(1, activation='sigmoid'))
    return model

(X_train, y_train), (X_test, y_test) = load_data()
model=build_model()
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 256)          2560000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 200, 256)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 198, 256)          196864    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                

In [8]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"] )


score = model.fit(X_train, y_train,
                epochs= EPOCHS,
                batch_size = BATCH_SIZE,
                validation_data = (X_test, y_test)
                )

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [11]:
score = model.evaluate(X_test, y_test, batch_size=BATCH_SIZE , verbose = 0)
print("\nTest score:", score[0])
print('Test accuracy:', score[1])


Test score: 0.6261743384599686
Test accuracy: 0.87596
