<a href="https://colab.research.google.com/github/bshakhruz/DAN-templates/blob/main/DNN_video_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Deep Neural Networks on Videos Using Google Colab


## 1. Introduction to Google Colab

Google Colab is a free cloud service based on Jupyter Notebooks that supports free GPU and TPU usage. It's ideal for machine learning, data analysis, and education. The platform eliminates the need for expensive hardware, making deep learning more accessible.



## 2. Setting up the Colab Environment

- **Create a New Notebook**: Open Google Colab, click on 'New Notebook' to start.
- **Enable GPU/TPU**: Go to `Edit` > `Notebook Settings` or `Runtime` > `Change runtime type` and select GPU or TPU from the dropdown to accelerate your computations.


In [2]:
## 3. Installing Dependencies
!pip install tensorflow opencv-python-headless pytorch torchvision torchaudio

## 4. Accessing and Preparing Video Data

To work with video data in Google Colab, you can upload video files directly to Colab or access them via Google Drive. Once the videos are accessible, you'll use OpenCV to preprocess them, such as extracting frames and normalizing pixel values. This step is crucial for converting raw video files into a structured format that can be used for training deep learning models. The goal is to organize your preprocessed video data into training, validation, and test sets.


In [3]:
# Installing Additional Dependencies for Preprocessing Phase
!apt update
!apt install ffmpeg
!pip install moviepy

In [4]:
# Mounting Google Drive (Optional)
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# This snippet loads a video, extract frames, and preprocess them,
# NOTE: place your actual path in the 'video_path' variable

# Necessary library imports
import cv2
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# Function to extract and preprocess frames
def extract_and_preprocess_frames(video_path):
    frames = []
    cap = cv2.VideoCapture(video_path)
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        # Preprocess steps (e.g., resizing, normalization)
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        resized_frame = cv2.resize(frame_rgb, (224, 224))  # Resize frame to model input size
        frames.append(resized_frame / 255.0)  # Normalize pixel values
    cap.release()
    return np.array(frames)

# Example usage for a single video
video_path = '/content/drive/My Drive/path_to_your_video.mp4' # Adjust this path
processed_frames = extract_and_preprocess_frames(video_path)

# Optionally, display a frame to verify preprocessing
plt.imshow(processed_frames[0])
plt.axis('off')
plt.show()

In [None]:
# Organizing the dataset
# Assuming 'processed_frames' contains all your preprocessed video frames
# and 'labels' is an array of corresponding labels for each video.

from sklearn.model_selection import train_test_split

# Splitting the dataset into training, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(processed_frames, labels, test_size=0.2, random_state=42)

# Further split the training set to create a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)  # Adjust test_size as per your requirement

# Note: Adjust the test_size parameter based on how much data you want to allocate for testing and validation.
# Now X_train, X_val, and X_test along with y_train, y_val, and y_test are ready to be used in the model training process.

## 5. Overview of Different Architectures for Video Analysis

Selecting the appropriate architecture for video analysis is crucial for achieving good performance. Here's an overview of the popular architectures and their primary applications:

### CNNs (Convolutional Neural Networks)

- **Use**: Best for extracting spatial features from individual video frames.
- **Explanation**: CNNs are adept at recognizing patterns, shapes, and objects within images, making them perfect for analyzing static frames.

### RNN/LSTM/GRU

- **Use**: Ideal for capturing temporal dependencies in sequential data, like videos.
- **Explanation**: RNNs (Recurrent Neural Networks) and their advanced variants, LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units), can model time-based dynamics, crucial for understanding activities and events in videos.

### 3D CNNs

- **Use**: For simultaneous spatial and temporal feature extraction.
- **Explanation**: 3D CNNs extend conventional CNNs by adding a time dimension, allowing them to process video clips as volumetric data, capturing motion information directly.

### Two-Stream Networks

- **Use**: Combines spatial and temporal streams for comprehensive video analysis.
- **Explanation**: This architecture uses one CNN stream to process single frames for spatial features, and another stream, often an RNN or 3D CNN, to process the motion between frames, offering a balance between the two types of information.

### Transformers (e.g., Vision Transformers - ViT)

- **Use**: Recently, transformers have been adapted for video processing tasks.
- **Explanation**: Transformers, known for their effectiveness in NLP, have been adapted for video through architectures like ViT, allowing for attention-based mechanisms to capture long-range dependencies in both space and time within videos.

## Choose Your Neural Architecture Template

After familiarizing yourself with the different architectures available for video analysis, select the one that best fits your project's needs from the options provided below. Ensure to adjust parameters such as `frame_height`, `frame_width`, and `num_classes` to match your dataset specifics. Once you've made your selection, run the corresponding cell to define and compile your model.

- **CNN for Spatial Features**: Best for analyzing frame-level features.
- **RNN/LSTM for Temporal Features**: Ideal for understanding temporal dynamics.
- **3D CNN for Spatio-Temporal Features**: Captures both spatial and temporal information.
- **Two-Stream Network**: Combines CNN and RNN strengths for comprehensive analysis.
- **Transformers**: Utilizes advanced attention mechanisms for complex patterns.


**Note**: Each architecture requires specific adjustments related to your dataset dimensions and the problem you are solving (e.g., classification, detection). Modify `frame_height`, `frame_width`, `num_classes`, and any other relevant parameters before running your chosen architecture cell.

In [None]:
# Basic CNN Model for Spatial Feature Extraction

import tensorflow as tf
from tensorflow.keras import layers, models

# Adjust these parameters to fit your dataset
frame_height = 224  # Height of the video frame
frame_width = 224   # Width of the video frame
num_classes = 10    # Number of output classes

# Basic CNN Model
model_cnn = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(frame_height, frame_width, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(num_classes, activation='softmax'),
])

model_cnn.summary()

# Compile the model
model_cnn.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
# RNN/LSTM for Temporal Feature Extraction

from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Adjust these parameters to fit your dataset
timesteps = 100  # Length of your sequences
features = 128   # Features extracted from each frame or timestep
num_classes = 10 # Number of output classes

# RNN Model with LSTM
model_rnn = Sequential([
    LSTM(64, input_shape=(timesteps, features), return_sequences=True),
    LSTM(64),
    Dense(64, activation='relu'),
    Dense(num_classes, activation='softmax'),
])

model_rnn.summary()

# Compile the model
model_rnn.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
# 3D CNN for Spatio-Temporal Feature Extraction

from tensorflow.keras.layers import Conv3D, MaxPooling3D, Flatten

# Adjust these parameters to fit your dataset
frames_per_clip = 16    # Number of frames per video clip
frame_height = 112     # Height of the video frame
frame_width = 112      # Width of the video frame
num_channels = 3       # Number of color channels (RGB)
num_classes = 10       # Number of output classes

# 3D CNN Model
model_3dcnn = models.Sequential([
    Conv3D(64, (3, 3, 3), activation='relu',
           input_shape=(frames_per_clip, frame_height, frame_width, num_channels)),
    MaxPooling3D((2, 2, 2)),
    Conv3D(128, (3, 3, 3), activation='relu'),
    MaxPooling3D((2, 2, 2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(num_classes, activation='softmax'),
])

model_3dcnn.summary()

# Compile the model
model_3dcnn.compile(optimizer='adam',
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])

In [5]:
# Note: Implementing a two-stream network involves creating and training two separate models:
# one for spatial features (e.g., a CNN) and one for temporal features (e.g., a 3D CNN or an RNN).
# After training, their predictions are typically combined through averaging or a learned fusion layer.

# This cell is meant to guide you conceptually; implementation details will vary based on your specific needs.

# For Spatial Features: Use the CNN model from Cell 1.
# For Temporal Features: Use the 3D CNN model from Cell 3 or an RNN model depending on your preference.

# The final step involves combining these models. One simple approach is to average their predictions:
# predictions = 0.5 * cnn_model.predict(spatial_data) + 0.5 * temporal_model.predict(temporal_data)

In [None]:
# Transformers for Video Processing (Vision Transformers - ViT)

from tensorflow.keras.applications import EfficientNetB0
from vit_keras import vit

# Adjust these parameters

## Training and Monitoring the Model

Now that your model is ready, it's time to train it with your dataset. Use the `.fit()` method for training. To monitor the training progress and ensure the best model is saved, we'll use callbacks such as `ModelCheckpoint` and `TensorBoard`.

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard

# Setup callbacks
checkpoint_cb = ModelCheckpoint('best_model.h5', save_best_only=True)
tensorboard_cb = TensorBoard(log_dir='./logs')

# Train the model
history = model.fit(
    X_train, y_train,
    epochs=20,
    batch_size=32,
    validation_data=(X_val, y_val),
    callbacks=[checkpoint_cb, tensorboard_cb]
)

## Evaluating the Model

After training, it's crucial to evaluate your model on the test set to understand its performance on unseen data. This step gives you insights into how well your model has learned and generalized from the training data.


In [None]:
# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.2f}")

## Preparing for Fine-Tuning (if needed)

To fine-tune your model, identify which layers need adjustment and set a lower learning rate for fine-tuning. This approach delicately refines the model's ability to adapt to your specific dataset.

In [None]:
# Example for a model with a pre-trained base
for layer in model.layers[:layer_to_freeze]:
    layer.trainable = False  # Freeze layers not intended for fine-tuning
for layer in model.layers[layer_to_freeze:]:
    layer.trainable = True  # Unfreeze layers for fine-tuning

# Adjust the learning rate for fine-tuning
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)  # Lower learning rate
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Summarize the model post adjustments to verify changes
model.summary()

In [None]:
# Fine-Tuning the Model With the model adjusted for fine-tuning,
# continue training to refine its performance on the dataset. Monitor the training process closely to ensure improvements.

# Continue training the model
history_fine = model.fit(
    X_train, y_train,
    epochs=10,  # Adjust epochs based on when performance plateaus
    batch_size=32,
    validation_data=(X_val, y_val),
    callbacks=[checkpoint_cb, tensorboard_cb]  # Reuse callbacks from initial training
)

## Evaluating the Model Post Fine-Tuning

After fine-tuning, evaluate the model again on the test set to assess any improvements. This step helps understand the effectiveness of your fine-tuning efforts.


In [None]:
# Evaluate the model again
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Post Fine-Tuning Test Accuracy: {test_acc:.2f}")

## Saving the Trained Model

After training and fine-tuning, save your model to reuse it later without needing to retrain. This step is crucial for deployment.

In [None]:
# Save the entire model to a file
model.save('my_model.h5')

# Note: TensorFlow also supports saving in the SavedModel format using 'save('my_model')'