<a href="https://colab.research.google.com/github/carmenbarriga/Violence-Detection-in-Videos-with-Transformers/blob/main/Transformers/DeVTr/ViolenceInMovies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Efficient Video Transformer (DeVTr) for Violence Detection**

@inproceedings{abdali2021data,
  title={Data efficient video transformer for violence detection},
  author={Abdali, Almamon Rasool},
  booktitle={2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)},
  pages={195--199},
  year={2021},
  organization={IEEE}
}

## **1.- Installation of the necessary libraries**

*   **Timm:** Library that provides pre-trained implementations of deep learning models using the PyTorch framework


In [1]:
! pip install timm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting timm
  Downloading timm-0.9.2-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub (from timm)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors (from timm)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: safetensors, huggingface-hub, timm
Successfully installed huggingface-hub-0.15.1 safetensors-0.3.1 timm-0.9.2


## **2.- Mount Google Drive**
Mount Google Drive to be able to access Google Drive files and directories

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **3.- Import the necessary libraries**

In [3]:
import copy
import cv2
import math
import numpy as np
import os
import pandas as pd
import time
import timm
import torch

from skimage.transform import resize
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn import model_selection
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torch.optim import lr_scheduler
from tqdm.notebook import tqdm

## **4.- Make some initial configurations**
The function `seed_everything` is used to set seeds across various libraries and environments in Python to ensure reproducibility of results. Seed 1001 will be used.

In [4]:
def seed_everything(seed):
  os.environ["PYTHONHASHSEED"] = str(seed)
  # Sets the seed for the numpy library's random number generator
  np.random.seed(seed)
  # Sets the seed for the torch library's random number generator (PyTorch) for both the CPU and GPU
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  # To ensure that calculations performed with the torch library on the GPU are deterministic
  torch.backends.cudnn.deterministic = True
  # Turn off automatic benchmarking and default settings are used to ensure more stable and predictable execution
  torch.backends.cudnn.benchmark = False

seed_everything(1001)

Releases the GPU cache used by PyTorch and displays the current Pytorch version

In [5]:
torch.cuda.empty_cache()
torch.__version__

'2.0.1+cu118'

To determine on which device the PyTorch computations will be executed, either on a GPU (CUDA) or on the CPU

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## **5.- Prepare the data**

Set **Violence in Movies** dataset folder path


In [7]:
violence_in_movies_folder = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/'
violence_in_movies_weights_dir = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/DeVTr/Weights/violence_in_movies_best_model_weights.pth'
violence_in_movies_dataframes_folder = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/DeVTr/Dataframes/Violence in Movies/'

Function to get the following information from the data set:


*   Total number of videos
*   Minimum duration
*   Maximum duration
*   Minimum frame rate
*   Maximum frame rate
*   Average number of frames
*   Video widths
*   Video heights

In [8]:
def get_database_info(database_folder):
  # Variables to store the shortest and longest duration of the videos
  minimum_duration = float('inf')
  maximum_duration = float('-inf')

  # Variables to store the minimum and maximum frame rate
  minimum_fps = float('inf')
  maximum_fps = float('-inf')

  # Variables to store the minimum and maximum number of frames
  minimum_frames = float('inf')
  maximum_frames = float('-inf')

  # Variable to store the total number of frames in all videos
  total_frames = 0

  widths = {}
  heights = {}

  videos_counter = 0

  # Loop through the folders (classes) of the dataset folder
  for folder_name in os.listdir(database_folder):
    folder_dir = database_folder + folder_name + '/'
    print(f'Folder name: {folder_name}\nFolder dir: {folder_dir}')
    # Loop through videos within the current folder
    for file_name in os.listdir(folder_dir):
      file_dir = folder_dir + file_name

      # Read the video using OpenCV
      cap = cv2.VideoCapture(file_dir)

      # Get the frame rate per second (FPS)
      fps = cap.get(cv2.CAP_PROP_FPS)

      # Update the minimum and maximum frame rate
      minimum_fps = min(minimum_fps, fps)
      maximum_fps = max(maximum_fps, fps)

      # Get the number of frames in the video
      num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

      # Update the minimum and maximum number of frames
      minimum_frames = min(minimum_frames, num_frames)
      maximum_frames = max(maximum_frames, num_frames)      

      # Update the total number of frames
      total_frames += num_frames

      # Get duration in seconds
      duration = num_frames / fps
    
      # Update minimum and maximum duration
      minimum_duration = min(minimum_duration, duration)
      maximum_duration = max(maximum_duration, duration)
    
      # Get the resolution (width and height) of the video
      width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
      height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

      if width not in widths:
        widths[width]=[file_dir]
      else:
        widths[width].append(file_dir)

      if height not in heights:
        heights[height]=[file_dir]
      else:
        heights[height].append(file_dir)

      # print(f'Archivo: {file_name}')
      # print(f'Duración: {duration} segundos')
      # print(f'Resolución: {width}x{height}')
      # print(f'Tasa de frames por segundo: {fps}')
      # print(f'Número de frames: {num_frames}')

      # Release the video capture object
      cap.release()          

      videos_counter +=1
  
  print(f'Number of videos: {videos_counter}')

  # Calculate the average number of frames in the videos
  average_frames = total_frames / videos_counter

  # Print shortest and longest duration of videos
  print(f'Minimum duration: {minimum_duration} seconds')
  print(f'Maximum duration: {maximum_duration} seconds')
  
  # Print the minimum and maximum frame rate of the videos
  print(f'Minimum frame rate: {minimum_fps} fps')
  print(f'Maximum frame rate: {maximum_fps} fps')

  # Print the minimum and maximum number of frames
  print(f'Minimum number of frames: {minimum_frames} fps')
  print(f'Maximum number of frames: {maximum_frames} fps')  
  
  # Print the average number of frames in the videos
  print(f'Average number of frames: {average_frames}')
  
  for key, value in widths.items():
    print(f"Width: {key}")
    print(f"Number of videos: {len(value)}")
    print("------------------------")

  for key, value in heights.items():
    print(f"Height: {key}")
    print(f"Number of videos: {len(value)}")
    print("------------------------")

In [9]:
get_database_info(violence_in_movies_folder)

Folder name: Violence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/
Folder name: Non Violence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Non Violence/
Number of videos: 200
Minimum duration: 1.6683333333333334 seconds
Maximum duration: 2.04 seconds
Minimum frame rate: 25.0 fps
Maximum frame rate: 29.97002997002997 fps
Minimum number of frames: 42 fps
Maximum number of frames: 60 fps
Average number of frames: 48.955
Width: 720
Number of videos: 200
------------------------
Height: 576
Number of videos: 73
------------------------
Height: 480
Number of videos: 127
------------------------


Function to check that videos can be opened correctly

In [10]:
def check_frames(video_dir, min_frames=40):
  # VideoCapture object to open and read the video
  video_capture = cv2.VideoCapture(video_dir)
  # To check if the VideoCapture object was able to open the video
  if video_capture.isOpened():
    # To keep track of how many frames have been counted
    frames_counter = 0
    while frames_counter < min_frames:
      # Read the next frame
      is_frame_read, frame = video_capture.read()
      # Check if there are no more frames available
      if frame is None:
        print(f"Something went wrong with '{video_dir}' video")
        return False
      frames_counter += 1
  else:
    print(f"Can't open '{video_dir}'")
    return False
  return True

Function to get the paths where videos are located and their labels

In [11]:
def get_video_labels(main_dir):
  videos = []
  labels = []
  # Loop through the folders (classes) of the dataset folder
  for folder_name in os.listdir(main_dir):
    folder_dir = main_dir + folder_name + '/'
    print(f'Folder name: {folder_name}\nFolder dir: {folder_dir}')
    # Loop through videos within the current folder
    for file_name in os.listdir(folder_dir):
      file_dir = folder_dir + file_name
      # Check if the video can be opened correctly and has at least 25 frames
      if check_frames(file_dir):
        video_dir = os.path.join(folder_dir, file_name)
        videos.append(video_dir)
        # Add the video label according to the folder where it is located
        if folder_name == 'Violence':
          labels.append(1)
        else:
          labels.append(0)
  return videos, labels

In [12]:
videos, labels = get_video_labels(violence_in_movies_folder)

Folder name: Violence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/
Folder name: Non Violence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Non Violence/


Create the Pandas `DataFrame`

In [13]:
data = pd.DataFrame(data={"file": videos, "label": labels})
data_rows = data.head()
data_rows

Unnamed: 0,file,label
0,/content/drive/MyDrive/transformers-for-violen...,1
1,/content/drive/MyDrive/transformers-for-violen...,1
2,/content/drive/MyDrive/transformers-for-violen...,1
3,/content/drive/MyDrive/transformers-for-violen...,1
4,/content/drive/MyDrive/transformers-for-violen...,1


In [14]:
for index, row in data_rows.iterrows():
  file = row['file']
  label = row['label']
  print(f"File: {file}, Label: {label}")

File: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/newfi8.avi, Label: 1
File: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/newfi5.avi, Label: 1
File: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/newfi9.avi, Label: 1
File: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/newfi6.avi, Label: 1
File: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/newfi4.avi, Label: 1


Split the data for training and testing:
*   80% train
*   20% test

In [15]:
train_data, test_data = model_selection.train_test_split(
  data, test_size=0.2, random_state=42
)

Show train data information

In [16]:
print('Train data shape: ', train_data.shape)
print('Number of violence videos in train data: ', train_data['label'].value_counts()[1])
print('Number of non violence videos in train data: ', train_data['label'].value_counts()[0])

Train data shape:  (160, 2)
Number of violence videos in train data:  79
Number of non violence videos in train data:  81


Save train dataframe in a csv file

In [17]:
train_data.to_csv(violence_in_movies_dataframes_folder + "train.csv", index=False)

Show test data information

In [18]:
print('Test data shape: ', test_data.shape)
print('Number of violence videos in test data: ', test_data['label'].value_counts()[1])
print('Number of non violence videos in test data: ', test_data['label'].value_counts()[0])

Test data shape:  (40, 2)
Number of violence videos in test data:  21
Number of non violence videos in test data:  19


Save test dataframe in a csv file

In [19]:
test_data.to_csv(violence_in_movies_dataframes_folder + "test.csv", index=False)

Defining some video properties

In [20]:
time_steps = 49   # Number of frames of each video
color_channels = 3  # Number of color channels
height = 256  # Height of each frame
width = 256   # Width of each frame

Class to perform the preprocessing of the videos. Videos that contain a greater number of frames than the amount passed to the class will be cut. The videos that contain less than the average amount will be completed with zeros until reaching the average.

In [21]:
def capture(filename, time_steps, color_channels, height, width):
  # Create an array to store the video frames after being processed
  frames = np.zeros((time_steps, color_channels, height, width), dtype=float)
  # VideoCapture object to open and read the video
  video_capture = cv2.VideoCapture(filename)
  # To check if the VideoCapture object was able to open the video
  if video_capture.isOpened():
    # To keep track of how many frames have been stored in the frames array
    frames_counter = 0
    while frames_counter < time_steps:
      # Read the next frame
      is_frame_read, frame = video_capture.read()
      # Check if there are no more frames available
      if not is_frame_read:
        break
      # Resize the original frame to the specified dimensions (height, width, color_channels) keeping its original aspect ratio
      frame = resize(frame, (height, width, color_channels))
      # To add an extra dimension (1, height, width, color_channels)
      frame = np.expand_dims(frame, axis=0)
      # Moves axis -1 (last axis) to index 1 (1, color_channels, height, width)
      frame = np.moveaxis(frame, -1, 1)
      # Normalization of the pixel values of the frame (if necessary)
      if np.max(frame) > 1:
        frame = frame / 255.0
      # Store the processed frame in the corresponding position within the frames array
      frames[frames_counter][:] = frame
      frames_counter += 1

    del frame
    del is_frame_read

  return frames


class TaskDataset(Dataset):
  def __init__(self, data, time_steps=40, color_channels=3, height=256, width=256):
    # data is a pandas dataframe that contains the paths to the video files with their labels
    self.data_locations = data
    self.time_steps, self.color_channels, self.height, self.width = time_steps, color_channels, height, width

  def __len__(self):
    return len(self.data_locations)

  def __getitem__(self, idx):
    if torch.is_tensor(idx):
      idx = idx.tolist()
    # To process the video and get its frames
    video = capture(self.data_locations.iloc[idx, 0], self.time_steps, self.color_channels, self.height, self.width)
    # Dictionary containing the processed video, its corresponding label and its path
    sample = {
      'video': torch.from_numpy(video),
      'label': torch.from_numpy(np.asarray(self.data_locations.iloc[idx, 1])),
      'path': self.data_locations.iloc[idx, 0]
    }

    return sample

Passing the training data to the TaskDataset class

In [22]:
train_dataset = TaskDataset(
  data=train_data, time_steps=time_steps, color_channels=color_channels, height=height, width=width
)

Passing the test data to the TaskDataset class

In [23]:
test_dataset = TaskDataset(
  data=test_data, time_steps=time_steps, color_channels=color_channels, height=height, width=width
)

Defining the train batch size

In [24]:
BATCH_SIZE = 16

Creating a `DataLoader` to load data in batches during training

In [25]:
train_loader = DataLoader(
  dataset=train_dataset,
  batch_size=BATCH_SIZE,
  pin_memory=True,
  drop_last=True,
  num_workers=0,
  shuffle=True
)

Creating a `DataLoader` to load data in batches during test

In [26]:
TEST_BATCH_SIZE = 10

In [27]:
test_loader = DataLoader(
  dataset=test_dataset,
  batch_size=TEST_BATCH_SIZE,
  pin_memory=True,
  drop_last=True,
  num_workers=0,
  shuffle=False
)

Putting the `DataLoaders` in the `dataloaders` dictionary and their sizes in the `dataset_sizes` dictionary

In [28]:
dataloaders = {'train': train_loader, 'test': test_loader}
dataset_sizes = {'train': len(train_dataset), 'test': len(test_dataset)}
print(dataloaders)
print(dataset_sizes)

{'train': <torch.utils.data.dataloader.DataLoader object at 0x7feda0329660>, 'test': <torch.utils.data.dataloader.DataLoader object at 0x7feda0328b20>}
{'train': 160, 'test': 40}


To realease the memory because `data`, `train_data` and `test_data` are no longer needed

In [29]:
del data
del train_data
del test_data

## **6.- DeVTr**

`TimeWarp` class to apply the model (VGG-19) to each frame of a video sequence and rearrange the results to preserve the temporal order of the frames

In [30]:
class TimeWarp(nn.Module):
  def __init__(self, model):
    super(TimeWarp, self).__init__()
    self.model = model

  def forward(self, x):
    _, time_steps, _, _, _ = x.size()
    output = []
    for frame in range(time_steps):
      x_t = self.model(x[:, frame, :, :, :])
      output.append(x_t)

    x = torch.stack(output, dim=0).transpose_(0, 1)

    output = None
    x_t = None

    return x

`PositionalEncoder` class to preserve the order of the frames in the video sequence

In [31]:
class PositionalEncoder(nn.Module):
  def __init__(self, embedding_dimension, dropout=0.1, time_steps=40):
    super(PositionalEncoder, self).__init__()
    self.dropout = nn.Dropout(p=dropout)
    self.embedding_dimension = embedding_dimension
    self.time_steps = time_steps

  def do_positional_encode(self):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    positional_encode = torch.zeros(self.time_steps, self.embedding_dimension).to(device)
    for pos in range(self.time_steps):
      for i in range(0, self.embedding_dimension, 2):
        positional_encode[pos, i] = math.sin(pos / (10000 ** ((2 * i) / self.embedding_dimension)))
        positional_encode[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1)) / self.embedding_dimension)))
    positional_encode = positional_encode.unsqueeze(0)
    return positional_encode

  def forward(self, x):
    x = x * math.sqrt(self.embedding_dimension)
    positional_encode = self.do_positional_encode()
    x += positional_encode[:, :x.size(1)]
    x = self.dropout(x)
    return x

`memoTransformer` class containing the Transformer encoder

In [32]:
class memoTransformer(nn.Module):
  def __init__(self, embedding_dimension, heads=8, layers=4, actv='gelu'):
    super(memoTransformer, self).__init__()
    self.encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_dimension, nhead=heads, activation=actv)
    self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=layers)

  def forward(self, x):
    x = self.transformer_encoder(x)
    return x

DeVTr model

In [33]:
def DeVTr(
  weights=None,                       # The path for pre-trained DeVTr model
  number_of_neurons=1024,             # Number of neurons of the first layer that goes after the output of the encoder
  classification_dropout_rate=0.4,    # Dropout rate of the classification network
  number_of_output_classes=1,         # Number of output classes (Violence or Non Violence)
  embedding_dimension=512,            # Number of output dimensions of the CNN network
  encoder_dropout_rate=0.1,           # Dropout rate of the transformer encoder
  number_of_frames=40,                # Number of frames of the input video
  encoder_layers=4,                   # Number of transformer encoder layers
  encoder_heads=8                     # Number of transformer encoder heads per layer
):
  # If the weights of the pre-trained DeVTr model are passed, default values will be used
  if weights:
    number_of_output_classes = 1
    encoder_dropout_rate = 0.1
    embedding_dimension = 512
    encoder_layers = 4
    encoder_heads = 8
    number_of_frames = 40

  # Creates the VGG-19 pre-trained network with batch normalization
  # The model is used to extract features of dimension 'embedding_dimension'
  vgg_19_model = timm.create_model('vgg19_bn.tv_in1k', pretrained=True, num_classes=embedding_dimension)

  # To freeze the first 40 layers of the model
  # This is because the initial layers usually contain more general and reusable
  # features that can be useful in various computer vision tasks
  # It seems that there are 53 layers
  i = 0
  for child in vgg_19_model.features.children():
    # To disable the calculation of gradients and freezes the layer parameters,
    # which means they will not be updated during training
    if i < 40:
      for param in child.parameters():
        param.requires_grad = False
    # Enables the calculation of gradients and allows the parameters of these layers
    # to be updated during training
    else:
      for param in child.parameters():
        param.requires_grad = True
    i += 1

  # Combines the VGG-19 network with a non-linear activation layer
  # ReLU(x) = max(0, x)
  embedding_network = nn.Sequential(vgg_19_model, nn.ReLU())

  final_model = nn.Sequential(
    TimeWarp(embedding_network),
    PositionalEncoder(embedding_dimension=embedding_dimension, dropout=encoder_dropout_rate, time_steps=number_of_frames),
    memoTransformer(embedding_dimension=embedding_dimension, heads=encoder_heads, layers=encoder_layers, actv='gelu'),
    nn.Flatten(),
    nn.Linear(number_of_frames * embedding_dimension, number_of_neurons),
    nn.Dropout(classification_dropout_rate),
    nn.ReLU(),
    nn.Linear(number_of_neurons, number_of_output_classes),
  )

  if weights:
    if torch.cuda.is_available():
      final_model.load_state_dict(torch.load(weights))
    else:
      final_model.load_state_dict(torch.load(weights, map_location ='cpu'))

  return final_model

## **7.- Training**

In [34]:
def train_model(model, criterion, optimizer, scheduler, device='cuda', num_epochs=7):
  model.to(device)

  # Start the training time
  since = time.time()

  # Save the best loss value during model training
  best_loss = float('inf')

  # Create a copy of the current model weights
  best_model_weights = copy.deepcopy(model.state_dict())

  for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch + 1, num_epochs))
    print('-' * 10)

    model.train()
    running_loss = 0.0
    correct_predictions_counter = 0

    # To create a progress bar to iterate over the 'train' dataloader using the tqdm library
    progress_bar = tqdm(dataloaders['train'], total=int(len(dataloaders['train'])))

    for batch, sample in enumerate(progress_bar):
      # Get the videos and labels and move them to the corresponding device memory
      inputs = sample['video'].to(device, dtype=torch.float)  # [batch_size, time_steps, color_channels, height, width]
      labels = sample['label'].view(sample['label'].shape[0], 1).to(device, dtype=torch.float)  # [batch_size] -> [batch_size, 1]

      # To clean up the accumulated gradients and ensure that the gradients are calculated correctly 
      # for the current batch during backpropagation and updating of the weights
      optimizer.zero_grad()

      # Get the outputs predicted by the model
      outputs = model(inputs)

      # Calculate the loss with the function specified in the criterion variable
      loss = criterion(outputs, labels)

      # Computes the gradients of all model parameters with respect to the loss function
      loss.backward()

      # Update model parameters based on gradients computed during backpropagation
      optimizer.step()

      # To get the total loss of the current batch:
      #   - loss.item() is the scalar value of the current batch loss
      #   - inputs.size(0) gets the batch size
      running_loss += loss.item() * inputs.size(0)

      # Apply a sigmoid activation function to the outputs to obtain the predictions
      # and round the predictions to be binary (0 or 1)
      predictions = torch.round(torch.sigmoid(outputs))

      # Adds the number of correct predictions in the current batch to the accumulated correct predictions counter
      correct_predictions_counter += torch.sum(predictions == labels.data)

    # Calculates the average loss for each epoch
    epoch_loss = running_loss / dataset_sizes['train']
    # Calculates the accuracy for each epoch
    epoch_accuracy = correct_predictions_counter.double() / dataset_sizes['train']
    print('Train Loss: {:.4f} Accuracy: {:.4f}'.format(epoch_loss, epoch_accuracy))

    # Updates the state of the optimizer based on the loss obtained in each training epoch
    scheduler.step(epoch_loss)

    # Stores the model weights that correspond to the best loss achieved so far
    if epoch_loss < best_loss:
      best_loss = epoch_loss
      best_model_weights = copy.deepcopy(model.state_dict())

  # End the training time
  time_elapsed = time.time() - since
  print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

  # The model is loaded with the weights corresponding to the best saved model
  model.load_state_dict(best_model_weights)
  # Save the weights
  torch.save(best_model_weights, violence_in_movies_weights_dir)

  return model

Initialize the model

In [35]:
model = DeVTr(number_of_frames=time_steps)

Downloading model.safetensors:   0%|          | 0.00/575M [00:00<?, ?B/s]

In [36]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2, verbose=True)
model = train_model(model, criterion, optimizer, scheduler, device=device, num_epochs=7)

Epoch 1/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.9669 Accuracy: 0.6250
Epoch 2/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.2362 Accuracy: 0.9625
Epoch 3/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0759 Accuracy: 0.9875
Epoch 4/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.1074 Accuracy: 0.9563
Epoch 5/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0045 Accuracy: 1.0000
Epoch 6/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0001 Accuracy: 1.0000
Epoch 7/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0001 Accuracy: 1.0000
Training complete in 52m 56s


## **7.- Test**

In [37]:
def test_model(model, criterion, device='cuda'):
  model.to(device)

  # To start the evaluation time
  since = time.time()

  model.eval()

  running_loss = 0.0
  correct_predictions_counter = 0

  pred_vs_real = {}
  pred_vs_real['path']= []
  pred_vs_real['label']= []  
  pred_vs_real['prediction']= []

  # To create a progress bar to iterate over the 'test' dataloader using the tqdm library
  progress_bar = tqdm(dataloaders['test'], total=int(len(dataloaders['test'])))

  processed_batch_counter = 0
  for batch, sample in enumerate(progress_bar):
    # Get the videos and labels and move them to the corresponding device memory
    inputs = sample['video'].to(device , dtype=torch.float)
    labels = sample['label'].view(sample['label'].shape[0], 1).to(device, dtype=torch.float)
    paths = sample['path']

    # Get the outputs predicted by the model
    outputs = model(inputs)

    # Apply a sigmoid activation function to the outputs to obtain the predictions
    # and round the predictions to be binary (0 or 1)
    predictions = torch.round(torch.sigmoid(outputs))

    # Add the predictions and labels to the dictionary pred_vs_real
    # converted to a numpy array and move them to CPU memory
    pred_vs_real['prediction'].extend(predictions.cpu().detach().numpy().flatten())
    pred_vs_real['label'].extend(labels.cpu().detach().numpy().flatten())
    pred_vs_real['path'].extend(list(paths))

    # Calculate the loss with the function specified in the criterion variable
    loss = criterion(outputs, labels)

    # To get the total loss of the current batch:
    #   - loss.item() is the scalar value of the current batch loss
    #   - inputs.size(0) gets the batch size
    running_loss += loss.item() * inputs.size(0)
    # Adds the number of correct predictions in the current batch to the accumulated correct predictions counter
    correct_predictions_counter += torch.sum(predictions == labels.data)

    # Updates the progress message in the progress_bar iterator showing the average loss
    # To do this, divide the accumulated loss by the total number of samples processed so far
    processed_batch_counter += 1
    progress_bar.set_postfix(loss=(running_loss / (processed_batch_counter * dataloaders['test'].batch_size)))

  final_loss = running_loss / dataset_sizes['test']
  accuracy = correct_predictions_counter.double() / dataset_sizes['test']
  precision = precision_score(pred_vs_real['label'], pred_vs_real['prediction'])
  recall = recall_score(pred_vs_real['label'], pred_vs_real['prediction'])
  f1 = f1_score(pred_vs_real['label'], pred_vs_real['prediction'])
  print('{} Loss: {:.4f} Accuracy: {:.4f} Precision: {:.4f} Recall: {:.4f} F1 Score: {:.4f}'.format('Test', final_loss, accuracy, precision, recall, f1))

  # Calculate and print the confusion matrix
  confusion = confusion_matrix(pred_vs_real['label'], pred_vs_real['prediction'])
  print("Confusion Matrix:")
  print(confusion)

  time_elapsed = time.time() - since
  print('Testing complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

  return pred_vs_real

In [38]:
pred_vs_real = test_model(model, criterion, device)

  0%|          | 0/4 [00:00<?, ?it/s]

Test Loss: 0.0000 Accuracy: 1.0000 Precision: 1.0000 Recall: 1.0000 F1 Score: 1.0000
Confusion Matrix:
[[19  0]
 [ 0 21]]
Testing complete in 1m 40s


Save model test results in a CSV file

In [39]:
# Create a DataFrame with the data from pred_vs_real
pred_vs_real_dataframe = pd.DataFrame({'path': pred_vs_real['path'], 'label': pred_vs_real['label'], 'prediction': pred_vs_real['prediction']})

# Save the DataFrame to a CSV file
pred_vs_real_dataframe.to_csv(violence_in_movies_dataframes_folder + 'results.csv', index=False)