<a href="https://colab.research.google.com/github/carmenbarriga/Violence-Detection-in-Videos-with-Transformers/blob/main/DeVTr_Violence_in_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Efficient Video Transformer (DeVTr) for Violence Detection**

@inproceedings{abdali2021data,
  title={Data efficient video transformer for violence detection},
  author={Abdali, Almamon Rasool},
  booktitle={2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)},
  pages={195--199},
  year={2021},
  organization={IEEE}
}

## **1.- Installation of the necessary libraries**

*   **Menovideo:** PyTorch library where DeVTr can be used
*   **Timm:** Library that provides pre-trained implementations of deep learning models using the PyTorch framework


In [1]:
! pip install menovideo
! pip install timm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **2.- Mount Google Drive**
Mount Google Drive to be able to access Google Drive files and directories

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **3.- Import the necessary libraries**

In [29]:
import copy
import cv2
import numpy as np
import os
import pandas as pd
import time
import torch

from skimage.transform import resize
from sklearn.metrics import precision_score, recall_score
from sklearn import model_selection
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torch.optim import lr_scheduler
from tqdm.notebook import tqdm

import menovideo.menovideo as menoformer
import menovideo.videopre as vide_reader

## **4.- Make some initial configurations**
The function `seed_everything` is used to set seeds across various libraries and environments in Python to ensure reproducibility of results. Seed 1001 will be used.

In [4]:
def seed_everything(seed):
  # Sets the seed for the numpy library's random number generator
  np.random.seed(seed)
  # Sets the seed for the torch library's random number generator (PyTorch) for both the CPU and GPU
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  # To ensure that calculations performed with the torch library on the GPU are deterministic
  torch.backends.cudnn.deterministic = True
  # Turn off automatic benchmarking and default settings are used to ensure more stable and predictable execution
  torch.backends.cudnn.benchmark = False

seed_everything(1001)

Releases the GPU cache used by PyTorch and displays the current Pytorch version

In [5]:
torch.cuda.empty_cache()
torch.__version__

'2.0.1+cu118'

To determine on which device the PyTorch computations will be executed, either on a GPU (CUDA) or on the CPU

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## **5.- Prepare the data**

Set Violence in Movies dataset folder path


In [7]:
violence_in_movies_folder = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/'
violence_in_movies_weights_dir = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/violence_in_movies_best_model_weights.pth'

Function to check that videos can be opened correctly

In [8]:
def check_frames(video_dir, min_frames=25):
  # VideoCapture object to open and read the video
  video_capture = cv2.VideoCapture(video_dir)
  # To check if the VideoCapture object was able to open the video
  if video_capture.isOpened():
    # To keep track of how many frames have been counted
    frames_counter = 0
    while frames_counter < min_frames:
      # Read the next frame
      is_frame_read, frame = video_capture.read()
      # Check if there are no more frames available
      if frame is None:
        print(f"Something went wrong with '{video_dir}' video")
        return False
      frames_counter += 1
  else:
    print(f"Can't open '{video_dir}'")
    return False
  return True

Function to get the paths where videos are located and their labels

In [9]:
def get_video_labels(main_dir):
  videos = []
  labels = []
  # Loop through the folders (classes) of the dataset folder
  for folder_name in os.listdir(main_dir):
    folder_dir = main_dir + folder_name + '/'
    print(f'Folder name: {folder_name}\nFolder dir: {folder_dir}')
    # Loop through videos within the current folder
    for file_name in os.listdir(folder_dir):
      file_dir = folder_dir + file_name
      # Check if the video can be opened correctly and has at least 25 frames
      if check_frames(file_dir):
        video_dir = os.path.join(folder_dir, file_name)
        videos.append(video_dir)
        # Add the video label according to the folder where it is located
        if folder_name == 'Violence':
          labels.append(1)
        else:
          labels.append(0)
  return videos, labels

In [10]:
videos, labels = get_video_labels(violence_in_movies_folder)

Folder name: Violence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Violence/
Folder name: Non Violence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/Violence in Movies/Non Violence/


Create the Pandas `DataFrame`

In [11]:
data = pd.DataFrame(data={"file": videos, "label": labels})
data.head()

Unnamed: 0,file,label
0,/content/drive/MyDrive/transformers-for-violen...,1
1,/content/drive/MyDrive/transformers-for-violen...,1
2,/content/drive/MyDrive/transformers-for-violen...,1
3,/content/drive/MyDrive/transformers-for-violen...,1
4,/content/drive/MyDrive/transformers-for-violen...,1


Split the data for training and testing:
*   80% train
*   20% test

In [12]:
train_data, test_data = model_selection.train_test_split(
  data, test_size=0.2, random_state=42
)

Show train data information

In [13]:
print('Train data shape: ', train_data.shape)
print('Number of violence videos in train data: ', train_data['label'].value_counts()[1])
print('Number of non violence videos in train data: ', train_data['label'].value_counts()[0])

Train data shape:  (160, 2)
Number of violence videos in train data:  79
Number of non violence videos in train data:  81


Show test data information

In [14]:
print('Test data shape: ', test_data.shape)
print('Number of violence videos in test data: ', test_data['label'].value_counts()[1])
print('Number of non violence videos in test data: ', test_data['label'].value_counts()[0])

Test data shape:  (40, 2)
Number of violence videos in test data:  21
Number of non violence videos in test data:  19


Defining some video properties

In [15]:
time_steps = 49   # Number of frames of each video
color_channels = 3  # Number of color channels
height = 256  # Height of each frame
width = 256   # Width of each frame

Passing the training data to the TaskDataset class

In [16]:
def capture(filename, time_steps, color_channels, height, width):
  # Create an array to store the video frames after being processed
  frames = np.zeros((time_steps, color_channels, height, width), dtype=np.float)
  # VideoCapture object to open and read the video
  video_capture = cv2.VideoCapture(filename)
  # To check if the VideoCapture object was able to open the video
  if video_capture.isOpened():
    # To keep track of how many frames have been stored in the frames array
    frames_counter = 0
    while frames_counter < time_steps:
      # Read the next frame
      is_frame_read, frame = video_capture.read()
      # Check if there are no more frames available
      if not is_frame_read:
        break
      # Resize the original frame to the specified dimensions (height, width, color_channels) keeping its original aspect ratio
      frame = resize(frame, (height, width, color_channels))
      # To add an extra dimension (1, height, width, color_channels)
      frame = np.expand_dims(frame, axis=0)
      # Moves axis -1 (last axis) to index 1 (1, color_channels, height, width)
      frame = np.moveaxis(frame, -1, 1)
      # Normalization of the pixel values of the frame (if necessary)
      if np.max(frame) > 1:
        frame = frame / 255.0
      # Store the processed frame in the corresponding position within the frames array
      frames[frames_counter][:] = frame
      frames_counter += 1

    del frame
    del is_frame_read

  return frames


class TaskDataset(Dataset):
  def __init__(self, data, time_steps=10, color_channels=3, height=90, width=90):
    """
    Args:
      data: pandas dataframe that contains the paths to the video files with their labels
      time_steps: number of frames
      color_channels: number of color channels
      height: height of frames
      width: width of frames
    """
    self.data_locations = data
    self.time_steps, self.color_channels, self.height, self.width = time_steps, color_channels, height, width

  def __len__(self):
    return len(self.data_locations)

  def __getitem__(self, idx):
    if torch.is_tensor(idx):
      idx = idx.tolist()
    # To process the video and get its frames
    video = capture(self.data_locations.iloc[idx, 0], self.time_steps, self.color_channels, self.height, self.width)
    # Dictionary containing the processed video and its corresponding label
    sample = {
      'video': torch.from_numpy(video),
      'label': torch.from_numpy(np.asarray(self.data_locations.iloc[idx, 1]))
    }

    return sample

In [17]:
train_dataset = TaskDataset(
  data=train_data, time_steps=time_steps, color_channels=color_channels, height=height, width=width
)

Passing the test data to the TaskDataset class

In [18]:
test_dataset = TaskDataset(
  data=test_data, time_steps=time_steps, color_channels=color_channels, height=height, width=width
)

Defining the batch size

In [19]:
BATCH_SIZE = 16

Creating a `DataLoader` to load data in batches during training

In [20]:
train_loader = DataLoader(
  dataset=train_dataset,
  batch_size=BATCH_SIZE,
  pin_memory=True,
  drop_last=True,
  num_workers=0,
  shuffle=True
)

Creating a `DataLoader` to load data in batches during test

In [33]:
BATCH_SIZE_TEST = 10

In [34]:
test_loader = DataLoader(
  dataset=test_dataset,
  batch_size=BATCH_SIZE_TEST,
  pin_memory=True,
  drop_last=True,
  num_workers=0,
  shuffle=False
)

Putting the `DataLoaders` in the `dataloaders` dictionary and their sizes in the `dataset_sizes` dictionary

In [35]:
dataloaders = {'train': train_loader, 'test': test_loader}
dataset_sizes = {'train': len(train_dataset), 'test': len(test_dataset)}
print(dataloaders)
print(dataset_sizes)

{'train': <torch.utils.data.dataloader.DataLoader object at 0x7fc1c1da4880>, 'test': <torch.utils.data.dataloader.DataLoader object at 0x7fc1b8781db0>}
{'train': 160, 'test': 40}


To realease the memory because `data`, `train_data` and `test_data` are no longer needed

In [23]:
del data
del train_data
del test_data

## **6.- Training**

In [26]:
def train_model(model, criterion, optimizer, scheduler, device='cuda', num_epochs=7):
  model.to(device)

  # Start the training time
  since = time.time()

  # Save the best loss value during model training
  best_loss = float('inf')
  # Create a copy of the current model weights
  best_model_weights = copy.deepcopy(model.state_dict())

  for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch + 1, num_epochs))
    print('-' * 10)

    model.train()
    running_loss = 0.0
    correct_predictions_counter = 0

    # To create a progress bar to iterate over the 'train' dataloader using the tqdm library
    progress_bar = tqdm(dataloaders['train'], total=int(len(dataloaders['train'])))

    for batch, sample in enumerate(progress_bar):
      # Get the videos and labels and move them to the corresponding device memory
      inputs = sample['video'].to(device, dtype=torch.float)  # [batch_size, time_steps, color_channels, height, width]
      labels = sample['label'].view(sample['label'].shape[0], 1).to(device, dtype=torch.float)  # [batch_size] -> [batch_size, 1]

      # To clean up the accumulated gradients and ensure that the gradients are calculated correctly 
      # for the current batch during backpropagation and updating of the weights
      optimizer.zero_grad()

      # Get the outputs predicted by the model
      outputs = model(inputs)

      # Calculate the loss with the function specified in the criterion variable
      loss = criterion(outputs, labels)

      # Computes the gradients of all model parameters with respect to the loss function
      loss.backward()
      # Update model parameters based on gradients computed during backpropagation
      optimizer.step()

      # To get the total loss of the current batch:
      #   - loss.item() is the scalar value of the current batch loss
      #   - inputs.size(0) gets the batch size
      running_loss += loss.item() * inputs.size(0)

      # Apply a sigmoid activation function to the outputs to obtain the predictions
      # and round the predictions to be binary (0 or 1)
      predictions = torch.round(torch.sigmoid(outputs))

      # Adds the number of correct predictions in the current batch to the accumulated correct predictions counter
      correct_predictions_counter += torch.sum(predictions == labels.data)

    epoch_loss = running_loss / dataset_sizes['train']
    epoch_accuracy = correct_predictions_counter.double() / dataset_sizes['train']
    print('Train Loss: {:.4f} Acc: {:.4f}'.format(epoch_loss, epoch_accuracy))

    # Updates the state of the optimizer based on the loss obtained in each training epoch
    scheduler.step(epoch_loss)

    # Stores the model weights that correspond to the best loss achieved so far
    if epoch_loss < best_loss:
      best_loss = epoch_loss
      best_model_weights = copy.deepcopy(model.state_dict())

  # End the training time
  time_elapsed = time.time() - since
  print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

  # The model is loaded with the weights corresponding to the best saved model
  model.load_state_dict(best_model_weights)
  # Save the weights
  torch.save(best_model_weights, violence_in_movies_weights_dir)
  return model

Initialize the model

In [27]:
model = menoformer.DeVTr(time_stp=time_steps)

Downloading model.safetensors:   0%|          | 0.00/575M [00:00<?, ?B/s]

In [28]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2, verbose=True)
model = train_model(model, criterion, optimizer, scheduler, device='cuda', num_epochs=7)

Epoch 1/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  frames = np.zeros((time_steps, color_channels, height, width), dtype=np.float)


Train Loss: 0.9049 Acc: 0.6313
Epoch 2/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.1288 Acc: 0.9750
Epoch 3/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0196 Acc: 0.9938
Epoch 4/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0014 Acc: 1.0000
Epoch 5/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0442 Acc: 0.9938
Epoch 6/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0459 Acc: 0.9938
Epoch 7/7
----------


  0%|          | 0/10 [00:00<?, ?it/s]

Train Loss: 0.0126 Acc: 0.9938
Epoch 00007: reducing learning rate of group 0 to 5.0000e-05.
Training complete in 51m 30s


## **7.- Test**

In [36]:
def test_model(model, criterion, device='cuda'):
  model.to(device)

  # To start the evaluation time
  since = time.time()

  model.eval()
  total_loss_sum = 0.0
  correct_predictions_counter = 0
  pred_vs_real = {}
  pred_vs_real['pred']= []
  pred_vs_real['real']= []
  y_true = []
  y_pred = []

  # To create a progress bar to iterate over the 'test' dataloader using the tqdm library
  progress_bar = tqdm(dataloaders['test'], total=int(len(dataloaders['test'])))

  processed_batch_counter = 0
  for batch, sample in enumerate(progress_bar):
    # Get the videos and labels and move them to the corresponding device memory
    inputs = sample['video'].to(device , dtype=torch.float)
    labels = sample['label'].view(sample['label'].shape[0], 1)
    labels = labels.to(device, dtype=torch.float)

    # Get the outputs predicted by the model
    outputs = model(inputs)
    # Apply a sigmoid activation function to the outputs to obtain the predictions
    # and round the predictions to be binary (0 or 1)
    preds = torch.round(torch.sigmoid(outputs))

    # Add the predictions and labels to the dictionary pred_vs_real
    # converted to a numpy array and move them to CPU memory
    pred_vs_real['pred'].extend(preds.cpu().detach().numpy().flatten())
    pred_vs_real['real'].extend(labels.cpu().detach().numpy().flatten())

    # Calculate the loss with the function specified in the criterion variable
    loss = criterion(outputs, labels)

    # Statistics
    processed_batch_counter += 1
    # To get the total loss of the current batch:
    #   - loss.item() is the scalar value of the current batch loss
    #   - inputs.size(0) gets the batch size
    total_loss_sum += loss.item() * inputs.size(0)
    # Adds the number of correct predictions in the current batch to the accumulated correct predictions counter
    correct_predictions_counter += torch.sum(preds == labels.data)

    # Updates the progress message in the progress_bar iterator showing the average loss
    # To do this, divide the accumulated loss by the total number of samples processed so far
    progress_bar.set_postfix(loss=(total_loss_sum / (processed_batch_counter * dataloaders['test'].batch_size)))

  final_loss = total_loss_sum / dataset_sizes['test']
  accuracy = correct_predictions_counter.double() / dataset_sizes['test']
  precision = precision_score(pred_vs_real['real'], pred_vs_real['pred'])
  recall = recall_score(pred_vs_real['real'], pred_vs_real['pred'])
  print('{} Loss: {:.4f} Accuracy: {:.4f} Precision: {:.4f} Recall: {:.4f}'.format('test', final_loss, accuracy, precision, recall))

  time_elapsed = time.time() - since
  print('testing complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

  return pred_vs_real

In [37]:
rl_vs_prd = test_model(model, criterion, device)
print(rl_vs_prd)

  0%|          | 0/4 [00:00<?, ?it/s]

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  frames = np.zeros((time_steps, color_channels, height, width), dtype=np.float)


test Loss: 0.3165 Accuracy: 0.9500 Precision: 1.0000 Recall: 0.9048
testing complete in 1m 37s
{'pred': [0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0], 'real': [1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]}


In [38]:
print(rl_vs_prd['pred'])
print(rl_vs_prd['real'])

[0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
[1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
