<a href="https://colab.research.google.com/github/carmenbarriga/Violence-Detection-in-Videos-with-Transformers/blob/main/DeVTr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Efficient Video Transformer - DeVTr**

@inproceedings{abdali2021data,
  title={Data efficient video transformer for violence detection},
  author={Abdali, Almamon Rasool},
  booktitle={2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)},
  pages={195--199},
  year={2021},
  organization={IEEE}
}

## **1.- Installation of the necessary libraries**

*   **Menovideo:** PyTorch library where DeVTr can be used
*   **Timm:** Library that provides pre-trained implementations of deep learning models using the PyTorch framework


In [1]:
!pip install menovideo
!pip install timm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting menovideo
  Downloading menovideo-0.5.1-py3-none-any.whl (7.3 kB)
Installing collected packages: menovideo
Successfully installed menovideo-0.5.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting timm
  Downloading timm-0.9.2-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub (from timm)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors (from timm)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.3 MB/s[

## **2.- Mount Google Drive**
Mount Google Drive to be able to access Google Drive files and directories

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **3.- Import the necessary libraries**

In [3]:
import cv2
import os
import pandas as pd
import time
import torch

from sklearn import model_selection
from sklearn.metrics import precision_score, recall_score
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm

import menovideo.menovideo as menoformer
import menovideo.videopre as vide_reader

## **4.- Make some configurations**
The function `seed_everything` is used to set seeds across various libraries and environments in Python to ensure reproducibility of results. Seed 1001 will be used.

In [4]:
def seed_everything(seed):
  # Sets the seed for the torch library's random number generator (PyTorch) for both the CPU and GPU
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  # To ensure that calculations performed with the torch library on the GPU are deterministic
  torch.backends.cudnn.deterministic = True
  # Turn off automatic benchmarking and default settings are used to ensure more stable and predictable execution
  torch.backends.cudnn.benchmark = False

seed_everything(1001)

Releases the GPU cache used by PyTorch and displays the current Pytorch version

In [5]:
torch.cuda.empty_cache()
torch.__version__

'2.0.1+cu118'

To determine on which device the PyTorch computations will be executed, either on a GPU (CUDA) or on the CPU

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## **5.- Prepare the database**

Set database folder paths:
* RLVS
* Hockey Fight
* Violence in Movies
* RWF-2000



In [7]:
# RLVS
rlvs_folder = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/real-life-violence-situations-dataset/Real Life Violence Dataset/'
# Hockey Fight
hockey_fight_folder = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/hockey-fight-dataset/'
# Violence in Movies
violence_in_movies_folder = '/content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/violence-in-movies-dataset/'
# RWF-2000
# rwf_2000_folder = ''

Function to check that videos can be opened correctly

In [8]:
def check_frames(video_dir, min_frames=25):
  number_of_frames = 0
  video = cv2.VideoCapture(video_dir)
  if video.isOpened():
    while number_of_frames < min_frames:
      is_frame_read, frame = video.read()
      if frame is None:
        print(f"Something went wrong with '{video_dir}' video")
        return False
      number_of_frames += 1
  else:
    print(f"Can't open '{video_dir}'")
    return False

  return True

Function to get the paths where videos are located and their labels

In [9]:
def get_video_labels(main_dir):
  videos = []
  labels = []
  for folder_name in os.listdir(main_dir):
    folder_dir = main_dir + folder_name + '/'
    print(f'Folder name: {folder_name}\nFolder dir: {folder_dir}')
    for file_name in os.listdir(folder_dir):
      file_dir = folder_dir + file_name
      if check_frames(file_dir):
        video_dir = os.path.join(folder_dir, file_name)
        videos.append(video_dir)
        if folder_name == 'Violence':
          labels.append(1)
        else:
          labels.append(0)
  return videos, labels

Create the dataframe

In [10]:
videos, labels = get_video_labels(rlvs_folder)
data = pd.DataFrame(data={"file": videos, "label": labels})
data.head()

Folder name: Violence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/real-life-violence-situations-dataset/Real Life Violence Dataset/Violence/
Folder name: NonViolence
Folder dir: /content/drive/MyDrive/transformers-for-violence-detection-in-videos/Datasets/real-life-violence-situations-dataset/Real Life Violence Dataset/NonViolence/


Unnamed: 0,file,label
0,/content/drive/MyDrive/transformers-for-violen...,1
1,/content/drive/MyDrive/transformers-for-violen...,1
2,/content/drive/MyDrive/transformers-for-violen...,1
3,/content/drive/MyDrive/transformers-for-violen...,1
4,/content/drive/MyDrive/transformers-for-violen...,1


Split the data for training and testing:
*   80% train
*   20% test



In [11]:
train_data, test_data = model_selection.train_test_split(
  data, test_size=0.2, random_state=42
)

Show train data information

In [12]:
print('Train data shape: ', train_data.shape)
print('Number of violence videos in train data: ', train_data['label'].value_counts()[1])
print('Number of non violence videos in train data: ', train_data['label'].value_counts()[0])

Train data shape:  (1600, 2)
Number of violence videos in train data:  801
Number of non violence videos in train data:  799


Show test data information

In [13]:
print('Test data shape: ', test_data.shape)
print('Number of violence videos in test data: ', test_data['label'].value_counts()[1])
print('Number of non violence videos in test data: ', test_data['label'].value_counts()[0])

Test data shape:  (400, 2)
Number of violence videos in test data:  199
Number of non violence videos in test data:  201


`vide_reader.TaskDataset` parameters:
1. Pandas dataframe contain the path and label of each video
2. `timesep`: Number of frames of a video
3. `rgb`: Number of color channels
4. `h`: Height of each frame
5. `w`: Width of each frame


In [14]:
time_stp = 40
RGB = 3
H = 200
W = 200
test_dataset = vide_reader.TaskDataset(test_data, timesep=time_stp, rgb=RGB, h=H, w=W)

To realease the memory because data, train_data and test_data are no longer needed

In [15]:
del data
del train_data
del test_data

In [16]:
BATCH_SIZE = 16

Create a DataLoader to load the test dataset in batches during model evaluation

In [18]:
test_loader = DataLoader(
  dataset=test_dataset,
  # Sets the batch size, that is, how many examples will be loaded simultaneously in each iteration
  batch_size=BATCH_SIZE,
  # The data must be copied to a "fixed" region of memory on the GPU device.
  # This means that data will be copied from RAM to GPU memory before training begins
  pin_memory=True,
  # If the total number of examples is not divisible by the batch size,
  # the last few examples will be skipped
  drop_last=True,
  # Specifies to 0 the number of worker threads to use to load the data in parallel to memory
  num_workers=0,
  # The data will be kept in its original order at each evaluation epoch
  shuffle=False,
)
dataloaders = {'test': test_loader}
dataset_sizes = {'test': len(test_dataset)}
print(dataloaders)
print(dataset_sizes)

{'test': <torch.utils.data.dataloader.DataLoader object at 0x7f0880b8f370>}
{'test': 400}


## **4.- Instantiate the model and specify the necessary parameters**

### **Model Architecture**

#### **Embedding Network**
1. Pre-trained 2D-CNN: Batch normalized VGG-19
2. Feed Forward Neural Network

#### **Transformer Encoder**

#### **Classification Network**
Two layers:
1.   Feed Forward Neural Network with 1024 neurons, 40% of dropout and RELU as activation function.
2.   Feed Forward Neural Network with a single neuron with sigmoid function to make a binary classification.

#### **Model Parameters**
*   **w:** The path for pre-traied DeVTr model
*   **base:** The base CNN that works as input embedding layer. If set to default it will use paper default work with VGG-19
*   **classifier:** The classifier Deep Neural Network (DNN) that recive the output from the Transformer encoder. If set to default it will use the same architecture of the paper. If it is wanted to use another architecture, it is needed to pass to it. Any nn.sequential network will work.
*   **mid_layer:** Feed Forward Network which is placed directly after the output of the Transformer encoder
*   **mid_drop:** Dropout for the mid_layer
*   **num_class:** Number of the output classes
*   **dim_embd:** Dimensionalty that we want our input to be after   transformed by time distrpyted ConvNet
*   **dr_rate:** Dropout of the Transformer encoder
*   **time_stp:** Number of frames of the input 

## **Default values**
  ```
  menoformer.DeVTr(
    w='none',
    base='default',
    classifier='default',
    mid_layer=1024,
    mid_drop=0.4,
    num_classes=1,
    dim_embd=512,
    dr_rate=0.1,
    time_stp=40,
  )
  ```


In [19]:
weights = 'drive/MyDrive/transformers-for-violence-detection-in-videos/vg19bn40convtransformer-ep-0.pth'
model = menoformer.DeVTr(w=weights, base='default')
model.to(device)
model.eval()

Downloading model.safetensors:   0%|          | 0.00/575M [00:00<?, ?B/s]

Sequential(
  (0): TimeWarp(
    (baseModel): Sequential(
      (0): VGG(
        (features): Sequential(
          (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (5): ReLU(inplace=True)
          (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (9): ReLU(inplace=True)
          (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stat

In [23]:
def test_model(model, criterion, device='cuda'):
  model.to(device)

  # To start the evaluation time
  since = time.time()
  model.eval()
  total_loss_sum = 0.0
  correct_predictions_counter = 0
  pred_vs_real = {}
  pred_vs_real['pred']= []
  pred_vs_real['real']= []
  y_true = []
  y_pred = []

  # To create a progress bar to iterate over the 'test' dataloader using the tqdm library
  progress_bar = tqdm(dataloaders['test'], total=int(len(dataloaders['test'])))

  processed_batch_counter = 0
  for batch, sample in enumerate(progress_bar):
    # Get the videos and labels from the data batch and move them to the corresponding device memory
    inputs = sample['video'].to(device , dtype=torch.float)   # [16, 40, 3, 200, 200]
    labels = sample['label'].view(sample['label'].shape[0], 1)  # [16] -> [16, 1]
    labels = labels.to(device, dtype=torch.float)

    # Get the outputs predicted by the model
    outputs = model(inputs)
    # Apply a sigmoid activation function to the outputs to obtain the predictions
    # and round the predictions to be binary (0 or 1)
    preds = torch.round(torch.sigmoid(outputs))

    # Add the predictions and labels to the dictionary pred_vs_real
    # converted to a numpy array and move them to CPU memory
    pred_vs_real['pred'].extend(preds.cpu().detach().numpy().flatten())
    pred_vs_real['real'].extend(labels.cpu().detach().numpy().flatten())

    # Calculate the loss with the function specified in the criterion variable
    loss = criterion(outputs, labels)

    # Statistics
    processed_batch_counter += 1
    # To get the total loss of the current batch:
    #   - loss.item() is the scalar value of the current batch loss
    #   - inputs.size(0) gets the batch size
    total_loss_sum += loss.item() * inputs.size(0)
    # Adds the number of correct predictions in the current batch to the accumulated correct predictions counter
    correct_predictions_counter += torch.sum(preds == labels.data)

    # Updates the progress message in the progress_bar iterator showing the average loss
    # To do this, divide the accumulated loss by the total number of samples processed so far
    progress_bar.set_postfix(loss=(total_loss_sum / (processed_batch_counter * dataloaders['test'].batch_size)))

  final_loss = total_loss_sum / dataset_sizes['test']
  accuracy = correct_predictions_counter.double() / dataset_sizes['test']
  precision = precision_score(pred_vs_real['real'], pred_vs_real['pred'])
  recall = recall_score(pred_vs_real['real'], pred_vs_real['pred'])
  print('{} Loss: {:.4f} Accuracy: {:.4f} Precision: {:.4f} Recall: {:.4f}'.format('test', final_loss, accuracy, precision, recall))

  time_elapsed = time.time() - since
  print('testing complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))

  return pred_vs_real

**Defining the loss function**

`BCEWithLogitsLoss` is the **Binary Cross Entropy with Logits Loss** used for binary classification tasks.

The `BCEWithLogitsLoss` function combines the sigmoid activation function and the cross entropy loss in a single operation. In this case, each sample can belong to one of two classes, represented by the labels $0$ (Nonviolence) and $1$ (Violence). We denote model predictions as logits (i.e., the raw outputs of the model before applying an activation function) and true labels as $y$.

*   Application of the sigmoid function:

$$
\text{logits_sigmoid} = sigmoid(\text{logits})
$$

Where $sigmoid$ is the sigmoid activation function which transforms the logits into probabilities between $0$ and $1$.
*   Calculation of cross entropy loss:

$$
\text{loss} = - (y \cdot \log(\text{logits_sigmoid}) + (1 - y) \cdot \log(1 - \text{logits_sigmoid}))
$$

Where $\log$ is the logarithm function.

The above formula calculates the binary cross entropy loss for each sample individually. The average of all the samples is then taken to obtain the total loss.


In [21]:
criterion = torch.nn.BCEWithLogitsLoss()

In [24]:
rl_vs_prd = test_model(model, criterion, device)
print(rl_vs_prd)

  0%|          | 0/25 [00:00<?, ?it/s]

test Loss: 0.1022 Accuracy: 0.9625 Precision: 0.9742 Recall: 0.9497
testing complete in 11m 12s
{'pred': [0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0,

In [25]:
print(rl_vs_prd['pred'])
print(rl_vs_prd['real'])

[0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0,