# Violence Detection using CNN + LSTM neural netowrk

## Introduction

Today, the amount of public violence has increased dramatically. As much in high schools as in the street. This has resulted in the ubiquitous use of surveillance cameras. This has helped the authorities to identify these events and take the necessary measures. But almost all systems today require the human-inspection of these videos to identify such events, which is virtually inefficient. It is therefore necessary to have such a practical system that can automatically monitor and identify the surveillance videos.
The development of various deep learning techniques, thanks to the availability of large data sets and computational resources, has resulted in a historic change in the community of computer vision. Various techniques have been developed to address problems such as object detection, recognition, tracking, action recognition, legend generation, etc. However, despite recent developments in deep learning, very few techniques based on deep learning have been proposed to address the problem of detecting violence from videos.

## Flowchart

The method consists of extracting a set of frames belonging to the video, sending them to a pretrained network called VGG16, obtaining the output of one of its final layers and from these outputs train another network architecture with a type of special neurons called LSTM. These neurons have memory and are able to analyze the temporal information of the video, if at any time they detect violence, it will be classified as a violent video.





## Imports

In [1]:
#!pip install kagglehub

In [2]:
%matplotlib inline
import cv2
import os
import numpy as np
import keras
import matplotlib.pyplot as plt
import kagglehub
import shutil

# import download
from random import shuffle
import tensorflow as tf
from tensorflow.keras.applications.efficientnet import EfficientNetB0
from tensorflow.keras.applications.resnet import ResNet50
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Bidirectional, LSTM, Dense, Activation, GlobalAveragePooling2D, Flatten

import sys
import h5py

current_dir = os.getcwd()
print(current_dir)
sys.path.append(current_dir+'/utils')
from utils.datamanager import download_dataset, label_video_names


c:\Users\Cesar\Desktop\maestria\TFM


## Check if gpu enabled

In [3]:
# Check if TensorFlow can see a GPU
gpus = tf.config.list_physical_devices('GPU')
print("Num GPUs Available:", len(gpus))
print("GPU Details:", gpus)

Num GPUs Available: 1
GPU Details: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Download dataset

In [4]:
current_dir = os.getcwd()

download_dataset(current_dir)

Dataset copied to: c:\Users\Cesar\Desktop\maestria\TFM


## Load Data

Firstly, we define the directory to place the video dataset

In [5]:
in_dir = "data"

Copy some of the data-dimensions for convenience.

In [6]:
# Frame size  
img_size = 224

img_size_touple = (img_size, img_size)

# Number of channels (RGB)
num_channels = 3

img_size_tiplet = (img_size, img_size,num_channels)

# Flat frame size
img_size_flat = img_size * img_size * num_channels

# Number of classes for classification (Violence-No Violence)
num_classes = 2

frames_per_file = 10
_images_per_file = frames_per_file


# Video extension
video_exts = ".avi"

In order to load the saved transfer values into RAM memory we are going to use this two functions:

In [7]:
def process_alldata(file):
    
    joint_transfer=[]
    frames_num=20
    count = 0
    
    with h5py.File(file, 'r') as f:
            
        X_batch = f['data'][:]
        y_batch = f['labels'][:]

    for i in range(int(len(X_batch)/frames_num)):
        inc = count+frames_num
        joint_transfer.append([X_batch[count:inc],y_batch[count]])
        count =inc
        
    data =[]
    target=[]
    
    for i in joint_transfer:
        data.append(i[0])
        target.append(np.array(i[1]))
        
    return data, target

##Recurrent Neural Network

The basic building block in a Recurrent Neural Network (RNN) is a Recurrent Unit (RU). There are many different variants of recurrent units such as the rather clunky LSTM (Long-Short-Term-Memory) and the somewhat simpler GRU (Gated Recurrent Unit) which we will use in this tutorial. Experiments in the literature suggest that the LSTM and GRU have roughly similar performance. Even simpler variants also exist and the literature suggests that they may perform even better than both LSTM and GRU, but they are not implemented in Keras which we will use in this tutorial.

A recurrent neuron has an internal state that is being updated every time the unit receives a new input. This internal state serves as a kind of memory. However, it is not a traditional kind of computer memory which stores bits that are either on or off. Instead the recurrent unit stores floating-point values in its memory-state, which are read and written using matrix-operations so the operations are all differentiable. This means the memory-state can store arbitrary floating-point values (although typically limited between -1.0 and 1.0) and the network can be trained like a normal neural network using Gradient Descent.



### Define LSTM architecture

When defining the LSTM architecture we have to take into account the dimensions of the transfer values. From each frame the VGG16 network obtains as output a vector of 4096 transfer values. From each video we are processing 20 frames so we will have 20 x 4096 values per video. The classification must be done taking into account the 20 frames of the video. If any of them detects violence, the video will be classified as violent.


The first input dimension of LSTM neurons is the temporal dimension, in our case it is 20. The second is the size of the features vector (transfer values).


In [8]:
n_chunks = _images_per_file

from keras import backend as K

def f1_score(y_true, y_pred):
    y_pred = K.round(y_pred)  # Round to 0 or 1
    tp = K.sum(K.cast(y_true * y_pred, 'float'), axis=0)
    fp = K.sum(K.cast((1 - y_true) * y_pred, 'float'), axis=0)
    fn = K.sum(K.cast(y_true * (1 - y_pred), 'float'), axis=0)

    precision = tp / (tp + fp + K.epsilon())
    recall = tp / (tp + fn + K.epsilon())
    
    f1 = 2 * precision * recall / (precision + recall + K.epsilon())
    return K.mean(f1)

def getLSTM(cells,chunk_size):
    model = Sequential()
    model.add(Bidirectional(LSTM(cells, input_shape=(n_chunks, chunk_size))))
    model.add(Dense(1024))
    model.add(Activation('relu'))
    model.add(Dense(50))
    model.add(Activation('relu'))
    model.add(Dense(2))
    model.add(Activation('softmax'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy',f1_score])
    return model

## Model training


In [9]:
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split

early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=3,
    min_delta=0.005,          # training stops if val_loss improves less than 0.001
    restore_best_weights=True
)

def train_model(data, target, input_size, cells, batchs = 16, epochs=200):
    model = getLSTM(cells,input_size)
    X_train, X_val, y_train, y_val = train_test_split(data, target, test_size=0.2, random_state=42)
    history = model.fit(np.array(X_train), np.array(y_train), epochs=epochs,
                        validation_data=(np.array(X_val), np.array(y_val)), 
                        batch_size=batchs, verbose=2, callbacks=[early_stopping])
    return model, history

## Test the model and print results

We are going to test the model with 20 % of the total videos. This videos have not been used to train the network. 

In [10]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Directorios
script_dir = os.getcwd()
training_dir = os.path.join(script_dir, "processedData/training/")
testing_dir = os.path.join(script_dir, "processedData/testing/")
graphs_dir = os.path.join(script_dir, "graphs/")
os.makedirs(graphs_dir, exist_ok=True)

# Rango de LSTM cells
cells_start = 1
cells_end = 10
cells_list = list(range(cells_start, cells_end))  # 1 to 9 cells

# Iterar sobre cada modelo CNN
for filename in os.listdir(training_dir):
    print(f"Processing CNN Model: {filename}")
    cnn_name = filename.split('.')[0]  # Nombre base para gráficas/CSV

    # Cargar datos
    data, target = process_alldata(os.path.join(training_dir, filename))
    data_test, target_test = process_alldata(os.path.join(testing_dir, filename))

    model_metrics = []   # Por época (para gráficas)
    final_metrics = []   # Solo última época (para CSV)

    for cells in cells_list:
        print(f"  → LSTM cells: {cells}")
        model, history = train_model(data, target, data[0].shape[1], cells)

        # Guardar métricas por época para graficar
        epochs = range(1, len(history.history['loss']) + 1)
        for i, epoch in enumerate(epochs):
            model_metrics.append({
                'cnn_model': cnn_name,
                'lstm_cells': cells,
                'epoch': epoch,
                'train_loss': history.history['loss'][i],
                'train_f1': history.history['f1_score'][i],
                'val_loss': history.history['val_loss'][i],
                'val_f1': history.history['val_f1_score'][i]
            })

        # Guardar solo la última época para el CSV final
        train_loss = round(history.history['loss'][-1], 4)
        train_f1 = round(history.history['f1_score'][-1], 4)
        val_loss = round(history.history['val_loss'][-1], 4)
        val_f1 = round(history.history['val_f1_score'][-1], 4)
        test_loss, test_f1 = model.evaluate(np.array(data_test), np.array(target_test), verbose=0)[:2]
        test_loss, test_f1 = round(test_loss, 4), round(test_f1, 4)

        final_metrics.append({
            'cnn_model': cnn_name,
            'lstm_cells': cells,
            'train_loss': train_loss,
            'train_f1': train_f1,
            'val_loss': val_loss,
            'val_f1': val_f1,
            'test_loss': test_loss,
            'test_f1': test_f1
        })

    # Guardar CSV final solo con última época
    df_final = pd.DataFrame(final_metrics)
    df_final.to_csv(os.path.join(graphs_dir, f"{cnn_name}_metrics.csv"), index=False)

    # Convertir a DataFrame para gráficas
    df_model = pd.DataFrame(model_metrics)

    # Graficar curvas por época y celdas - Validation Loss
    plt.figure(figsize=(10, 6))
    for cells in df_model['lstm_cells'].unique():
        subset = df_model[df_model['lstm_cells'] == cells]
        plt.plot(subset['epoch'], subset['val_loss'], label=f'{cells} cells')
    plt.title(f'Validation Loss vs Epochs - {cnn_name}')
    plt.xlabel('Epoch')
    plt.ylabel('Validation Loss')
    plt.legend()
    plt.grid(True)
    plt.savefig(os.path.join(graphs_dir, f"{cnn_name}_val_loss_per_epoch.png"), dpi=300)
    plt.close()

    # Graficar curvas por época y celdas - Validation F1 Score
    plt.figure(figsize=(10, 6))
    for cells in df_model['lstm_cells'].unique():
        subset = df_model[df_model['lstm_cells'] == cells]
        plt.plot(subset['epoch'], subset['val_f1'], label=f'{cells} cells')
    plt.title(f'Validation F1 Score vs Epochs - {cnn_name}')
    plt.xlabel('Epoch')
    plt.ylabel('Validation F1 Score')
    plt.legend()
    plt.grid(True)
    plt.savefig(os.path.join(graphs_dir, f"{cnn_name}_val_f1_per_epoch.png"), dpi=300)
    plt.close()


Processing CNN Model: efficientnetb0.h5
  → LSTM cells: 1
Epoch 1/200
20/20 - 7s - loss: 0.6921 - accuracy: 0.5844 - f1_score: 0.5513 - val_loss: 0.6667 - val_accuracy: 0.8375 - val_f1_score: 0.8219 - 7s/epoch - 342ms/step
Epoch 2/200
20/20 - 0s - loss: 0.5210 - accuracy: 0.8750 - f1_score: 0.8664 - val_loss: 0.3452 - val_accuracy: 0.9250 - val_f1_score: 0.9141 - 450ms/epoch - 23ms/step
Epoch 3/200
20/20 - 0s - loss: 0.2793 - accuracy: 0.9156 - f1_score: 0.9035 - val_loss: 0.2712 - val_accuracy: 0.9125 - val_f1_score: 0.9031 - 439ms/epoch - 22ms/step
Epoch 4/200
20/20 - 0s - loss: 0.1793 - accuracy: 0.9563 - f1_score: 0.9552 - val_loss: 0.2682 - val_accuracy: 0.9375 - val_f1_score: 0.9278 - 418ms/epoch - 21ms/step
Epoch 5/200
20/20 - 0s - loss: 0.1479 - accuracy: 0.9563 - f1_score: 0.9540 - val_loss: 0.2523 - val_accuracy: 0.9125 - val_f1_score: 0.9031 - 351ms/epoch - 18ms/step
Epoch 6/200
20/20 - 0s - loss: 0.1060 - accuracy: 0.9750 - f1_score: 0.9745 - val_loss: 0.2465 - val_accuracy