# Imports and Downloads

In [None]:
!pip install keybert

Collecting keybert
  Downloading keybert-0.8.5-py3-none-any.whl.metadata (15 kB)
Collecting sentence-transformers>=0.3.8 (from keybert)
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Downloading keybert-0.8.5-py3-none-any.whl (37 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers, keybert
Successfully installed keybert-0.8.5 sentence-transformers-3.1.1


In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from transformers import RobertaTokenizer, RobertaModel
import torch
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
from keybert import KeyBERT
import joblib
import os
import zipfile
from tqdm import tqdm
from google.colab import files

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Load CSV File

In [None]:
uploaded = files.upload()

Saving capsdata.csv to capsdata.csv


In [None]:
df = pd.read_csv('capsdata.csv', encoding='ISO-8859-1', usecols=['video_name','captions', 'labels'])
print(df.head())

  video_name                                           captions    labels
0       v071  The video depicts a business meeting in a mode...  violence
1       v075  This video captures a chaotic scene on a busy ...  violence
2       v056  The video depicts a physical altercation betwe...  violence
3       v001  The video captures a series of events in a bar...  violence
4       v006  The video takes place in a shopping mall inter...  violence


CLASS DISTRIBUTION :
(1) Total = 137
(2) AO = 35
(3) Vandalism = 34
(4) Violence = 34
(5) Normal = 34

# Preprocessing (add more complex preprocessing later)

This function tokenizes the text, converts it to lowercase, and removes stopwords. The result is stored in a new column processed_caption. The function checks if the input is a valid string; otherwise, it returns an empty string.

In [None]:
def preprocess(text):
    if isinstance(text, str):  # to check if the text is a string
        tokens = word_tokenize(text.lower())
        stop_words = set(stopwords.words('english'))
        return ' '.join([token for token in tokens if token not in stop_words])
    else:
        return ''  # returning an empty string if the text is not valid

df['processed_caption'] = df['captions'].apply(preprocess)

# RoBERTa for Contextual Embeddings

It is a robustly optimized variant of BERT (Bidirectional Encoder Representations from Transformers). It outputs contextualized embeddings for each token in the input text. The roberta-base model has 12 layers, 12 attention heads, and 110 million parameters.

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

def get_roberta_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The input text is tokenized, padded, and truncated to a maximum length of 512 tokens. The tokenizer converts the text into a format suitable for the RoBERTa model. The model then outputs a hidden state for each token, and the mean of these hidden states across all tokens is taken as the final embedding for the text.

# KeyBERT for Key Phrase Extraction

Utilizes BERT-based embeddings to extract key phrases from text, allowing you to capture the most relevant information in each caption.
The extract_keywords function extracts key phrases from the processed captions, focusing on 1-2 word n-grams and filtering out stopwords. The top 5 key phrases are joined into a single string.
After extracting key phrases, the RoBERTa embeddings are generated for these phrases, stored in the roberta_embeddings column.

In [None]:
kw_model = KeyBERT(model='roberta-base')

def extract_keywords(text):
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 2), stop_words='english', top_n=5)
    return ' '.join([kw[0] for kw in keywords])

df['key_phrases'] = df['processed_caption'].apply(extract_keywords)
df['roberta_embeddings'] = df['key_phrases'].apply(get_roberta_embeddings)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
df.head()

Unnamed: 0,video_name,captions,labels,processed_caption,key_phrases,roberta_embeddings
0,v071,The video depicts a business meeting in a mode...,violence,video depicts business meeting modern office s...,situation escalates participants seated discus...,"[-0.0129237175, 0.07156829, 0.0018011759, 0.09..."
1,v075,This video captures a chaotic scene on a busy ...,violence,video captures chaotic scene busy street group...,disrupts normal causing motorbike motorbike co...,"[-0.07501026, 0.2028176, -0.060801685, 0.21468..."
2,v056,The video depicts a physical altercation betwe...,violence,video depicts physical altercation two individ...,struggle intensifies confrontation making anom...,"[0.07223556, -0.04319455, -0.08394647, 0.02764..."
3,v001,The video captures a series of events in a bar...,violence,video captures series events bar pool hall set...,intervention confrontational disruptions safet...,"[0.024885466, -0.104202785, 0.0061696554, 0.22..."
4,v006,The video takes place in a shopping mall inter...,violence,"video takes place shopping mall interior , spe...",environmental hazards maintaining safety poten...,"[0.006564617, -0.06291287, -0.09457886, -0.075..."


In [None]:
df.tail()

Unnamed: 0,video_name,captions,labels,processed_caption,key_phrases,roberta_embeddings
132,Normal_Videos_031.mp4,The video depicts a young man engaged in movin...,normal,video depicts young man engaged moving activit...,diligently moving vandalism scene abandoned ob...,"[-0.02203827, 0.10722544, -0.05024123, 0.00638..."
133,Normal_Videos_033.mp4,The video depicts a nighttime scene on a Europ...,normal,video depicts nighttime scene european street ...,presence uniformed vandalism presence parked v...,"[-0.094127476, 0.01554891, -0.08836173, -0.198..."
134,Normal_Videos_034.mp4,The video depicts a typical urban street scene...,normal,video depicts typical urban street scene dayti...,designated crosswalks parked curbs crosswalks ...,"[-0.0467777, 0.13316399, 0.050161745, -0.08464..."
135,Normal_Videos_828_x264.mp4,The video depicts a sequence of events occurri...,normal,video depicts sequence events occurring parkin...,abandoned objects vandalism garage parking gar...,"[-0.0103572365, 0.005626667, -0.02791793, -0.1..."
136,Normal_Videos_905_x264.mp4,The video captures a typical day at a busy int...,normal,video captures typical day busy intersection s...,surveillance camera vandalism footage restaura...,"[0.03237952, 0.004308843, -0.1355987, -0.04985..."


# LDA for Topic Modelling

Latent Dirichlet Allocation (LDA): A generative statistical model that discovers topics within a collection of documents. Here, it's used to model three topics corresponding to the anomalies: vandalism, violence, and abandoned objects.

CountVectorizer converts the processed captions into a document-term matrix, filtering out words that appear too frequently (max_df=0.95) or too rarely (min_df=2).

In [None]:
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(df['processed_caption'])

n_topics = 4  # 4 topics: vandalism, violence, abandoned object, normal
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_output = lda.fit_transform(doc_term_matrix)

In [None]:
lda_output

array([[0.00512395, 0.39379625, 0.59599745, 0.00508235],
       [0.00384428, 0.05386502, 0.49058913, 0.45170156],
       [0.00323247, 0.00318048, 0.03111845, 0.96246861],
       [0.00200957, 0.0020054 , 0.99404123, 0.0019438 ],
       [0.00131847, 0.00133181, 0.99603406, 0.00131566],
       [0.00240943, 0.00243744, 0.99278944, 0.00236369],
       [0.00133764, 0.00130533, 0.9308114 , 0.06654563],
       [0.00129533, 0.0013134 , 0.99613098, 0.00126029],
       [0.00187977, 0.26005906, 0.73622198, 0.00183918],
       [0.00189056, 0.00187162, 0.99436443, 0.00187339],
       [0.0037518 , 0.00365658, 0.98901377, 0.00357784],
       [0.17371396, 0.00171913, 0.82289369, 0.00167323],
       [0.00235295, 0.00222434, 0.99312977, 0.00229294],
       [0.00337284, 0.00331008, 0.48464538, 0.5086717 ],
       [0.00170392, 0.00169091, 0.99486903, 0.00173614],
       [0.00283376, 0.00285348, 0.99139752, 0.00291524],
       [0.00213612, 0.00216122, 0.99357328, 0.00212937],
       [0.1450444 , 0.00209606,

# Saving Trained Models

In [None]:
torch.save(model.state_dict(), 'roberta_model.pth')
joblib.dump(lda, 'lda_model.pkl')

['lda_model.pkl']

# Combining Features

The RoBERTa embeddings and LDA topic distributions are horizontally stacked to create a single feature vector for each caption, capturing both the semantic content and the topic information.

In [None]:
combined_features = np.hstack([np.vstack(df['roberta_embeddings'].values), lda_output])
combined_features

array([[-0.01292372,  0.07156829,  0.00180118, ...,  0.39379625,
         0.59599745,  0.00508235],
       [-0.07501026,  0.2028176 , -0.06080168, ...,  0.05386502,
         0.49058913,  0.45170156],
       [ 0.07223556, -0.04319455, -0.08394647, ...,  0.00318048,
         0.03111845,  0.96246861],
       ...,
       [-0.0467777 ,  0.13316399,  0.05016175, ...,  0.0056332 ,
         0.00550217,  0.9832556 ],
       [-0.01035724,  0.00562667, -0.02791793, ...,  0.00432278,
         0.00411414,  0.08374999],
       [ 0.03237952,  0.00430884, -0.1355987 , ...,  0.00257575,
         0.0024559 ,  0.65774595]])

# Downloading the Embeddings

In [None]:
print(df.columns)

Index(['video_name', 'captions', 'labels', 'processed_caption', 'key_phrases',
       'roberta_embeddings'],
      dtype='object')


Individually for local context:

In [None]:
os.makedirs('row_embeddings', exist_ok=True)

for index, row in tqdm(df.iterrows(), total=len(df), desc="Saving row embeddings"):
    row_features = combined_features[index]
    video_name = row['video_name']
    filename = f'row_embeddings/{video_name}_embedding.npy'
    np.save(filename, row_features)

print("All row embeddings have been saved.")

Saving row embeddings: 100%|██████████| 137/137 [00:00<00:00, 4124.52it/s]

All row embeddings have been saved.





In [None]:
zip_filename = 'rowtext_embeddings.zip'
with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk('row_embeddings'):
        for file in files:
            zipf.write(os.path.join(root, file),
                       os.path.relpath(os.path.join(root, file),
                                       os.path.join('row_embeddings', '..')))

print(f"Created zip file: {zip_filename}")

Created zip file: rowtext_embeddings.zip


All together for local context:

In [None]:
np.save('global_embeddings.npy', combined_features)

In [None]:
loaded_features = np.load('global_embeddings.npy')
print(loaded_features.shape)
loaded_features

(137, 772)


array([[-0.01292372,  0.07156829,  0.00180118, ...,  0.39379625,
         0.59599745,  0.00508235],
       [-0.07501026,  0.2028176 , -0.06080168, ...,  0.05386502,
         0.49058913,  0.45170156],
       [ 0.07223556, -0.04319455, -0.08394647, ...,  0.00318048,
         0.03111845,  0.96246861],
       ...,
       [-0.0467777 ,  0.13316399,  0.05016175, ...,  0.0056332 ,
         0.00550217,  0.9832556 ],
       [-0.01035724,  0.00562667, -0.02791793, ...,  0.00432278,
         0.00411414,  0.08374999],
       [ 0.03237952,  0.00430884, -0.1355987 , ...,  0.00257575,
         0.0024559 ,  0.65774595]])

In [None]:
combined_features_list = combined_features.tolist()
df['combined_features'] = combined_features_list
df.to_csv('updated_capsdata.csv', index=False)


# Why LDA?

While RoBERTa embeddings capture the contextual meaning of individual tokens, LDA captures the broader thematic content of the captions. LDA provides a probabilistic distribution of topics over each document (caption). This distribution can highlight the context in which an anomaly occurs, giving more nuanced information than just the presence of specific keywords.

# Why RoBERTa and KeyBERT?

RoBERTa generates embeddings that capture the meaning of words based on their context within a sentence. This is crucial for accurately representing the complex semantics of video captions, especially in the context of anomaly detection, where the meaning of a phrase can vary greatly depending on the surrounding words.

KeyBERT is designed to identify the most important phrases in a text, which can be particularly valuable when dealing with video captions. These key phrases often summarize the main events or objects in a scene, making them crucial for identifying anomalies like vandalism, violence, or abandoned objects.