<a href="https://colab.research.google.com/github/dgromann/SwearWords_SubtitleAnalysis/blob/main/SwearWord_Detector_MovieSubtitles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Swear word detector in movie subtitles

This is a notebook to automatically align SRT subtitles to sentences and detect swearwords in Engnlish, German, and Polish. Furthermore it serves to automatically represent the results as Linked Data. 

Basic requirements for the following code are to be installed in the following cell. 

In [None]:
!pip install pysrt
!pip install simalign
!pip install multi-rake
!pip3 install --upgrade nltk
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pysrt
  Downloading pysrt-1.1.2.tar.gz (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.4/104.4 KB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pysrt
  Building wheel for pysrt (setup.py) ... [?25l[?25hdone
  Created wheel for pysrt: filename=pysrt-1.1.2-py3-none-any.whl size=13443 sha256=43535e77565290b7d55f51a8098450bde9130df368d13b70df7b5ef54e2b97c8
  Stored in directory: /root/.cache/pip/wheels/c3/34/f1/ae1d86b7f454100c10f7ab8dc411303b7834e7f40e343ca2c0
Successfully built pysrt
Installing collected packages: pysrt
Successfully installed pysrt-1.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simalign
  Downloading simalign-0.3-py3-none-any.whl (8.1 kB)
Collecting transformers
  Downloa

In [None]:
import nltk
from simalign import SentenceAligner

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')

#SimAlign needed? 
simalign_bert = SentenceAligner(model="microsoft/xlm-align-base", token_type="bpe", matching_methods="mai")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Downloading (…)lve/main/config.json:   0%|          | 0.00/512 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/942M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/xlm-align-base were not used when initializing XLMRobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

2023-03-21 23:23:41,454 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: microsoft/xlm-align-base
INFO:simalign.simalign:Initialized the EmbeddingLoader with model: microsoft/xlm-align-base


# Setup data access
Connect this notebook to your Google Drive or load the data from another location. 

In [None]:
#Mount Google Drive for access to files there
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Change into the folder in which your data can be found. 

In [None]:
%cd drive/My Drive/UniWien/Projects/NexusLinguarum/Annas_STSM/Research/BridgetJones_Diary

/content/drive/My Drive/UniWien/Projects/NexusLinguarum/Annas_STSM/Research/BridgetJones_Diary


# Preprocessing subtitles 

In order to provide proper context to swearwords for manual inspection, we first sentence align the subtitles. Expected is an SubRip Subtitle file (SRT) file that is timestamp-aligned as input and the output are sentences.   

In [None]:
import re

'''
Method to ???
'''
def time_stamp_fragments_to_sentences(subs):
  clean = re.compile('<.*?>')
  final_text = []
  role = ""
  for item in subs:
      text = re.sub(clean, "", item.text).replace("\n", " ").strip()
      if len(final_text) > 0 and final_text[len(final_text)-1].endswith("..."):
        final_text[len(final_text)-1] = final_text[len(final_text)-1].replace("...", "")
        final_text[len(final_text)-1] += " "+text
      elif len(final_text) > 0 and final_text[len(final_text)-1].endswith(","):
        final_text[len(final_text)-1] += " "+text
      elif ". " in text or "?" or "!":
        for sentence in re.split(r'(?<!\w\.\w.)(?<!\d\.)(?<![A-Z][a-z]\.)(?<=\b\.|\?)\s', text):
          if len(sentence) > 0:
            final_text.append(sentence)
      else:
        if len(text) > 0: 
          final_text.append(text)
      
  return final_text

Methods to analyse the text.

In [None]:
import pysrt
import pandas as pd
from multi_rake import Rake

#Loading the English subtitles from its SRT file 
subs_en = pysrt.open('EnglishSubtitles/OfficialSubtitles_en_BridgetJones_Diary.srt')

#Loading the German official and fan-based subtitles from their SRT files
subs_de_official = pysrt.open('GermanSubtitles/OfficialSubtitles_de_BridgetJones_Diary.srt')
#subs_de_fan = pysrt.open('GermanSubtitles/Fansubbed2_kidwao_German_Bridget_jones.srt')

#Loading the Polish official and fan-based subtitles from their SRT files 
#With Polish the encoding is often an issue and might need to be changed, here cp1250 worked best
subs_pl = pysrt.open('PolishSubtitles/Polish_subtitles_official_final.srt')
print(subs_pl)

#Changing the time stamp aligned sentence fragments to sentences
en_text = time_stamp_fragments_to_sentences(subs_en)
de_text_official = time_stamp_fragments_to_sentences(subs_de_official)
#de_text_fan = time_stamp_fragments_to_sentences(subs_de_fan)
pl_text_official = time_stamp_fragments_to_sentences(subs_pl)

print("Cleaned Subs En: ", en_text, "\nNumber of sentences: ", len(en_text), "\nNumber of words: ", sum([len(line.split()) for line in en_text]))
print("Cleaned Subs De professional translation: ", de_text_official, "\nNumber of sentences: ", len(de_text_official), "\nNumber of words: ",sum([len(line.split()) for line in de_text_official]))
print("Cleaned Subs Pl professional translation: ", pl_text_official, "\nNumber of sentences: ", len(pl_text_official), "\nNumber of words: ",sum([len(line.split()) for line in pl_text_official]))


Methods to align subtitles across languages.

In [None]:
import re
from pysrt import SubRipFile
from pysrt import SubRipItem
from pysrt import SubRipTime


def join_lines(txtsub1, txtsub2):
    if (len(txtsub1) > 0) & (len(txtsub2) > 0):
        return txtsub1 + '\n' + txtsub2
    else:
        return txtsub1 + txtsub2

def find_subtitle(subtitle, from_t, to_t, lo=0):
    i = lo
    while (i < len(subtitle)):
        if (subtitle[i].start >= to_t):
            break
        if (subtitle[i].start <= from_t) & (to_t  <= subtitle[i].end):
            return subtitle[i].text, i
        i += 1
    return "", i

def merge_subtitle(sub_a, sub_b, delta):
    clean = re.compile('<.*?>')
    out = SubRipFile()
    intervals = [item.start.ordinal for item in sub_a]
    intervals.extend([item.end.ordinal for item in sub_a])
    intervals.extend([item.start.ordinal for item in sub_b])
    intervals.extend([item.end.ordinal for item in sub_b])
    intervals.sort()

    j = k = 0
    for i in range(1, len(intervals)):
        start = SubRipTime.from_ordinal(intervals[i-1])
        end = SubRipTime.from_ordinal(intervals[i])

        if (end-start) > delta:
            text_a, j = find_subtitle(sub_a, start, end, j)
            text_b, k = find_subtitle(sub_b, start, end, k)

            text_a = re.sub(clean, "", text_a)
            text_b = re.sub(clean, "", text_b)

            text = join_lines("en: "+text_a, "de: "+text_b)
            if len(text_a) > 0 and len(text_b) > 0:
                item = SubRipItem(0, start, end, text)
                out.append(item)

    out.clean_indexes()
    return out

Instead of first building sentences out of subtitles aligned by sentence, the subsentence sequences can directly be aligned across languages based on time stamps. 



In [None]:
def merge_subtitle(sub_a, sub_b, delta):
    clean = re.compile('<.*?>')
    out = SubRipFile()
    intervals = [item.start.ordinal for item in sub_a]
    intervals.extend([item.end.ordinal for item in sub_a])
    intervals.extend([item.start.ordinal for item in sub_b])
    intervals.extend([item.end.ordinal for item in sub_b])
    intervals.sort()

    j = k = 0
    for i in range(1, len(intervals)):
        start = SubRipTime.from_ordinal(intervals[i-1])
        end = SubRipTime.from_ordinal(intervals[i])

        if (end-start) > delta:
            text_a, j = find_subtitle(sub_a, start, end, j)
            text_b, k = find_subtitle(sub_b, start, end, k)

            text_a = re.sub(clean, "", text_a)
            text_b = re.sub(clean, "", text_b)

            text = join_lines("en: "+text_a, "de: "+text_b)
            if len(text_a) > 0 and len(text_b) > 0:
                item = SubRipItem(0, start, end, text)
                out.append(item)

    out.clean_indexes()
    return out

In [None]:
from multi_rake import Rake

r = Rake()

def rake_extract_profane_phrases(sentence_en, sentence_language_b): 
  rake_keywords_a = [a[0] for a in r.apply(sentence_en)]
  #rake_keywords_b = [b[0] for b in r.apply(sentence_language_b)]

  print(rake_keywords_a)

for line in en_text:
  rake_extract_profane_phrases(line, None)




# Detecting offensive language

Multilingual [Distill-Roberta-based](https://huggingface.co/valurank/distilroberta-offensive) library for detecting swear words. 

In [None]:
import torch
from multi_rake import Rake
import numpy as np
from scipy.special import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification
#RobertaForSequenceClassification

r = Rake()

tokenizer = AutoTokenizer.from_pretrained("valurank/distilroberta-offensive")
model = AutoModelForSequenceClassification.from_pretrained("valurank/distilroberta-offensive")

swearwords_en = dict()
with torch.no_grad():
  for sentence_en in en_text: 
      inputs = tokenizer(sentence_en, return_tensors="pt")
      logits = model(**inputs).logits
      prediction = model.config.id2label[logits.argmax().item()]
      if prediction == "OFFENSIVE":
        for word in [a[0] for a in r.apply(sentence_en)]:
          input_word = tokenizer(word, return_tensors="pt")
          logits_en = model(**input_word).logits
          if model.config.id2label[logits_en.argmax().item()] == "OFFENSIVE":
            swearwords_en[word] = sentence_en
            print(word, sentence_en)

      #rake_keywords_a = [a[0] for a in r.apply(sentence_en)]
      #for keyword in rake_keywords_a:
      #  inputs = tokenizer(keyword, return_tensors="pt")
      #  logits = model(**inputs).logits

#predicted_class_id = logits.argmax().item()
        #print(keyword, model.config.id2label[logits.argmax().item()])


# To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
#num_labels = len(model.config.id2label)
#model = RobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-emotion", num_labels=num_labels)

#labels = torch.tensor([1])
#loss = model(**inputs, labels=labels).loss
#round(loss.item(), 2)

middle-aged bore Every year, she tries to fix me up with some bushy-haired, middle-aged bore and I feared this year would be no exception.
feared Every year, she tries to fix me up with some bushy-haired, middle-aged bore and I feared this year would be no exception.
torture Torture.
pretty nasty beast Pretty nasty beast, apparently.
hoo Hoo.
verbally incontinent spinster Particularly not with some verbally incontinent spinster who smokes like a chimney, drinks like a fish and dresses like her mother.
smokes Particularly not with some verbally incontinent spinster who smokes like a chimney, drinks like a fish and dresses like her mother.
finally die fat I suddenly realized that unless some thing changed soon I was going to live a life where my major relationship was with a bottle of wine and I'd finally die fat and alone and be found three weeks later, half-eaten by Alsatians.
shit-faced I had to make sure that next year I wouldn't end up shit-faced and listening to sad FM easy-listeni