# Replace toxic words with pretrained BERT model
Here I will implement basic replace algorithm using pretrained BERT model. 
The main idea is to find "toxic" words in the text and replace them with the appropriate ones using pretrained BERT model.

In [3]:
with open('../data/external/en.txt') as file:
    badwords = [line.rstrip() for line in file]

In [4]:
badwords[0:10]

['2g1c',
 '2 girls 1 cup',
 'acrotomophilia',
 'alabama hot pocket',
 'alaskan pipeline',
 'anal',
 'anilingus',
 'anus',
 'apeshit',
 'arsehole']

Enhanced vocabulary is taken from https://github.com/Orthrus-Lexicon/Toxic/.

In [1]:
with open('../data/external/toxic_words.txt') as file:
    toxic_words = [line.rstrip() for line in file]

In [2]:
toxic_words[:10]

['***',
 '*itches',
 '4r5e',
 '5h1t',
 '5hit',
 'God',
 'God damn',
 'Goddamn',
 'a**',
 'a*****es']

In [5]:
import pandas as pd

test_data = pd.read_csv('../data/interim/test.csv')
test_data.head()

Unnamed: 0,reference,translation
0,Fucking A your mom likes lan.,my mom loves you.
1,We'll be fucking pariahs.,we're going to be completely unnerved.
2,"I'm done, Live Dead.","I'm through, Dead Meat."
3,What is this place? A fucking vampire secret h...,that's a secret vampire headquarters.
4,Just a silly dream and nothing more,# Just a silky dream and nothing more


## Replacing words

In [13]:
import torch
import transformers

# Load pretrained model

model_name = 'bert-base-cased'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
def replace_toxic_words(text, vocab):
    tokenized_text = tokenizer.tokenize(text)
    
    toxic_word_indices = []
    
    masked_text = [token if token.lower() not in vocab else '[MASK]' for token in tokenized_text]
    masked_text = " ".join(masked_text)
    
    input_ids = tokenizer.encode(masked_text, add_special_tokens=True)
    input_ids_tensor = torch.tensor(input_ids).unsqueeze(0)
    
    with torch.no_grad():
        predictions = model(input_ids_tensor)[0]
        predicted_tokens = []
        
        for i, token in enumerate(tokenized_text):
            if token.lower() in vocab:
                predicted_word = tokenizer.convert_ids_to_tokens(torch.argmax(predictions[0, i + 1]).item())
                predicted_tokens.append(predicted_word)
            else:
                predicted_tokens.append(token)

    replaced_text = tokenizer.convert_tokens_to_string(predicted_tokens)
    return replaced_text

Let's see the performance of the algorithm on the first 2000 elements of the test data.

In [15]:
from tqdm import tqdm

size = 2000
pred = []

for i in tqdm(range(size)):
    pred.append(replace_toxic_words(test_data.reference[i], badwords))

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:18<00:00,  3.58it/s]


New vocabulary:

In [16]:
pred_new = []

for i in tqdm(range(size)):
    pred_new.append(replace_toxic_words(test_data.reference[i], toxic_words))

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:00<00:00,  3.70it/s]


## Algorithm performance

### Toxicity

In [17]:
from tqdm import tqdm
from detoxify import Detoxify

In [18]:
tox_values = []
detox = Detoxify('unbiased')

for i in tqdm(range(len(pred))):
    tox_values.append(detox.predict(pred[i])['toxicity'])
    
print('Approximate toxicity of the algorithm:', sum(tox_values) / len(tox_values))

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [06:29<00:00,  5.13it/s]


Approximate toxicity of the algorithm: 0.5485283137535735


The toxicity reduced from 0.737 to 0.549. This metric could be improved with a larger list of toxic words.

New vocabulary:

In [19]:
tox_values = []
detox = Detoxify('unbiased')

for i in tqdm(range(len(pred_new))):
    tox_values.append(detox.predict(pred_new[i])['toxicity'])
    
print('Approximate toxicity of the algorithm:', sum(tox_values) / len(tox_values))

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [06:40<00:00,  5.00it/s]

Approximate toxicity of the algorithm: 0.31616000352780976





Wow, this is the improvement. Let's see other metrics and overall work of the algorithm.

### Similarity

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
count_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
cosine_sims = []

for i in tqdm(range(len(pred))):
    texts = [test_data.reference[i], pred[i]]
    vector_matrix = count_vectorizer.fit_transform(texts)

    cosine_sims.append(cosine_similarity(vector_matrix)[0][1])
    
print("Average similarity:", sum(cosine_sims) / len(cosine_sims))

100%|████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1010.22it/s]

Average similarity: 0.9676834196750391





Similarity value is high.

New vocabulary:

In [22]:
count_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
cosine_sims = []

for i in tqdm(range(len(pred_new))):
    texts = [test_data.reference[i], pred_new[i]]
    vector_matrix = count_vectorizer.fit_transform(texts)

    cosine_sims.append(cosine_similarity(vector_matrix)[0][1])
    
print("Average similarity:", sum(cosine_sims) / len(cosine_sims))

100%|█████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 951.32it/s]

Average similarity: 0.9155217641994428





The similarity reduced as expected, but still good enough.

## Examples of work

In [23]:
for i in [8, 58, 650]: 
    print("Reference:", test_data.reference[i])
    print("Detoxifyed:", pred[i])
    print()

Reference: Doesn't anybody in this town speak in complete fucking sentences anymore?
Detoxifyed: Doesn't anybody in this town speak in complete coherent sentences anymore?

Reference: you even tried to wipe your butt off.
Detoxifyed: you even tried to wipe your face off.

Reference: My eyes are fucked up.
Detoxifyed: My eyes are fucked up.



Comparing to the word removal, the semantic load of the sentence and grammatical logic are preserved. But (as seen in example 3) the algorithm depends on the bad words vocabulary.

In [24]:
for i in [8, 58, 650, 46]: 
    print("Reference:", test_data.reference[i])
    print("Detoxifyed:", pred_new[i])
    print()

Reference: Doesn't anybody in this town speak in complete fucking sentences anymore?
Detoxifyed: Doesn't anybody in this town speak in complete coherent sentences anymore?

Reference: you even tried to wipe your butt off.
Detoxifyed: you even tried to wipe your face off.

Reference: My eyes are fucked up.
Detoxifyed: My eyes are tearing up.

Reference: I told you bastards, don't waste ammunition!
Detoxifyed: I told you before, don't waste ammunition!



Fairly accurate work.

## Conclusions

This is a simple method that does its job. However, it is unable to modify and recognize toxic constructs consisting of several words at once.