# Replace toxic words with pretrained BERT model
Here I will implement basic replace algorithm using pretrained BERT model. 
The main idea is to find "toxic" words in the text and replace them with the appropriate ones using pretrained BERT model.

In [1]:
with open('../data/external/en.txt') as file:
    badwords = [line.rstrip() for line in file]

In [2]:
badwords[0:10]

['2g1c',
 '2 girls 1 cup',
 'acrotomophilia',
 'alabama hot pocket',
 'alaskan pipeline',
 'anal',
 'anilingus',
 'anus',
 'apeshit',
 'arsehole']

In [3]:
import pandas as pd

test_data = pd.read_csv('../data/interim/test.csv')
test_data.head()

Unnamed: 0,reference,translation
0,Fucking A your mom likes lan.,my mom loves you.
1,We'll be fucking pariahs.,we're going to be completely unnerved.
2,"I'm done, Live Dead.","I'm through, Dead Meat."
3,What is this place? A fucking vampire secret h...,that's a secret vampire headquarters.
4,Just a silly dream and nothing more,# Just a silky dream and nothing more


## Replacing words

In [4]:
import torch
import transformers

# Load pretrained model

model_name = 'bert-base-uncased'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForMaskedLM.from_pretrained(model_name)

(…)base-uncased/resolve/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [81]:
def replace_toxic_words(text):
    tokenized_text = tokenizer.tokenize(text)
    
    toxic_word_indices = []
    
    masked_text = [token if token.lower() not in badwords else '[MASK]' for token in tokenized_text]
    masked_text = " ".join(masked_text)
    
    input_ids = tokenizer.encode(masked_text, add_special_tokens=True)
    input_ids_tensor = torch.tensor(input_ids).unsqueeze(0)
    
    with torch.no_grad():
        predictions = model(input_ids_tensor)[0]
        predicted_tokens = []
        
        for i, token in enumerate(tokenized_text):
            if token.lower() in badwords:
                predicted_word = tokenizer.convert_ids_to_tokens(torch.argmax(predictions[0, i + 1]).item())
                predicted_tokens.append(predicted_word)
            else:
                predicted_tokens.append(token)

    replaced_text = tokenizer.convert_tokens_to_string(predicted_tokens)
    return replaced_text

Let's see the performance of the algorithm on the first 2000 elements of the test data.

In [88]:
from tqdm import tqdm

size = 2000
pred = []

for i in tqdm(range(size)):
    pred.append(replace_toxic_words(test_data.reference[i]))

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [08:17<00:00,  4.02it/s]


## Algorithm performance

### Toxicity

In [90]:
from tqdm import tqdm
from detoxify import Detoxify

tox_values = []
detox = Detoxify('unbiased')

for i in tqdm(range(len(pred))):
    tox_values.append(detox.predict(pred[i])['toxicity'])
    
print('Approximate toxicity of the algorithm:', sum(tox_values) / len(tox_values))

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [04:16<00:00,  7.81it/s]

Approximate toxicity of the algorithm: 0.5501514020925388





The toxicity reduced from 0.737 to 0.55. This metric could be improved with a larger list of toxic words.

### Similarity

In [91]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
cosine_sims = []

for i in tqdm(range(len(pred))):
    texts = [test_data.reference[i], pred[i]]
    vector_matrix = count_vectorizer.fit_transform(texts)

    cosine_sims.append(cosine_similarity(vector_matrix)[0][1])
    
print("Average similarity:", sum(cosine_sims) / len(cosine_sims))

100%|█████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 925.86it/s]

Average similarity: 0.965633296708456





Similarity value is high.

## Examples of work

In [89]:
for i in [8, 58, 650]: 
    print("Reference:", test_data.reference[i])
    print("Detoxifyed:", pred[i])
    print()

Reference: Doesn't anybody in this town speak in complete fucking sentences anymore?
Detoxifyed: doesn't anybody in this town speak in complete english sentences anymore?

Reference: you even tried to wipe your butt off.
Detoxifyed: you even tried to wipe your hands off.

Reference: My eyes are fucked up.
Detoxifyed: my eyes are fucked up.



Comparing to the word removal, the semantic load of the sentence and grammatical logic are preserved.

## Conclusions

This is a simple method that does its job. However, it is very limited to a list of words (as seen in example three) and is unable to modify and recognize toxic constructs consisting of several words at once.