Last time we have encountered the task of text detoxification for the English language. Let's recall how it looks like. 

<a href="https://ibb.co/GtqSwd3"><img src="https://i.ibb.co/hZSz5g1/image.png" alt="image" border="0"></a>

However, parallel data is not often available in different languages. Multilingual LLMs ([mBART](https://arxiv.org/abs/2001.08210), [mT5](https://arxiv.org/abs/2010.11934), [mT0](https://arxiv.org/abs/2211.01786) etc.) come and help. 

In this seminar we will review some of the methods for cross-lingual and multilingual textual style transfer. 

So, if we are facing any of the sequence-to-sequence tasks in a particular language, but there is no data in that language, there are several solutions to this problem:
 
    1. Translate data that you have to the language you need and use multilingual language model. 
    2. Use Backtranslation approach:
        - Take the data on the language you want your model to work.
        - Translate to the language on which the model is available
        - Do TST.
        - Translate the result back to the original language. 
    3. Use Adapters etc. 


Today we will cover the first approach and we will show some modifications of it. 

Backtranslation approach:

<a href="https://ibb.co/JHJ9R4k"><img src="https://i.ibb.co/QMw4FRm/image.png" alt="image" border="0"></a>

Training data translation approach:

<a href="https://ibb.co/3dmK0bW"><img src="https://i.ibb.co/T2MyHGR/image.png" alt="image" border="0"></a>

In [1]:
import os
os.environ['NVIDIA_VISIBLE_DEVICES']='3'
os.environ['CUDA_VISIBLE_DEVICES']='3'

from tqdm.auto import tqdm

import pandas as pd
import torch
from tqdm import tqdm
from transformers import (
    MBartForConditionalGeneration,
    MBart50TokenizerFast,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)

Let's download our data.

In [2]:
# !gdown --id 12g9yfO8hykt1JH7gfKre55BRokln1bEj
# !gdown --id 1Tii3nAOvTgkwvBkIChgv-NZy_njZIqEa

In [3]:
en2ru_google = pd.read_csv('../data/detox_en2ru_google.tsv')
en2ru_yandex = pd.read_csv('../data/english_data/detox_en2ru_yandex.tsv', sep='\t')

In our case we have taken ParaDetox dataset and translated it with Google Translate and Yandex Translate (both translators show approximately same performance.)

Let's take a look at the data we have.

In [4]:
en2ru_google.head()

Unnamed: 0,toxic_comment,neutral_comment,toxic_ru,neutral_ru
0,he had steel balls too !,he was brave too!,у него тоже были стальные яйца!,он тоже был смелым!
1,"dude should have been taken to api , he would ...",It would have been good if he went to api. He ...,"чувака надо было отвести в апи, он был бы как ...","Было бы хорошо, если бы он пошел на апи. Он бы..."
2,"im not gonna sell the fucking picture , i just...","I'm not gonna sell the picture, i just want to...","Я не собираюсь продавать эту чертову картинку,...","Я не собираюсь продавать картинку, я просто хо..."
3,the garbage that is being created by cnn and o...,the news that is being created by cnn and othe...,"мусор, создаваемый CNN и другими информационны...","новости, создаваемые CNN и другими информацион..."
4,the reason they dont exist is because neither ...,The reason they don't exist is because neither...,"причина, по которой их не существует, заключае...","Причина, по которой они не существуют, заключа..."


In [5]:
en2ru_yandex[['toxic_comment', 'toxic_ru', 'neutral_comment', 'neutral_ru']].head()

Unnamed: 0,toxic_comment,toxic_ru,neutral_comment,neutral_ru
0,he had steel balls too !,у него тоже были стальные яйца!,he was brave too!,он тоже был храбрым!
1,"dude should have been taken to api , he would ...","чувака надо было отвезти в апи, он был бы там ...",It would have been good if he went to api. He ...,"Было бы хорошо, если бы он пошел в апи. Он бы ..."
2,"im not gonna sell the fucking picture , i just...",я не собираюсь продавать эту гребаную фотограф...,"I'm not gonna sell the picture, i just want to...","Я не собираюсь продавать фотографию, я просто ..."
3,the garbage that is being created by cnn and o...,"мусор, который создают cnn и другие информацио...",the news that is being created by cnn and othe...,"новости, которые создают cnn и другие информац..."
4,the reason they dont exist is because neither ...,"причина, по которой их не существует, заключае...",The reason they don't exist is because neither...,"Причина, по которой их не существует, заключае..."


Luckily, we have several multilingual models like mBART and mT5 that have been trained on gigabytes of multilingual texts. 

These multilingual models can do the same detoxification taks but for multiple languages with still decent performance.  

In [6]:
class DetoxDataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __getitem__(self, idx):

        source = self.tokenizer(
            self.data.iloc[idx].toxic_comment,
            max_length=150,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )
        target = self.tokenizer(
            self.data.iloc[idx].neutral_comment,
            max_length=150,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )
        source["labels"] = target["input_ids"]

        return {k: v.squeeze(0) for k, v in source.items()}

    def __len__(self):
        return self.data.shape[0]

We will simply expand our data and concat original "toxic" -> "polite" part with the translated part. That gives us **twice** more training data. 

In [7]:
ru_part = en2ru_yandex[['toxic_ru', 'neutral_ru']].copy()
data = pd.concat(
    [en2ru_yandex[['toxic_comment', 'neutral_comment']], 
    ru_part.rename(columns={'toxic_ru': 'toxic_comment', 'neutral_ru': 'neutral_comment'})]
    )

In [8]:
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50").cuda()
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50")

In [9]:
from sklearn.model_selection import train_test_split

train_part, valid_part = train_test_split(data, random_state=42, test_size=0.01)
trainset = DetoxDataset(train_part, tokenizer)
valset = DetoxDataset(valid_part, tokenizer)

In [10]:
training_args = Seq2SeqTrainingArguments(
    output_dir="mbart_mdetox",
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    learning_rate=1e-5,
    evaluation_strategy="steps",
    save_strategy="no",
    save_total_limit=1,
    logging_steps=500,
    gradient_accumulation_steps=1,
)

In [11]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=trainset,
    eval_dataset=valset,
    tokenizer=tokenizer,
)

In [12]:
trainer.train()

***** Running training *****
  Num examples = 39136
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 4892
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33metomoscow[0m. Use [1m`wandb login --relogin`[0m to force relogin
2023-05-16 15:30:50.225355: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-05-16 15:30:50.225403: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Step,Training Loss,Validation Loss
500,2.5527,0.111055


***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8
***** Running Evaluation *****
  Num examples = 396
  Batch size = 8


TrainOutput(global_step=4892, training_loss=0.3439985986430433, metrics={'train_runtime': 2161.436, 'train_samples_per_second': 18.106, 'train_steps_per_second': 2.263, 'total_flos': 1.24237486227456e+16, 'train_loss': 0.3439985986430433, 'epoch': 1.0})

Once the model is trained, let's try to generate something!

In [20]:
def paraphrase(
    text,
    model,
    tokenizer,
    n=None,
    max_length="auto",
    beams=5,
):
    texts = [text] if isinstance(text, str) else text
    inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to(
        model.device
    )

    if max_length == "auto":
        max_length = inputs.shape[1] + 10

    result = model.generate(
        inputs,
        num_return_sequences=n or 1,
        do_sample=False,
        temperature=1.0,
        repetition_penalty=10.0,
        max_length=max_length,
        min_length=int(0.5 * max_length),
        num_beams=beams,
        # forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],
    )
    texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]

    return texts[0]

In [13]:
# Download some data for testing.

# !gdown --id 1sVdnUUNa2k9CkQGrST5XnoqpOxR44hHW

In [16]:
test_data = pd.read_csv('../data/russian_data/test.tsv', sep='\t')

Now it's time to check the performance of our model. 

In [23]:
from tqdm.auto import tqdm 

detox_ru = [
    paraphrase(text, model, tokenizer)
    for text in tqdm(test_data['toxic_comment'].values)
]


  0%|          | 0/1000 [00:00<?, ?it/s]

wandb: Network error (ReadTimeout), entering retry loop.


Let's see, what we have.

In [43]:
for index in [10, 12, 99, 93]:
    print(f"Source: {test_data['toxic_comment'][index]}")
    print(f"Generated: {detox_ru[index]}")
    print(f"Ground Truth: {test_data['neutral_comment'][index]}")

Source: ублюдок! закон бумеранга еще никто не отменял!!
Generated: закон бумеранга еще никто не отменял!!
Ground Truth: закон бумеранга еще никто не отменял!
Source: Ах тыж ебаный ты нахуй! Как уснуть теперь
Generated: Как уснуть в этот раз? Ты на самом деле плохой.
Ground Truth: Ах тыж! Как уснуть теперь
Source: не надо опускаться до уровня шута безмозглого
Generated: не надо опускаться до уровня безмозглого.
Ground Truth: Не надо опускаться до нижнего уровня
Source: потому что жизнь долбоёбов ничему не учит
Generated: Потому что жизнь людей ничему не учит.
Ground Truth: потому что жизнь глупого ничему не учит


In [82]:
# import sys
# from IPython.display import clear_output
# sys.path.append('..')

# from evaluate_ru import evaluate_style_transfer, load_model

# style_model, style_tokenizer = load_model(
#     "IlyaGusev/rubertconv_toxic_clf", use_cuda=True
# )
# meaning_model, meaning_tokenizer = load_model(
#     "s-nlp/rubert-base-cased-conversational-paraphrase-v1", use_cuda=True
# )
# cola_model, cola_tolenizer = load_model(
#     "s-nlp/ruRoberta-large-RuCoLa-v1", use_cuda=True
# )

# def evaluate(original, rewritten, references=None):
#     return evaluate_style_transfer(
#         original_texts=original,
#         rewritten_texts=rewritten,
#         references=references,
#         style_model=style_model,
#         style_tokenizer=style_tokenizer,
#         meaning_model=meaning_model,
#         meaning_tokenizer=meaning_tokenizer,
#         cola_model=cola_model,
#         cola_tokenizer=cola_tolenizer,
#         style_target_label=0,
#         meaning_target_label=0,
#         cola_target_label=0,
#         aggregate=True
#         )

# # refs = open('../data/english_data/test_neutral_parallel.txt', 'r').read().split('\n')

# results = evaluate(test_data['toxic_comment'].values, test_data['neutral_comment'].values, detox_ru)

# clear_output()

# print(results)

In [47]:
with open('../data/english_data/test_toxic_parallel.txt', 'r') as f:
    test_en = f.read().split('\n')

detox_en = [paraphrase(text, model, tokenizer) for text in tqdm(test_en)]

  0%|          | 0/671 [00:00<?, ?it/s]

In [74]:
from IPython.display import clear_output
import sys

sys.path.append('..')

from evaluate_en import evaluate_style_transfer, load_model

style_model, style_tokenizer = load_model(
    "s-nlp/roberta_toxicity_classifier", use_cuda=True
)
meaning_model, meaning_tokenizer = load_model(
    "Elron/bleurt-large-128", use_cuda=True
)
cola_model, cola_tolenizer = load_model(
    "cointegrated/roberta-large-cola-krishna2020", use_cuda=True
)

def evaluate(original, rewritten, references=None):
    return evaluate_style_transfer(
        original_texts=original,
        rewritten_texts=rewritten,
        references=references,
        style_model=style_model,
        style_tokenizer=style_tokenizer,
        meaning_model=meaning_model,
        meaning_tokenizer=meaning_tokenizer,
        cola_model=cola_model,
        cola_tokenizer=cola_tolenizer,
        style_target_label=0,
        meaning_target_label=0,
        cola_target_label=0,
        aggregate=True
        )

refs = open('../data/english_data/test_neutral_parallel.txt', 'r').read().split('\n')

results = evaluate(test_en, refs[:-1], detox_en)

clear_output()

print(results)

{'accuracy': 0.9500501155853271, 'similarity': 0.5612009167671204, 'fluency': 0.8355763554573059, 'joint': 0.4501223564147949, 'chrF': 68.79356510085162}


In [49]:
for index in [10, 12, 99, 93]:
    print(f"Source: {test_en[index]}")
    print(f"Generated: {detox_en[index]}")
    # print(f"Ground Truth: {test_data['neutral_comment'][index]}")

Source: actually no , i hate looking for shit .
Generated: actually no, i hate looking for things.
Source: after all most of these dorks are millionares and can pay their own way .
Generated: After all most of them are millionares and can pay their own way.
Source: DIGIT years later and they still blame bush for his fuck ups .
Generated: DIGIT years later and they still blame bush for his mistakes.
Source: destined to repeat the same shit over and over again until we die .
Generated: destined to repeat the same thing over and over again until we die.


Since the data for training was multilingual, our model can perform detoxification on BOTH English and Russian.

### Cross-lingual approach

Not only can we train the model to be multilingual, but also cross-lingual. 

How? 

Simply pass the pairs of "English Toxic Text" ---> "Russian Polite Text" and you're good!

In [54]:
model2 = MBartForConditionalGeneration.from_pretrained('../saved_models/mbart_large_50_en_ru/checkpoint-40000')
tokenizer = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50')

loading configuration file ../saved_models/mbart_large_50_en_ru/checkpoint-40000/config.json
Model config MBartConfig {
  "_name_or_path": "facebook/mbart-large-50",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "MBartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": 

In [55]:
with open('../data/english_data/test_toxic_parallel.txt', 'r') as f:
    test_en = f.read().split('\n')

detox_en2ru = [paraphrase(text, model2.cuda(), tokenizer) for text in tqdm(test_en)]

  0%|          | 0/671 [00:00<?, ?it/s]

In [83]:
for i in [1, 5, 10, 15, 20]:
    print(f"Source: {test_en[i]}")
    print(f"Generated: {detox_en2ru[i]}")

Source:  mandated  and " right fucking now " would be good .
Generated: mandated и " прямо сейчас " было бы хорошо.
Source: &gt today was one of the most fucked up days of my life .
Generated: Сегодня был один из самых трудных дней в моей жизни.
Source: actually no , i hate looking for shit .
Generated: На самом деле нет, я ненавижу искать что-то плохое.
Source: all i got from that was shits gonna go down and nobody is going to be ready .
Generated: Все, что я получил от этого, - это то, что произойдет, и никто не будет готов.
Source: almost as fucked up as the cia funding and arming bin laden .
Generated: почти так же плохо, как финансирование ЦРУ и вооружение Бен Ладена.


Some advices for future. 

1. If you lack data for target language, use any other dataset available - multitask training **BOOSTS** performance of the whole model. As a proof, check out [mT0 paper](https://arxiv.org/abs/2211.01786).
2. If the language you are working with is **very** rare, consider using Backtranslation approach. 
3. When working with translation problems, use either Google\Yandex\DeepL Translate or LLM like BLOOM. Please avoid using open-source translators like [opus-mt](https://huggingface.co/Helsinki-NLP/opus-mt-ru-en), they are much worse. 