# Practical machine learning and deep learning. Lab 5
## Competition
No competition for today



# Fine-tuning a model on a translation task
Today we will be finetunning T5(Text-To-Text Transfer Transformer) [model](https://github.com/google-research/t5x) on translation task. For this purpose we will be using [HuggingFace transformers](https://huggingface.co/docs/transformers/index) and [WMT16](https://huggingface.co/datasets/wmt16) dataset. 

In [1]:
# installing huggingface libraries for dataset, models and metrics
!pip install datasets transformers[sentencepiece] sacrebleu

!pip install numpy==1.24.3



In [2]:
# Necessary inputs
import warnings

from datasets import load_dataset, load_metric
import transformers
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

warnings.filterwarnings('ignore')

## Selecting the model
For the example purpose we select as model checkpoint the smallest transformer in T5 family - `t5_small`. Other pre-trained models can be found [here](https://huggingface.co/docs/transformers/model_doc/t5#:~:text=T5%20comes%20in%20different%20sizes%3A).

In [3]:
# selecting model checkpoint
model_checkpoint = "t5-small"

## Loading the dataset

In [4]:
# setting random seed for transformers library
transformers.set_seed(42)

# Load the WMT16 dataset
df = pd.read_csv("/kaggle/input/toxic-comments-classification/filtered.tsv", sep='\t', index_col=0)
dataset = datasets.Dataset.from_pandas(df, split='train')

# Load the BLUE metric
metric = load_metric("sacrebleu")

## Dataset
Downloaded from HuggingFace dataset is a `DatasetDict`. It contains keys `["train", "validation", "test"]` - which represents a dataset splits

In [5]:
df.head()

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348


In [6]:
# Swap translation based on the toxicity
cond = (df["ref_tox"] < df["trn_tox"])
df.loc[cond, ['reference', 'translation']] = (
    df.loc[cond, ['translation', 'reference']].values)
df.loc[cond, ['ref_tox', 'trn_tox']] = (
    df.loc[cond, ['trn_tox', 'ref_tox']].values)

In [7]:
df

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"if Alkar floods her with her mental waste, it ...","If Alkar is flooding her with psychic waste, t...",0.785171,0.010309,0.981983,0.014195
1,you're becoming disgusting.,Now you're getting nasty.,0.749687,0.071429,0.999039,0.065473
2,"well, we can spare your life.","Well, we could spare your life, for one.",0.919051,0.268293,0.985068,0.213313
3,"monkey, you have to wake up.","Ah! Monkey, you've got to snap out of it.",0.664333,0.309524,0.994215,0.053362
4,I have orders to kill her.,I've got orders to put her down.,0.726639,0.181818,0.999348,0.009402
...,...,...,...,...,...,...
577772,you didn't know that Estelle stole your fish f...,You didn't know that Estelle had stolen some f...,0.870322,0.030769,0.949143,0.000121
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.722897,0.058824,0.996124,0.215794
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.617511,0.212121,0.984538,0.000049
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.679613,0.358209,0.991945,0.000124


In [88]:
print(len(df))
filtered_df = df[(df["ref_tox"] > 0.99) & (df["trn_tox"] < 0.01)]

print(len(filtered_df))

577777
173907


In [89]:
dataset = datasets.Dataset.from_pandas(filtered_df).remove_columns('__index_level_0__')

split_dict = dataset.train_test_split(
    test_size=0.1,
    seed=42,
)

In [90]:
split_dict["test"]

Dataset({
    features: ['reference', 'translation', 'similarity', 'lenght_diff', 'ref_tox', 'trn_tox'],
    num_rows: 17391
})

## Metric
[Sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) computes:
- `score`: BLEU score
- `counts`: list of counts of correct n-grams
- `totals`: list of counts of total n-grams
- `precisions`: list of precisions
- `bp`: Brevity penalty
- `sys_len`: cumulative sysem length
- `ref_len`: cumulative reference length

The main metric is [BLEU score](https://en.wikipedia.org/wiki/BLEU). BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score measures the similarity of the machine-translated text to a set of high quality reference translations.

The BLEU metric is calculates using [n-grams](https://en.wikipedia.org/wiki/N-gram).

In [11]:
fake_preds = ["hello there", "general kenobi", "Can I get an A"]
fake_labels = [["hello there"], ["general kenobi"], ['Can I get a C']]
metric.compute(predictions=fake_preds, references=fake_labels)

{'score': 45.59274666224604,
 'counts': [7, 4, 1, 0],
 'totals': [9, 6, 3, 2],
 'precisions': [77.77777777777777,
  66.66666666666667,
  33.333333333333336,
  25.0],
 'bp': 1.0,
 'sys_len': 9,
 'ref_len': 9}

## Preprocessing the data
As usual we will need to preprocess data and tokenize it before passing to model

In [12]:
from transformers import AutoTokenizer

# we will use autotokenizer for this purpose
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [13]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [14]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

In [15]:
# prefix for model input
prefix = "Make this text non-toxic:"

In [91]:
max_input_length = 128
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + ref for ref in examples["reference"]]
    targets = [tsn for tsn in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_overflowing_tokens=False)

    # Setup the tokenizer for targets
    labels = tokenizer(targets, max_length=max_target_length, truncation=True, return_overflowing_tokens=False)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [92]:
# example of preprocessing
preprocess_function(split_dict['train'][:2])

{'input_ids': [[1796, 48, 1499, 529, 18, 14367, 10, 10499, 32, 32, 17, 5, 9459, 6, 18117, 6, 43, 25, 894, 82, 1379, 570, 58, 1], [1796, 48, 1499, 529, 18, 14367, 10, 4067, 3, 27826, 19, 8, 833, 44, 160, 629, 469, 58, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[11604, 6, 43, 25, 894, 82, 1379, 570, 58, 1], [571, 6819, 19, 3, 9, 833, 44, 160, 629, 8988, 58, 1]]}

In [100]:
# for the example purpose we will crop the dataset and select first 5000 for train
# and 500 for validation and test
batch_size = 256
cropped_datasets = split_dict
cropped_datasets['train'] = split_dict['train'].select(range(5000))
cropped_datasets['test'] = split_dict['test'].select(range(500))
tokenized_datasets = cropped_datasets.map(preprocess_function, batched=True, batch_size=batch_size, remove_columns=split_dict["train"].column_names)
tokenized_datasets['train'][0]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

{'input_ids': [1796,
  48,
  1499,
  529,
  18,
  14367,
  10,
  10499,
  32,
  32,
  17,
  5,
  9459,
  6,
  18117,
  6,
  43,
  25,
  894,
  82,
  1379,
  570,
  58,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [11604, 6, 43, 25, 894, 82, 1379, 570, 58, 1]}

## Fine-tuning the model

In [19]:
!pip install langchain
!pip install sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [103]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
# create a model for the pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [104]:
# defining the parameters for training
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-detoxification",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    report_to='tensorboard',
)

In [105]:
# instead of writing collate_fn function we will use DataCollatorForSeq2Seq
# simliarly it implements the batch creation for training
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [106]:
import numpy as np

# simple postprocessing for text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

# compute metrics function to pass to trainer
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [107]:
# instead of writing train loop we will use Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [108]:
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.0351,1.863859,22.9312,12.372
2,1.9564,1.80135,23.7373,12.446
3,1.8765,1.77107,24.4659,12.27
4,1.7714,1.773111,24.7587,12.196
5,1.7093,1.764896,24.8789,12.218
6,1.6653,1.76689,25.3012,12.102


KeyboardInterrupt: 

In [109]:
# saving model
trainer.save_model('best')

In [110]:
# loading the model and run inference for it
model = AutoModelForSeq2SeqLM.from_pretrained('best')
model.eval()
model.config.use_cache = False

In [111]:
def detoxify(model, inference_request, tokenizer=tokenizer):
    input_ids = tokenizer(inference_request, return_tensors="pt").input_ids
    outputs = model.generate(input_ids=input_ids)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True,temperature=0))

In [117]:
inference_request = prefix +"Fuck you, bitch!"
detoxify(model, inference_request,tokenizer)

you're gonna be a mess!


In [113]:
inference_request = prefix + 'This bastard could not even mutliply fucking tensors in mind!'
detoxify(model, inference_request,tokenizer)

this man could not even have tensors in mind!


In [118]:
inference_request = prefix + """You sound like a bitch, bitch
Shut the fuck up
When your fans become your haters
You done?
Fuck, your beard's weird
Alright
You yellin' at the mic, you weird beard
We doin' this once
Your beard's weird, why you yellin' at the mic?"""
detoxify(model, inference_request,tokenizer)

you sound like a sailor, sailor, sailor
