<a href="https://colab.research.google.com/github/ethansimrm/medical-mt/blob/main/T5_Translation_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [22]:
#Install required dependencies
!pip install transformers==4.28.0 datasets evaluate sacrebleu torch git+https://github.com/huggingface/accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/accelerate
  Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-xmdokjii
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-xmdokjii
  Resolved https://github.com/huggingface/accelerate to commit 0226f750257b3bf2cadc4f189f9eef0c764a0467
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [23]:
#Load in EN-FR subset of OPUS books
from datasets import load_dataset, load_dataset_builder
books = load_dataset("opus_books", "en-fr")



  0%|          | 0/1 [00:00<?, ?it/s]

In [24]:
#ds_builder = load_dataset_builder("rotten_tomatoes")
#ds_builder = load_dataset_builder("wmt14", 'fr-en')
#ds_builder.info.description
#ds_builder.info.features

In [25]:
#Entirety of OPUS books is a training dataset, so split into train:test with 80:20.
books = books["train"].train_test_split(test_size=0.2)

In [26]:
#Inspect data; the split is random.
books["train"][0]

{'id': '6091',
 'translation': {'en': 'What a stroke was this for poor Jane! who would willingly have gone through the world without believing that so much wickedness existed in the whole race of mankind, as was here collected in one individual.',
  'fr': 'Quel coup pour la pauvre Jane qui aurait parcouru le monde entier sans s’imaginer qu’il existât dans toute l’humanité autant de noirceur qu’elle en découvrait en ce moment dans un seul homme !'}}

In [27]:
#Tokenise data - we must load the correct tokeniser for our model before we input parameters. This is based on SentencePiece.
from transformers import AutoTokenizer
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
#NB: As long as data looks like {"input" : "XXXX" , "target" : "YYYY"}, it can be processed.

In [28]:
#Specify a preprocessing function which allows us to tokenise source and target languages correctly AND prime T5 with the correct prompt before each sentence
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "


def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]] 
    #We are essentially querying the dictionary here, whereby books["train"]["translation"][0]["en"] references the first English sentence above.
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True) #We will truncate sentences longer than 128 words long
    return model_inputs

#books["train"]["translation"][0]["en"] will give us the first English sentence

In [29]:
#Apply this function over the training dataset over multiple elements simultaneously using the batched = True argument.
tokenized_books = books.map(preprocess_function, batched=True) 

Map:   0%|          | 0/101668 [00:00<?, ? examples/s]

Map:   0%|          | 0/25417 [00:00<?, ? examples/s]

In [30]:
#Create a batch of examples and dynamically pad to hit length of longest sentence per batch
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [31]:
#Load evaluation method
import evaluate
metric = evaluate.load("sacrebleu")

In [32]:
import numpy as np

def postprocess_text(preds, labels): 
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) #Convert back into words
    print(decoded_preds[0])

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id) #Ignore padded labels added by the data collator to the test set
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    print(decoded_labels[0])

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) #Remove leading and trailing spaces

    result = metric.compute(predictions=decoded_preds, references=decoded_labels) #BLEU score for provided input and references
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens) #Compute mean prediction length
    result = {k: round(v, 4) for k, v in result.items()} #Round score to 4dp
    return result

In [33]:
#Ready to download model
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) #Model is 242MB in size

In [40]:
training_args = Seq2SeqTrainingArguments( #Collects hyperparameters
    output_dir="test_t5_small_example",
    evaluation_strategy="epoch", #Evaluates at the end of each epoch
    learning_rate=2e-5, #Initial learning rate for AdamW
    per_device_train_batch_size=16, #Minibatch learning
    per_device_eval_batch_size=16, #Batch size for evaluation
    weight_decay=0.01, #Weight decay for loss computation; Loss = Loss + WD * sum (weights squared)
    save_total_limit=3, #Number of checkpoints to save
    num_train_epochs=2,
    predict_with_generate=True, #Use with ROUGE/BLEU and other translation metrics (see below)
    fp16=True, #Remove fp16 = True if not using CUDA
    push_to_hub=True,
)

trainer = Seq2SeqTrainer( #Saves us from writing our own training loops
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

#However, these metrics require that we generate some text with the model rather than a single forward pass as with e.g. classification. 
#The Seq2SeqTrainer allows for the use of the generate method when setting predict_with_generate=True which will generate text for each sample in the evaluation set. 
#That means we evaluate generated text within the compute_metric function. We just need to decode the predictions and labels first.

/content/test_t5_small_example is already a clone of https://huggingface.co/ethansimrm/test_t5_small_example. Make sure you pull the latest changes with `repo.git_pull()`.


In [36]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


In [None]:
trainer.push_to_hub()