In [91]:
!pip install transformers[sentencepiece]==4.28.0 datasets sacrebleu evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

In [92]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value = user_secrets.get_secret("HF_KEY")
login(secret_value)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [93]:
import transformers

print(transformers.__version__)

4.29.2


In [94]:
from datasets import load_dataset

#wmt16_train = load_dataset("ethansimrm/wmt16_biomed", use_auth_token=True) #We won't need to use this for now
wmt16_test = load_dataset("ethansimrm/wmt16_biomed_test", use_auth_token=True)
wmt16_gold = load_dataset("ethansimrm/wmt16_biomed_gold", use_auth_token=True)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Let's aim low for now. Given a (small) pre-trained large language model - Marian or T5-Small, etc, we wish to evaluate baseline performance on the WMT16 Biomedical Test Set. 
- We have converted the test set and the gold-standard answers to CSV files and uploaded both to the HuggingFace Hub.
- Intuitively, we must extract source sentences from the test set and target sentences from the gold set, but only where there is a direct correspondence between them.
- We know for sure that "passage/0" (the scientific paper titles) - are one sentence long in both languages, so there is a direct correspondence - we cannot guarantee the same for the rest of the sentences.



Additional Information:
- We could simply truncate the gold-standard answers to the length of the source sentences, but we only require a quick-and-dirty evaluation today.
- Furthermore, the test set contains French abstracts which contain information which cannot be inferred from the English source; we cannot assume a 1-1 faithful translation.

In [95]:
#The relevant columns for the abstract titles are:
source_sentences = wmt16_test["test"]["passage/0/sentence/0/text"]
target_sentences = wmt16_gold["train"]["passage/0/sentence/0/text"]

In [96]:
from datasets import load_metric
metric = load_metric("sacrebleu")

In [97]:
#This used a pretrained model from an earlier tutorial - it's quite bad. We need to train our base models
#on at least some out of domain EN-FR data, and upload them to the Hub before evaluating on the test set.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, T5Model, pipeline
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("ethansimrm/test_t5_small_example_kaggle3") 
translator = pipeline("translation", model=model, tokenizer=tokenizer, use_auth_token = True)
#for s in source_sentences:
    #source = "translate English to French: " + s
    #if(source[len(source) - 1] == '.'):
        #source = source[:-1]
    #print(source)
    #result = translator(source)
    #print(result)

The usual corpora used in research are massive. Let's get a (much) smaller one out - the news commentary corpus from HuggingFace (OPUS; converted into sentence alignments) should work.
We could have obtained EMEA (and many more) corpora from OPUS, but we don't have the time to convert this into HuggingFace format right now.

In [119]:
training_data = load_dataset("news_commentary", "en-fr") 

Downloading builder script:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.41k [00:00<?, ?B/s]

Downloading and preparing dataset news_commentary/en-fr (download: 23.83 MiB, generated: 67.08 MiB, post-processed: Unknown size, total: 90.91 MiB) to /root/.cache/huggingface/datasets/news_commentary/en-fr/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4...


Downloading data:   0%|          | 0.00/25.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/209479 [00:00<?, ? examples/s]

Dataset news_commentary downloaded and prepared to /root/.cache/huggingface/datasets/news_commentary/en-fr/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [122]:
training_data = training_data['train'].train_test_split(test_size=0.2)

In [123]:
training_data['train'][0]

{'id': '128044',
 'translation': {'en': 'Europe’s major competitors are turning climate change into an opportunity to encourage growth and create high-quality jobs in rapidly innovating economic sectors.',
  'fr': 'Les principaux concurrents de l’Europe font du changement climatique une occasion pour&#160; encourager la croissance et créer des emplois de haut niveau dans des secteurs économiques en pleine mutation.'}}

In [124]:
from transformers import AutoTokenizer
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [125]:
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "
def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]] 
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

In [126]:
tokenized_train = training_data.map(preprocess_function, batched=True)

  0%|          | 0/168 [00:00<?, ?ba/s]

  0%|          | 0/42 [00:00<?, ?ba/s]

In [127]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [128]:
import numpy as np

def postprocess_text(preds, labels): 
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) #Convert back into words

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id) #Ignore padded labels added by the data collator to the test set
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) 

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) #Remove leading and trailing spaces

    result = metric.compute(predictions=decoded_preds, references=decoded_labels) #BLEU score for provided input and references
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens) #Compute mean prediction length
    result = {k: round(v, 4) for k, v in result.items()} #Round score to 4dp
    return result

In [129]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [130]:
import torch

training_args = Seq2SeqTrainingArguments( #Collects hyperparameters
    output_dir="t5_small_prelim_news",
    evaluation_strategy="epoch", #Evaluates at the end of each epoch
    learning_rate=2e-5, #Initial learning rate for AdamW
    per_device_train_batch_size=16, #Minibatch learning
    per_device_eval_batch_size=16, #Batch size for evaluation
    weight_decay=0.01, #Weight decay for loss computation; Loss = Loss + WD * sum (weights squared)
    save_total_limit=3, #Number of checkpoints to save
    num_train_epochs=2,
    predict_with_generate=True, #Use with ROUGE/BLEU and other translation metrics (see below)
    fp16=True, #Remove fp16 = True if not using CUDA
    push_to_hub=True,
)

trainer = Seq2SeqTrainer( #Saves us from writing our own training loops
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_booksTest["train"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices.

In [None]:
trainer.train()

In [None]:
trainer.push_to_hub()