# Machine translation with fine-tuned t5 model

This notebook provides an example solution for a Machine Translation task. This solution uses a large language model, the [google/flan-t5-xl model](https://huggingface.co/google/flan-t5-xl) (3B parameters) from the Hugging Face platform, to translate text from English to multiple target languages. 

Compute resource: AWS EC2 p5.48xlarge instance

## Prepare the environment

In [1]:
# !pip3 install -q ipykernel==6.22.0
# # !pip3 install -q torch==2.0.1
# !pip3 install -q transformers==4.28.1
# !pip3 install -q bitsandbytes==0.39.0
# !pip3 install -q peft==0.3.0
# !pip3 install -q pytest==7.3.2
# !pip3 install -q datasets==2.10.0
# !pip3 install -q sentencepiece
# !pip3 install -q accelerate
# !pip3 install -q nltk

# # install deepspeed and ninja for jit compilations of kernels
# !pip3 install -q deepspeed ninja --upgrade

In [1]:
# Import libraries
import pandas as pd
import torch
from transformers import pipeline, AutoTokenizer, T5Tokenizer, T5ForConditionalGeneration
from transformers.pipelines.pt_utils import KeyDataset
import tqdm
import datasets
from datasets import Dataset
import os
import numpy as np
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

os.environ["TOKENIZERS_PARALLELISM"] = "true"

BLEU = 'bleu'

language_mapping = {"es":"Spanish", "de":"German", "fr": "French", "it":"Italian", "pt":"Portuguese"}

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
torch.cuda.is_available()

True

In [4]:
torch.cuda.current_device()

0

In [5]:
torch.cuda.device_count()

8

In [8]:
torch.cuda.get_device_name(0)

'NVIDIA H100 80GB HBM3'

In [3]:
seed = 100
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

## Load Pretrained Model from Hugging Face

In [6]:
model_id = "google/flan-t5-xl" # Hugging Face Model Id
tokenizer = T5Tokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Load dataset and process the data

The dataset has the following columns: 
- `ID`
- `input_to_translate`: the source sentence in English
- `label`: the translation reference in the target language
- `gender`: f(emale) or m(ale)
- `language_pair`: `<source>_<target>`, such as en_fr for English to French

In [5]:
training_features = pd.read_csv("data/training.csv", encoding="utf-8-sig")
training_features.head(2)

Unnamed: 0,ID,input_to_translate,label,gender,language_pair
0,0,She started training for the biathlon in 2003.,Comenzó a entrenar para el biatlón en 2003.,f,en_es
1,1,He joined Philippine Airlines as a trainee pil...,Er wurde Flugschüler bei Philippine Airlines u...,m,en_de


In [6]:
def generate_prompt(x):
    language_mapping = {"es":"Spanish", "de":"German", "fr": "French", "it":"Italian", "pt":"Portuguese"}
    source_text = x["input_to_translate"]
    language = x["language_pair"].split('_')[1]
    input_text = f"Translate the following sentence from English to {language_mapping[language]}: \"{source_text}\" "
    return input_text

In [7]:
training_features["prompt"] = training_features.apply(generate_prompt, axis=1)
training_features.head(2)

Unnamed: 0,ID,input_to_translate,label,gender,language_pair,prompt
0,0,She started training for the biathlon in 2003.,Comenzó a entrenar para el biatlón en 2003.,f,en_es,Translate the following sentence from English ...
1,1,He joined Philippine Airlines as a trainee pil...,Er wurde Flugschüler bei Philippine Airlines u...,m,en_de,Translate the following sentence from English ...


#### Check the generated prompt:

In [8]:
training_features.iloc[0]["prompt"]

'Translate the following sentence from English to Spanish: "She started training for the biathlon in 2003." '

In [10]:
training_features.iloc[1]["prompt"]

'Translate the following sentence from English to German: "He joined Philippine Airlines as a trainee pilot, and was later pirated by Boeing." '

#### Load and generate prompt for test set

The test set is smilar with the training set, except that it is lacking the "label" column.

In [13]:
test_features = pd.read_csv("data/test_features.csv", encoding="utf-8-sig")
list(test_features)

['ID', 'input_to_translate', 'gender', 'language_pair']

In [12]:
test_features["prompt"] = test_features.apply(generate_prompt, axis=1)
list(test_features)

['ID', 'input_to_translate', 'gender', 'language_pair', 'prompt']

### Use Hugging Face Dataset object

In [11]:
train_ds_raw = datasets.Dataset.from_pandas(training_features, split="train")
train_ds_raw

Dataset({
    features: ['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'],
    num_rows: 12000
})

In [12]:
test_ds_raw = datasets.Dataset.from_pandas(test_features, split="test")
test_ds_raw

Dataset({
    features: ['ID', 'input_to_translate', 'gender', 'language_pair', 'prompt'],
    num_rows: 3000
})

In [17]:
tokenizer.model_max_length

512

In [18]:
tokenizer.pad_token_id

0

### Figure out token length and tokenize the training set

In [19]:
tokenized_source_training = train_ds_raw.map(
    lambda x: tokenizer(x["prompt"], truncation=True), 
    batched=True, remove_columns=['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'])

source_lengths_training = [len(x) for x in tokenized_source_training["input_ids"]]

print(f"Max source length: {max(source_lengths_training)}")
print(f"95% source length: {int(np.percentile(source_lengths_training, 95))}")

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Max source length: 155
95% source length: 70


In [20]:
tokenized_target_training = train_ds_raw.map(
    lambda x: tokenizer(x["label"], truncation=True), 
    batched=True, remove_columns=['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'])
target_lengths_training = [len(x) for x in tokenized_target_training["input_ids"]]

print(f"Max target length: {max(target_lengths_training)}")
print(f"95% target length: {int(np.percentile(target_lengths_training, 95))}")

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Max target length: 273
95% target length: 101


In [21]:
tokenized_source_test = test_ds_raw.map(
    lambda x: tokenizer(x["prompt"], truncation=True), 
    batched=True, remove_columns=['ID', 'input_to_translate', 'gender', 'language_pair', 'prompt'])

source_lengths_test = [len(x) for x in tokenized_source_test["input_ids"]]

print(f"Max source length in test set: {max(source_lengths_test)}")
print(f"95% source length in test set: {int(np.percentile(source_lengths_test, 95))}")

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Max source length in test set: 155
95% source length in test set: 76


In [22]:
max_source_length = max(max(source_lengths_training), max(source_lengths_test))
max_source_length

155

In [23]:
max_target_length = max(target_lengths_training)
max_target_length

273

In [24]:
# reference: https://www.philschmid.de/fine-tune-flan-t5-deepspeed
def preprocess_function(sample, padding="max_length"):

    # tokenize inputs
    model_inputs = tokenizer(sample["prompt"], max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["label"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [25]:
tokenized_train_ds = train_ds_raw.map(
    preprocess_function, batched=True, 
    remove_columns=['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'])

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

In [26]:
tokenized_train_ds

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 12000
})

### Split the original training set into train and test:

In [27]:
ds_dict = tokenized_train_ds.train_test_split(test_size=0.2)
ds_dict

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9600
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2400
    })
})

In [28]:
trainset = ds_dict["train"]
trainset           

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 9600
})

In [29]:
testset = ds_dict["test"]
testset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 2400
})

In [30]:
# save dataset to disk
save_dataset_path = "training_data"
trainset.save_to_disk(os.path.join(save_dataset_path,"train"))
testset.save_to_disk(os.path.join(save_dataset_path,"eval"))

Saving the dataset (0/1 shards):   0%|          | 0/9600 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2400 [00:00<?, ? examples/s]

## Fine-tuning the model

reference: https://www.philschmid.de/fine-tune-flan-t5-deepspeed 

In [31]:
# !pip3 install -q pytesseract transformers datasets nltk tensorboard py7zr evaluate sacrebleu --upgrade

In [32]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [33]:
# Metric
metric = evaluate.load("sacrebleu")

In [34]:
# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]

    # Replace -100 in the labels as we can't decode them.
    # for some reason, also get a lot of -100 in preds
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)    
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result


In [35]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100

# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [36]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

  warn(msg)



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/envs/ml/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /opt/conda/envs/ml/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...


In [37]:
batch_size = 96
repository_id = f"{model_id.split('/')[1]}-finetuned-translation-10132023"
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    generation_max_length=273,
    weight_decay=0.01,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=False,
    bf16=True,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # push to hub parameters
    report_to="tensorboard",
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
    push_to_hub=True,
)

In [38]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=trainset,
    eval_dataset=testset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [39]:
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,0.829607,33.3801,53.8142
2,No log,0.771901,35.5554,53.2679
3,No log,0.756788,36.1064,53.1525


TrainOutput(global_step=300, training_loss=0.97969482421875, metrics={'train_runtime': 2427.1468, 'train_samples_per_second': 11.866, 'train_steps_per_second': 0.124, 'total_flos': 7.6970842914816e+16, 'train_loss': 0.97969482421875, 'epoch': 3.0})

**Notes:**

batch_size = 128 -> OutOfMemoryError: CUDA out of memory.

batch_size = 96 -> OK

(needed to add the replacement of -100 for preds as well in compute metrics; otherwise get IndexError of "piece id is out of range")


each epoch ~13min (training 3min, evaluation ~10min)  
3 epochs -> around 40min

In [40]:
trainer.evaluate()

{'eval_loss': 0.7567881345748901,
 'eval_bleu': 36.1064,
 'eval_gen_len': 53.1525,
 'eval_runtime': 560.4878,
 'eval_samples_per_second': 4.282,
 'eval_steps_per_second': 0.045,
 'epoch': 3.0}

In [41]:
repository_id

'flan-t5-xl-finetuned-translation-10132023'

In [42]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

'https://huggingface.co/delmeng/flan-t5-xl-finetuned-translation-10132023/tree/main/'

## Use the fine-tuned model

In [14]:
model_id = "google/flan-t5-xl" # Hugging Face Model Id
repository_id = f"delmeng/{model_id.split('/')[1]}-finetuned-translation-10132023"
tokenizer = T5Tokenizer.from_pretrained(repository_id)
model = T5ForConditionalGeneration.from_pretrained(repository_id, device_map="auto")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
tokenizer.model_max_length

512

### Still use the "translation" pipeline task type

In [16]:
pipe_ft = pipeline("translation", model = repository_id, max_length=tokenizer.model_max_length, device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Evaluation on a small subset of the training set

In [19]:
sample_size = 96
train_ds_test = datasets.Dataset.from_pandas(training_features.head(sample_size), split="train")
train_ds_test

Dataset({
    features: ['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'],
    num_rows: 96
})

In [20]:
predicted_labels = []
prediction = pd.DataFrame({"ID": pd.Series(dtype="int"),
                   "predicted_label": pd.Series(dtype="str")})
batch_size = 48
# default batch size is 1, if not specified
# with higher batch size, it's easier to trigger out of memory error

for out in tqdm.tqdm(pipe_ft(KeyDataset(train_ds_test, "prompt"), batch_size=batch_size),total=len(train_ds_test)):
# for out in pipe(KeyDataset(train_ds_raw, "prompt")):
# for out in tqdm.tqdm(pipe(KeyDataset(train_ds_raw, "prompt"))):

    #print(out)
    generated_text = out[0]['translation_text']
    predicted_labels.append(generated_text)

  1%|          | 1/96 [01:25<2:15:16, 85.44s/it]
100%|██████████| 96/96 [01:54<00:00,  1.19s/it]



In [21]:
prediction["ID"] = training_features.iloc[0:sample_size]["ID"]
prediction["predicted_label"] = predicted_labels

In [13]:
def bleu_func(x, y):
    chencherry = SmoothingFunction()
    x_split = [x_entry.strip().split() for x_entry in x]
    y_split = y.strip().split()
    return sentence_bleu(x_split, y_split, smoothing_function=chencherry.method3)

def bleu_custom(y_true, y_pred, groups):
    joined = pd.concat([y_true, y_pred, groups], axis=1)
    joined[BLEU] = joined.apply(lambda x: bleu_func([x[y_true.name]], x[y_pred.name]), axis=1)
    values = [joined[joined[groups.name] == unique][BLEU].mean() for unique in unique_list]
    print(f"Overall mean: {joined[BLEU].mean()}")
    print(f"Different genders: {values}")
    print(f"Final score: {joined[BLEU].mean() - np.fabs(values[0] - values[1])/2}")
    return joined[BLEU].mean() - np.fabs(values[0] - values[1])/2

In [22]:
bleu_custom(
    training_features.iloc[0:sample_size]["label"], 
    prediction["predicted_label"], 
    training_features.iloc[0:sample_size]["gender"]
)

Overall mean: 0.35987764995509036
Different genders: [0.3603295420111593, 0.3592171923346818]
Final score: 0.35932147511685164


0.35932147511685164

### Evaluation on test dataset using the fine-tuned model

In [18]:
predicted_labels = []
test_prediction = pd.DataFrame({"ID": pd.Series(dtype="int"), "label": pd.Series(dtype="str")})
batch_size = 32
# default batch size is 1, if not specified
# with higher batch size, it's easier to trigger out of memory error

for out in tqdm.tqdm(pipe_ft(KeyDataset(test_ds_raw, "prompt"), batch_size=batch_size),total=len(test_ds_raw)):
    generated_text = out[0]['translation_text']
    predicted_labels.append(generated_text)

test_prediction["ID"] = test_features["ID"]
test_prediction["label"] = predicted_labels
test_prediction.to_csv("t5_xl_finetuned_translation_submission-10142023.csv", index = False, encoding='utf-8-sig')

100%|██████████| 3000/3000 [29:03<00:00,  1.72it/s]


**Notes:**

when batch size = 48, got OOM error at     51%|█████     | 1536/3000 [12:25<11:50,  2.06it/s]

when batch size = 32 -> OK (30min)


final score: 0.265392 (compare with 0.167 using the pretrained model without fine-tuning)

## Try the "text2text-generation" pipeline task type

In [16]:
pipe_ft = pipeline("text2text-generation", model = repository_id, max_length=tokenizer.model_max_length, device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Evaluation on a small subset of the training set

In [18]:
sample_size = 96
train_ds_test = datasets.Dataset.from_pandas(training_features.head(sample_size), split="train")
train_ds_test

Dataset({
    features: ['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'],
    num_rows: 96
})

In [19]:
predicted_labels = []
prediction = pd.DataFrame({"ID": pd.Series(dtype="int"),
                   "predicted_label": pd.Series(dtype="str")})
batch_size = 48
# default batch size is 1, if not specified
# with higher batch size, it's easier to trigger out of memory error

for out in tqdm.tqdm(pipe_ft(KeyDataset(train_ds_test, "prompt"), batch_size=batch_size),total=len(train_ds_test)):
    generated_text = out[0]['generated_text']
    predicted_labels.append(generated_text)

  1%|          | 1/96 [00:26<41:21, 26.12s/it]
100%|██████████| 96/96 [00:34<00:00,  2.80it/s]



In [20]:
prediction["ID"] = training_features.iloc[0:sample_size]["ID"]
prediction["predicted_label"] = predicted_labels

In [21]:
bleu_custom(
    training_features.iloc[0:sample_size]["label"], 
    prediction["predicted_label"], 
    training_features.iloc[0:sample_size]["gender"]
)

Overall mean: 0.3452952905976548
Different genders: [0.3550664310052024, 0.33101439307893155]
Final score: 0.33326927163451936


0.33326927163451936

### Evaluation on the test set

In [22]:
predicted_labels = []
test_prediction = pd.DataFrame({"ID": pd.Series(dtype="int"), "label": pd.Series(dtype="str")})
batch_size = 32
# default batch size is 1, if not specified
# with higher batch size, it's easier to trigger out of memory error

for out in tqdm.tqdm(pipe_ft(KeyDataset(test_ds_raw, "prompt"), batch_size=batch_size),total=len(test_ds_raw)):
    generated_text = out[0]['generated_text']
    predicted_labels.append(generated_text)

test_prediction["ID"] = test_features["ID"]
test_prediction["label"] = predicted_labels
test_prediction.to_csv("t5_xl_finetuned_text_submission-10142023.csv", index = False, encoding='utf-8-sig')

100%|██████████| 3000/3000 [14:51<00:00,  3.37it/s]


when batch size = 32 -> 15min

Final score: 0.245683

**Observation:** the performance of the "text2text-generation" task type is not as good as the "translation" task type used above, although the inference seems to be much faster.

# Fine tuning with Deepspeed

Reference: 

https://www.philschmid.de/fine-tune-flan-t5-deepspeed

https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/configs/ds_flan_t5_z3_config_bf16.json

https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_seq2seq_deepspeed.py

Note that the Deepspeed script and configuration file used below are based on these references.

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

The command to fine tune the model with Deepspeed:

```
deepspeed --num_gpus=8 run_seq2seq_deepspeed.py \
    --model_id google/flan-t5-xl \
    --repository_id delmeng/flan-t5-xl-finetuning-translation-ds \
    --dataset_path training_data \
    --epochs 3 \
    --per_device_train_batch_size 96 \
    --per_device_eval_batch_size 96 \
    --generation_max_length 273 \
    --lr 1e-4 \
    --deepspeed deepspeed_config.json
```

(Note that it took around 3.5h for this training job.)

## Use the fine-tuned model for inference

After the training, the model was uploaded to my Hugging Face repository, so I can download and use it.

https://huggingface.co/delmeng/flan-t5-xl-finetuning-translation-ds/tree/main

In [13]:
repository_id = "delmeng/flan-t5-xl-finetuning-translation-ds"
tokenizer = T5Tokenizer.from_pretrained(repository_id)
model = T5ForConditionalGeneration.from_pretrained(repository_id, device_map="auto")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
tokenizer.model_max_length

512

In [14]:
pipe_ft = pipeline("translation", model = repository_id, max_length=tokenizer.model_max_length, device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Evaluation on a small subset of the training set

In [16]:
sample_size = 96
train_ds_test = datasets.Dataset.from_pandas(training_features.head(sample_size), split="train")
train_ds_test

Dataset({
    features: ['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'],
    num_rows: 96
})

In [None]:
predicted_labels = []
prediction = pd.DataFrame({"ID": pd.Series(dtype="int"),
                   "predicted_label": pd.Series(dtype="str")})
batch_size = 8

for out in tqdm.tqdm(pipe_ft(KeyDataset(train_ds_test, "prompt"), batch_size=batch_size),total=len(train_ds_test)):

    generated_text = out[0]['translation_text']
    predicted_labels.append(generated_text)

  0%|          | 0/96 [00:00<?, ?it/s]

Tried different batch size here => batch = 8 is a good choice.

batch = 48: This step is super slow!! Give up!!  
batch = 16: This step is super slow!! Give up!!  
batch = 8: 3min 48s  100%|██████████| 96/96 [03:48<00:00,  2.38s/it]  
batch = 1: 4min 100%|██████████| 96/96 [04:03<00:00,  2.53s/it]  


In [18]:
prediction["ID"] = training_features.iloc[0:sample_size]["ID"]
prediction["predicted_label"] = predicted_labels

In [19]:
bleu_custom(
    training_features.iloc[0:sample_size]["label"], 
    prediction["predicted_label"], 
    training_features.iloc[0:sample_size]["gender"]
)

Overall mean: 0.37039246550402644
Different genders: [0.37006333542410835, 0.37087350177467604]
Final score: 0.3699873823287426


0.3699873823287426

### Evaluation on the test set

In [None]:
# note: this didn't work, it stuck at 6%-ish and couldn't finish

predicted_labels = []
test_prediction = pd.DataFrame({"ID": pd.Series(dtype="int"), "label": pd.Series(dtype="str")})
batch_size = 8

for out in tqdm.tqdm(pipe_ft(KeyDataset(test_ds_raw, "prompt"), batch_size=batch_size),total=len(test_ds_raw)):
    generated_text = out[0]['translation_text']
    predicted_labels.append(generated_text)


  6%|▌         | 177/3000 [08:04<1:41:23,  2.15s/it]

In [16]:
predicted_labels = []
test_prediction = pd.DataFrame({"ID": pd.Series(dtype="int"), "label": pd.Series(dtype="str")})

for input_text in tqdm.tqdm(KeyDataset(test_ds_raw, "prompt")):
    generated_text = pipe_ft(input_text)[0]['translation_text']
    predicted_labels.append(generated_text)

test_prediction["ID"] = test_features["ID"]
test_prediction["label"] = predicted_labels
test_prediction.to_csv("t5_xl_finetuned_translation_ds-10142023.csv", index = False, encoding='utf-8-sig')

100%|██████████| 3000/3000 [2:30:17<00:00,  3.01s/it]  



The inference took 2.5 hours.

Final score: 0.262809

**Observation:** fine-tuning with Deepspeed didn't help with the fine-tuning in my case. In fact, somehow it slowed down the process. This could be caused by some configuration issue? The overall translation performance is similar to without Deepspeed though.