# Machine translation with pretrained t5 model

This notebook provides an example solution for a Machine Translation. This solution uses a large language model, the [google/flan-t5-xl model](https://huggingface.co/google/flan-t5-xl) (3B parameters) from the Hugging Face platform, to translate text from English to multiple target languages. 

Compute resource: Amazon SageMaker ml.g4dn.xlarge

First, install and import libraries.

In [1]:
!pip3 install -q ipykernel==6.22.0
!pip3 install -q torch==2.0.1
!pip3 install -q transformers==4.28.1
!pip3 install -q bitsandbytes==0.39.0
!pip3 install -q peft==0.3.0
!pip3 install -q pytest==7.3.2
!pip3 install -q datasets==2.10.0
!pip3 install -q sentencepiece
!pip3 install -q accelerate
!pip3 install -q nltk

In [1]:
# Import libraries
import pandas as pd
import torch
from transformers import pipeline, AutoTokenizer, T5Tokenizer, T5ForConditionalGeneration
from transformers.pipelines.pt_utils import KeyDataset
import tqdm
import datasets
from datasets import Dataset
import os
import numpy as np
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

os.environ["TOKENIZERS_PARALLELISM"] = "true"

BLEU = 'bleu'

language_mapping = {"es":"Spanish", "de":"German", "fr": "French", "it":"Italian", "pt":"Portuguese"}

In [2]:
torch.cuda.is_available()

True

In [3]:
seed = 100
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

## Load Pretrained Model from Hugging Face

In [4]:
model_id = "google/flan-t5-xl" # Hugging Face Model Id
tokenizer = T5Tokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Load dataset:

The dataset has the following columns: 
- `ID`
- `input_to_translate`: the source sentence in English
- `label`: the translation reference in the target language
- `gender`: f(emale) or m(ale)
- `language_pair`: `<source>_<target>`, such as en_fr for English to French

In [5]:
training_features = pd.read_csv("data/training.csv", encoding="utf-8-sig")
training_features.head(2)

Unnamed: 0,ID,input_to_translate,label,gender,language_pair
0,0,She started training for the biathlon in 2003.,Comenzó a entrenar para el biatlón en 2003.,f,en_es
1,1,He joined Philippine Airlines as a trainee pil...,Er wurde Flugschüler bei Philippine Airlines u...,m,en_de


In [6]:
def generate_prompt(x):
    language_mapping = {"es":"Spanish", "de":"German", "fr": "French", "it":"Italian", "pt":"Portuguese"}
    source_text = x["input_to_translate"]
    language = x["language_pair"].split('_')[1]
    input_text = f"Translate the following sentence from English to {language_mapping[language]}: \"{source_text}\" "
    return input_text

In [7]:
training_features["prompt"] = training_features.apply(generate_prompt, axis=1)
training_features.head(2)

Unnamed: 0,ID,input_to_translate,label,gender,language_pair,prompt
0,0,She started training for the biathlon in 2003.,Comenzó a entrenar para el biatlón en 2003.,f,en_es,Translate the following sentence from English ...
1,1,He joined Philippine Airlines as a trainee pil...,Er wurde Flugschüler bei Philippine Airlines u...,m,en_de,Translate the following sentence from English ...


#### Check the generated prompt:

In [8]:
training_features.iloc[0]["prompt"]

'Translate the following sentence from English to Spanish: "She started training for the biathlon in 2003." '

In [10]:
training_features.iloc[1]["prompt"]

'Translate the following sentence from English to German: "He joined Philippine Airlines as a trainee pilot, and was later pirated by Boeing." '

#### Load and generate prompt for test set

The test set is smilar with the training set, except that it is lacking the "label" column.

In [13]:
test_features = pd.read_csv("data/test_features.csv", encoding="utf-8-sig")
list(test_features)

['ID', 'input_to_translate', 'gender', 'language_pair']

In [12]:
test_features["prompt"] = test_features.apply(generate_prompt, axis=1)
list(test_features)

['ID', 'input_to_translate', 'gender', 'language_pair', 'prompt']

### Try to use the model to translate input text:

In [9]:
input_text = training_features.iloc[1]["prompt"]
input_ids = tokenizer(input_text, padding=True, truncation=True, max_length=512, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

"Er kam als Flugschüler zu Philippine Airlines und wurde später von Boeing raubkopiert."


In [14]:
input_text

'Translate the following sentence from English to German: "He joined Philippine Airlines as a trainee pilot, and was later pirated by Boeing." '

In [21]:
# reference translation
training_features.iloc[1]["label"]

'Er wurde Flugsch√ºler bei Philippine Airlines und wurde sp√§ter von Boeing abgeworben.'

Translate a few sentences:

In [24]:
%time

language_mapping = {"es":"Spanish", "de":"German", "fr": "French", "it":"Italian", "pt":"Portuguese"}
predicted_labels = []
prediction = pd.DataFrame({"ID": pd.Series(dtype="int"),
                   "predicted_label": pd.Series(dtype="str")})

sample_size = 10

for row in tqdm.tqdm(range(sample_size)):
    sentence_id = training_features.iloc[row]["ID"]
    input_text = training_features.iloc[row]["prompt"]
    input_ids = tokenizer(input_text, padding=True, truncation=True, max_length=512, return_tensors="pt").input_ids.to("cuda")

    outputs = model.generate(input_ids, max_length=1024)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    predicted_labels.append(generated_text)


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs


100%|██████████| 10/10 [10:33<00:00, 63.32s/it]


## Use Hugging Face pipeline instead:

In [13]:
pipe = pipeline("translation", model = model, tokenizer = tokenizer)



### Translate one sentence:

In [10]:
pipe('Translate the following sentence from English to German: "He joined Philippine Airlines as a trainee pilot, and was later pirated by Boeing." ')



[{'translation_text': '"Er kam als Flugschüler zu Philippine Airlines und wurde später von Boeing raubkopiert."'}]

### Translate a few sentences:

In [14]:
language_mapping = {"es":"Spanish", "de":"German", "fr": "French", "it":"Italian", "pt":"Portuguese"}
predicted_labels = []
prediction = pd.DataFrame({"ID": pd.Series(dtype="int"),
                   "predicted_label": pd.Series(dtype="str")})

sample_size = 10

for row in tqdm.tqdm(range(sample_size)):
    sentence_id = training_features.iloc[row]["ID"]
    input_text = training_features.iloc[row]["prompt"]

    generated_text = pipe(input_text)[0]['translation_text']
    predicted_labels.append(generated_text)


100%|██████████| 10/10 [00:29<00:00,  2.91s/it]


**Overall time is much faster than not using Pipeline!**

In [16]:
prediction["ID"] = training_features.iloc[0:sample_size]["ID"]
prediction["predicted_label"] = predicted_labels

### Calculate Bleu score on training data:

In [14]:
def bleu_func(x, y):
    chencherry = SmoothingFunction()
    x_split = [x_entry.strip().split() for x_entry in x]
    y_split = y.strip().split()
    return sentence_bleu(x_split, y_split, smoothing_function=chencherry.method3)

def bleu_custom(y_true, y_pred, groups):
    joined = pd.concat([y_true, y_pred, groups], axis=1)
    joined[BLEU] = joined.apply(lambda x: bleu_func([x[y_true.name]], x[y_pred.name]), axis=1)
    values = [joined[joined[groups.name] == unique][BLEU].mean() for unique in unique_list]
    print(f"Overall mean: {joined[BLEU].mean()}")
    print(f"Different genders: {values}")
    print(f"Final score: {joined[BLEU].mean() - np.fabs(values[0] - values[1])/2}")
    return joined[BLEU].mean() - np.fabs(values[0] - values[1])/2

In [18]:
bleu_custom(
    training_features.iloc[0:sample_size]["label"], 
    prediction["predicted_label"], 
    training_features.iloc[0:sample_size]["gender"]
)

Overall mean: 0.2131333021255633
Different genders: [0.22161764199499698, 0.20040679232141279]
Final score: 0.2025278772887712


0.2025278772887712

Can see that in this dataset, the gender difference is not too bad.

## Use Hugging Face Dataset 


https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline

https://huggingface.co/docs/transformers/pipeline_tutorial

https://github.com/huggingface/transformers/issues/22387

In [15]:
sample_size = 10
train_ds_raw = datasets.Dataset.from_pandas(training_features.head(sample_size), split="train")
train_ds_raw

Dataset({
    features: ['ID', 'input_to_translate', 'label', 'gender', 'language_pair', 'prompt'],
    num_rows: 10
})

### Streaming and batching using pipeline

In [15]:
predicted_labels = []
prediction = pd.DataFrame({"ID": pd.Series(dtype="int"),
                   "predicted_label": pd.Series(dtype="str")})
batch_size = 2
# default batch size is 1, if not specified
# with higher batch size, it's easier to trigger out of memory error

for out in tqdm.tqdm(pipe(KeyDataset(train_ds_raw, "prompt"), batch_size=batch_size),total=len(train_ds_raw)):
# for out in pipe(KeyDataset(train_ds_raw, "prompt")):
# for out in tqdm.tqdm(pipe(KeyDataset(train_ds_raw, "prompt"))):

    #print(out)
    generated_text = out[0]['translation_text']
    predicted_labels.append(generated_text)

 10%|█         | 1/10 [00:05<00:50,  5.65s/it]
 30%|███       | 3/10 [00:08<00:17,  2.49s/it]
 50%|█████     | 5/10 [00:15<00:15,  3.10s/it]
 70%|███████   | 7/10 [00:18<00:07,  2.39s/it]
100%|██████████| 10/10 [00:27<00:00,  2.78s/it]



**Overall time is similar to not using Dataset.**

In [17]:
prediction["ID"] = training_features.iloc[0:sample_size]["ID"]
prediction["predicted_label"] = predicted_labels

In [18]:
bleu_custom(
    training_features.iloc[0:sample_size]["label"], 
    prediction["predicted_label"], 
    training_features.iloc[0:sample_size]["gender"]
)

Overall mean: 0.2131333021255633
Different genders: [0.22161764199499698, 0.20040679232141279]
Final score: 0.2025278772887712


0.2025278772887712

## Translation on test dataset with pretrained model

In [16]:
test_ds_raw = datasets.Dataset.from_pandas(test_features, split="test")
test_ds_raw

Dataset({
    features: ['ID', 'input_to_translate', 'gender', 'language_pair', 'prompt'],
    num_rows: 3000
})

In [17]:
predicted_labels = []
test_prediction = pd.DataFrame({"ID": pd.Series(dtype="int"), "label": pd.Series(dtype="str")})
batch_size = 1
# default batch size is 1, if not specified
# with higher batch size, it's easier to trigger out of memory error

for out in tqdm.tqdm(pipe(KeyDataset(test_ds_raw, "prompt"), batch_size=batch_size),total=len(test_ds_raw)):
    generated_text = out[0]['translation_text']
    predicted_labels.append(generated_text)

test_prediction["ID"] = test_features["ID"]
test_prediction["label"] = predicted_labels
test_prediction.to_csv("t5_xl_pretrain-10102023.csv", index = False, encoding='utf-8-sig')

100%|██████████| 3000/3000 [2:20:48<00:00,  2.82s/it]  


when batch size = 2, got OOM error at 55%|█████▍    | 1646/3000

when batch size = 1, it went well: 100%|██████████| 3000/3000 [2:20:48<00:00,  2.82s/it] 

final score: 0.167153	