In [1]:
from IPython.display import clear_output

In [2]:
import wandb

wandb.init(mode="offline")

In this Homework, you will make a Bi-directional translation model (i.e. it should do english to arabic as well as arabic to english translation) using T5-small arhcitecture. You will use Huggingface transformers library for this.

TODO:
1. Create a combined tokenization model for English and arabic using [sentencepiece](https://github.com/google/sentencepiece/blob/master/python/README.md). Don't forget to add pad_id, unk_id, bos_id, eos_id. Choose an appropriate vocabulary size.
2. Split the data into 90:10 train and test set. Make it compaitble to be used by huggingface transformers trainer and for bi-directional training.
3. Load the sentencepiece tokenizer into T5 tokenizer in huggingface. Study how you can do this.
4. Load an untrained T5 small model from hugging face transformers. You will need to specify your own vocab size for embedding layer. You can also make more changes if you want.
5. Train this model on your training data using huggingface transformers trainer. As you will be training it both way you can use a task descriptors like "Translate from english to arabic: ..." and "Translate from arabic to English: ..." before your input sentence.
6. Evaluate your models on test data and calculate the [bleu score](https://huggingface.co/spaces/evaluate-metric/bleu). You can use the evalute model as specified in the link.
7. Show some example inpu, true translation and generated translation from the test data. Do this for both english to arabic and arabic to english.



If the below code to download data doesn't work due to usage restrictions, download it directly from [here](https://drive.google.com/file/d/1APYsNu_geYk8d9vkI1e3EkLSTi4YPWDq/view?usp=sharing) and upload it your runtime.

In [3]:
%pip install gdown

clear_output()

In [4]:
import gdown
gdown.download('https://drive.google.com/uc?export=download&id=1APYsNu_geYk8d9vkI1e3EkLSTi4YPWDq')

Downloading...
From: https://drive.google.com/uc?export=download&id=1APYsNu_geYk8d9vkI1e3EkLSTi4YPWDq
To: /kaggle/working/arabic_english.txt
100%|██████████| 6.50M/6.50M [00:00<00:00, 61.2MB/s]


'arabic_english.txt'

In [5]:
# !gdown "1APYsNu_geYk8d9vkI1e3EkLSTi4YPWDq"

In [6]:
# %pip install datasets
# %pip install sentencepiece
# %pip install transformers
%pip install evaluate

clear_output()

In [7]:
import torch
from datasets import Dataset
import sentencepiece as spm
import numpy as np

In [8]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [9]:
from sklearn.model_selection import train_test_split
import pandas as pd
import random

In [10]:
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq

2024-08-03 22:24:15.582137: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-03 22:24:15.582245: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-03 22:24:15.754935: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [11]:
file_path = "arabic_english.txt"

with open(file_path, 'r') as file:
    lines = file.readlines()

data = [line.strip().split('\t') for line in lines]
df = pd.DataFrame(data, columns=['English', 'Arabic'])

df.head(10)

Unnamed: 0,English,Arabic
0,Hi.,مرحبًا.
1,Run!,اركض!
2,Help!,النجدة!
3,Jump!,اقفز!
4,Stop!,قف!
5,Go on.,داوم.
6,Go on.,استمر.
7,Hello!,مرحباً.
8,Hurry!,تعجّل!
9,Hurry!,استعجل!


In [12]:
len(df)

24638

In [72]:
train_dataset, test_dataset = train_test_split(df.head(10000), test_size=0.1, random_state=42) # 10000 because of cuda memory
train_dataset, val_dataset = train_test_split(train_dataset, test_size=0.1, random_state=42)

In [73]:
train_dataset = Dataset.from_pandas(train_dataset)
val_dataset = Dataset.from_pandas(val_dataset)
test_dataset = Dataset.from_pandas(test_dataset)

In [74]:
print(train_dataset.shape)

(8100, 3)


In [75]:
with open('combined_text.txt', 'w', encoding='utf-8') as f:
    for data in train_dataset:
        f.write(data['English'] + '\n')
        f.write(data['Arabic'] + '\n')

In [76]:
spm.SentencePieceTrainer.train(input='combined_text.txt', model_prefix='m', vocab_size=3000, pad_id=0, unk_id=1, bos_id=2, eos_id=3)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: combined_text.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 3000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 1
  bos_id: 2
  eos_id: 3
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differen

In [77]:
tokenizer = T5Tokenizer.from_pretrained('m.model')



In [78]:
def find_max_token_length(dataset, tokenizer):
    max_length = 0
    for data in dataset:
        for key in ('English', 'Arabic'):
            text = data[key]
            tokens = tokenizer.encode(text)
            max_length = max(max_length, len(tokens))
    return max_length

max_length = find_max_token_length(train_dataset, tokenizer)
print(max_length)

31


In [79]:
def tokenize_function(batch):
    src_language = 'English'
    tgt_language = 'Arabic'

    if random.randint(0, 1) == 1:
        src_language, tgt_language = tgt_language, src_language

    prompts = [f'Translate from {src_language} to {tgt_language} ' + text for text in batch[src_language]]
    targets = [text for text in batch[tgt_language]]

#     batch['src_language'] = [src_language] * len(prompts)
#     batch['prompt'] = prompts

    inputs = tokenizer(prompts, max_length=max_length, padding='max_length', truncation=True, return_tensors='pt')
    targets = tokenizer(targets, max_length=max_length, padding='max_length', truncation=True, return_tensors='pt')

    batch['input_ids'] = inputs['input_ids']
    batch['attention_mask'] = inputs['attention_mask']
    batch['labels'] = targets['input_ids']

    return batch


In [80]:
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=train_dataset.column_names)
tokenized_val_dataset = val_dataset.map(tokenize_function, batched=True, remove_columns=val_dataset.column_names)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/8100 [00:00<?, ? examples/s]

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [81]:
print(tokenized_train_dataset[0])

{'input_ids': [5, 2998, 2601, 206, 325, 13, 2899, 86, 178, 2532, 109, 1342, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [146, 65, 35, 1269, 54, 2552, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


In [82]:
pd.DataFrame(tokenized_train_dataset[:10]).head()

Unnamed: 0,input_ids,attention_mask,labels
0,"[5, 2998, 2601, 206, 325, 13, 2899, 86, 178, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...","[146, 65, 35, 1269, 54, 2552, 4, 3, 0, 0, 0, 0..."
1,"[5, 2998, 2601, 206, 325, 13, 2899, 94, 246, 8...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1642, 117, 127, 531, 817, 4, 3, 0, 0, 0, 0, 0..."
2,"[5, 2998, 2601, 206, 325, 13, 2899, 91, 7, 8, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[45, 591, 1245, 18, 10, 3, 0, 0, 0, 0, 0, 0, 0..."
3,"[5, 2998, 2601, 206, 325, 13, 2899, 90, 38, 21...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[208, 51, 369, 14, 23, 20, 396, 406, 710, 37, ..."
4,"[5, 2998, 2601, 206, 325, 13, 2899, 24, 17, 5,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[28, 81, 44, 18, 154, 574, 4, 3, 0, 0, 0, 0, 0..."


In [83]:
torch.tensor(tokenized_train_dataset[0]['input_ids']).shape

torch.Size([31])

In [84]:
idx = 0
row = tokenized_test_dataset[idx]

# print(f'{row["src_language"]=}')
# print(row['Arabic'])
# print(row['English'])

print(tokenizer.decode(row['input_ids'], skip_special_tokens=True))
print(tokenizer.decode(row['labels'], skip_special_tokens=True))

Translate from English to Arabic I don't come here very often.
لا آتي إلى هنا كثيرا.


In [85]:
custom_config = T5Config(
    vocab_size=tokenizer.vocab_size,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.pad_token_id,
    decoder_start_token_id=tokenizer.pad_token_id
)

In [86]:
print(tokenizer.tokenize("Test"))
print(tokenizer.tokenize("اختبار"))

['▁', 'T', 'est']
['▁اخت', 'بار']


In [87]:
model = T5ForConditionalGeneration.from_pretrained('t5-small', config=custom_config, ignore_mismatched_sizes=True).to(device)
model = torch.nn.DataParallel(model, device_ids=[0, 1]).to(device)

Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-small and are newly initialized because the shapes did not match:
- shared.weight: found shape torch.Size([32128, 512]) in the checkpoint and torch.Size([3000, 512]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [88]:
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
#     per_device_train_batch_size=4,
#     per_device_eval_batch_size=4,
    learning_rate=1e-4,
    num_train_epochs=3,
    weight_decay=0.01,
#     save_total_limit=2,
    logging_dir='./logs',
    remove_unused_columns=False,
)

In [89]:
from transformers import TrainerCallback

train_losses=[]
val_losses=[]

class CustomCallback(TrainerCallback):
    def on_log(self, args, state, control, **kwargs):
        # Check if training loss is available in the logs
        if "train_loss" in state.log_history:
            train_loss = state.log_history["train_loss"]
            train_losses.append(train_loss)

        # Check if validation loss is available in the logs
        if "eval_loss" in state.log_history:
            val_loss = state.log_history["eval_loss"]
            val_losses.append(val_loss)

In [90]:
import evaluate

bleu_metric = evaluate.load('bleu')

In [91]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    
    result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["bleu"]}

In [92]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding='max_length', max_length=max_length)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[CustomCallback()],
    compute_metrics=compute_metrics
)

In [93]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,3.3775,No log
2,2.209,No log
3,2.0883,No log




TrainOutput(global_step=1521, training_loss=2.550937461978428, metrics={'train_runtime': 228.8892, 'train_samples_per_second': 106.165, 'train_steps_per_second': 6.645, 'total_flos': 0.0, 'train_loss': 2.550937461978428, 'epoch': 3.0})

In [97]:
sample_english = 'Translate from English to Arabic Test'
with torch.no_grad():
    encoding = tokenizer(sample_english, return_tensors='pt')
    input_ids, attn_mask = encoding.input_ids.to(device), encoding.attention_mask.to(device)
    res = model.module.generate(input_ids=input_ids, attention_mask=attn_mask)
print(tokenizer.batch_decode(res)[0])

<pad>.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [98]:
import matplotlib.pyplot as plt

def plot_losses(train_losses, val_losses):
    plt.title('Loss vs Epochs')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.plot(train_losses)
    plt.plot(val_losses)
    plt.legend(['Train', 'Validation'])
    plt.show()

In [99]:
results = trainer.evaluate()
print(results)

{'eval_runtime': 5.5474, 'eval_samples_per_second': 162.239, 'eval_steps_per_second': 10.275, 'epoch': 3.0}
