<a href="https://colab.research.google.com/github/abhaysrivastav/finetuning-methods/blob/main/Chapter3_Demo1_TransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translation from English to Spanish using Flan-T5 and Helsinki-NLP/opus-100 Dataset


## Introduction
In this notebook, we will use the Flan-T5 model to perform translation from English to Spanish. We will use the "Helsinki-NLP/opus-100" dataset from Hugging Face, specifically the en-es subset, to train and evaluate our translation model.


In [None]:
!pip install --upgrade transformers tensorflow datasets

In [None]:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")


## Loading the Dataset

In [None]:

from datasets import load_dataset

# Load the Helsinki-NLP/opus-100 dataset
dataset = load_dataset('Helsinki-NLP/opus-100', 'en-es')
print(dataset['train'][0])


## Data Preprocessing

In [None]:
# Preprocess the dataset for input into the model
def preprocess_data(examples):
    inputs = [f'Translate from English to Spanish: {example["en"]}' for example in examples['translation']]
    targets = [example['es'] for example in examples['translation']]

    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")

    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]

    # For decoder inputs
    decoder_inputs = tokenizer(targets, max_length=128, truncation=True, padding="max_length")
    model_inputs["decoder_input_ids"] = decoder_inputs["input_ids"]

    return model_inputs

# Apply preprocessing to the dataset
train_dataset = dataset['train'].select(range(30000)).map(preprocess_data, batched=True)
test_dataset = dataset['test'].map(preprocess_data, batched=True)

print(train_dataset[0])


In [None]:
print(type(train_dataset))
print(type(test_dataset))

## Freezing the Model

In [None]:
print(model)

In [None]:
for param in model.shared.parameters():
    param.requires_grad = False

# Freeze the encoder
for param in model.encoder.parameters():
    param.requires_grad = False

# Freeze the decoder
for param in model.decoder.parameters():
    param.requires_grad = False

In [None]:
def params_info(model):
    total_params = 0
    trainable_params = 0
    for param in model.parameters():
        num = param.numel()
        total_params += num
        if param.requires_grad:
            trainable_params += num
    non_trainable_params = total_params - trainable_params

    def size_in_mb(param_count):
        return param_count * 4 / 1024 / 1024  # 4 bytes per param for float32

    print(f"Total params: {total_params:,} ({size_in_mb(total_params):.2f} MB)")
    print(f"Trainable params: {trainable_params:,} ({size_in_mb(trainable_params):.2f} MB)")
    print(f"Non-trainable params: {non_trainable_params:,} ({size_in_mb(non_trainable_params):.2f} MB)")

# Usage
params_info(model)


### Important Considerations in Transfer Learning

1. **Freezing the LLM Layer:** In transfer learning, it's important to freeze the pre-trained language model layer to retain the knowledge it has already acquired and to avoid overfitting. This allows the model to leverage its pre-trained capabilities while focusing on learning the new task-specific nuances.

2. **Loss Function with `from_logits=True`:** When fine-tuning language models from Hugging Face, it's crucial to use the loss function with `from_logits=True`. This is because these models do not apply softmax to their outputs, and using `from_logits=True` ensures that the loss is computed correctly.


## Model Training

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,      # should be Hugging Face Dataset, not tf.data.Dataset
    eval_dataset=test_dataset,        # should be Hugging Face Dataset, not tf.data.Dataset
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


## Performing Translation

In [None]:
metrics = trainer.evaluate()
print(metrics)

In [None]:
model.save_pretrained("./finetuned-flan-t5")
tokenizer.save_pretrained("./finetuned-flan-t5")

In [None]:
input_text = "Translate from English to Spanish: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")

# Move input tensors to the same device as the model
device = next(model.parameters()).device  # Get model device
for k in inputs:
    inputs[k] = inputs[k].to(device)

outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


## Conclusion
In this notebook, we used the Flan-T5 model to perform translation from English to Spanish using the Helsinki-NLP/opus-100 dataset. We preprocessed the dataset, fine-tuned the model while freezing the LLM layer, and performed translations. We manually validated the translations to assess the quality of the model's performance. The results demonstrate the effectiveness of the Flan-T5 model for translation tasks.
