### Changing to the main directory

In [1]:
%cd ..

/home/isham/Desktop/machine-learning-projects/fine-tuning-q-and-a


### Importing Necessary libraries

In [2]:
from transformers import AutoModelForSeq2SeqLM, TrainingArguments, Trainer
from utils import BASE_MODEL_PATH, TRAINING_PATH, FINAL_MODEL_PATH, PROCESSED_DATA_DIR
from utils import EPOCHS, LR, BATCH_SIZE, SAVE_TOTAL_LIMIT, EVALUATION_STRATEGY

from datasets import load_from_disk
import os

from utils import clear_gpu_memory

### Loading Tokenized Data

In [3]:
train_tokenized_data = load_from_disk(os.path.join(PROCESSED_DATA_DIR, "train_tokenized_data"))
val_tokenized_data = load_from_disk(os.path.join(PROCESSED_DATA_DIR, "val_tokenized_data"))
test_tokenized_data = load_from_disk(os.path.join(PROCESSED_DATA_DIR, "test_tokenized_data"))

### Initializing Training Arguments

In [4]:
training_args = TrainingArguments(
    output_dir=TRAINING_PATH,
    save_total_limit=SAVE_TOTAL_LIMIT,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    num_train_epochs=EPOCHS,
    evaluation_strategy=EVALUATION_STRATEGY,
)

### Evaluation

Evaluation Results Before Fine Tuning

In [5]:
loaded_original_model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_PATH)

In [6]:
trainer = Trainer(
    model=loaded_original_model,
    args=training_args,
    train_dataset=train_tokenized_data,
    eval_dataset=val_tokenized_data
)

In [7]:
eval_results = trainer.evaluate(eval_dataset=test_tokenized_data)
print(eval_results)

{'eval_loss': 42.816688537597656, 'eval_runtime': 12.8824, 'eval_samples_per_second': 82.826, 'eval_steps_per_second': 20.726}


In [8]:
del loaded_original_model
clear_gpu_memory()

Evaluation Results After Fine Tuning

In [9]:
trained_model = AutoModelForSeq2SeqLM.from_pretrained(FINAL_MODEL_PATH)

In [10]:
trainer = Trainer(
    model=trained_model,
    args=training_args,
    train_dataset=train_tokenized_data,
    eval_dataset=val_tokenized_data
)

In [11]:
eval_results = trainer.evaluate(eval_dataset=test_tokenized_data)
print(eval_results)

{'eval_loss': 0.16271285712718964, 'eval_runtime': 12.6429, 'eval_samples_per_second': 84.395, 'eval_steps_per_second': 21.118}


### Clearing GPU Memory

In [12]:
del trained_model
clear_gpu_memory()

### Conclusion

Before fine-tuning, the original flan-t5 model exhibits a higher evaluation loss of 42.8166, which generally indicates that the model's predictions were less accurate when compared to the validation dataset. The evaluation runtime stands at 12.8824 seconds, and the model processes approximately 82.826 samples per second.

After fine-tuning, the trained flan-t5 model shows a significantly reduced evaluation loss of 0.1627, suggesting a considerable improvement in prediction accuracy and a closer match to the expected outputs in the validation data. The runtime for evaluation is slightly higher at 12.6429 seconds, yet the model processes a higher rate of 84.395 samples per second.

The improvement in loss post fine-tuning reflects a more precise model, likely due to the model learning from the domain-specific nuances in the fine-tuning process. The consistent processing speed before and after fine-tuning, despite the complexity of the model having potentially increased, indicates an efficient use of computational resources. This comparison underlines the importance of fine-tuning in tailoring a language model to a specific context, resulting in more accurate and reliable predictions.

In the following section, we will use samples from test dataset of actual answers with those generated by the model to assess its performance more comprehensively. While evaluation loss provides a quantitative measure of the model's accuracy, it does not encapsulate the full scope of the model's capabilities. A qualitative comparison will give us better insight into the model's practical effectiveness and its ability to produce coherent and contextually relevant answers.