# Evaluation

## Exercise 1: Base model

Use the `share.prompt` function to test the base model's capacity for classifying spam, either by copying a text from the `SetFit/enron_spam` dataset or by writing your own. The source code below wraps your text in an appropriate instruction. Feel free to change both the text and the instruction to experiment with the model.

In [None]:
import gc, os, share, evaluation, torch

# load model
model = share.load_model(share.LLAMA2_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_MODEL_DIR)

In [None]:
text = ... # your text here
text_template = f"Input:{text}{2 * os.linesep}Instruction: Output '1' if the following e-mail is spam and '0' if not. Answer in 1 token only.{2 * os.linesep}Output:"
print(share.prompt(model, tokenizer, text_template))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 2: Fine-tuned model

The Llama 2 model at `share.LLAMA2_ENRON_SPAM_LORA_MODEL_DIR` model has been fine-tuned on the `SetFit/enron_spam` dataset.

As before, use the `share.prompt` function to test the fine-tuned model's capablities by hand.

In [None]:
import gc, os, share, evaluation, torch

# load model
model = share.load_model(share.LLAMA2_ENRON_SPAM_LORA_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_ENRON_SPAM_LORA_MODEL_DIR)

In [None]:
text = ... # your text here
text_template = f"Input:{text}{2 * os.linesep}Instruction: Output '1' if the following e-mail is spam and '0' if not. Answer in 1 token only.{2 * os.linesep}Output:"
print(share.prompt(model, tokenizer, text_template))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 3: Automated evaluation of the base model

The `evaluation.eval_precision_recall_f1` function from the `evaluation` module computes precision, recall and the F1 score for the given model on the `SetFit/enron_spam`. The keyword argument `test_size` allows you to determine the size of the test dataset.

Run the function for the Llama2 base model (`share.LLAMA2_MODEL_DIR`) and experiment with different values for the `test_size` argument.

In [None]:
import share, evaluation

# load the model
model = share.load_model(share.LLAMA2_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_MODEL_DIR)

In [None]:
metrics_base_model = {}
for test_size in (...): # add test sizes here
    metrics_base_model[test_size] = evaluation.eval_precision_recall_f1(model, tokenizer, test_size=test_size)

In [None]:
import numpy, matplotlib.pyplot

test_sizes = sorted(metrics_base_model)

precision = [metrics_base_model[k]["precision"] for k in test_sizes]
recall = [metrics_base_model[k]["recall"] for k in test_sizes]
f1 = [metrics_base_model[k]["f1"] for k in test_sizes]

metrics = {
    "Precision": precision,
    "Recall": recall,
    "F1 score": f1,
}
x = numpy.arange(len(test_sizes))
width = 0.25
multiplier = 0
fig, ax = matplotlib.pyplot.subplots(layout="constrained")
for metric, measurement in metrics.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, measurement, width, label=metric)
    ax.bar_label(rects, padding=3)
    multiplier += 1
ax.set_xticks(x + width, test_sizes)
ax.legend(loc="upper left", ncols=3)
ax.set_ylim(0, 1)

In [None]:
import gc, torch

# unload the model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 4: Automated evaluation of the fine-tuned model

Run the `evaluation.eval_precision_recall_f1` function again for the fine-tuned model model and the same `test_size` values, and compare the results for the fine-tuned model. Compare then the results for one of the values for `test_size`.

In [None]:
import share, evaluation

# load the model
model = share.load_model(share.LLAMA2_ENRON_SPAM_LORA_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_ENRON_SPAM_LORA_MODEL_DIR)

In [None]:
metrics_fine_tuned_model = {}
for test_size in (...): # add test sizes here
    metrics_fine_tuned_model[test_size] = evaluation.eval_precision_recall_f1(model, tokenizer, test_size=test_size)

In [None]:
import numpy, matplotlib.pyplot

test_sizes = sorted(metrics_fine_tuned_model)

precision = [metrics_fine_tuned_model[k]["precision"] for k in test_sizes]
recall = [metrics_fine_tuned_model[k]["recall"] for k in test_sizes]
f1 = [metrics_fine_tuned_model[k]["f1"] for k in test_sizes]

metrics = {
    "Precision": precision,
    "Recall": recall,
    "F1 score": f1,
}
x = numpy.arange(len(test_sizes))
width = 0.25
multiplier = 0
fig, ax = matplotlib.pyplot.subplots(layout="constrained")
for metric, measurement in metrics.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, measurement, width, label=metric)
    ax.bar_label(rects, padding=3)
    multiplier += 1
ax.set_xticks(x + width, test_sizes)
ax.legend(loc="upper left", ncols=3)
ax.set_ylim(0, 1)



In [None]:
import numpy, matplotlib.pyplot

test_size = 32 # test size to compare here

metrics = {
    "Base model": [metrics_base_model[test_size][k] for k in ("precision", "recall", "f1")],
    "Fine-tuned model": [metrics_fine_tuned_model[test_size][k] for k in ("precision", "recall", "f1")]
}

x = numpy.arange(len(("precision", "recall", "f1")))
width = 0.25
multiplier = 0
fig, ax = matplotlib.pyplot.subplots(layout="constrained")
for metric, measurement in metrics.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, measurement, width, label=metric)
    ax.bar_label(rects, padding=3)
    multiplier += 1
ax.set_xticks(x + width, ("precision", "recall", "f1"))
ax.legend(loc="upper left", ncols=3)
ax.set_ylim(0, 1)

In [None]:
import gc, torch

# unload the model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 5: Perplexity of the base model

The `evaluation.eval_perplexity` function measures a model's perplexity for the `iamtarun/python_code_instructions_18k_alpaca` dataset. Run the function for the Llama 2 base model with different values for the `test_size` argument and compare the results.

In [None]:
import share, evaluation

# load the model
model = share.load_model(share.LLAMA2_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_MODEL_DIR)

In [None]:
test_size = ... # enter the size here
metrics_base_model = evaluation.eval_perplexity(model, tokenizer, test_size=test_size)
print(f"Test size {test_size}:")
print(f"Perplexity: {metrics_base_model['perplexity']}")

In [None]:
import gc, torch

# unload the model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 6: Perplexity of different fine-tuned models

The

* `share.LLAMA2_PYTHON_CODE_LORA_MODEL_DIR`
* `share.LLAMA2_PYTHON_CODE_ADAPTER_MODEL_DIR`
* `share.LLAMA2_PYTHON_CODE_FULL_MODEL_DIR`

have been fine-tuned on the `iamtarun/python_code_instructions_18k_alpaca` dataset using LoRA, Llama-Adapter and full parameter fine-tuning respectively. Run `evaluation.eval_perplexity` for each one with the same value for the `test_size` argument and compare the results.

In [None]:
import share, evaluation

print(f"LoRA finetuning: {evaluation.eval_perplexity_load(share.LLAMA2_PYTHON_CODE_LORA_MODEL_DIR, test_size=32)['perplexity']}")
print(f"Llama-Adapter finetuning: {evaluation.eval_perplexity_load(share.LLAMA2_PYTHON_CODE_ADAPTER_MODEL_DIR, test_size=32)['perplexity']}")
print(f"Full parameter finetuning: {evaluation.eval_perplexity_load(share.LLAMA2_PYTHON_CODE_FULL_MODEL_DIR, test_size=32)['perplexity']}")