# Part 3: Evaluation ✒️ 📑

***Full notebook contents viewable on [Kaggle](https://www.kaggle.com/code/chuhuayang/prompt-recovery-pt-3-evaluation).***

Previously, we created a LoRA Adapter that modifies the Gemma-7B model for cleaner and more accurate generation on our Prompt Recovery task. Now, we will test our model performance on the previously unused data. As outline in the competition description, Kaggle will calculate performance by using Google's [sentence-T5 model](https://www.kaggle.com/models/google/sentence-t5/tensorFlow2/st5-base) to transform the predicted and actual rewrite prompts into higher-dimensional vectors, or embeddings. Then, Sharpened Cosine Similarity, with an exponent of 3, is used to evaluate the difference between the two vectors.

We will first obtain the predicted rewrite prompts by loading our Adapter and performing inference. Then, we will try to recreate Kaggle's scoring method as closely as possible, and observe the performance.

### Set-up

In addition to the previously-used Transfomers libraries, we will need to install the [SentenceTransfomers](https://www.sbert.net/) library.

In [1]:
%%capture

!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!pip install -q -U peft
!pip install -q -U sentence-transformers

import numpy as np
import pandas as pd
import os

import torch
import torch.nn as nn

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
import bitsandbytes as bnb

2024-05-27 06:22:44.083421: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-27 06:22:44.083527: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-27 06:22:44.241788: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Adapter

We load in the base model, using the same configurations as before. Then, we apply our Adapter, using PEFT's API.

In [2]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/gemma/transformers/7b-it/3",
    device_map="auto",
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/7b-it/3")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [3]:
%%capture

model = PeftModel.from_pretrained(base_model, "/kaggle/input/prompt-recovery-pt-2-fine-tuning/adapter")

### Test Data

We will apply the same template, which our model has learned to recognize. This time, the rewrite prompt is left blank. and our model should fill it in appropriately. To reduce bias in test results, we test on the 600 examples that our model had not seen during training. 

In [4]:
import random
random.seed(0)

TEST_TEMPLATE = """### Instruction:
Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the "Original Text" and "Rewritten Text", and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.

### Original Text: 
{original_text}

### Rewritten Text:
{rewritten_text}

### Response:

"""

df = pd.read_csv("/kaggle/input/prompt-recovery-pt-1-generate-training-data/training_data.csv")
df_test = df[4400:5000].copy().reset_index(drop=True)

### Inference

Hugging Face Transformers offers the [Pipelines](https://huggingface.co/docs/transformers/en/main_classes/pipelines) class, which simplifies the process of encoding inputs, generating new tokens, and decoding outputs. We use `max_new_tokens` to cut off generation at 35 new tokens, preventing excessively long outputs. We also use `return_full_text=False` when calling the pipeline, which filters out the repeated inputs.

In [5]:
predictor = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=35)

def predict(row):
    text_inputs = TEST_TEMPLATE.format(original_text=row["original_text"], rewritten_text=row["rewritten_text"])
    generated_text = predictor(text_inputs=text_inputs, return_full_text=False)[0].get("generated_text")
    return generated_text.strip()

df_test["prediction"] = df_test.apply(predict, axis=1)

In [6]:
import gc

del bnb_config
del base_model
del model
del tokenizer

gc.collect()

torch.cuda.empty_cache()

### Scoring

SentenceTransfomers is integrated into the Hugging Face infrastructure. All of its pre-trained models, including the `sentence-t5-base` variant used in this competition, are hosted on Hugging Face Hub. We can easily load a model from the Hub by specifying its name.

In [7]:
from sentence_transformers import SentenceTransformer

st_model = SentenceTransformer('sentence-t5-base')

if torch.cuda.is_available():
    st_model = st_model.to(torch.device("cuda"))

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

rust_model.ot:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

PyTorch comes with a cosine similarity function. We encode the predicted and actual rewrite prompts into two vectors, apply the function, and exponentiate to get the score for each input. The mean is our model's overall performance. 

In [8]:
def sharpened_cosine_similarity(actual, pred):
    cosine_similarity = nn.functional.cosine_similarity(actual, pred, dim=0)
    return cosine_similarity ** 3

def scoring(row):
    actual_embeddings = st_model.encode(row["rewrite_prompt"], convert_to_tensor=True, show_progress_bar=False)
    pred_embeddings = st_model.encode(row["prediction"], convert_to_tensor=True, show_progress_bar=False)
    score = sharpened_cosine_similarity(actual_embeddings, pred_embeddings).item()
    return score

df_test["score"] = df_test.apply(scoring, axis=1)
print(f"The overall score is {df_test['score'].mean()}")
df_test.to_csv("scored_rewrite_prompts.csv")

The overall score is 0.7817590881884098
