## Model Evaluation

### 1. BERTScore
- compares semantic similarity instead of raw words
- great for capturing meaning, not just word
- *how similar in meaning is our model's output to the actual output (reference), even if words differ*

**Metrices**
  - Precision(P)
  - Recall(R)
  - F1 score(F1): harmonic mean of P and R

In [3]:
import json
import urllib.request
from bert_score import score

def load_data(url, file_name):
    urllib.request.urlretrieve(url, file_name)
    # load the JSON
    with open(file_name, 'r', encoding='utf-8') as f:
        data = json.load(f)
        print('Data Loaded Successfully')
        print(f'Number of entries: {len(data)}')
    return data

file_id = "1s3hQ4d2soSFyVerNrJK9ihnGxmSga_96"
url = f"https://drive.google.com/uc?export=download&id={file_id}"
file_name = "eval.json"
data = load_data(url, file_name)

Data Loaded Successfully
Number of entries: 2588


In [4]:
# eval function
def evaluate_bertscore(data, threshold=0.7, show_bad=False):
    references = [ex["output"] for ex in data if ex.get("model_response", "").strip()]
                                #only keep this example if the model_response exists and is not blank/just whitespace.
    predictions = [ex["model_response"] for ex in data if ex.get("model_response", "").strip()]
    inputs = [ex for ex in data if ex.get("model_response", "").strip()]

    print(f"Evaluatinng {len(predictions)} examples with BERTScore....\n")

   # precision, recall, f1 
    P, R, F1 = score(predictions, references, lang='en',model_type ="bert-base-uncased")

    print(f"Average BERTScore Precision: {P.mean().item():.4f}")
    print(f"Average BERTScore Recall: {R.mean().item():.4f}")
    print(f"Average BERTScore F1: {F1.mean().item():.4f}")

    # to print low-scoring examples.
    if show_bad:
        print(f'Evaluating examples with low f1 score(below the {threshold})...')
        for i, (ex, ref, pred, f1_score) in enumerate(zip(inputs, references, predictions, F1)):
            if f1_score< threshold:
                print(f"\nExample {i+1}")
                print(f"Instruction: {ex['instruction']}")
                print(f"Input: {ex['input']}")
                print(f"Expected: {ref}")
                print(f"Model: {pred}")
                print(f"F1 Score: {f1_score:.4f}")
                print("-" * 50)

In [6]:
evaluate_bertscore(data)

Evaluatinng 2588 examples with BERTScore....

Average BERTScore Precision: 0.5568
Average BERTScore Recall: 0.5571
Average BERTScore F1: 0.5512


metrices are not quite good...can be improved further

## 2. BELU( Bilingual Evaluation Understudy)
- precision based metric
- checks how many n-grams in our model's response match the reference(actual Output)
- doesb't care whether the model is adding any extra thing

In [8]:
#BELU-4
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import numpy as np

def compute_bleu(data, weights=(0.25, 0.25, 0.25, 0.25)):
    smoothie = SmoothingFunction().method4
    bleu_scores = []

    for ex in data:
        reference = ex['output'].strip().split()
        hypothesis = ex['model_response'].strip().split()

        score = sentence_bleu([reference], hypothesis, weights=weights, smoothing_function=smoothie)
        bleu_scores.append(score)

    return np.mean(bleu_scores)

In [9]:
compute_bleu(data)

0.09139332085559292

not good as expected!