# Model Evaluation

#### 1. Translation:
Metrics:
- BLEU Score: Measures the overlap in n-grams between the generated translation and reference translations.

Standard Range: 0 to 1

Interpretation: Higher values indicate better overlap with reference translations.


- ROUGE Score: Evaluates the overlap in n-grams, word sequences, and word overlap between generated and reference translations.

Standard Range: 0 to 1

Interpretation: Higher values suggest better overlap in n-grams, word sequences, and word overlap.
- METEOR Score: Considers precision, recall, stemming, synonymy, stemming, and word order.

Standard Range: 0 to 1

Interpretation: Higher values indicate better overall performance considering precision, recall, stemming, synonymy, and word order.



#### 2. Summarization:
Metrics:
- ROUGE Score: Measures the overlap in n-grams, word sequences, and word overlap between generated and reference summaries.

Standard Range: 0 to 1

Interpretation: Higher values suggest better overlap in n-grams, word sequences, and word overlap.
- BLEU Score: Can also be used for evaluating summarization tasks.

Standard Range: 0 to 1

Interpretation: Higher values indicate better overlap with reference summaries.


#### 3. Text Generation:
Metrics:
- Perplexity: Measures the uncertainty of a language model on a given text.

Standard Range: Lower is better; no strict upper bound.

Interpretation: Lower values indicate better language model performance on the given text.

- Diversity Metrics: Evaluate the diversity of generated text (e.g., uniqueness of generated responses).

Standard Range: Context-dependent; aim for diversity.

Interpretation: Higher diversity indicates more unique and varied generated responses.


### Translation:

In [None]:
import nltk
from nltk.translate.meteor_score import meteor_score

# Sample Input
reference_sentences = ["The cat is on the mat."]
hypothesis_sentence = "A cat is lying on the carpet."  #generated by a model

# Convert to METEOR's expected format
reference_sentences = [reference_sentence.split() for reference_sentence in reference_sentences]
hypothesis_sentence = hypothesis_sentence.split()

# METEOR Score Calculation
score = meteor_score(reference_sentences, hypothesis_sentence)
print(f"METEOR Score: {score:.4f}")

METEOR Score: 0.6148


A METEOR score of 0.6148 indicates a moderate level of quality in the translation output. It suggests that the translation system has achieved a reasonable alignment with reference translations, considering various linguistic factors such as precision, recall, stemming, synonymy, and word order.

### Summarization

In [None]:
pip install rouge

Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1002)'))) - skipping
Note: you may need to restart the kernel to use updated packages.


In [None]:
from rouge import Rouge

# Generated Summary
generated_summary = "Scientists excited about discovery of new planet in distant galaxy."

# Reference Summaries
reference_summaries = ["Scientists are excited about the discovery of a new planet in a distant galaxy."]

# ROUGE Score
rouge = Rouge()
rouge_scores = rouge.get_scores(generated_summary, reference_summaries[0])
print(f"ROUGE-1 Score: {rouge_scores[0]['rouge-1']['f']:.4f}")

ROUGE-1 Score: 0.8696


ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

A ROUGE-1 Score of 0.8696 indicates a high level of overlap in unigrams between the generated summary and the reference summaries. The higher the ROUGE-1 Score, the better the generated summary aligns with the reference summaries in terms of single-word sequences.

### Text Generation

Perplexity is a measure of how well the model predicts the input sequence, and lower values are indicative of better performance.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Initialize GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generated Text
generated_text = "In a world where robots have emotions, they express joy and sorrow just like humans."

# Tokenize and calculate Perplexity
input_ids = tokenizer.encode(generated_text, return_tensors="pt")
with torch.no_grad():
    logits = model(input_ids).logits

# Calculate Perplexity
probs = torch.nn.functional.softmax(logits, dim=-1)
perplexity = torch.exp(torch.nn.functional.cross_entropy(logits.squeeze(0), input_ids.squeeze(0)))

print(f"Perplexity: {perplexity.item():.4f}")


Perplexity: 20329.5020


Perplexity is a measure of how well a probability distribution or probability model predicts a sample. In the context of language models like GPT-2, perplexity is commonly used to evaluate how well the model predicts a sequence of tokens (words or subwords).


Here's what your obtained perplexity value of 20329.5020 means:

- The model's estimated probability of the sequence of tokens is equivalent to the probability of a random event with a perplexity of 20329.5020.

- Higher perplexity values indicate higher uncertainty or poorer performance. In an ideal case, the perplexity would be close to the actual number of possible outcomes (vocabulary size), resulting in a perplexity around 1.

- In natural language processing, perplexity is often used as an evaluation metric for language models. A lower perplexity suggests that the model assigns higher probabilities to the observed sequence of tokens.

Overall, a perplexity of 20329.5020 means that, on average, the model is assigning a relatively low probability to the correct sequence of tokens, indicating room for improvement in terms of language modeling performance.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Initialize GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generated Text
generated_text = "I enjoy learning new things."

# Tokenize and calculate Perplexity
input_ids = tokenizer.encode(generated_text, return_tensors="pt")
with torch.no_grad():
    logits = model(input_ids).logits

# Calculate Perplexity
probs = torch.nn.functional.softmax(logits, dim=-1)
perplexity = torch.exp(torch.nn.functional.cross_entropy(logits.squeeze(0), input_ids.squeeze(0)))

print(f"Generated Text: {generated_text}")
print(f"Perplexity: {perplexity.item():.4f}")


Generated Text: I enjoy learning new things.
Perplexity: 7027.2896
