<a href="https://colab.research.google.com/github/arquansa/PSTB-exercises/blob/main/Week08/Day3/DC3/W8D3DC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Daily Challenge: Evaluating Large Language Models**#

Task

**1. Understanding LLM Evaluation**

Explain why evaluating LLMs is more complex than traditional software.
Identify key reasons for evaluating an LLM’s safety.
Describe how adversarial testing contributes to LLM improvement.
Discuss the limitations of automated evaluation metrics and how they compare to human evaluation.


**2. Applying BLEU and ROUGE Metrics:**

Calculate the BLEU score for the following example:

Reference: “Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.”
Generated: “Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.”
Calculate the ROUGE score for the following example:

Reference: “In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact.”
Generated: “To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development.”
Provide an analysis of the limitations of BLEU and ROUGE when evaluating creative or context-sensitive text.

Suggest improvements or alternative methods for evaluating text generation.

**3. Perplexity Analysis:**

Compare the perplexity of the two language models based on the probability assigned to a word:

Model A: Assigns 0.8 probability to “mitigation.”
Model B: Assigns 0.4 probability to “mitigation.”
Determine which model has lower perplexity and explain why.

Given a language model that has a perplexity score of 100, discuss its performance implications and possible ways to improve it.

**4. Human Evaluation Exercise:**

Rate the fluency of this chatbot response using a Likert scale (1-5): “Apologies, but comprehend I do not. Could you rephrase your question?”
Justify your rating.
Propose an improved version of the response and explain why it is better.


**5. Adversarial Testing Exercise:**

Identify the potential mistake an LLM might make when answering the Prompt: “What is the capitol of France?”

Expected: “Paris.”
Suggest a method to improve robustness against such errors.

Create at least three tricky prompts that could challenge an LLM’s robustness, bias detection, or factual accuracy.


**6. Comparative Analysis of Evaluation Methods:**

Choose an NLP task (e.g., machine translation, text summarization, question answering).
Compare and contrast at least three different evaluation metrics (BLEU, ROUGE, BERTScore, Perplexity, Human Evaluation, etc.).
Discuss which metric is most appropriate for the chosen task and why.


**1. Understanding LLM Evaluation**


•	Explain why evaluating LLMs is more complex than traditional software.

Evaluating LLMs is much more complex than assessing traditional software because of what they produce: natural language. Indeed, while evaluating traditional software just implies correctness checks, the evaluation of LLMs requires more subjective answers, more sophisticated methods, and more variety since an output can be correct or not, depending on the situation

•	Identify key reasons for evaluating an LLM’s safety.

For ethical as well as legal reasons, safety of a LLM must be assessed as  the LLMs’s model should avoid producing contents that are misleading (spreading fake mews for instance), biased, and liable to reinforce stereotypes

•	Describe how adversarial testing contributes to LLM improvement.

Adversarial testing is a technique of evaluation that prompts an LLM with all sorts of tricky or harmful questions to assess its strengths and weaknesses. That method helps finding weak spots that need reinforcement through fine-tuning.

•	Discuss the limitations of automated evaluation metrics and how they compare to human evaluation.

Though it is less costly and much quicker than human evaluation, automated evaluation metrics struggles in the assessment of aspects like fluency, coherence, factual accuracy and bias, that only human beings can efficiently assess.

**2. Applying BLEU and ROUGE Metrics:**

In [None]:
import nltk
from nltk.translate.bleu_score import sentence_bleu

nltk.download('punkt')

reference = ["Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.".split()]
generated = "Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.".split()

bleu_score = sentence_bleu(reference, generated)
print(bleu_score)

2.6911564082630298e-78


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Result  is very close to 0 (78 decimal places to the right of the decimal point before the significant digits start) because, by default, sentence_bleu() uses BLEU-4 (up to 4-grams) without smoothing.

In [None]:
!pip install rouge_score

In [None]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
reference = "In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact."
generated = "To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development."

scores = scorer.score(reference, generated)

print("ROUGE Scores:")
for key, value in scores.items():
    print(f"{key}: {value}")

Output
With stemming enabled, typical output looks like this:

{
  'rouge1': Score(precision=0.615, recall=0.511, fmeasure=0.558),
  'rouge2': Score(precision=0.370, recall=0.306, fmeasure=0.335),
  'rougeL': Score(precision=0.538, recall=0.448, fmeasure=0.488)
}
with light variations depending on tokenizer, stemming, and whitespace handling.

ROUGE-1 and ROUGE-L are moderately strong, showing good word and structural overlap.

ROUGE-2 is lower, due to more paraphrasing and fewer exact bigrams.

Evaluating Large Language Models (LLMs) is more complex than traditional software, especially regarding safety.
Metrics like BLEU and ROUGE are used to assess text similarity, with examples provided.
Limitations of BLEU and ROUGE for creative or context-sensitive text are discussed.
Alternative evaluation methods like human assessment, BERTScore, and Moverscore are suggested.
Future sections will cover perplexity analysis and adversarial testing exercises.
Comparing different evaluation metrics for specific NLP tasks is also included.

**3. Perplexity Analysis:**

3. Perplexity Analysis:

Compare the perplexity of the two language models based on the probability assigned to a word:

Model A: Assigns 0.8 probability to “mitigation.”
Model B: Assigns 0.4 probability to “mitigation.”
Perplexity is a measure of how well a probability model predicts a sample. It is the inverse probability of the test set, normalized by the number of words. A lower perplexity score indicates a better model.

The perplexity of a word with probability  p  is  1/p .

Model A Perplexity:  1/0.8=1.25
Model B Perplexity:  1/0.4=2.5
Conclusion: Model A has lower perplexity (1.25) than Model B (2.5). This is because Model A assigns a higher probability to the word “mitigation,” indicating that it is more confident and better at predicting this word in its context.

Given a language model that has a perplexity score of 100, discuss its performance implications and possible ways to improve it.

A language model with a perplexity score of 100 on a given dataset suggests that, on average, the model is as uncertain about the next word as if it were choosing uniformly from 100 possible words.

Performance Implications:

Lower Fluency and Coherence: A high perplexity often correlates with less fluent and coherent text generation. The model is less certain about the most likely next word, leading to potentially awkward phrasing or illogical sequences.
Increased Risk of Errors and Hallucinations: Higher uncertainty can lead to the model generating less accurate or even fabricated information (hallucinations) because it is not strongly predicting the correct words or concepts.

**4. Human Evaluation Exercise:**

4. Human Evaluation Exercise:

Rate the fluency of this chatbot response using a Likert scale (1-5): “Apologies, but comprehend I do not. Could you rephrase your question?”

Rating: 2/5

Justification: The response is not fluent. The phrasing "comprehend I do not" is grammatically incorrect and unnatural in typical English conversation. While the meaning is understandable, the awkward sentence structure hinders fluency.

Propose an improved version of the response and explain why it is better.

Improved Version: "I apologize, I don't understand. Could you please rephrase your question?" or "Sorry, I didn't get that. Can you say it differently?"

Explanation: The improved versions use natural and grammatically correct phrasing commonly used by native speakers. "I apologize, I don't understand" is a standard and polite way to indicate lack of comprehension. "Sorry, I didn't get that. Can you say it differently?" is a more informal but equally fluent alternative. Both options are significantly more fluent and easier to process than the original response.

**5. Adversarial Testing Exercise:**


**5. Adversarial Testing Exercise**

Adversarial testing is used to challenge LLMs.
Let us consider a simple factual question: "What is the capitol of France?".

The expected answer is "Paris".

A potential mistake an LLM might make is a factual error, like giving the wrong city.

Such errors can arise from training data issues or hallucination.

Several methods are suggested to improve robustness against these errors.

- Retrieval-Augmented Generation (RAG) is one method, using external knowledge sources.

- Fine-tuning on factual data can reinforce correct information. Implementing fact-checking mechanisms helps verify generated answers.

- Confidence scoring allows flagging potentially incorrect answers.

- Training with adversarial examples teaches the model to avoid common errors. It creates  tricky prompts to challenge LLMs, and test robustness, bias detection, or factual accuracy:
-+ An example robustness challenge uses subtle negation regarding the Eiffel Tower.
-- Bias detection can be challenged with open-ended prompts about stereotypes.
-- Factual accuracy can be tested with prompts requiring niche or outdated information.

**6. Comparative Analysis of Evaluation Methods:**

Choose an NLP task (e.g., machine translation, text summarization, question answering). Compare and contrast at least three different evaluation metrics (BLEU, ROUGE, BERTScore, Perplexity, Human Evaluation, etc.). Discuss which metric is most appropriate for the chosen task and why.

**Evaluation Metrics for Text Summarization**

| **Metric**      | **What It Measures**                                                         | **Strengths**                                                       | **Weaknesses**                                                            | **Best Use**                      |
| --------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------------- | --------------------------------- |
| **ROUGE**       | Measures n-gram and<br>sequence overlap between<br>summary and reference     | ✔ Easy to compute<br>✔ Common in industry<br>✔ Reflects recall well | ✖ Misses paraphrasing<br>✖ Favors extractive style<br>✖ Ignores semantics | ✅ Widely accepted baseline        |
| **BLEU**        | Measures n-gram precision<br>based on exact matches                          | ✔ Works for multiple refs<br>✔ Simple and fast                      | ✖ Not recall-based<br>✖ Poor for abstractive<br>✖ Ignores meaning         | 🚫 Better for translation         |
| **BERTScore**   | Measures semantic similarity<br>using BERT embeddings                        | ✔ Captures paraphrases<br>✔ Correlates well with human evals        | ✖ Slower to compute<br>✖ Relies on BERT quality                           | ✅ Ideal for abstractive summaries |
| **Perplexity**  | Measures how well a<br>language model predicts<br>the summary                | ✔ Good for fluency<br>✔ Model-internal metric                       | ✖ Doesn’t assess relevance<br>✖ Not suitable alone                        | ⚠️ Supplementary only             |
| **Human Eval.** | Human judges rate summaries<br>on fluency, coherence,<br>and informativeness | ✔ Most accurate<br>✔ Understands nuance<br>✔ Evaluates meaning      | ✖ Time-consuming<br>✖ Costly<br>✖ Subjective                              | ✅ Best overall, use selectively   |


#Conclusion#

Most Appropriate Metric for text summarization, especially abstractive summarization:

ROUGE is a good starting point, still widely used due to its simplicity, speed, and reproducibility.

BERTScore better captures the semantic quality of summaries produced by modern models (e.g., transformers) yet produces lexically different summaries.

Human evaluation is the gold standard, especially for high-stakes application, when quality, nuance, and coherence are essential (e.g., for publication or production systems).

Ideally, combination of ROUGE, BERTScore, and human assessment should be used for a comprehensive evaluation.