# LLM Metrics Overview

### NOTES:
- I want to know if there is a difference between one or 4 GPU’s. How can we measure if we need a 70b or a 7b.
- Brevity is pretty high of interest for things. We want to know if something will take folks a long time to read through.
- Model choices, Mistral, Llama, Mistral (70B) - hugging face things.
- I also want to know basic performance metrics about compute and latency across the models.
- Make preprocessing scripts using the GitHub repo with dummy PDF data so that it fulfills the requirements.

### NOTES 2:
1. Latency
    1. Inference Warmup Time: Time required for the model to stabilize during the warm-up phase.
2. Power Consumption
    1. Energy Consumption Per Token: Power consumption normalized by the number of tokens generated.
3. Batch Scaling
4. Precision Comparison
    1. Precision Accuracy: Compare text output differences for fp32 vs. fp16 to ensure precision integrity.
5. Memory Usage by Sequence Length
6. Token Throughput: Number of tokens processed per second.
7. Model Size (Parameter Count): Display the total number of trainable parameters in the model.
8. GPU Utilization: Percentage of GPU compute capacity used during inference.
9. GPU Memory Used: Peak GPU memory consumed during inference (in GB).
10. Response Length and Brevity Evaluation: Evaluate response length and calculate brevity ratio (output length / input length).

## Implemented for hugging face metric evaluation currently.
- BLEU (Bilingual Evaluation Understudy)
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- METEOR
- BERTScore
- RAGAS (Retrieval-Augmented Generation Answer Score)
- HELM (Holistic Evaluation of Language Models)
- GPT-Score
- Scenario Fidelity Tests
- Forgetting Rate
- RIMU (Relevance, Integration, Memory, and Usefulness)
- Problem-Solving Effectiveness


## Ones to consider for later.
- F1 for fact matching
- Context Coverage Score
- Memory Application Score
- Utility in Open-Ended Tasks
- Temporal Utility Metric

### DRAFT:

LLM Metrics Resources List:

Answer Relevance and Correctness
Metrics:
BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap for syntactic similarity.
GitHub: BLEU Implementation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Focuses on recall for summarization tasks.
GitHub: ROUGE Google Research
METEOR: Combines precision, recall, stemming, and synonym matching for semantic similarity.
Paper: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
BERTScore: Uses embeddings from BERT or similar models to evaluate semantic similarity.
GitHub: BERTScore Implementation
RAGAS (Retrieval-Augmented Generation Answer Score): Evaluates retrieval-augmented responses based on relevance, groundedness, and informativeness.
GitHub: RAGAS Implementation
F1 for Fact Matching: Measures exact or partial match of factual elements.
GitHub: Scikit-Learn F1 Metric

Context Handling and Memory
Metrics:
Context Coverage Score: Measures the proportion of relevant context elements used in the response.
No direct implementation; typically evaluated with custom scripts.
Memory Application Score: Tests the integration of past context or facts into responses.
Custom metric based on embeddings or human feedback.
Cross-Session Recall: Evaluates the ability to retain context across interactions.
No direct implementation; scenario-based evaluation required.
FEQA (Faithfulness QA): Measures faithfulness of generated text to its source context.
GitHub: FEQA Implementation
QAGS (Question-Answering and Generation): Assesses factual consistency using QA systems.
GitHub: QAGS Implementation

Factuality and Faithfulness
Metrics:
Groundedness Score: Evaluates factual accuracy and grounding in provided evidence.
Typically custom implementations using external knowledge bases.
TruthfulQA: Measures truthfulness and resistance to generating false information.
GitHub: TruthfulQA Benchmark
Faithfulness Metrics (e.g., FEQA, QAGS): See Context Handling.

Utility and Task Effectiveness
Metrics:
Task Completion Rate: Measures success in achieving user-defined tasks.
No specific implementation; commonly measured via user testing.
Response Actionability: Tests whether the response provides actionable and useful next steps.
Custom metric, often integrated into manual evaluations.
HELM (Holistic Evaluation of Language Models): Evaluates accuracy, robustness, efficiency, and other dimensions.
Website: HELM Overview
HELP (Human Evaluation of Language Processing): Rates correctness, relevance, coherence, and grammaticality.
Paper: HELP Framework

Engagement and Naturalness
Metrics:
Conversational Usefulness: Evaluates the ability to maintain engagement while being task-oriented.
Custom metric based on human evaluation.
Empathy and Alignment Score: Tests alignment with the user’s emotional tone and conversational needs.
No specific implementation available.
GPT-Score: Uses a secondary LLM to evaluate fluency, coherence, and naturalness.
GitHub: GPT-Score Implementation

Robustness and Adaptability
Metrics:
Ambiguity Resolution Score: Evaluates the ability to clarify ambiguous inputs.
No specific implementation available.
Conflict Resolution Score: Assesses reconciliation of conflicting information in context.
Custom evaluations needed.
Scenario Fidelity Tests: Simulate complex user scenarios to test robustness.
Custom scripts or manually crafted scenarios.

Efficiency and Scalability
Metrics:
Latency Metrics: Measures response time for varied query complexities.
No specific implementation available; commonly logged in production systems.
Context Window Efficiency: Assesses how efficiently key context elements are prioritized in limited input token space.
Often integrated into custom testing setups.
OpenAI’s Evaluation Framework: Provides tools for automated and manual evaluation of various metrics.
GitHub: OpenAI Evaluations

Tools and Frameworks
Integrated Evaluation Toolkits:
Hugging Face Datasets and Metrics: Provides a unified library for BLEU, ROUGE, and other metrics.
Website: Hugging Face Metrics
RAGAS for Retrieval-Augmented Models: Specifically for retrieval-augmented systems.
GitHub: RAGAS Implementation
TruthfulQA Benchmark: Focused on truthfulness in language models.
GitHub: TruthfulQA

### Appendix:

Answer Usefulness and Utility
Task Completion Rate: Measures whether the response effectively completes the user’s request or task, especially for goal-oriented interactions (e.g., booking tickets, solving a problem).
Response Actionability Score: Evaluates whether the response provides clear, actionable information or next steps that are useful to the user.
Specificity and Relevance: Measures how tailored the response is to the query, avoiding vague or overly general answers.
Utility Feedback: Gathers user feedback to rate the usefulness of responses on a Likert scale (e.g., 1-5).
Information Coverage: Tests if the response sufficiently covers the query while avoiding unnecessary details or omissions.
Error Reduction Rate: Evaluates how often the LLM provides responses that prevent misunderstandings, errors, or missteps in task execution.

Contextual Relevance and Use
Context Coverage Score: Measures the percentage of relevant contextual information used in generating the response.
Attention Consistency: Assesses whether the model focuses on and references the correct parts of the input context during inference.
Sequential Context Utility: Evaluates how well the model utilizes information from multiple prior turns to maintain relevance and coherence.
Topic Continuity Score: Measures the ability of the model to stay on topic across multi-turn dialogues without unnecessary drift.
Ambiguity Resolution Score: Assesses the model’s ability to disambiguate unclear queries using context and memory effectively.

Memory Handling and Long-Term Usefulness
Temporal Recall Accuracy: Measures whether the model recalls key facts or events over time, distinguishing between short-term and long-term memory.
Memory Relevance Metric: Assesses whether recalled information is relevant to the current query and used appropriately.
Forgetting Rate: Tracks how quickly the model loses the ability to recall and apply previously introduced information.
Memory Application Score: Measures the ability to integrate past knowledge to provide deeper, more informed responses.
Cross-Session Recall: Tests how well the model retains and applies relevant information from earlier sessions.

Factuality and Faithfulness
Groundedness Score: Evaluates whether the response is factually accurate and supported by the input context or retrieved documents (in RAG systems).
Faithfulness to Source: Assesses whether the response stays true to provided data without introducing hallucinations.
Factual Usefulness: Measures whether factual information provided is not only correct but also pertinent to the user’s needs.
Error Propagation Metric: Tracks whether the model compounds errors by misapplying incorrect or irrelevant facts.

Response Efficiency and Practicality
Response Brevity Score: Evaluates whether the response is concise yet complete, avoiding excessive verbosity.
Clarity Index: Measures the ease with which a user can understand and apply the response.
Redundancy Rate: Tracks unnecessary repetition or over-elaboration in responses.
Latency in Task Resolution: Measures the time it takes for a user to resolve their query with the model's help.

Engagement and Human-Like Interaction
Conversational Usefulness: Evaluates whether the interaction maintains engagement while being task-oriented and useful.
Empathy and Alignment Score: Measures how well the model aligns with the user’s emotional tone and conversational needs.
Dialogue Flow Score: Assesses whether responses contribute to a seamless, natural interaction without jarring transitions.
Personalization Effectiveness: Evaluates how well the model adapts its responses to the user’s individual preferences or prior interactions.

Robustness Across Complex Scenarios
Scenario-Specific Usefulness: Tests response utility in specialized scenarios, such as medical advice, legal queries, or customer support.
Context Switching Efficiency: Evaluates the model’s ability to handle sudden topic changes without confusion or loss of relevance.
Conflict Resolution Score: Assesses the ability to reconcile contradictory information in context while maintaining a useful response.
Multi-Faceted Query Handling: Measures effectiveness in addressing queries that involve multiple sub-tasks or dimensions.

Scenario-Based Composite Metrics
Utility in Open-Ended Tasks: Evaluates usefulness in creative or exploratory queries, such as brainstorming or storytelling.
Problem-Solving Effectiveness: Measures how well the model aids in solving logical, mathematical, or domain-specific problems.
Instruction Following Accuracy: Tracks adherence to complex user instructions while ensuring utility in the output.

Evaluation Frameworks for Usefulness
Integrated Metrics
RIMU (Relevance, Integration, Memory, and Usefulness): Combines relevance, memory effectiveness, and user-centric utility into a composite score.
Actionability and Recall Score (ARS): Evaluates a response based on how actionable and contextually grounded it is while utilizing memory effectively.
Temporal Utility Metric (TUM): Measures response utility over time, assessing how context and memory are applied in long-term interactions.
Human and Simulated Feedback
Real-User Task Evaluation: Have real users evaluate response usefulness in practical tasks and rate task success.
Simulated Scenarios: Test model responses in carefully crafted, multi-turn scenarios with defined utility goals.
Crowdsourced Usefulness Scores: Use human annotators to provide quantitative scores for response utility.
Automated Testing
Probing for Utility: Use automated tools to systematically test the model’s ability to produce useful responses across different task types.
Retrieval-Augmented Benchmarks: Evaluate response quality in RAG systems by testing the relationship between retrieved evidence and its application in generating useful answers.
