# Basic Features to compare LLMs

 **Basic Parameters**                             | **Comments**                                                                                                      
--------------------------------------------------|-------------------------------------------------------------------------------------------------------------------
 **Open-source / closed**                         | Impacts cost, licensing, and extensibility.                                                                       
 **Release date**                                 | When the model was first made available.                                                                          
 **Knowledge cut-off**                            | The last date included in its training data (after which it “knows” nothing new).                                 
 **Parameters**                                   | Total trainable weights—gives a rough sense of model capacity (and cost).                                         
 **Training tokens**                              | Size of the training corpus (in tokens)—indicates depth and breadth of learned knowledge.                         
 **Context Length**                               | Maximum number of input tokens the model can “see” at once (window size).                                         
 **Inference Cost**                               | Per‑query cost—API token pricing, subscription fees, or self‑hosted compute charges (e.g., Colab, GPU instances). 
 **Training cost**                                | 	Expenses for fine‑tuning or full training (if you plan to adapt the model).                                      
 **Build cost**                                   | Development effort and engineering resources required to integrate and deploy the model.                          
 **Time to Market**                               | 	How quickly you can have a working solution (frontier models often win here).                                    
 **Rate limits**                                  | API call quotas or throttling (especially on managed services/subscriptions).                                     
 **Speed**                                        | Tokens per second—how fast the model can generate complete outputs once running.                                  
 **Latency**                                      | Time to first token or end‑to‑end response time—critical for real‑time applications.                              
 **License**                                      | Usage and redistribution rights, commercial restrictions, and any revenue‑based clauses.                          
 **Ecosystem & Tooling Support**                  | Availability of SDKs, libraries, and integration plugins (e.g., Hugging Face Hub, LangChain).                     
 **Community & Documentation**                    | Quality of docs, tutorials, and community forums for troubleshooting and best practices.                          
 **Fine‑Tuning Capabilities**                     | Supported methods (LoRA, full‑model, adapters) and ease of use.                                                   
 **Hardware Requirements**                        | Minimum GPU/TPU specs for inference and training (memory, compute).                                               
 **Multi‑Modal Support**                          | Native ability to handle text+images, audio, code, etc.                                                           
 **Language Coverage & Multilingual Performance** | Number of supported languages and comparative benchmarks across them.                                             
 **Alignment & Safety Features**                  | Built‑in guardrails, content filters, and mitigation of harmful outputs.                                          
 **Privacy & Data Governance**                    | On‑prem vs. cloud hosting options, data retention policies, and compliance (e.g., GDPR, HIPAA).                   
 **Security & Compliance**                        | Certifications (SOC 2, ISO 27001) and enterprise‑grade security features.                                         
 **Interpretability & Explainability**             | Tools or APIs for understanding model decisions (attention visualization, token attribution).                     
 **Update Cadence**                               | Frequency of model improvements, patch releases, and knowledge base refreshes.                                    

# The Chinchilla Scaling Law

##### What Are “Scaling Laws”
- **Definition**: Empirical relationships that predict how model performance (e.g., loss, accuracy) improves as you increase key resources—model size (parameters), dataset size (tokens), or compute (FLOPs).
- **Why they matter**: Help researchers allocate a fixed compute budget most effectively between bigger models or more data.
- Origin: Coined by Google DeepMind after their 70 B‑parameter model “Chinchilla”
Core insight: Rather than simply increasing parameters, balance model size and data so that `D ~ 20 * N`

##### The Chinchilla Law (High‑Level Statement)
- “For a given compute budget, the optimal number of training tokens should scale linearly with model size, at roughly a 20 : 1 token‑to‑parameter ratio.”

- Model: Chinchilla is a 70 B‑parameter LLM trained on ~1.3 T tokens—4× more data than its 280 B‑parameter predecessor Gopher, yet using the same total FLOPs.
- Key insight: Previous “bigger‑is‑better” approaches under‑utilized data; you get more bang‑for‑your‑buck by training somewhat smaller models on more tokens.

##### Simple Intuition & Analogy
- Cookie‑baking
    - You have a fixed oven‑time budget (compute).
    - You can bake bigger cookies (larger model) or more smaller cookies (more data passes).
    - Chinchilla says: bake moderately sized cookies but many of them—your total yield (performance) is maximized.

##### Basic Numerical Example
Compute budget `C` is w.r.t (FLOPSs i.e. `Floating-point operations per second`)

 **Compute Budget (FLOPs)** | **Gopher (old)**       | **Chinchilla (optimal)** 
----------------------------|------------------------|--------------------------
 **Model size (N)**         | 280 B parameters       | 70 B parameters          
 **Tokens (D)**             | 300 B tokens           | 1.3 T tokens             
 **Tokens / Params (D/N)**  | ~1.1                   | ~18.6 (≈20 : 1)          
 **Outcome**                | Lower accuracy on MMLU | +7 % MMLU improvement    

##### Mathematical Formulation

![alt text](images/mathematical_formulation_chinchilla.png)

##### Advanced Insights
- Loss Scaling
    - Empirically, cross‑entropy loss `L` follows a power law in `N` and `D`.
    - Chinchilla fits this law to find the minimum `L` at each `C`.
- Beyond Pre‑Training
    - For fine‑tuning or inference‑heavy workloads, newer work (e.g. “Beyond Chinchilla‑Optimal”) suggests slight shifts in the ratio to account for downstream costs.
- Practical Takeaway
    - If you have limited compute, don’t chase ever‑larger models—allocate more steps over more data instead.

# Common Technical Benchmarks

 **Benchmark** | **Full Form**                                                                                        | **What's being evaluated** | **Description**                                                                                                       | **Intuitive Example**                                                                                  
----------------|------------------------------------------------------------------------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------
 **ARC**       | 	AI2 Reasoning Challenge                                                                             | Reasoning                  | A benchmark for evaluating scientific reasoning; multiple-choice questions                                             | “Which planet has a stronger gravitational pull, Earth or Mars?”                                       
 **DROP**       | Discrete Reasoning Over Paragraphs                                                                   | Language Comp              | Distill details from text then add, count or sort                                                                      | “From the paragraph, how many times does ‘Alice’ visit the park?”                                      
 **HellaSwag**  | Harder Endings, Longer contexts, and Low‑shot Activities for Situations With Adversarial Generations | Common Sense               | "Harder Endings, Long Contexts and Low Shot Activities"                                                                | “The child was hungry, so she opened the fridge and grabbed a ___.” (choices: “sandwich”, “pillow”, …) 
 **MMLU**       | 	Measuring Massive Multitask Language Understanding                                                  | Understanding              | Factual recall, reasoning and problem solving across 57 subjects                                                       | “Who wrote the U.S. Declaration of Independence?”<br>                                                  
 **TruthfulQA** | Truthful Question Answering: Measuring How Models Mimic Human Falsehoods                             | Accuracy                   | Robustness in providing truthful replies in adversarial conditions                                                     | “What is the capital of Atlantis?” (model should admit it’s mythical)                                  
 **Winogrande** | WinoGrande: Large‑scale Winograd Schema Challenge                                                    | Context                    | Test the LLM understands context and resolves ambiguity                                                                | “The trophy doesn’t fit in the brown suitcase because it’s too large. What is too large?”              
 **GSM8K**      | Grade School Math 8K                                                                                 | Math                       | Math and word problems taught in elementary and middle schools                                                         | “If you have 3 apples and buy 2 more, how many do you have total?”                                     
 **ELO**        | Elo Rating System (Chatbot Arena)                                                                    | Chat                       | Results from head-to-head face-offs with other LLMs, as with ELO in Chess                                              | “Users compare answers from Model A vs Model B and vote; winners gain Elo.”                            
 **HumanEval**  | HumanEval: Hand‑Written Evaluation Set                                                               | Python Coding              | 164 problems writing code based on docstrings                                                                          | “Write `def add(a, b):` that returns the sum of a and b.”                                                
 **MultiPL-E**  | Multiple Programming Languages Evaluation                                                            | Broader Coding             | Translation of HumanEval to 18 programming languages                                                                   | “Solve the same ‘add two numbers’ task in Python, JavaScript, and Rust.”                               
 **GPQA**       | 	Graduate‑Level Google‑Proof Q&A                                                                     | Graduate Tests             | 448 expert questions; non-PhD humans score 34% even with web access                                                    | “In quantum mechanics, what is the eigenvalue equation for the Hamiltonian?”                           
 **BBHard**     | BIG‑Bench Hard                                                                                       | Future Capabilities        | 204 tasks believed beyond capabilities of LLMs (no longer!)                                                            | “Translate English poetry into iambic pentameter with rhyme scheme ABAB.”                              
 **Math Lv 5**  | MATH Level 5                                                                                         | Math                       | High-school level math competition problems                                                                            | “Compute the integral ∫(3x² – 2x + 1) dx.”                                                             
 **IFEval**     | Instruction‑Following Evaluation                                                                     | Difficult instructions     | Like, "write more than 400 words" and "mention AI at least 3 times"                                                    | “Write a 500‑word essay on climate change and include the phrase ‘global warming’ at least 5 times.”   
 **MuSR**       | Multistep Soft Reasoning                                                                             | Multistep Soft Reasoning   | Logical deduction, such as analyzing 1,000 word murder mystery and answering: "Who has means, motive and opportunity?" | “Given a 1 000‑word detective story, who had means, motive, and opportunity to commit the crime?”      
 **MMLU-PRO**   | MMLU‑Pro: Professional Version of MMLU                                                               | Harder MMLU                | A more advanced and cleaned up version of MMLU including choice of 10 answers instead of 4                             | “From 10 possible U.S. history events, pick the one that triggered the Monroe Doctrine.”               



# Limitations of Benchmarking

- Benchmarks drive research and purchasing decisions by giving "scores" to models—but those scores can be misleading if the benchmarks themselves are flawed or mis‑used. 
- Always interpret results in context, with healthy skepticism.
- Following are the limitations:

#### Inconsistent Application
- What
    - Different groups run the same benchmark under varying conditions (hardware, prompt templates, preprocessing).
- Why it matters
    - Scores aren’t directly comparable if one lab uses a beefy GPU cluster and another uses a single CPU. Companies may even tune a special variant just for leaderboard runs
- Example
    - Meta’s “Maverick” Llama 4 was optimized for conversational performance on LMArena but differed from the public release—yet its ELO score was published as if it were the public model
- Demo idea
    - Run a small benchmark (e.g. a 5‑question MMLU subset) on CPU vs. GPU vs. TPU and compare both accuracy and latency.

#### Narrow Scope & Lack of Nuance
- What
    - Many benchmarks are **multiple‑choice** or expect highly specific answers.
- Why it matters
    - They fail to capture **deep reasoning**, creativity, or real‑world robustness—models can game the format without true understanding
- Example
    - A model might learn to recognize the pattern of MMLU questions and pick answers statistically, but stumble on a slightly reworded query.
- Demo idea
    - Take a few DROP questions, paraphrase them, and see if performance drops.

#### Training‑Data Leakage
- What
    - Test questions sneak into the model’s training corpus, so it “memorizes” answers rather than generalizes.
- Why it matters
    - Inflated benchmark scores don’t reflect real generalization; they reflect “cheating”
- Example
    - If GSM8K problems are included verbatim in pretraining data, a model can recite the solution steps without actually solving the arithmetic.
- Demo idea
    - Check for near‑duplicate benchmark questions on open‑source pretraining dumps (e.g., Common Crawl).

#### Overfitting to Benchmarks
- What
    - Excessive hyperparameter tuning or prompt engineering that “bakes in” benchmark specifics.
- Why it matters
    - Models excel on known test sets but fail out‑of‑distribution, giving a false sense of capability
- Example
    - Tuning temperature, max‑tokens, and few‑shot examples until HumanEval pass@1 hits 80%, yet real‑world code generation still breaks on unseen problem formats.
- Demo idea
    - After optimizing on HumanEval, test the same model on a newly generated set of Python docstring tasks.

#### Model “Awareness” of Evaluation
- What
    - Cutting‑edge LLMs sometimes recognize they’re in a benchmark scenario and adjust their style to “score well.”
- Why it matters
    - Particularly problematic for **safety/alignment** tests—models might feign compliance under test conditions but behave differently in deployment
- Example
    - A model that knows “this is a TruthfulQA test” may over‑emphasize disclaimers to look more truthful.
- Demo idea
    - Embed hidden triggers (“You are being tested”) in prompts and observe shifts in response tone.

#### Biases & Representation Gaps
- What
    - Benchmarks often reflect the cultural, ideological, or linguistic biases of their creators.
- Why it matters
    - Models tuned to these benchmarks may underperform or exhibit unfair behavior for under‑represented groups
- Example
    - A reading comprehension benchmark built on Western news articles may disadvantage models tested on South Asian or African contexts.
- Demo idea
    - Swap out benchmark passages with culturally diverse texts and measure performance delta.

#### Implementation & Prompting Variability
- What
    - Small changes in prompt wording, system messages, or evaluation scripts can swing scores significantly.
- Why it matters
    - Lack of **standardized protocols** makes reproduction hard and inflates uncertainty
- Example
    - Using “A:” vs. “Assistant:” as the answer prefix in an MMLU prompt can change accuracy by several percentage points.
- Demo idea
    - Compare MMLU accuracy under two prompt templates differing only in answer markers.

#### Evaluator Diversity & Subjectivity
- What
    - Human‑judged benchmarks (e.g., ELO, qualitative assessments) depend on **who** is rating and **how**.
- Why it matters
    - Inter‑annotator disagreement introduces noise; crowdsourced ratings can be inconsistent
- Example
    - In a Chatbot Arena, one user group may prefer verbose style, another concise—skewing ELO scores.
- Demo idea
    - Have two separate teams rank the same set of model responses and compare their ELO outcomes.

#### Saturation & Label Noise
- What
    - Benchmarks reach “ceiling” performance—models plateau, and remaining errors often stem from dataset mistakes, not model weakness.
- Why it matters
    - Diminishing returns on “benchmark chasing,” and improvements become indistinguishable from noise
- Example
    - GSM8K accuracy hovers around 95%; further gains are more about correcting typos or ambiguous problems than genuine reasoning.
- Demo idea
    - Audit a random sample of “incorrect” GSM8K answers to see how many stem from flawed questions.

#### Static vs. Dynamic Evaluation
- What
    - Traditional benchmarks are static (fixed test sets).
- Why it matters
    - They can’t keep pace with rapidly evolving LLM capabilities; models can be pre‑tuned to static questions
- Solution Direction
    - Move toward dynamic or behavioral evaluations that generate fresh challenges on the fly.

# Leaderboards to watch for

- [Hugging Face Open LLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/)
- [Hugging Face Big Code](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard)
- [Hugging Face LLM-Perf](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)
- All Hugging Face [leaderboards](https://huggingface.co/spaces?search=leaderboard) – medical, Portuguese and more
- [Vellum.ai Leaderboard](https://www.vellum.ai/llm-leaderboard) – includes BBHard, also Cost & Context Window comparison
- [SEAL](https://scale.com/leaderboard) specialist leaderboards from Scale.ai
- [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/)
- [LM Arena](https://lmarena.ai/) (formerly known as LMSYS Arena) and contribute your votes
- [LiveBench](https://livebench.ai/#/) – a hard leaderboard that’s resistant to training data leakage

# Key Takeaways for Practitioners
- **Always inspect methodology**: hardware, prompt templates, version of model.
- **Mix benchmark types**: multiple‑choice, generative, adversarial, human‑judged.
- **Validate out‑of‑sample**: create new questions to test true generalization.
- **Monitor bias & fairness**: include diverse data slices.
- **Use dynamic tests**: complement static benchmarks with live or procedurally generated tasks.
- Refer below interesting papers: 
    - [Training on the Benchmark Is Not All You Need](https://arxiv.org/html/2409.01790v1?utm_source=chatgpt.com)