### (<span style="color: #e74c3c;">Rubric + non-math</span>) **SteLLA: A Structured Grading System Using LLMs with RAG**

https://arxiv.org/html/2501.09092v1

**Goal:** Make LLMs reliable for automated grading of short-answer questions by using instructor-provided reference answers and rubrics as external knowledge

**Key Innovation:** <span style="color: #2ecc71;">**Reference-based Retrieval Augmented Generation (R-RAG)**</span> - converts rubric points into specific yes/no questions for structured evaluation rather than holistic scoring

**Dataset & Methodology**
- **Data Source**: 175 student responses from a college-level Biology exam
- **Evaluation Approach**: Binary scoring (0/1) for each rubric point, with 4 total points possible
- **Human Evaluation**: Two trained undergraduate RAs labeled data under instructor supervision
- Provides a **breakdown grade** for each concept and justifies it, not just a final score.

**System Architecture**
The system consists of three main modules:
1. **R-RAG Module**: Generates evaluation questions from rubric points using question-generation techniques
2. **LLM-based Evaluation Module**: Uses GPT-4 with zero-shot/few-shot learning to grade student responses
3. **Scoring Module**: Consolidates individual question grades into final scores and unified feedback

**Results:**
- **Cohen's Kappa**: 0.6720 (substantial agreement with human graders)
- **Human baseline**: κ=0.8315, 91.6% accuracy vs **GPT-4**: 83.6% accuracy
- **High relevance**: Only 1/676 GPT-4 justifications deemed irrelevant by humans
- **Structured feedback**: Provides breakdown grades for each concept with justification

**Key Finding:** <span style="color: #e74c3c;">**"Infer too much" problem**</span> - GPT-4 over-interprets student responses, reading implications not explicitly stated

**Bottom Line:** QA-based evaluation enables structured, reliable automated grading but LLMs struggle with appropriate inference boundaries in educational contexts

**Relevance:** <span style="color: #3498db;">**Direct application**</span> to automated grading systems - provides structured methodology for systematic error analysis and demonstrates the critical challenge of grounding general LLMs to specific educational domains

**Note**: This paper's finding about GPT-4 being "prone to inferring too much implication from the given text" aligns with our potential research direction for comparative analysis between different LLM approaches in math education contexts.


### (<span style="color: #e74c3c;"> LLMs are not capable of genuine logical reasoning</span>) **GSM-Symbolic: Testing LLM Mathematical Reasoning**

https://arxiv.org/abs/2410.05229v1

**Goal:** Test whether LLMs truly understand math or just pattern-match; question reliability of GSM8K benchmark

**Key Innovation:** Symbolic templates automatically generate multiple variants of the same math problem with different names, numbers, and complexity levels

**Results:** 
- Noticeable variance when responding to different instantiations of the same question
- **Numerical changes hurt more than name changes**
- **Performance degrades with problem complexity (more clauses)**
- **65% performance drop with irrelevant information**
- Pattern-matching, not genuine reasoning

**Bottom Line:** Current LLMs cannot reliably reason about mathematics; GSM8K benchmark scores are misleading due to data contamination and single-point metrics that hide model fragility

![image.png](attachment:image.png)


### **Chain-of-Thought Prompting Elicits Reasoning in Large Language Models**

https://arxiv.org/abs/2201.11903

**Goal:** Improve LLM reasoning capabilities by prompting models to generate step-by-step intermediate reasoning steps before final answers

**Key Innovation:** Chain-of-thought prompting - provide few-shot examples that include the reasoning process, not just input-output pairs

**Results:**
- **Emergent ability at scale**: Only works with models ≥100B parameters
- **Striking improvements**: PaLM 540B achieves state-of-the-art on GSM8K math problems (57% accuracy vs 18% with standard prompting) - more than tripling performance
- **Broad applicability**: Works across arithmetic, commonsense, and symbolic reasoning tasks
- **Length generalization**: Enables solving problems longer than training examples

**Bottom Line:** Chain-of-thought prompting significantly improves reasoning performance in large language models by mimicking human step-by-step thinking, but requires massive model scale to emerge

**Relevance:** Provides a method to improve mathematical reasoning in LLMs, directly applicable to grading systems that need to evaluate reasoning processes

![image.png](attachment:image.png)

### (<span style="color: #e74c3c;">Very similar, but not successful in calculation errors</span>) **Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction**

https://arxiv.org/abs/2406.00755v1

**Goal:** Evaluate LLMs' mathematical reasoning from an "examiner perspective" - testing their ability to identify and correct errors rather than just solve problems

**Key Innovation:** Four comprehensive evaluation tasks for error analysis:
- **Error-Presence Identification (EP):** Detect if any error exists in a solution
- **Error-Step Identification (ES):** Find the first wrong step in a solution  
- **Error-Type Identification (ET):** Classify the type of error (9 categories defined)
- **Error Correction (EC):** Fix the wrong steps and provide correct answers

**Dataset:** EIC-Math - 1,800 cases with single-step, single-type errors across 9 error types (calculation, counting, context value, hallucination, unit conversion, operator, formula confusion, missing step, contradictory step)

**Results:**
- **GPT-4 dominates** across all tasks, but only achieves 76.2% average accuracy
- <span style="color: #e74c3c;">**Calculation errors**</span> are most challenging for all models (26.3% average accuracy)
- **Missing step errors** hardest to classify (2.9% accuracy in error-type identification)
- **Error type hints improve performance significantly:** 47.9% average improvement in correction accuracy
- **Open-source models** highly sensitive to prompt variations vs. closed-source models

**Bottom Line:** Current LLMs struggle significantly with error identification and correction in mathematical reasoning, revealing fundamental limitations in their mathematical understanding and suggesting new directions for evaluation beyond simple problem-solving metrics

**Relevance:** This is **directly applicable** to automated grading systems - combines SteLLA's error analysis focus with mathematical reasoning evaluation, providing a framework for assessing LLM reliability in educational contexts

![image.png](attachment:image.png)

### (<span style="color: #e74c3c;">ChatGPT behaves like a “careless student”, prone to slip and occasionally guessing the questions</span>) **Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective**

https://openreview.net/pdf?id=s6X3s3rBPW Under review at ICLR 2024 (**Rejected** - scores: 3, 3, 5, 5)

**Goal:** Apply Computerized Adaptive Testing (CAT) from psychometrics to efficiently evaluate LLM cognitive abilities with fewer questions while enabling direct human-LLM comparisons

**Key Innovation:** <span style="color: #3498db;">**Two-stage adaptive framework**</span> using Item Response Theory (IRT) to calibrate question difficulty/discrimination, then Fisher Information selection for optimal question sequencing

**Dataset:** Three educational domains - MOOC (computer science), MATH (high school mathematics), CODE (programming) with human response data for calibration

**Results:**
- <span style="color: #2ecc71;">**Efficiency gain**</span>: Only 20% of questions needed vs. fixed test sets for same accuracy
- <span style="color: #e74c3c;">**GPT-4 dominates**</span> but only reaches middle-level student ability in math reasoning
- <span style="color: #f39c12;">**ChatGPT behaves like "careless student"**</span>: 10% guessing, 30% slip rate
- **Adaptive selection works**: 60% question overlap between models with 20-30% model-specific questions

**Key Limitations (Why Rejected):**
- <span style="color: #e74c3c;">**Weak efficiency motivation**</span> - unclear why evaluation efficiency matters for LLMs
- <span style="color: #e74c3c;">**Fairness concerns**</span> - different models get different question sets
- <span style="color: #e74c3c;">**Questionable assumptions**</span> - uses human-calibrated IRT models for LLMs without validation

**Bottom Line:** Provides systematic methodology for LLM evaluation but fundamental assumptions about human-LLM behavioral similarity remain unvalidated

**Relevance:** <span style="color: #9b59b6;">**Methodological framework**</span> for efficient evaluation in educational contexts, though implementation needs refinement for fairness and theoretical grounding


### (<span style="color: #e74c3c;">To fill in</span>) **Evaluating LLMs at Detecting Errors in LLM Responses**

https://openreview.net/pdf?id=dnwRScljXr Published at COLM 2024

**Goal:** Introduce the first benchmark for evaluating error detection methods in LLM responses, consisting of objective, realistic, and diverse errors

**Key Innovation:** **ReaLMistake benchmark** - 900 instances with objective error annotations across 4 categories: reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge

**Dataset:** Three tasks designed to elicit natural errors:
- **MathGen**: Math word problem generation 
- **FgFactV**: Fine-grained fact verification
- **AnsCls**: Answerability classification
- 900 total instances from GPT-4 and Llama 2 70B responses

**Error Types Detected:**
- **Reasoning correctness**: Logical errors in problem-solving steps
- **Instruction-following**: Failures to follow task specifications
- **Context-faithfulness**: Inconsistencies with provided context
- **Parameterized knowledge**: Factual errors about world knowledge

**Results by Error Type & Model:**
- **Binary error detection**: All models struggle with recall (detecting errors when they exist)
- **GPT-4**: High precision (~80%) but very low recall (~20-30%) across error types
- **Claude-3**: Similar pattern - good at avoiding false positives, poor at catching actual errors
- **Reasoning errors**: Particularly challenging across all models
- **Context-faithfulness errors**: Slightly better detection rates
- **Human performance**: 95.7 F1 score vs LLM performance in 40-60 F1 range

**Key Finding:** **Error detection fundamentally challenging** - popular improvement techniques (self-consistency, majority vote) don't help

**Bottom Line:** Current LLMs struggle significantly with detecting their own errors, revealing critical limitations for automated grading and self-correction applications

**Relevance:** Highly relevant - directly addresses reliability concerns in automated grading by demonstrating LLM limitations in error detection, providing systematic evaluation framework for educational applications

![image.png](attachment:image.png)

### (<span style="color: #e74c3c;">Key reference; They generate solutions, instead of detecting errors</span>) **PAL: Program-aided Language Models**

https://dl.acm.org/doi/10.5555/3618408.3618843 ICML'23

**Goal:** Improve LLM reasoning by generating programs as intermediate steps rather than natural language, offloading execution to a Python interpreter

**Key Innovation:** **Program-Aided Language Models (PAL)** - LLMs decompose problems into programmatic steps, but delegate solution execution to a Python interpreter rather than performing calculations themselves

**Methodology:**
- **Problem decomposition**: LLM reads natural language problems and generates Python code as reasoning steps
- **Execution offloading**: Python interpreter handles all calculations and logic execution
- **Meaningful variable names**: Uses descriptive variable names to maintain grounding between code and problem entities
- **Few-shot prompting**: Augments existing chain-of-thought prompts with corresponding Python code

**Dataset & Evaluation:**
- **Mathematical reasoning**: 8 datasets (GSM8K, SVAMP, ASDIV, etc.)
- **Symbolic reasoning**: 3 tasks from BIG-Bench Hard (colored objects, penguins, date understanding)
- **Algorithmic tasks**: Object counting and repeat copy
- **GSM-HARD**: New challenging dataset with larger numbers to test arithmetic robustness

**Results:**
- **Mathematical tasks**: PAL achieves 72.0% on GSM8K vs 65.6% for chain-of-thought with Codex
- **Robustness**: On GSM-HARD, PAL maintains 61.2% accuracy while chain-of-thought drops to 23.1%
- **Symbolic reasoning**: Consistent improvements across all tasks (8.8% to 21.8% absolute gains)
- **Scalability**: Works with weaker models and even text-based LMs with sufficient coding ability

**Key Findings:**
- **Arithmetic errors are primary failure mode**: Analysis shows LLMs struggle with calculations, not problem understanding
- **Code structure matters**: Meaningful variable names critical for performance; random variable names hurt significantly
- **Interpreter is essential**: Having LLMs execute their own generated code performs poorly (23.2% vs 72.0%)

**Bottom Line:** PAL demonstrates that hybrid neuro-symbolic approaches can significantly improve reasoning by leveraging LLM strengths (problem decomposition) while avoiding weaknesses (arithmetic calculation)

**Relevance:** Highly relevant for automated grading - shows how to improve mathematical reasoning reliability by combining LLMs with external solvers, directly applicable to grading systems that need accurate problem-solving

![image.png](attachment:image.png)