Skip to content

abhaygupta1266/EnDive-Code

Repository files navigation

      EnDive Logo

Evaluating AI fairness across dialects of English.


📝 About

EnDive (English Diversity) is a large-scale benchmark designed to evaluate the fairness and performance of large language models (LLMs) across underrepresented English dialects. Spanning five diverse dialects—African American Vernacular English (AAVE), Chicano English (ChcE), Colloquial Singapore English (CollSgE), Indian English (IndE), and Jamaican English (JamE)—EnDive features 60 dialect-specific datasets covering 12 NLU tasks.

The benchmark captures how LLMs perform when prompts are written in different dialects rather than Standard American English (SAE). By comparing these dialect translations against SAE counterparts, EnDive identifies disparities in model accuracy, fluency, and robustness—highlighting critical issues of bias and inclusivity in today's AI systems.


📊 Performance Table

This table illustrates how well different language models understand and respond to English dialects by comparing their accuracy on SAE prompts versus dialect-specific prompts using Chain-of-Thought (CoT) reasoning. Each column represents a dialect—such as AAVE, JamE, or IndE—and each cell shows how a model’s performance changes when the same question is rephrased in that dialect. A significant drop in accuracy from SAE to CoT suggests the model struggles with that dialect, signaling potential bias and a lack of linguistic inclusivity. By highlighting these gaps, the table underscores the critical need to evaluate and improve AI fairness across diverse language varieties.

Model AAVE (Δ) ChcE (Δ) CollSgE (Δ) IndE (Δ) JamE (Δ)
🥇 o1 89.13 → 93.15 88.54 → 93.39 89.14 → 93.50 90.34 → 94.07 89.40 → 93.14
🥈 Gemini 2.5 Pro 88.89 → 92.06 88.70 → 92.31 89.94 → 92.14 89.72 → 90.69 89.42 → 92.44
🥉 GPT-4o 82.20 → 87.36 80.37 → 87.35 87.31 → 89.06 83.74 → 87.52 82.53 → 87.40
DeepSeek-v3 82.06 → 87.36 81.55 → 87.27 81.65 → 87.37 89.02 → 87.44 81.40 → 87.38
Claude 3.5 Sonnet 79.78 → 83.10 81.15 → 88.78 81.18 → 88.93 81.60 → 89.64 81.06 → 88.37
LLaMa-3-8B Instruct 82.69 → 87.49 78.08 → 82.94 78.41 → 83.00 81.52 → 86.26 79.14 → 83.20
GPT-4o-mini 74.53 → 78.27 75.01 → 77.70 76.50 → 86.61 74.66 → 82.56 80.56 → 86.60
Average 82.75 → 86.97 81.91 → 87.11 83.20 → 88.39 86.95 → 88.95 83.20 → 88.39

🧪 Tasks

EnDive spans 12 NLP tasks across four core reasoning categories:

  • 🧠 Language Understanding: BoolQ, MultiRC, WSC, SST-2, COPA
  • 💻 Algorithmic Reasoning: HumanEval, MBPP
  • 🧮 Math: GSM8K, SVAMP
  • 🔎 Logic: LogicBench, FOLIO

🧠 Simplified Methodology

We generate dialectal translations of SAE prompts using few-shot prompting with GPT-4o. To ensure quality, we apply BLEU score filtering and exclude examples with significant similarity (translations that get >0.7 BLEU score). Translations are further validated by native speakers for fluency and authenticity. We then evaluate seven LLMs across 12 NLU tasks by comparing performance on SAE versus dialect variants.


📁 Directory Structure

EnDive-Code/
├── BLEU Score Code/                  # BLEU filtering thresholds
├── Eval Code Multi-AAVENUE/         # Evaluation code for different dialects
├── GPT 4o Translation Code/         # Few-shot translation prompts with GPT-4o
├── Multi-VALUE Translation code/    # Rule-based translation baseline
├── Translation Alignment Code/      # Alignment validation
├── ChatGPT Image May 26, 2025, 02_49_32 PM.png  # EnDive logo
├── Fluency Calculations.ipynb       # Fluency metric notebooks
├── Metrics (Rouge, BARTScore...).ipynb  # Metric analysis
├── Preference Scores.ipynb          # Human eval preference scores
└── README.md                        # ← you are here

📚 Citation

If you use EnDive in your research, please cite us:

@misc{gupta2025endivecrossdialectbenchmarkfairness,
  title={EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models}, 
  author={Abhay Gupta and Jacob Cheung and Philip Meng and Shayan Sayyed and Austen Liao and Kevin Zhu and Sean O'Brien},
  year={2025},
  eprint={2504.07100},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2504.07100}, 
}

📬 Contact

For questions or collaborations, feel free to email us at abhaygupta1266@gmail.com.

      EnDive Logo

Pushing AI fairness forward, across diverse English dialects.

🔝 Back to Top

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published