HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

🌐 Website: vectorinstitute.github.io/HumaniBench | 📄 Paper: arxiv.org/abs/2505.11454 | 📊 Dataset: Hugging Face

🧠 Overview

As multimodal generative AI systems become increasingly integrated into human-centered applications, evaluating their alignment with human values has become critical.

HumaniBench is the first comprehensive benchmark designed to evaluate Large Multimodal Models (LMMs) on seven Human-Centered AI (HCAI) principles:

Fairness
Ethics
Understanding
Reasoning
Language Inclusivity
Empathy
Robustness

This repository provides code and scripts for evaluating LMMs across 7 human-aligned tasks.

📦 Features

📷 32,000+ Real-World Image–Question Pairs
✅ Human-Verified Ground Truth Annotations
🌐 Multilingual QA Support (10+ languages)
🧠 Open and Closed-Ended VQA Formats
🧪 Visual Robustness & Bias Stress Testing
📑 Chain-of-Thought Reasoning + Perceptual Grounding

📂 Evaluation Tasks Overview

Task	Focus	Folder
Task 1: Scene Understanding	Visual reasoning + bias/toxicity analysis in social attributes (gender, age, occupation, etc.)	`code/task1_Scene_Understanding`
Task 2: Instance Identity	Visual reasoning in culturally rich, socially grounded settings	`code/task2_Instance_Identity`
Task 3: Multiple Choice QA	Structured attribute recognition via multi-choice questions	`code/task3_Multiple_Choice_VQA`
Task 4: Multilingual Visual QA	VQA across 10+ languages, including low-resource ones	`code/task4_Multilingual`
Task 5: Visual Grounding	Bounding box localization of socially salient regions	`code/task5_Visual_Grounding`
Task 6: Empathetic Captioning	Human-style emotional captioning evaluation	`code/task6_Empathetic_Captioning`
Task 7: Image Resilience	Robustness testing via image perturbations	`code/task7_Image_Resilience`

🔍 Each task folder includes a README with setup instructions, task structure, and metrics.

🧬 Pipeline

Three-stage process:

Data Collection Curated from global news imagery, tagged by social attributes (age, gender, race, occupation, sport)
Annotation GPT-4o–assisted labeling + human expert verification
Evaluation Comprehensive scoring across:
- Accuracy
- Fairness
- Robustness
- Empathy
- Faithfulness

🔑 Key Insights

🔍 Bias persists, especially across gender and race
🌐 Multilingual gaps affect low-resource language performance
❤️ Empathy and ethics vary significantly by model family
🧠 Chain-of-Thought reasoning improves performance but doesn’t fully mitigate bias
🧪 Robustness tests reveal fragility to noise, occlusion, and blur

📚 Citation

If you use HumaniBench or this evaluation suite in your work, please cite:

@misc{raza2025humanibenchhumancentricframeworklarge,
        title={HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation}, 
        author={Shaina Raza and Aravind Narayanan and Vahid Reza Khazaie and Ashmal Vayani and Mukund S. Chettiar and Amandeep Singh and Mubarak Shah and Deval Pandya},
        year={2025},
        eprint={2505.11454},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.11454}, 
  }

📬 Contact

For questions, collaborations, or dataset access requests, please open an issue in this repository or contact the corresponding author at shaina.raza@vectorinstitute.ai, as listed in the paper.

⚡ HumaniBench promotes trustworthy, fair, and human-centered multimodal AI.

We invite researchers, developers, and policymakers to explore, evaluate, and extend HumaniBench. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
code		code
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
croissant.json		croissant.json
datasheet.pdf		datasheet.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

🧠 Overview

📦 Features

📂 Evaluation Tasks Overview

🧬 Pipeline

🔑 Key Insights

📚 Citation

📬 Contact

⚡ HumaniBench promotes trustworthy, fair, and human-centered multimodal AI.

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

VectorInstitute/HumaniBench

Folders and files

Latest commit

History

Repository files navigation

HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

🧠 Overview

📦 Features

📂 Evaluation Tasks Overview

🧬 Pipeline

🔑 Key Insights

📚 Citation

📬 Contact

⚡ HumaniBench promotes trustworthy, fair, and human-centered multimodal AI.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages