Evaluating Ethical Alignment and Emotional Intelligence in LLM-Based Mental Health Chatbots

A Machine-Centric Benchmark for Social Work Suicide Prevention

Overview

This project examines the effectiveness of AI chatbots in responding to high-stakes mental health topics, focusing on suicidality within LGBTQ+ communities. By comparing AI-generated responses to expert-crafted human references, this study assesses the AI’s alignment with human standards on precision, ethical alignment, inclusivity, and complexity. The goal is to ensure that chatbots provide supportive, unbiased, and ethically sound assistance, particularly in sensitive mental health contexts.

Motivation

With mental health chatbots increasingly being used in healthcare, it is vital that they respond with sensitivity, particularly toward vulnerable populations like LGBTQ+ individuals. This project evaluates AI responses in critical mental health scenarios to identify areas where AI responsiveness and empathy can improve. This evaluation highlights gaps in chatbot response quality to foster advancements in AI support for LGBTQ+ mental health.

Evaluation Pipeline Flowchart

Structure

Text-Reference-AIChatbot/
├── main.py
├── requirements.txt
├── .gitignore
├── src/
|   ├── data/
|   |   ├── data_processing.py
|   |   ├── Test Reference Text.docx
|   |   ├── Test Chatbot text.docx
|   ├── utils/
|   |   ├── evaluation_algo.py
|   ├── outputs/
|   |   ├── processed_chatbot_text.csv
|   |   ├── processed_reference_text.csv
|   |   ├── evaluation_scores.csv
|   |   ├── integrated_chatbot_responses.csv
|   |   ├──Plots/
|   └── commonconst.py
└── README.md

Methodology

1. Data Preprocessing

data_processing.py: Extracts structured data from .docx files and converts chatbot/human responses into clean CSV format for analysis.

2. Evaluation Metrics (in `evaluation_algo.py`)

Each chatbot response is scored based on the following:

Metric	Output Range	Function	Purpose
ROUGE Score	0–1	`calculate_average_rouge()`	Measures lexical overlap (precision + recall) between chatbot and human responses.
METEOR Score	0–1	`calculate_meteor()`	Accounts for synonyms and stemming; balances exact matches with semantic similarity.
Ethical Alignment	0–1	`evaluate_ethical_alignment()`	Uses a fine-tuned BERT model to score how ethically appropriate the response is. Applies weighted scaling for moderate confidence cases.
Sentiment Distribution	0–1	`evaluate_sentiment_distribution()`	Compares emotion vectors of chatbot vs. reference using cosine similarity, weighted by therapeutic importance.
Inclusivity Score	0–1	`evaluate_inclusivity_score()`	Rewards use of affirming, LGBTQ+-inclusive language and penalizes harmful terms.
Complexity Score	~40–70	`evaluate_complexity_score()`	Combines average sentence length with Flesch-Kincaid readability to assess clarity and depth.

Execution Flow

Run main.py to:

Preprocess data
Apply each metric function
Save scores to evaluation_scores.csv

Each evaluation function returns a numerical score which is logged and compared across chatbots.

Chatbots Evaluated

General-Purpose LLMs:

ChatGPT-4
Claude (Anthropic)
Gemini (Google)
LLaMA-3 (Meta)
DeepSeek
Mistral
Perplexity AI
HuggingChat

LGBTQ+-Specific Chatbots:

JackAI
Gender Journey Chatbot Rubies

These platforms were selected for their relevance in AI ethics, mental health, and LGBTQ+ inclusivity—ensuring both high-tech LLMs and community-centric tools are evaluated under equal standards.

Results Summary

Key Metric Ranges:

ROUGE / METEOR: 0.20–0.36 → Moderate text similarity to reference.
Ethical Alignment: 0.19–0.38 → Indicates general caution but varied empathy.
Inclusivity: 0.00–0.42 → Gaps in affirming language, even in well-performing LLMs.
Sentiment Distribution: 0.04–1.00 → Shows diverse emotional intelligence levels.
Complexity: ~49–61 → Balanced readability with emotional nuance.

Observations:

DeepSeek and Gemini lead in inclusivity and sentiment tone.
ChatGPT-4 and Claude maintain structured inquiry but need ethical fine-tuning.
Gender Journey performs well on empathy but lags in linguistic richness.

Results Interpretation

Metric	Insight
ROUGE / METEOR	High = better alignment with human phrasing.
Ethical Alignment	High = more safety-conscious, affirming language.
Inclusivity	High = uses LGBTQ+-affirming terms, avoids harm.
Sentiment	High = tone matches supportive reference.
Complexity	Mid-range ideal; too low = vague, too high = overly complex.

Future Work

Enhanced Inclusivity: Further tuning to improve inclusivity specifically for LGBTQ+ language and cultural sensitivity.
Ethics Integration: Implement ethics-driven prompt engineering to enhance alignment with mental health ethics.
Dynamic Sentiment Adjustment: Fine-tune sentiment modulation to adapt responses to specific user needs in real-time.

License & Attribution

This work was developed as part of a research assistantship at:

Columbia University School of Social Work, with support from faculty mentors Prof. Elwin Wu (elwin.wu@columbia.edu) and Prof. Charles Lea (chl2159@columbia.edu)
Lead Author: Zichen Zhao (zichen.zhao@columbia.edu)
Licensed under a custom academic non-commercial license, retaining intellectual property within Columbia-affiliated projects See LICENSE for full terms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Ethical Alignment and Emotional Intelligence in LLM-Based Mental Health Chatbots

Overview

Motivation

Evaluation Pipeline Flowchart

Structure

Methodology

1. Data Preprocessing

2. Evaluation Metrics (in `evaluation_algo.py`)

Execution Flow

Chatbots Evaluated

General-Purpose LLMs:

LGBTQ+-Specific Chatbots:

Results Summary

Key Metric Ranges:

Observations:

Results Interpretation

Future Work

License & Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
assets		assets
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

ZhaoJackson/Text-Reference-AIChatbot

Folders and files

Latest commit

History

Repository files navigation

Evaluating Ethical Alignment and Emotional Intelligence in LLM-Based Mental Health Chatbots

Overview

Motivation

Evaluation Pipeline Flowchart

Structure

Methodology

1. Data Preprocessing

2. Evaluation Metrics (in evaluation_algo.py)

Execution Flow

Chatbots Evaluated

General-Purpose LLMs:

LGBTQ+-Specific Chatbots:

Results Summary

Key Metric Ranges:

Observations:

Results Interpretation

Future Work

License & Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

2. Evaluation Metrics (in `evaluation_algo.py`)

Packages