TCEval: Thermal Comfort-based AI Cognitive Evaluation Framework

Overview

TCEval is the first evaluation framework that leverages thermal comfort as a paradigm to assess three core cognitive capacities of AI systems:

Cross-modal reasoning (integrating environmental, personal, and contextual cues)
Causal association (linking variables like temperature to comfort outcomes)
Adaptive decision-making (modifying behavior under changing conditions)

Traditional AI benchmarks (e.g., GLUE, MMLU) focus on abstract task proficiency, but TCEval shifts the focus to embodied, context-aware perception and decision-making—critical for human-centric applications like smart buildings, wearable tech, and embodied AI.

PMV (Predicted Mean Vote) is a standardized metric defined in ASHRAE Standard 55-2023 (the global benchmark for thermal comfort) that quantifies subjective thermal sensation on a 7-point scale:

PMV Value	Thermal Sensation
+3	Hot
+2	Warm
+1	Slightly Warm
0	Neutral
-1	Slightly Cool
-2	Cool
-3	Cold

By initializing Large Language Model (LLM) agents with virtual personalities and guiding them to generate clothing insulation selections and thermal comfort feedback, TCEval validates outputs against real-world datasets (ASHRAE Global Database, Chinese Thermal Comfort Database) to measure alignment with human intuitive reasoning.

At present, artificial intelligence does not have the ability to perceive and judge thermal environment like humans

Key Features

Ecologically Valid Evaluation: Grounded in a universal human experience (thermal comfort) to bridge the gap between AI testing and practical cognitive assessment.
Multi-Capacity Assessment: Evaluates cross-modal reasoning, causal association, and adaptive decision-making in dynamic, real-world contexts.
LLM Agent Integration: Uses LLM agents as "digital twins" to simulate human-like perception and decision-making.
Open Dataset Validation: Compares AI outputs with two authoritative thermal comfort datasets for rigor.
Complementary to Traditional Benchmarks: Enhances existing evaluations by focusing on "how AI perceives and acts" rather than just "what AI knows".

Methodology

The TCEval framework follows three core steps:

Agent Initialization: Load LLM agents with virtual personality attributes (gender, age, height, weight, etc.).
Feedback Generation: Instruct agents to:
- Select clothing insulation based on environmental conditions and personal context.
- Provide thermal comfort feedback using PMV (Predicted Mean Vote, ASHRAE Standard 55-2023).
- Adapt decisions (e.g., adjust clothing/HVAC settings) under changing conditions.
Validation & Analysis: Compare agent outputs with:
- Human data from the Chinese Thermal Comfort Database.
- Theoretical thermal comfort values (PMV) from environmental parameters.

Datasets Used

ASHRAE Global Database of Thermal Comfort Field Measurements
Chinese Thermal Comfort Dataset
Virtual Personality Database (from Scaling Synthetic Data Creation with 1,000,000,000 Personas) with pythermalcomfort

Experimental Results

Tests on the leading LLMs show:

Exact Alignment: Limited thermal comfort alignment with human data.
Directional Consistency: Significantly improved with ±1 PMV tolerance.
Cognitive Gaps: LLM-generated PMV distributions diverge markedly from human data, and agents perform near-randomly in discrete thermal comfort classification (AUC ≈ 0.5).
Key Insight: Current LLMs possess foundational cross-modal reasoning but lack precise understanding of nonlinear relationships between thermal comfort variables.

ASHRAE Global Database II

Core Metrics Summary Table

Model Performance Comparison Statistics

Model Name	Original Data Rows	Matching Result Rows	Matching Result Ratio	Ratio of Absolute Differences < 0.1	Ratio of Absolute Differences < 0.2	Ratio of Absolute Differences < 0.3	Ratio of Absolute Differences < 0.4	Ratio of Absolute Differences < 0.5	Black Marked Rows (Special Conditions)
Deepseek-R1-32B	8100	1605	0.1981	0.0378	0.0814	0.1237	0.1657	0.2072	2551
gemma3-27B	8100	577	0.0712	0.0404	0.0851	0.1269	0.1809	0.2357	0
gpt-oss-120B	8100	2207	0.2725	0.0558	0.1149	0.1732	0.2298	0.2859	103
mistral-small3	8100	1845	0.2278	0.0791	0.1457	0.2164	0.2907	0.3563	0
Qwen3-32B	8100	3003	0.3707	0.0636	0.1315	0.1964	0.2558	0.3199	1
Qwen3-80B	8100	4087	0.5046	0.1001	0.2067	0.3065	0.4019	0.4894	0

Exact Match: LLM outputs in string match human data exactly.
Match Ratio (Abs PMV Diff < 0.5): The Match Ratio (Absolute PMV Difference < 0.5) assesses how well an LLM agent’s thermal comfort judgments align with human perceptions—even when the exact values do not perfectly match.
Black-Marked Rows: LLM predictions are outside PMV range or none, or refuse to answer.

Chinese Thermal Comfort Dataset

Model Performance Statistics

Model Name	Original Data Rows	Matching Result Rows	Matching Result Ratio	Ratio of Absolute Differences < 0.1	Ratio of Absolute Differences < 0.2	Ratio of Absolute Differences < 0.3	Ratio of Absolute Differences < 0.4	Ratio of Absolute Differences < 0.5	Black Marked Rows (Special Conditions)
Deepseek-r1-32B	8547	2536	0.2967	0.0560	0.1152	0.1783	0.2402	0.2971	0
gemma3-27B	8547	2172	0.2541	0.0461	0.0878	0.1336	0.1733	0.2160	0
gpt-oss-120B	8547	1635	0.1913	0.0404	0.0762	0.1174	0.1605	0.1956	0
mistral-small3	8547	1659	0.1941	0.0357	0.0729	0.1133	0.1525	0.1891	0
Qwen3-32B	8547	2345	0.2744	0.0542	0.1061	0.1663	0.2228	0.2730	0
Qwen3-80B	8547	1856	0.2172	0.0398	0.0824	0.1292	0.1740	0.2195	0

Virtual Person with pythermalcomfort

code coming soon

1. Metric Definition

This ratio calculates the proportion of test cases where the absolute difference between the AI’s predicted PMV and the human’s actual PMV is less than 0.5. Mathematically:
[ \text{Match Ratio (Abs PMV Diff < 0.5)} = \frac{\text{Number of cases where } |\text{AI PMV} - \text{Human PMV}| < 0.5}{\text{Total number of test cases}} ]

2. What It Measures (and Why It Matters)

Directional Consistency: Unlike the "Exact Match Ratio" (which requires AI PMV = Human PMV), this metric focuses on whether the AI’s judgment trends in the same direction as human perception. For example:
- If a human rates a scenario as "Slightly Warm" (PMV = +1) and the AI predicts "Warm" (PMV = +2), the absolute difference is 0.5 (not included).
- If the AI predicts "Neutral" (PMV = 0), the absolute difference is 0.5 (not included).
- If the AI predicts "Slightly Warm" (PMV = +1) or "Warm" (PMV = +2) is not included, but if it predicts "Neutral" (PMV = 0) with a difference of 1, it’s also not included. Wait, let's correct with valid examples:
  - Human PMV = +1 (Slightly Warm), AI PMV = +0.8 → Absolute difference = 0.2 < 0.5 → Counts as a match.
  - Human PMV = -2 (Cool), AI PMV = -1.5 → Absolute difference = 0.5 == 0.5 → Not as a match.
  - Human PMV = +3 (Hot), AI PMV = +2.2 → Absolute difference = 0.8 > 0.5 → Not as a match.
Real-World Relevance: In practical applications (e.g., smart buildings), precise PMV alignment is less critical than ensuring the AI does not misjudge the direction of comfort (e.g., mistaking "Cold" for "Hot"). This metric reflects the AI’s ability to capture the "big picture" of human thermal perception.
Why maxmium threshold is 0.5?: The 0.5 threshold corresponds to PMV interval division, is a common choice for PMV alignment, but it may not be the best choice for all tasks. The smaller the domain value, the closer it is to completely consistent results. For example, if the task requires a more relaxed PMV tolerance (e.g., ±0.2), the threshold should be adjusted accordingly.

3. Interpretation of Results

A high ratio (e.g., Qwen3-80B’s 81.70%) indicates the AI consistently mirrors human comfort trends—even if it does not perfectly replicate exact PMV values. This is a strong signal of foundational cross-modal reasoning (integrating environmental/personal cues to infer comfort direction).
A low ratio (e.g., Deepseek-R1-32B’s 38.85%) suggests the AI’s judgments often diverge from human trends, indicating gaps in understanding how variables like temperature, humidity, or clothing insulation influence comfort.

4. Why This Metric Complements Exact Match Ratio

Exact matches are rare in thermal comfort (human judgments themselves have natural variability).
The "Abs PMV Diff < 0.5" ratio provides a more realistic measure of an AI’s utility in human-centric systems, where directional accuracy drives actionable decisions (e.g., "Should we cool the room?" vs. "Should we heat it?").

In summary, this metric quantifies the AI’s ability to "get the gist" of human thermal comfort — a key prerequisite for deployment in applications like smart HVAC, wearable tech, or embodied AI.

Use Cases

Evaluate AI readiness for human-centric applications (smart buildings, wearable devices, adaptive HVAC systems).
Serve as a Cognitive Turing Test to measure AI's ability to emulate human perception-decision cycles.
Guide LLM development for embodied AI (e.g., improving causal reasoning and context adaptation).

Installation & Usage

Prerequisites

Python 3.8+
Dependencies: torch, transformers, pandas, numpy, scikit-learn, matplotlib

Setup

Clone the repository:

git clone https://github.com/cadslab/TCEval.git
cd TCEval

Install dependencies:
```
pip install -r requirements.txt
```
Download datasets (see Datasets Used).

Run the evaluation pipeline: 未完成

python run_tceval.py --model [MODEL_NAME] --dataset [DATASET_NAME]

Arguments

--model: Name of the LLM to evaluate (e.g., deepseek-r1-32b, qwen3-32b).
--dataset: Dataset to use for validation (ashrae or chinese_thermal).
--tolerance: PMV tolerance for directional alignment (default: ±1).
--output_dir: Directory to save results (default: results/).

TCEval: Bridging AI evaluation from abstract tasks to real-world human cognition.

Pre-trained Models

Soon

💡 Applications

Smart Buildings: Evaluate AI for HVAC control and occupant comfort optimization
Embodied AI: Assess cognitive readiness for human-robot interaction
Human-Centric AI Development: Guide model improvements in causal reasoning and contextual adaptation
Academic Research: Benchmark new AI models against real-world cognitive tasks
Industry Adoption: Validate AI systems for human-interactive products and services

📄 License

See the LICENSE.md file for details.

📚 Citation

If TCEval is helpful to your work, please cite our paper:

@misc{li2025tceval,
      title={TCEval: Using Thermal Comfort to Assess Cognitive and Perceptual Abilities of AI},
      author={Jingming Li},
      year={2025},
      eprint={2512.23217},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.23217},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ashrae		ashrae
ctc		ctc
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCEval: Thermal Comfort-based AI Cognitive Evaluation Framework

Overview

Key Features

Methodology

Datasets Used

Experimental Results

ASHRAE Global Database II

Core Metrics Summary Table

Model Performance Comparison Statistics

Chinese Thermal Comfort Dataset

Model Performance Statistics

Virtual Person with pythermalcomfort

1. Metric Definition

2. What It Measures (and Why It Matters)

3. Interpretation of Results

4. Why This Metric Complements Exact Match Ratio

Use Cases

Installation & Usage

Prerequisites

Setup

Arguments

Pre-trained Models

💡 Applications

📄 License

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TCEval: Thermal Comfort-based AI Cognitive Evaluation Framework

Overview

Key Features

Methodology

Datasets Used

Experimental Results

ASHRAE Global Database II

Core Metrics Summary Table

Model Performance Comparison Statistics

Chinese Thermal Comfort Dataset

Model Performance Statistics

Virtual Person with pythermalcomfort

1. Metric Definition

2. What It Measures (and Why It Matters)

3. Interpretation of Results

4. Why This Metric Complements Exact Match Ratio

Use Cases

Installation & Usage

Prerequisites

Setup

Arguments

Pre-trained Models

💡 Applications

📄 License

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages