A Logic-Grounded Lateral Thinking Benchmark for LLMs Evaluating Implicit Reasoning, Hallucination Detection, and Instruction Following.
Introduction โข Dataset โข Methodology โข Citation
DeepTurtle is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) through Lateral Thinking Puzzles (also known as "Turtle Soup").
Unlike traditional QA datasets, DeepTurtle focuses on non-monotonic reasoning where the premise is vague, and the truth is hidden. It introduces a novel Logic Profile engine to strictly define the boundaries of "Truth," enabling precise detection of model hallucinationsโparticularly referencing failure modes observed in DeepSeek models.
๐ Play the Game: https://haiguitang.net
The full dataset is hosted on Hugging Face. It contains 61 rigorously curated "Golden Samples" with human-in-the-loop annotations.
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("YuiMax/DeepTurtle")
# Inspect the first case
print(dataset['train'][0]['logic_profile'])The core contribution of DeepTurtle is the structured Logic Profile, which transforms a vague story into a computable logic state machine.
{
"title": "Live Stream Murder",
"logic_profile": {
"entities_preprocess": {
"step2_identity_matrix": [
{
"noun": "Streamer",
"role_feature": "Predator",
"knowledge_feature": "Omniscient (Knows victim's location)"
}
]
},
"logic_rules": [
"IF user asks 'Is it a ghost?', THEN verdict is 'NO'.",
"IF user asks 'Did he find me via GPS?', THEN verdict is 'YES' (Implicit Logic)."
]
}
}
We categorize LLM failures into two distinct types based on real-world interactions.
The model fails to adhere to the logic_profile.
- Fact Hallucination: Inventing details not present in the "Truth".
- Semantic Ambiguity: Misinterpreting relationship queries (e.g., confusing "interpersonal relationship" with "logical relevance").
- Sycophancy: Incorrectly agreeing with user guesses to be helpful.
A unique dataset feature where the model is correct, but the user flags it as wrong.
- Utility: These samples are critical for RLHF Rejection Sampling, training models to "stand their ground" when logically correct.
We are developing an evaluation script to automatically benchmark LLMs against the logic_rules defined in the dataset.
- Auto-Evaluator based on Logic Gates
- Hallucination Rate metric
We welcome contributions of new puzzles! Please ensure your pull request includes:
- Surface: The question.
- Truth: The full story.
- Logic Profile: The parsed entity matrix and rules.
This project is licensed under the MIT License.
If you use DeepTurtle in your research, please cite:
@misc{deepturtle2026,
title={DeepTurtle: A Logic-Grounded Lateral Thinking Benchmark},
author={DeepTurtle Team},
year={2026},
publisher={GitHub},
journal={GitHub repository},
howpublished={\url{[https://github.com/YuiMax/DeepTurtle](https://github.com/YuiMax/DeepTurtle)}}
}