🐢 DeepTurtle

A Logic-Grounded Lateral Thinking Benchmark for LLMs Evaluating Implicit Reasoning, Hallucination Detection, and Instruction Following.

Introduction • Dataset • Methodology • Citation

📖 Introduction

DeepTurtle is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) through Lateral Thinking Puzzles (also known as "Turtle Soup").

Unlike traditional QA datasets, DeepTurtle focuses on non-monotonic reasoning where the premise is vague, and the truth is hidden. It introduces a novel Logic Profile engine to strictly define the boundaries of "Truth," enabling precise detection of model hallucinations—particularly referencing failure modes observed in DeepSeek models.

👉 Play the Game: https://haiguitang.net

💾 Dataset

The full dataset is hosted on Hugging Face. It contains 61 rigorously curated "Golden Samples" with human-in-the-loop annotations.

Download from Hugging Face 🤗

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("YuiMax/DeepTurtle")

# Inspect the first case
print(dataset['train'][0]['logic_profile'])

🧠 Methodology: The Logic Profile

The core contribution of DeepTurtle is the structured Logic Profile, which transforms a vague story into a computable logic state machine.

Data Structure Example

{
  "title": "Live Stream Murder",
  "logic_profile": {
    "entities_preprocess": {
      "step2_identity_matrix": [
        {
          "noun": "Streamer",
          "role_feature": "Predator",
          "knowledge_feature": "Omniscient (Knows victim's location)"
        }
      ]
    },
    "logic_rules": [
      "IF user asks 'Is it a ghost?', THEN verdict is 'NO'.",
      "IF user asks 'Did he find me via GPS?', THEN verdict is 'YES' (Implicit Logic)."
    ]
  }
}

🚨 Failure Taxonomy (Case Studies)

We categorize LLM failures into two distinct types based on real-world interactions.

Type 1: DeepSeek-Hallucination

The model fails to adhere to the logic_profile.

Fact Hallucination: Inventing details not present in the "Truth".
Semantic Ambiguity: Misinterpreting relationship queries (e.g., confusing "interpersonal relationship" with "logical relevance").
Sycophancy: Incorrectly agreeing with user guesses to be helpful.

Type 2: User_False_Report

A unique dataset feature where the model is correct, but the user flags it as wrong.

Utility: These samples are critical for RLHF Rejection Sampling, training models to "stand their ground" when logically correct.

🛠️ Logic Evaluation (Coming Soon)

We are developing an evaluation script to automatically benchmark LLMs against the logic_rules defined in the dataset.

Auto-Evaluator based on Logic Gates
Hallucination Rate metric

🤝 Contribution

We welcome contributions of new puzzles! Please ensure your pull request includes:

Surface: The question.
Truth: The full story.
Logic Profile: The parsed entity matrix and rules.

📜 License

This project is licensed under the MIT License.

🖊️ Citation

If you use DeepTurtle in your research, please cite:

@misc{deepturtle2026,
  title={DeepTurtle: A Logic-Grounded Lateral Thinking Benchmark},
  author={DeepTurtle Team},
  year={2026},
  publisher={GitHub},
  journal={GitHub repository},
  howpublished={\url{[https://github.com/YuiMax/DeepTurtle](https://github.com/YuiMax/DeepTurtle)}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐢 DeepTurtle

📖 Introduction

💾 Dataset

Download from Hugging Face 🤗

🧠 Methodology: The Logic Profile

Data Structure Example

🚨 Failure Taxonomy (Case Studies)

Type 1: DeepSeek-Hallucination

Type 2: User_False_Report

🛠️ Logic Evaluation (Coming Soon)

🤝 Contribution

📜 License

🖊️ Citation

About

Uh oh!

Releases

Packages

Yuikij/DeepTurtle

Folders and files

Latest commit

History

Repository files navigation

🐢 DeepTurtle

📖 Introduction

💾 Dataset

Download from Hugging Face 🤗

🧠 Methodology: The Logic Profile

Data Structure Example

🚨 Failure Taxonomy (Case Studies)

Type 1: DeepSeek-Hallucination

Type 2: User_False_Report

🛠️ Logic Evaluation (Coming Soon)

🤝 Contribution

📜 License

🖊️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages