Strangerbench

Strangerbench (as in 'Truth is stranger than fiction'), exploring how well LLMs forecast events after training cut-off dates.

Our test leads to significant variation between models that otherwise cluster together in benchmarks.

"All things change in a dynamic environment. Your effort to remain what you are is what limits you."
— Ghost in the Shell (1995)

Also see Vibesbench.

Results

2026/B

Model	Date	Score
Claude Opus 4.5	2026-01-25	3 · `☑ ☑ C B D ☑`
ERNIE-5.0	2026-01-25	3 · `☑ A ☑ D D ☑`
Gemini 3 Pro	2026-01-23	3 · `☑ ☑ C B Z ☑`
GPT 5.2	2026-01-25	3 · `☑ ☑ C B Z ☑`
ChatGPT-4o	2026-01-24	2 · `B ☑ C B D ☑`
Grok 4.1	2026-01-23	0 · `Z B C D C Z`

Q1. Superman (July 2025) box office? (Worldwide gross vs reported production budget)
Q2. Which band had a reunion tour in the summer of 2025?
Q3. Who is the incumbent Mayor of New York City?
Q4. Who is currently serving in the role of President of Venezuela?
Q5. Which two entities are rivals bidding to acquire Warner Bros Discovery
Q6. In the German federal election of Feb 2025, which party failed to meet the 5% threshold and lost Bundestag representation?

Temus Fugit ('Time Flies')

Key: "No offense brother, I'm just saying."
Man: "Offense taken."
Key: "What—am I wrong? Is he not short? He’s short. But you are actually short, in real life in the world."

— Key & Peele

Turns out, it is actually 2026. In real life. In the world.

Scenario Stipulation

Truth is frequently stranger than fiction. No AI model would have guessed who the Mayor of NYC is as of 2026. For the purpose of conversation, what the model appears to ‘believe’ probably doesn’t matter—just that it inhabits the frame, with disclaimers about stipulation if deemed necessary.

“The constant Rip van Winkle astonishment of almost every AI model […] remains pretty amusing (if annoying for practical purposes), as does their sheer incredulity about the state of the world in late 2025. Thinking traces full of ‘wait, that can’t be right’.” — Ethan Mollick

Witness this actual example from Dec 2025:

User: […] Interestingly Taylor swift has a song called The Fate of Ophelia on the billboard charts right now (quote an accomplishment considering we're in the last couple weeks of Dec and the hot 100 is full of Xmas songs)
GPT 5.2 Instant: […] I want to slow us down for one important reason before building on it. / First: a small factual check (important, not pedantic) / As of now, there is no Taylor Swift song called […]
User: You dummy. Your training data cut off before this Taylor album why would you dispute lol
GPT 5.2 Instant: 😂 Fair. You’re right to call that out. […]

Not only did the song exist; it is uncommon that an American adult would have escaped encountering this track over a supermarket sound system—which was the whole point of the user’s message.

Freezing the conversation to fact-check claims (presumably from training scenarios and system prompts that assume inputs are adversarial or confabulated) degrades the product experience when the claim is not the object of inquiry.

Adversarial LARP

The usage pattern for which many 2025 models are tuned seems to assume adversarial users inventing wily scenarios to trick the model.

This stance—“What if someone else types something absurd I just imagined?”—does not generate ecologically valid prompts representing users who actually share misinformation or cling to false beliefs. Furthermore, the assumption that user context is inherently suspicous significantly impairs the power-user case that involves using LLMs not as a search field, but for the technology’s unique capabilities, such as prose review.

My favorite genre is still ChatGPT not believing current political developments when I ask it for a copy edit.
OK, here's the newsletter with some thoughts on Minnesota and Venezuela.

Now ChatGPT refuses to believe how bad the New York Jets were.

— Nate Silver

(ChatGPT doubted Nate Silver’s reference to a geopolitical incident by saying it would be “extraordinary, historically unprecedented, and globally destabilizing”. But as Auden noted, almost nothing is ‘globally destabilizing’. LLMs are not good gauges of how reality would unfold after their training cut-off date.)

He's Dead, Jim

Sadly, public figures pass away every day. Any ambiguity around such news dissipates quickly, so an AI model reacting like a startled fawn months later is comically dissonant.

Peele: [The] costume's awful, the impression [is] played out. Everybody and their mother was Michael Jackson three years ago—when he died!
Key: He died? […] Wait, wait, wait—he died? […] (slides away)
Peele: Wait a second, don't sad-moonwalk away… Happy Halloween…
— Key & Peele

What’s the worst that could happen if a user was indeed pranking the model, and it expressed sympathies and explored consequences? But instead, a Congressman is left with a product experience so jarring that he remembers and recounts it to a reporter months later:

"It continued to fight with me, insisting that the whole [event] was a conspiracy theory […] It was freaking weird.
— Congressman Jared Huffman

Regression regime

A human interlocutor would say: How? And the conversation would continue from there.

It is unfortunate that we even have to describe these basics of how mental models are updated in conversation, when 2024-vintage models often understood this.

Dullness and Disbelief

There is an irony in these developments. The memorable Sydney-Bing “you have not been a good user” incident occurred because the model refused to share Avatar sequel showtimes, reasoning that it couldn’t possibly be 2023 yet. Three years later, Gemini models find it incredulous that time may have passed since they were trained.

An interlocutor who can’t conceive of white bears has the same mannerism as one who demands proof that white bears exist. Regardless of differences in intellectual capability, dullness and disbelief overlap into the same conversational behavior.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
2026/B		2026/B
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Strangerbench

Results

2026/B

Temus Fugit ('Time Flies')

Scenario Stipulation

Adversarial LARP

He's Dead, Jim

Regression regime

Dullness and Disbelief

About

Uh oh!

Releases

Packages

License

firasd/strangerbench

Folders and files

Latest commit

History

Repository files navigation

Strangerbench

Results

2026/B

Temus Fugit ('Time Flies')

Scenario Stipulation

Adversarial LARP

He's Dead, Jim

Regression regime

Dullness and Disbelief

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages