Skip to content

st heatmap

b2o2i edited this page May 10, 2026 · 3 revisions

st-heatmap — Plot a cross-product fact-check score heatmap

Generates a color-coded grid showing how every AI-pair scored in the cross-product fact-check. Rows are evaluator AIs; columns are target story authors. Darker cells = higher veracity scores. The diagonal shows self-evaluation scores.

Run after: st-cross

Examples

st-heatmap --display subject.json                        # show chart on screen
st-heatmap --file subject.json                           # save PNG to ./tmp/
st-heatmap --display --ai-caption subject.json           # chart + AI narrative (default AI)
st-heatmap --file --ai-caption --agent openai subject.json  # save PNG + caption via openai
st-heatmap --file --ai-title --agent gemini subject.json    # save PNG + title via gemini
st-heatmap --display --file --ai-summary subject.json    # screen + save PNG + summary
st-heatmap --display --file --ai-story subject.json      # screen + save PNG + full story

Options

Chart output

Flag Description Default
--display Display heatmap on screen off
--file Save heatmap as PNG off
--path PATH Output directory for PNG ./tmp

AI content generation

All AI flags print generated text to stdout after the chart is produced. Combine with --file to save the chart and generate content in one command.

Flag Output Word limit
--ai-title Short title for the heatmap ≤ 10 words
--ai-short Short caption ≤ 80 words
--ai-caption Detailed caption with patterns and outliers 100–160 words
--ai-summary Concise summary 120–200 words
--ai-story Comprehensive narrative 800–1,200 words

Use --agent NAME to select the provider (default: your DEFAULT_AGENT setting). Supported: anthropic, xai, gemini, openai, ollama.

Other

Flag Description
--agent AI AI provider for content generation (default: xai)
--cache Enable API response cache (default: on)
--no-cache Disable API cache for this run
-v, --verbose Verbose output
-q, --quiet Minimal output

Example output

The heatmap below was generated from a pizza dough fact-checking run across 5 AI providers using --file --ai-caption:

st-heatmap --agent openai --ai-caption --file pizza_dough.json

Cross-product heatmap — pizza dough domain

AI-generated caption (--ai-caption, via openai):

This heatmap evaluates the veracity of AI-generated content focused on the domain of crafting homemade pizza dough, with scores ranging from 0.9 to 1.8. In terms of column patterns, the target AI "gemini:gemini-2.5-flash" boasts a consistently dark column, indicating it is highly trusted across all evaluators. Conversely, "perplexity:sonar-pro" shows a lighter column, suggesting that its stories are viewed with more skepticism. Evaluator-wise, "gemini:gemini-2.5-flash" emerges as the most lenient, consistently giving higher scores. In contrast, "perplexity:sonar-pro" appears stricter, as evidenced by its lower scores across the board, reflecting a more conservative assessment of truthfulness.

Examining the diagonal reveals a slightly darker hue compared to the surrounding cells, suggesting a mild self-promotion bias, where AIs rate their own work more favorably. A notable outlier is "perplexity:sonar-pro" evaluating itself with a score of 0.9, the lowest in the matrix. This score is explained by the presence of five partially false and two false counts, indicating a critical self-assessment. For technical readers interested in selecting an AI for pizza dough content, "gemini:gemini-2.5-flash" offers a reliable choice, consistently earning high trust scores from various evaluators.


Reading the heatmap

  • Dark column — that AI's reports are consistently trusted by all other evaluators
  • Light column — that AI's reports are viewed with more scepticism
  • Dark row — that evaluator is lenient (gives high scores to everyone)
  • Light row — that evaluator is strict (gives low scores to everyone)
  • Diagonal — self-evaluation; a darker diagonal suggests mild self-promotion bias
  • Single outlier cell — one AI finds another's work significantly more or less credible

Scores above 1.5 indicate strong factual agreement. Scores below 1.0 suggest frequent disagreement or identified errors.


Related: st-cross · st-verdict · st-speed · st-analyze


For developers

Uses mmd_data_analysis.get_flattened_fc_data() to build the score matrix, then renders with mmd_plot. AI content flags (--ai-caption etc.) call process_prompt() from ai_handler with the flattened score data as context. The --path directory is created automatically if it does not exist.

Clone this wiki locally