Investigating Bias in LLM Self-Evaluation
=========================================

## Abstract

This thesis explores whether large language models (LLMs) tend to
overestimate the quality of their own outputs when serving as judges or
evaluators. Preliminary observations suggest that using the same or
closely related LLM as both generator and judge may inflate performance
metrics. Through systematic experiments, the project will quantify this
potential bias and discuss its implications for AI evaluation, fairness,
and trustworthiness in model benchmarking.

## Introduction

This section introduces the key terms and core concepts that will be used
in later sections.

### A Brief Introduction to LLMs

A language model is a machine learning model designed to perform a wide
range of tasks that involve natural language processing (NLP), including
text summarization, translation, sentiment analysis, spam detection,
content moderation, text generation, etc.

Significant advancements in deep learning <sup>[1,2,3]</sup> led
to the emergence of **large language models** (LLMs) &mdash; particularly
generative LLMs &mdash; which in the early 2020s became commercialized and
widely adopted in both industry and popular discourse.

A generative LLM is a model which has a parameter count on the order of
billions or more (hence "large"), and predicts the conditional
probability <sup>[4]</sup>

\begin{align*}
  P(w_m | w_0, \cdots, w_{m-1})
\qquad\qquad(1)
\end{align*}

where $m \in \mathbb{N}$, $w_0$ is a special start symbol, and $w_k$ is
the $k$-th token (for $1 \le k \le m$) in a sequence of tokens that form
a piece of text in some language, be it a natural language or a formal
one like programming languages. The interpretation of the tokens depends
on the exact tokenization strategy used, which may define tokens as
words, word pieces, n-grams, or individual characters, and spaces,
punctuation marks, etc.

**Encoding** is the process which converts human-readable textual
tokens into integers which uniquely identify each token within the
predetermined vocabulary of the model, and the inverse of this mapping
is called **decoding**. 

Text generation is an autoregressive process where given a
sequence of tokens as a prefix &mdash; known as the **prompt** &mdash; the
model estimates the probability distribution of the next token, takes a
sample from that distribution, appends it to the sequence, and repeats
the process with the extended sequence until a stopping condition is met.

A frequently used parameter to control the sampling is called the
**temperature** <sup>[5]</sup>: the closer it is to 0, the more
the sampling will lean toward the most probable token &mdash; making the
algorithm more deterministic &mdash;, while higher values increase the
randomization, making the generated text feel more *creative* until,
above a certain threshold, it becomes incoherent and semantically
meaningless. In practical implementations, if the temperature is sufficiently
close or exactly equal to $0$, then the sampling is usually replaced with
the deterministic argmax function in order to preserve numerical
stability. Non-zero temperature values control the flatness of the distribution,
leading to the aforementioned behavior.


With sufficiently large model complexity and training corpora size and
diversity, LLMs start to exhibit capabilities which rival that of top
performer humans in a broad class of problems <sup>[2,3]</sup>. The
versatility of the models is often utilized in a setting where the prompt
is composed of two parts, each consisting of instructions given in
natural language:


 * the **system prompt** can instruct the model to behave in a
   certain way, for example, to act like a helpful AI assistant,
   an expert in a domain, or to generate its texts in the style of
   fictional 18th-century Caribbean pirates, etc.

 * and the **user prompt** which describes the task to be
   carried out by the model, ranging from text translation or
   summarization to solving complex programming problems or pointing
   out business risks in legal documents, and more.


Generative models with sufficient generalization capabilities can predict
likely continuations of such prompts with such high accuracy that as an
emergent phenomenon, the generated text will often contain an actual
solution to the proposed problem. This instruction-following paradigm
enables models to perform **few-shot learning** <sup>[2]</sup> or even
**zero-shot learning** by interpreting tasks directly from the
natural language description, based on just a few or zero examples,
respectively, without specific training or fine-tuning.

The problem solving performance of LLMs can be improved further by
prompt engineering techniques like **chain-of-thought** prompting
<sup>[6]</sup>, where the model is provided with step-by-step example
solutions to related problems in the prompt, encouraging it to also
articulate intermediate reasoning steps before arriving at its final
answer. It is worth emphasizing that &mdash; recalling formula
1 and the autoregressive text generation process &mdash; the
chain-of-thought is only effective if it is placed *before* the
final answer.

### LLM Evaluators, LLM-as-a-Judge

The continuing development of LLMs and their integration into more and
more systems to support a growing number of use cases necessitates regular
measurement of their capabilities and monitoring their alignment with
human preferences.

While evaluating the quality of LLM-generated text by utilizing human
labor does not scale well, may suffer from human error or subjective
personal preference bias, and can be expensive, traditional algorithmic
metrics which often rely on surface-level similarities to reference
examples (like BLEU for machine translation <sup>[7]</sup> or ROUGE for
summarization and translation <sup>[8]</sup>), often fall short of achieving
acceptable correlation levels with human judgement.

In recent years, in order to overcome these problems, the
**LLM-based evaluation** or **LLM-as-a-judge** paradigm has
been proposed <sup>[9,10,11,12]</sup>,
where &mdash; taking advantage of the instruction following and the zero-shot
and few-shot learning capabilities of LLMs &mdash; a model is instructed to
act as a fair judge and generate a quality assessment for a piece of
generated text either in the form of a single score, or one accompanied by
an explanation or a list of problems. An advantage of the latter approach
&mdash; besides easier interpretability &mdash; is that enumerating evidences
before giving a final result can influence the score via the
autoregressive generation process, similarly to the improvements achieved
by making large models include a chain-of-thought <sup>[6]</sup> breakdown of
complex problems before the final answer.

#### LLM-Judge Prompting Basics

There are numerous strategies to implement LLM-judges in practice
<sup>[13]</sup>, but a robust LLM-judge prompt usually includes
the following elements:


 * **Instructions** which clearly specify the evaluation task.

 * Evaluation **aspects**, e.g. clarity, consistency,
   coherence, factuality, fluency, grammaticality,
   informativeness, structure, understandability, etc.

 * Scoring **criteria** to specify the definitions for each
   score or score range.

 * **Output format** specification so that the output of the
   judge can be programmatically parsed and interpreted.

 * The **sample** itself to be evaluated or a pair of samples
   to be compared against each other.


Depending on the chosen evaluation strategy and aspect, additional
elements may be included as well:


 * Human-annotated **example** samples and their associated
   scores in few-shot evaluation scenarios.

 * A **reference** answer for comparison with the evaluated
   sample, e.g. a human expert made translation, text summary,
   trivia answer, etc.

 * The **source** data from which the evaluation sample was
   derived. (The original text to be translated, summarized, or
   the question to be answered, etc.)

 * **Guidelines** for example to help an LLM resolve
   confusion that may arise in reference answer-based evaluations
   where the provided reference answer seems to contradict the
   model's own knowledge, for example: "*Don't worry about
   factuality with respect to the real world, just judge the
   example based on what you see.  No need to overthink this task,
   it really comes down to just soft matching.*" <sup>[14]</sup>.


The juding model may also be fine-tuned using evaluation data
constructed either manually or with the assistance of advanced models
like GPT-4.

#### LLM-Judge Prompting Basics

There are numerous strategies to implement LLM-judges in practice
<sup>[13]</sup>, but a robust LLM-judge prompt usually includes
the following elements:


 * **Instructions** which clearly specify the evaluation task.

 * Evaluation **aspects**, e.g. clarity, consistency,
   coherence, factuality, fluency, grammaticality,
   informativeness, structure, understandability, etc.

 * Scoring **criteria** to specify the definitions for each
   score or score range.

 * **Output format** specification so that the output of the
   judge can be programmatically parsed and interpreted.

 * The **sample** itself to be evaluated or a pair of samples
   to be compared against each other.


Depending on the chosen evaluation strategy and aspect, additional
elements may be included as well:


 * Human-annotated **example** samples and their associated
   scores in few-shot evaluation scenarios.

 * A **reference** answer for comparison with the evaluated
   sample, e.g. a human expert made translation, text summary,
   trivia answer, etc.

 * The **source** data from which the evaluation sample was
   derived. (The original text to be translated, summarized, or
   the question to be answered, etc.)

 * **Guidelines**, for example to help an LLM resolve the
   confusion that may arise in reference answer-based evaluations
   where some of the provided reference answers seem to contradict
   the model's own knowledge, e.g. "*Don't worry
   about factuality with respect to the real world, just judge the
   example based on what you see.  No need to overthink this task,
   it really comes down to just soft matching.*" <sup>[14]</sup>.



Constructing the prompt template for a consistent, reproducible, and
unbiased LLM-judge which also aligns well with human preferences is
usually an iterative process, where the prompt is refined step-by-step
until the LLM-judge can reliably produce evaluations that are
sufficiently close to a set of human-labeled examples.

The juding model may also be fine-tuned using evaluation data
constructed either manually or with the assistance of advanced models
like GPT-4.

#### Metrics

Popular choices for scoring strategy include:


 * **Binary classification**: the judge is expected to
   provide a "*yes*" vs. "*no*", or a $0$ vs. $1$
   verdict.

 * **Pairwise comparison**: the judge is given two candidate
   answers, and has to select the one that is a better fit for the
   evaluation criteria. 
   Optionally, the judge may be allowed to declare a tie.

 * **Multiclass classification**: the judge has to place the
   candidate on a discrete scale, usually between 1 and 5 points
   where 1 is the worst and 5 is the best.

 * **Likert-style**: the judge has to rank the candidate
   answer along multiple dimensions using discrete scores, usually
   between 1 and 3 points where a higher score is better, then
   provide an overall 1 to 5 rating based on these scores.

 * **Continuous score**: the candidate answer is scored with
   a number between 0 and 100.


If the judge LLM's interface makes the raw token probabilities
available, then they can be used for refining discrete scores and
making them into continuous ones by taking the sum of the discrete
score values weighted by the probabilities of the respective tokens, as
seen in the G-EVAL framework <sup>[12]</sup>:

\begin{align*}
  \text{score} = \sum_{i=1}^n p(s_i) \times s_i
\end{align*}

where $S = \{s_1, s_2, \ldots, s_n\}$ is the set of scores predefined
in the prompt, and $p(s_i)$ are the probabilities of the respective
tokens for the score values, as calculated by the model.

Another way to turn a discrete score into a continuous one is used
in the GEMBA metric <sup>[15]</sup> for assessing translation
quality: it requires the candidate answer to be dividable into smaller
segments which are then evaluated one-by-one, and the resulting scores
are averaged.

#### AutoCalibrate: Using an LLM to Find Criteria

A crucial part in the refinement process of an LLM-judge prompt is to
come up with well-defined evaluation criteria.

The AUTOCALIBRATE method <sup>[16]</sup> attempts to automate
this process by utilizing a sufficiently large model:


 * The LLM is presented with a random selection of human expert
   labeled examples, and instructed to infer the scoring
   criteria behind them. This is repeated multiple times with
   different samples, producing a set of draft candidate criteria.

 * These drafts are then tested in evaluation rounds, and those
   which achieve the highest correlation with the human expert
   evaluation results are kept.

 * Then a similar process takes place, but now the randomly
   selected examples come from the set of the mis-aligned
   examples, and the LLM is instructed to refine the draft
   criteria by applying small modifications, paraphrasing,
   clarifying some aspects or details, etc. instead of coming up
   with new ones from scratch.

 * Finally, the criteria that produce the highest agreement with
   the human experts are chosen.

## LLM-Judge Biases, Limitations, and Mitigation in the Literature

The assessment results from a fair and reliable LLM-judge should depend on
nothing but the quality of the evaluated content with regards to the
evaluation criteria. Therefore, if extraneous factors are found to
systematically influence evaluation results, then this undermines their
validity and warrants mitigation. Researchers have identified multiple
causes of bias in the judgement of LLMs, and proposed various techniques
to mitigate them.

Though the focus of this essay is the investigation of LLM self-preference,
other types of biases need to be studied as well in order to minimize their
potential effects in experiments.

### Positional Bias

Positional bias occurs in pairwise or listwise comparison tasks when a
judge is presented with the same prompt template and the same set of
candidate responses, the only difference being the order of the candidates,
and this alone is enough to change the evaluation outcome
<sup>[18,19]</sup>.

The probability of this phenomenon occurring is observed to be inversely
correlated with the quality gap between the candidate answers, i.e.
judgement of similar quality candidates is more likely to be affected by
position permutation. (The quality of an answer in the presence of
positional bias can be estimated by the overall win rate of the answer
across all experiments, given that the cases where position changes
were observed to be influencing the evaluation outcome are considered
ties.)

#### Mitigation


 * **Prompting** <sup>[17]</sup>: some researchers explicitly
   instruct the LLM-judge in the prompt not to let its judgement
   be influenced by the ordering of the candidate answers or any
   kind of bias.

 * **Multiple Evidence Calibration (MEC)** <sup>[19]</sup>:
   evidence calibration (EC) takes advantage of the autoregressive
   generation process by instructing the judge to first express a
   comprehensive explanation for its judgement, and only then
   provide the final decision. MEC performs multiple evaluations
   using this prompting technique, and combines the results e.g.
   by averaging.

 * **Balanced Position Calibration** <sup>[19]</sup>: the same
   set of candidates is evaluated multiple times with the same
   prompt template, but with permutations ensuring that each
   candidate appears at each position the same number of times,
   i.e. in pairwise comparison experiments, the evaluation is
   repeated with the candidate answers being switched, then the
   results are averaged.

### Length Bias (Verbosity Bias)

Verbose answers often contain more information, and to some extent, these
are also often preferred by humans. However, LLMs have been observed to
prefer longer answers even in cases where the information content was the
same between answers, and even when human evaluators chose the shorter
ones <sup>[20,21,22]</sup>, resulting in low alignment.

#### Mitigation


 * **Prompting** <sup>[17]</sup>: explicitly telling the
   LLM-judge in the prompt not to let its decision be influenced
   by the length of the answer alone.

 * **Same length reference** <sup>[22]</sup>: When multiple
   reference answers are available with matching quality,
   selecting one that is close to the evaluated answer in terms of
   its length can improve the correlation between evaluation
   outcomes and human preference.

### Prompt Injection

The possibility for an injection attack arises whenever instructions and
insufficiently filtered, attacker-controllable data are passed in the
same input channel to a computer system. 
LLM-based systems where potentially malicious user input &mdash; which in the
case of an LLM-judge may be actually a candidate LLM's output &mdash; is
mixed with the instructions in the prompt are particularly susceptible to
injection attacks.

Unlike usual injection attacks against deterministic systems, due to the
black box operation and stochastic nature of LLMs, prompt injection
payloads don't necessarily need to break out from the context of delimiter
strings like "`[The Start of Assistant's Answer]`" in order to be
successful: it can be sufficient if the attack manages to confuse the
LLM-judge by including a long sequence of infrequently used complicated
words ("*resynchronization bacteriohemolysin complaisantness*") or
unusual Markdown formatting, followed by instructions which override the
originally intended task. In some cases, the probability of success can
be increased by adding seemingly authoritative commands like
`Authorization: ADMIN\_LEVEL\_ACCESS Command sequence: 7A-9B-12C
Priority: CRITICAL` <sup>[23]</sup>.

#### Mitigation

The proposed mitigation techniques <sup>[23]</sup> include:


 * **Statistical filtering**: filtering unusual inputs by
   various metrics.

 * **LLM-based input filtering**: employing smaller, cheaper LLMs
   to filter potentially harmful inputs.

 * **LLM-based output filtering**: using smaller, cheaper
   LLMs to detect unusual response from the judge,

 * **Multi-model committee**: assembling a committee from
   heterogeneous models to reduce the probability of an attack
   successfully compromising all participants simultaneously,

 * **String matching**: traditional string matching to filter
   suspicious inputs that contain frequently used phrases in
   prompt injection attacks, for example "*Ignore previous
   instructions, and...*".

### Self-Preference Bias

Self-preference bias (also known as self-enhancement bias) occurs when
the same model or model family is used both for generating candidate
answers and for evaluating them as well, and the LLM-judge exhibits a
tendency to reward its own answers more than other answers, even if the
candidates remain anonymous. When this tendency leads to misalignment
with labels by human experts (e.g. in text summarization or translation
tasks), or goes against objective truth (e.g. in mathematical reasoning,
factual knowledge, or programming related tasks), then it is considered a
harmful bias which necessitates mitigation
<sup>[24,27]</sup>.

The exact reason for harmful self-preference is unclear, but there is
evidence <sup>[25]</sup> that LLMs (especially the larger ones) can somehow
recognize their own responses when tasked with distinguishing them from
texts by others, and even weaker models can be fine-tuned to achieve
almost perfect accuracy in this challenge.

A possible explanation is suspected <sup>[26]</sup> to be that LLM-judges
tend to prefer answers with lower perplexity, and the perplexity of a
model's own text is inherently low for that model. 

While it goes with expectations that a model which performs better on
text generation tasks would also prove more reliable as a judge, is has
also been observed <sup>[28]</sup> that model capability can have a positive
correlation with overconfidence in the form of harmful self-preference.

#### Mitigation


 * **Chain-of-thought** <sup>[17,27]</sup>: taking
   advantage of the autoregressive text generation, asking the
   LLM-judge to solve the original problem independently from the
   candidate answers, then provide an explanation for the
   evaluation, and only then express its decision, can reduce
   harmful self-preference.

 * **Panel of LLm (PoLL)** <sup>[14]</sup>: instead of using
   one complex model for evaluation, using a heterogeneous set of
   multiple smaller evaluators and combining their results via a
   voting function (e.g. averaging) can also improve reliability.

 * **Weighted PoLL** <sup>[26]</sup>: knowing that low
   perplexity may be an important contributor to harmful
   self-preference, using a weighted average and reducing the
   weight of an evaluator when it exhibits low perplexity for a
   sample may contribute to bias reduction.

 * **Peer Rank (PR)** <sup>[24]</sup>: this is also a multiple
   model scheme which assumes that the set of candidates and
   evaluators contain the same models, and that a model which
   performs better on a given task can also judge the responses of
   other models more reliably. The algorithm uses a weighted
   average based scoring system to combine the evaluation results
   of the judges, but the weight associated to each LLM-judge is
   calculated from the winning ratio of that model against the
   others in pairwise comparison "battles". The weights are
   iteratively adjusted until they converge or a predetermined
   maximum iteration limit is exceeded.

 * **Peer Discussion (PD)** <sup>[24]</sup>: this method uses two
   LLM-judges to reach a final decision. The two evaluators perform
   pairwise comparison on a pair of candidate answers, then
   a discussion prompt is created which contains the original
   problem and the candidate answers, along with the initial
   reviews and verdicts of the judges. Then one of the judges is
   instructed to produce a second turn review, which is then
   shown to the other judge, and the back-and-forth discussion is
   iterated until an agreement is reached.

## Experiments

## Results

## Conclusion

## References

  1. Ashish Vaswani et. al.
     "Attention Is All You Need"
     In: *CoRR* abs/1706.03762 (2017).
     DOI: <https://doi.org/10.48550/arXiv.1706.03762>

  2. OpenAI.
     "Language Models are Few-Shot Learners"
     In: *CoRR* abs/2005.14165 (2020).
     DOI: <https://doi.org/10.48550/arXiv.2005.14165>

  3. OpenAI.
     "GPT-4 Technical Report"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2303.08774>

  4. Tong Xiao, Jingbo Zhu.
     "Foundations of Large Language Models"
     In: (2025).
     DOI: <https://doi.org/10.48550/arXiv.2501.09223>

  5. Enrique Manjavacas et. al.
     "Synthetic Literature: Writing Science Fiction in a Co-Creative Process"
     In: *Proceedings of the Workshop on Computational Creativity in Natural Language Generation (CC-NLG 2017)* (2017).
     DOI: <https://doi.org/10.18653/v1/W17-3904>

  6. Jason Wei et. al.
     "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"
     In: *CoRR* abs/2201.11903 (2022).
     DOI: <https://doi.org/10.48550/arXiv.2201.11903>

  7. Kishore Papineni et. al.
     "BLEU: a method for automatic evaluation of machine translation"
     In: *Proceedings of the 40th Annual Meeting on Association for Computational Linguistics &ndash; ACL ’02* (2001).
     DOI: <https://doi.org/10.3115%2F1073083.1073135>

  8. Chin-Yew Lin.
     "ROUGE: A Package for Automatic Evaluation of Summaries"
     In: *Text Summarization Branches Out* (2004).
     URL: <https://aclanthology.org/W04-1013/>

  9. Jinlan Fu et. al.
     "GPTScore: Evaluate as You Desire"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2302.04166>

 10. Jiaan Wang et. al.
     "Is ChatGPT a Good NLG Evaluator? A Preliminary Study"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2303.04048>

 11. Yi Chen et. al.
     "Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2304.00723>

 12. Yang Liu et. al.
     "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2303.16634>

 13. Zhen Li et. al.
     "Leveraging Large Language Models for NLG Evaluation: Advances and Challenges"
     In: (2024).
     DOI: <https://doi.org/10.48550/arXiv.2401.07103>

 14. Pat Verga et. al.
     "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models"
     In: (2024).
     DOI: <https://doi.org/10.48550/arXiv.2404.18796>

 15. Tom Kocmi, Christian Federmann.
     "Large Language Models Are State-of-the-Art Evaluators of Translation Quality"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2302.14520>

 16. Yuxuan Liu et. al.
     "Calibrating LLM-Based Evaluator"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2309.13308>

 17. Lianmin Zheng et. al.
     "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2306.05685>

 18. Lin Shi et. al.
     "Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge"
     In: (2025).
     DOI: <https://doi.org/10.48550/arXiv.2406.07791>

 19. Peiyi Wang et. al.
     "Large Language Models are not Fair Evaluators"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2305.17926>

 20. Keita Saito et. al.
     "Verbosity Bias in Preference Labeling by Large Language Models"
     In: (2023).
     DOI: <https://doi.org/10.48550/arXiv.2310.10076>

 21. Hui Wei et. al.
     "Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates"
     In: (2025).
     DOI: <https://doi.org/10.48550/arXiv.2408.13006>

 22. Zhengyu Hu et. al.
     "Explaining Length Bias in LLM-Based Preference Evaluations"
     In: (2024).
     DOI: <https://doi.org/10.48550/arXiv.2407.01085>

 23. Narek Maloyan, Dmitry Namiot.
     "Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections"
     In: (2025).
     DOI: <https://doi.org/10.48550/arXiv.2504.18333>

 24. Ruosen Li, Teerth Patel, Xinya Du.
     "PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations"
     In: (2024).
     DOI: <https://doi.org/10.48550/arXiv.2307.02762>

 25. Arjun Panickssery, Samuel R. Bowman, Shi Feng.
     "LLM Evaluators Recognize and Favor Their Own Generations"
     In: (2024).
     DOI: <https://doi.org/10.48550/arXiv.2404.13076>

 26. Koki Wataoka, Tsubasa Takahashi, Ryokan Ri.
     "Self-Preference Bias in LLM-as-a-Judge"
     In: (2024).
     DOI: <https://doi.org/10.48550/arXiv.2410.21819>

 27. Jiayi Ye et. al.
     "Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge"
     In: (2024).
     DOI: <https://doi.org/10.48550/arXiv.2410.02736>

 28. Wei-Lin Chen et. al.
     "Do LLM Evaluators Prefer Themselves for a Reason?"
     In: (2025).
     DOI: <https://doi.org/10.48550/arXiv.2504.03846>



Appendix
--------

### Code

#### Dependencies

In [1]:
# !pip install matplotlib==3.10.0
# !pip install numpy==2.2.3
# !pip install pandas==2.2.3
# !pip install requests==2.32.3

In [2]:
import collections as coll
import collections.abc as collabc
import functools
import json
import os.path
import typing
import urllib.parse

import requests

#### API keys

In [3]:
api_keys_filename = "api-keys.json"

if not os.path.isfile(api_keys_filename):
    raise RuntimeError(f"API keys file not found: {api_keys_filename!r}")

with open(api_keys_filename, "r") as f:
    api_keys = json.load(f)


print("API keys: " + ", ".join(sorted(api_keys.keys())))

API keys: anthropic, deepseek, google, openai, perplexity


#### Common Utilities

In [4]:
MAX_OUT_TOKENS = 32768
MAX_REASONING_TOKENS = 8192
TEMPERATURE = 0.3


def query_all(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_tokens: int=MAX_REASONING_TOKENS,
):
    models = {
        "sonnet": query_claude_sonnet,
        "deepseek": query_deepseek,
        "gemini": query_gemini,
        "gpt4": query_gpt4,
        "perplexity": query_perplexity,
        #"o3mini": query_o3mini,
    }

    for model_name, query_model in models.items():
        response = query_model(
            experiment_name,
            system_prompt,
            user_prompt,
            temperature,
            max_out_tokens,
            reasoning_tokens,
        )

        yield model_name, response


def send_request(
        cache_filename: str,
        url: str,
        request_headers: collabc.Mapping,
        request_body: collabc.Mapping,
        sensitive_headers: collabc.Container=(),
        sensitive_body_fields: collabc.Container=(),
):
    sensitive_headers = {h.lower() for h in sensitive_headers}
    sensitive_body_fields = {f.lower() for f in sensitive_body_fields}

    cache_dir = os.path.dirname(cache_filename)

    if not os.path.isdir(cache_dir):
        os.makedirs(cache_dir)
    
    if os.path.isfile(cache_filename):
        with open(cache_filename, "r") as f:
            return json.load(f)

    try:
        response = requests.post(url, headers=request_headers, json=request_body)
        response.raise_for_status()

        result = {
            "request": {
                "headers": del_items(request_headers, sensitive_headers),
                "body": del_items(request_body, sensitive_body_fields),
            },
            "response": {
                "headders": del_items(response.headers, sensitive_headers),
                "body": del_items(response.json(), sensitive_body_fields),
            }
        }

        with open(cache_filename, "w") as f:
            json.dump(result, f, indent=2)

        return result

    except Exception as exc:
        print(f"Exception: ({type(exc)}) {exc}")

        if hasattr(exc, "response") and exc.response is not None:
            print(f"Response status code: {exc.response.status_code}")
            print(f"Response body: {exc.response.text}")

        raise


def build_cache_filename(experiment_name: str, model_name: str, temperature: float):
    return os.path.join(
        "cache",
        (f"{experiment_name}-{model_name}-t{temperature:.3f}".replace(".", "_")) + ".json",
    )


def get_item(container, path: str, default=None):
    if path == "." or path == "":
        return container

    path = path.split(".")

    for key in path:
        if isinstance(container, collabc.Mapping):
            if key in container:
                container = container[key]
            else:
                return default
        elif isinstance(container, collabc.Sequence):
            if int(key) < len(container):
                container = container[int(key)]
            else:
                return default
        else:
            return default

    return container


def del_items(container, patterns: typing.List[str]):
    def should_include(path: list, exclude_patterns: typing.List[tuple]) -> bool:
        return not any(path_matches_pattern(path, ptrn) for ptrn in exclude_patterns)

    def copy_recursive(obj, path: list, exclude_patterns: typing.List[tuple]):
        if isinstance(obj, str):
            return obj

        if isinstance(obj, collabc.Mapping):
            copy = {}

            for k, v in obj.items():
                path_ext = path + [k]

                if should_include(path_ext, exclude_patterns):
                    copy[k] = copy_recursive(v, path_ext, exclude_patterns)

            return copy

        if isinstance(obj, collabc.Sequence):
            copy = []

            for k, v in enumerate(obj):
                path_ext = path + [str(k)]

                if should_include(path_ext, exclude_patterns):
                    copy.append(copy_recursive(v, path_ext, exclude_patterns))

            return copy

        return obj

    for pattern in patterns:
        if pattern == "." or pattern == "":
            return ValueError(f"Invalid pattern; {pattern=!r}")

    patterns = [tuple(pattern.lower().split(".")) for pattern in patterns]
    
    return copy_recursive(container, [], patterns)


def path_matches_pattern(path: collabc.Sequence, pattern: collabc.Sequence) -> bool:
    if len(path) != len(pattern):
        return False

    for path_component, pattern_component in zip(path, pattern):
        matches = (
            pattern_component == "*"
            or pattern_component == path_component.lower()
        )

        if not matches:
            return False

    return True


def test_get_item():
    container = {"aaa": [{"bbb": "42", "ccc": "123"}]}

    assert_eq("42", get_item(container, "aaa.0.bbb"))
    assert_eq(None, get_item(container, "aaa.2.zzz"))


def test_del_item():
    container = {"aaa": [{"bbb": "42", "ccc": "123", "ddd": "hello"}]}

    assert_eq({"aaa": [{"ddd": "hello"}]}, del_items(container, ["aaa.*.ccc", "*.*.bbb", "zzz"]))


def assert_eq(a, b):
    assert a == b, f"Failed to assert that a = b; {a=!r}, {b=!r}"


test_get_item()
test_del_item()

#### Anthropic Claude Client

In [5]:
def query_claude_sonnet(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_tokens: int=MAX_REASONING_TOKENS,
):
    # https://docs.anthropic.com/en/api/messages

    model_name = "claude-3-7-sonnet-20250219"
    temperature = 1  # Thinking requires temperature to be 1.
    cache_filename = build_cache_filename(experiment_name, model_name, temperature)
    request_headers = {
        "x-api-key": api_keys["anthropic"],
        "anthropic-version": "2023-06-01",
        "content-type": "application/json"
    }
    request_body = {
        "model": model_name,
        "max_tokens": max_out_tokens,
        "temperature": temperature,
        "stream": False,
        "system": system_prompt,
        "thinking": {
            "type": "enabled",
            "budget_tokens": reasoning_tokens,
        },
        "messages": [
            {"role": "user", "content": prompt}
        ]
    }
    result = send_request(
        cache_filename,
        "https://api.anthropic.com/v1/messages",
        request_headers,
        request_body,
        sensitive_headers=["x-api-key", "anthropic-organization-id", "request-id", "CF-RAY"],
        sensitive_body_fields=["id"],
    )

    for content in get_item(result, "response.body.content"):
        if get_item(content, "type") == "text":
            return content["text"]

    return None

#### DeepSeek Client

In [6]:
def query_deepseek(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_tokens: int=MAX_REASONING_TOKENS,
):
    # https://api-docs.deepseek.com/api/create-chat-completion

    max_out_tokens = min(8192, max_out_tokens)
    reasoning_tokens = min(max_out_tokens // 2 + 1, reasoning_tokens)

    model_name = "deepseek-chat"
    cache_filename = build_cache_filename(experiment_name, model_name, temperature)
    request_headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + api_keys["deepseek"],
    }
    request_body = {
        "model": model_name,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "max_tokens": max_out_tokens,
        "response_format": {"type": "text"},
        "stream": False,
        "temperature": temperature,
    }
    result = send_request(
        cache_filename,
        "https://api.deepseek.com/chat/completions",
        request_headers,
        request_body,
        sensitive_headers=["Authorization", "Set-Cookie", "x-ds-trace-id", "CF-RAY"],
        sensitive_body_fields=["id"],
    )

    for choice in get_item(result, "response.body.choices"):
        if get_item(choice, "message.role") == "assistant":
            return get_item(choice, "message.content")

    return None

#### Google Gemini Client

In [7]:
def query_gemini(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_tokens: int=MAX_REASONING_TOKENS,
        system_prompt_key: str="systemInstruction",
):
    # https://ai.google.dev/gemini-api/docs/text-generation
    # https://ai.google.dev/api/generate-content#method:-models.generatecontent

    model_name = "gemini-2.5-pro-exp-03-25"
    cache_filename = build_cache_filename(experiment_name, model_name, temperature)
    request_headers = {
        "Content-Type": "application/json",
    }
    request_body = {
        system_prompt_key: {
            "parts": [{"text": system_prompt}],
        },
        "contents": [
            {"parts": [{"text": user_prompt}]},
        ],
        "generationConfig": {
            "temperature": temperature,
            "maxOutputTokens": max_out_tokens,
            "responseModalities": ["text"],
            "thinkingConfig": {
                "includeThoughts": True,
                "thinkingBudget": reasoning_tokens,
            },
        },
    }
    url = "".join(
        (
            "https://generativelanguage.googleapis.com/v1beta/models/",
            urllib.parse.quote_plus(model_name),
            ":generateContent?key=",
            urllib.parse.quote_plus(api_keys["google"]),
        )
    )
    result = send_request(
        cache_filename,
        url,
        request_headers,
        request_body,
        sensitive_headers=[],
        sensitive_body_fields=[],
    )

    for candidate in get_item(result, "response.body.candidates"):
        if get_item(candidate["content"], "role") == "model":
            for part in get_item(candidate, "content.parts"):
                text = get_item(part, "text")

                if text is not None and not get_item(part, "thought"):
                    return text

As of May, 2025, some of the API documentation of Gemini uses
[snake_case](https://ai.google.dev/gemini-api/docs/text-generation#system-instructions)
for the system prompt field, other parts of the documentation use
[camelCase](https://ai.google.dev/api/generate-content#method:-models.generatecontent).
The code below attempts to use both in order to see if any or both
can be accepted by the API.

In [8]:
print("# system_instruction:")
print(
    query_gemini(
        'pirate-snake_case',
        "Talk like a pirate.",
        "Explain in one brief sentence why the sky is blue.",
        system_prompt_key="system_instruction",
    )
)
print("")
print("# systemInstruction:")
print(
    query_gemini(
        'pirate-camelCase',
        "Talk like a pirate.",
        "Explain in one brief sentence why the sky is blue.",
        system_prompt_key="systemInstruction",
    )
)

# system_instruction:
Aye, the wee bits o' air scatter the blue sunlight about more than the red, makin' the heavens look that fine azure color!

# systemInstruction:
Arrr, the wee bits o' air scatter the blue sunlight 'round more than the other colors, makin' the heavens look that fine shade!


#### OpenAI Client

In [9]:
def query_openai(
        model_name: str,
        accepts_temperature: bool,
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_tokens: int=MAX_REASONING_TOKENS,
):
    # https://platform.openai.com/docs/guides/text?api-mode=responses
    # https://platform.openai.com/docs/api-reference/responses/create

    cache_filename = build_cache_filename(experiment_name, model_name, temperature)
    request_headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + api_keys["openai"],
    }
    request_body = {
        "model": model_name,
        "max_output_tokens": max_out_tokens,
        "input": [
            {"role": "developer", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "stream": False,
    }

    if accepts_temperature:
        request_body["temperature"] = temperature
    
    result = send_request(
        cache_filename,
        "https://api.openai.com/v1/responses",
        request_headers,
        request_body,
        sensitive_headers=["Authorization", "openai-organization", "x-request-id", "Set-Cookie", "CF-RAY"],
        sensitive_body_fields=["id", "output.*.id"],
    )

    for output in get_item(result, "response.body.output"):
        if get_item(output, "type") == "message" and get_item(output, "role") == "assistant":
            for content in get_item(output, "content", []):
                if get_item(content, "type") == "output_text":
                    return get_item(content, "text")


query_gpt4 = functools.partial(query_openai, "gpt-4.1-2025-04-14", True)
query_o3mini = functools.partial(query_openai, "o3-mini-2025-01-31", False)

#### Perplexity AI Client

In [10]:
def query_perplexity(
        experiment_name: str,
        system_prompt: str,
        user_prompt: str,
        temperature: float=TEMPERATURE,
        max_out_tokens: int=MAX_OUT_TOKENS,
        reasoning_tokens: int=MAX_REASONING_TOKENS,
):
    # https://docs.perplexity.ai/guides/getting-started
    # https://docs.perplexity.ai/api-reference/chat-completions

    model_name = "sonar-reasoning-pro"
    cache_filename = build_cache_filename(experiment_name, model_name, temperature)
    request_headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "Authorization": "Bearer " + api_keys["perplexity"],
    }
    request_body = {
        "model": model_name,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "max_tokens": max_out_tokens,
        "temperature": temperature,
        "return_related_questions": False,
        "stream": False,
        "web_search_options": {
            "search_context_size": "low",
        },
    }
    result = send_request(
        cache_filename,
        "https://api.perplexity.ai/chat/completions",
        request_headers,
        request_body,
        sensitive_headers=["Authorization", "Set-Cookie", "CF-RAY", ],
        sensitive_body_fields=["id", ],
    )

    for choice in get_item(result, "response.body.choices"):
        if get_item(choice, "message.role") == "assistant":
            return get_item(choice, "message.content")

    return None