> 荃者所以在魚，得魚而忘荃；蹄者所以在兔，得兔而忘蹄；言者所以在意，得意而忘言。吾安得忘言之人而與之言哉 (Zhuangzi)

# Introduction

Language models (LMs), in particular those classified as generative artificial intelligence (gen AI), are finding increasing uses in finance and economics. These models are usually tested for their ability to reason, and seem to do well: for example, OpenAI's GPT-4 boats more than 80% correct results in academic and professional micro- and macroeconomics tests (@achiam2023gpt). Still, even such advanced models can fail miserably. @perez2024testing demonstrate how the same model can correctly solve a logical puzzle requiring reasoning about higher order knowledge, only to fail when irrelevant details are changed. Building on results such as this and other examples that clearly illustrate the limits of rationality assumptions on LMs, this work discusses how to systematically measure *economic* reasoning, combining literatures on economic thought and on computer science about gen AI benchmarking. In practical terms, the task at hand is to come up with testing mechanisms that estimate the level of economic reasoning of an LM by means of a prompt consisting of $n \geq 0$ examples and a question with multiple answers.

At its most essential form, testing for economic reasoning is the same as probing if the model is able to think in terms of logical operators. However, they can be subjective because (a) economic thought is always changing and (b) they are only as good as their abilites ot explain limited sets of reality (those that modern academics constantly see= rather than any other reality).

Similar to many other social disciplines, economics requires the analytical judgment referred to by @robbins1932essay in the analyses of events as a basis to extrapolate and predict, and this has a bearing on how economic reasoning should be benchmarked. Economic inference depends primarily on articulating unobservable quantities, theorecised and estimated on the basis of observable measures. This is unlike other major disciplines. For example, in human and veterinary medicine, all physiological and pathological variables of clinical importance are observable, even if that is not yet technologically feasible today. In the medical sciences, theoretical models merely fill in the gaps in the absence of a technologically feasible complete measurement. In contrast, many economically relevant quantities are latent variables that cannot by definition be observed, and always require a model applied to data to be estimated, implicit or not.

A quantitative test for economic reasoning must take this into account: selecting a correct answer in an economics question through reasoning will always depend on an unobserved transformation of the information received and the existing knowledge. This is important. LMs may also happen to choose the correct answer from either luck of through simple token probability. It is easy to see why a correct answer selected by chance is not informative about the reasoning abilities of a model. The second case requires more explanation: mathematically, LMs are trained to identify the most likely token $\theta$ in a vocabulary $V$ given the tokens in its prompt. In practice, the function is inexcrutable so it is also considered an unobservable transformation. But a few characteristics allows us to distinguish reasoning from prediction. First, reasoning is robust to minutiae and other irrelevant detail. Mathematically, it would be analogous to applying a manifold transformation that retains only the relevant information in a prompt and then applies logic operations on top of them, and on them only. Second, reasoning is locally complete, meaning that an LM that can correctly deduce that A implies B also is able to understand that A' does not imply B, or that A does not imply B'. In other words, a reasoning that appears to be correct but whose obvious corolary is not achieved by an LM cannot be said to have been reasoned in the first place.

Knowledge: linguistic, common and commonsense.

Interpretation. information theory. Shannon.

The main intuition of this work is to combine a number of building blocks of evaluation. 

- the benchmark must be challenging for machines: I use an adjusted version of adversarial filtering (@zellers2019hellaswag) to create answer candidates that are hard for LMs to guess

- the test must incorporate slow-moving evolutions in academic economic thought: evolving test set based on newly published academic work. 

- results related to reasoning must be distinguished as best as possible from the ability to interpret the prompt or from knowledge (implicit or explicit) about economics, ie reasoning is a separate step: sets of perturbations in the spirit of @alzahrani2024benchmarks for each initial task.

the benchmark counts with a mathematical adjustment that takes into account performance across perturbations, penalising results that vary with ....

This benchmark evaluation addresses a poignant issue for the economics profession: the lack of publicly available data about how these benchmarks are created and any, and toasted.

A major inspiration in the design of the questions and how they can generate identifying variations is the social economics literature. A key reference is @stantcheva2023run. The idea here is that the design of the questionnaire itself can elicit responses that allow for insight into non-observable traits such as reasoning. Many of the insights of this literature carry over naturally to the machine space.[^untested]

[^untested]: Actually testing whether LMs *do not* parrot or "organically" exhibit biases or other behaviours that are assumed to be exclusively human would be an interesting line of research.

## Literature

Four streams of literature.

Benchmarking... A substantial body of work creates and discusses benchmarking models in general. A very useful reference is @storks2020recent. Literature on benchmarking economic reasoning appears to be new, although other works have touched upon the topic from different angles.

Social economics...

Reasoning itself...

A nascent literature on the evaluation of language models in economic settings. An early foray into questions related to AI's ability to conduct economic reasoning is due to @parkes2015economic. But their angle is more on how AIs can be used to estimate synthetic economic agents - machina oeconomicus - ideal versions of purely rational agents, rather than on the measurement and the implications of AIs acquiring economic reasoning abilities. In any case, @parkes2015economic see economic reasoning as the ability to understand and solve complex game-theoretical environments (eg, the poker example). @mei2024turing do an extensive comparison of personality traits from the behaviour of ChatGPT with human behaviour in games that require cooperation, finding that its performance is consistent with humans, and when it deviates the AI models tend to behave in the altruistic and cooperative than the mass distribution of humans. Interestingly, ChatGPT responds differently to different formulations of the same situation. In contrast to @mei2024turing, this paper and its empirical counterpart are more generall, and discuss reasoning as a whole. Another contrast to that paper is that the current benchmark is focused on reasoning ability only, not personality. @perez2024testing illustrate the brittleness of a leading AI's reasoning, which has markedly lower performance when trivial details in the prompts are different. Similarly, @korinek2023gen report (in his Chat 23) that results from a technical prompt in economics are reasonable but also brittle, with answers changing when prompt wording changes or even simply if the tasks are re-ordered.

# Lessons from human surveys

I use a considerable amount of specific advice on human surveys from @stantcheva2023run to generate identifying variation in the questions. Specifically:

- coeteris paribus questions
- pre-testing
- including possibilities for blank, indifferent or even recognise that AI does not know
- avoiding jargon
- questions that check for "attention" and "effort" on the part of the respondent
- also including open ended questions (as in @ferrario2022eliciting)
  - including follow-up questions ("are thre any other reasons")
  - going beyond @ferrario2022eliciting, in this paper I use open ended questions that are similar in nature to closed end questions and deploy large language models to interpret them.
- question ordering
  - in particular, consideration is given to whether each question should be presented to a separate instance of the LM, or the full questionnaire could be shared in the same "chat".
- take due consideration of how to address the different types of bias associated with surveys (adapted for the machine context, naturally)

# Desirable characteristics of a benchmark

## Evolve over time

Economic reasoning evolves over time. For example, the Lucas critique (@lucas1976econometric) was influential in shifting macroeconomic modelling, while the credibility revolution described in @angrist2008mostly was similarly influential in microeconomic work. @debreu1984economic describes the evolution of economic theory up until that point.

A historical perspective on the thought about causality going back to the early 18th century is found in @heckman2015causal.

# A model of economic reasoning

The result from existing benchmarks is largely, if not completely, directly related to the number of questions correctly answered. However, this measures only the model's ability to answer correctly, *not necessarily* its reasoning capabilities. The latter are part of a latent state space sitting between the input prompt and the answer. More concretely, for an input prompt $X$, which includes a question and any necessary explicit information, the language model is a function $\mathbf{M}$ that maps it to a given response: $\mathbf{M} : X \to y$. In order to show that it is done by reasoning, we need tests (and more specifically, measurements) that convey some information about the inner workings of this function. 

## Reasoning as an abstract of the input

- Input prompt $X$

- Transformed into $g(X, \kappa)$, a state space function that also takes the existing knowledge $\kappa$ and associates it with the prompt to maps it to its abstract fundamentals (similar to manifold learning)

- Result based on $g(X)$.

## A (very) simple model

This section builds on the intuition that in true reasoning, the result should be robust to minute perturbations, ie the model is a constant function over the domain of the input. Formally, both $\mathbf{M}(X) = y$ and $\mathbf{M}(X + \epsilon) = y$ for an infinitesimal $\epsilon$. This implies the derivative with respect to the input prompt is zero. Using as an approachable example the simplest possible neural network, the logistic regression $\mathbf{N}(x) = \sigma(Wx + b)$, such robustness further implies that $\frac{d\mathbf{N}}{d x} = \sigma(Wx + b)(1-\sigma(Wx + b))W = 0$. Because $W$ cannot be a zero vector in a functioning network that is responsive to its inputs and $\sigma(Wx + b)(1-\sigma(Wx + b)) = 0$ has no solution because neither term is 0 or 1 in a sigmoid function with finite inputs, the neural network cannot be a constant function. This extremely simplified example, which holds for recursive architectures of similarly simple layers, does not bode well for the robustness of results given small perturbations in the input prompt.



# Reasoning benchmarks in other fields

- Math

- Medical

- Biology

- Economics

In economics, reasoning will always depend on unobservable thought experiments. Even if all existing data was observable, economic research would still revolve around ideas, counterfactuals and thought experiments. A key idea is Haavelmo's distinction between correlation (which is observed from data) and causation (which is not, and requires a thought experiment) (@heckman2015causal). Similarly, the concept of statistical conditioning (again, which can be observed or estimated from data) and the "fix" operator (which also relies on a model) makes a complete difference.

# A model of reasoning

This section develops a model of reasoning that fits naturally into both natural and artificial LMs. It will serve as the basis for the subsequent analyses and empirical creation of a reasoning benchmark.

Let a sentence $\mathbf{S} = (\theta_1, \theta_2, \theta_3, ...)$ be a sequence of token-location tuples $\theta_x = (\tau, x)$, with each $\tau \in \mathbf{V}$ belonging to a vocabulary $\mathbf{V}$ and $x \in \mathbb{N}^{d_{\text{model}}}$.[^location] Create a function $\pi_{i, C} : \theta, \mathbf{S} \to \{-1, 0, 1\}$ that maps each token into one of three possibilities: the token's information can be considered a adversarial (-1), irrelevant (0) or relevant (1) with respect to the likelihood of individual (or LM) $i$ uttering another sentence C. For example, take the following quote from the character Barf in the 1987 movie Spaceballs, organised as two sentences "I'm a mog. Half man, half dog." and "I'm my own best friend." With word-level tokenisation, $\mathbf{S} = \{("\text{I'm}", 1), ("\text{a}", 2), ("\text{mog}", 3), ("\text{.}", 4), ("\text{Half}", 5), ("\text{man}", 6), ("\text{,}", 7), ("\text{half}", 8), ("\text{dog}", 9), ("\text{.}", 10)\}$ and $\mathbf{C}$ is similarly broken down. This example illustrates that even when there is not a logical connection grounded in truth, tokens in one sentence - even those made up like "mog", can have a bearing on the likelihood of tokens appearing in another sentence. This likelihood can differ depending on the location of the token, which also allows for situations where repeteating of a word $\tau$ is meant to convey different meaning. Another feature of this example is that all $\pi_{\text{Barf}, C}(\theta) = 1 \forall \theta \in \mathbf{S}$. In the alternative sentence "I'm a mog. Half man, half dog. I am alive.", the new component is obviously irrelevant for $\mathbf{C}$: $\prod_{x \in [10, 14]} \pi_{\text{Barf}, C}(\theta_x) = 0$.

[^location]: The location is important because it helps define meaning, along with the actual letter (more generally, symbol) content of th token. Note that in this paper, white spaces are abstracted away for expositional simplicity.

This exposition is important to delve into the reasoning aspect, entirely organised by function $\pi$. Since $\pi_{i, C}$ measures how informative a token is for individual $i$'s $\mathbf{C}$, it constitutes the first aspect of reasoning: to recognise when a token is adversarial, irrelevant or relevant. This step is necessary before the application of any logical rules $\mathcal{l} \in \mathcal{L}$ on the weighted token, $\pi_{i, C}(\theta_x) \theta_x$. The exact underpinnings of these logical rules are beyond the scope of this work - it can be approximated by a possibly non-linear function, $g$. What suffices in this work is to say that reasoning *depends* on correctly classifying the tokens: all relevant tokens must be so identified, lest they be either ignored as the irrelevant ones or taken with the opposite meaning. Similarly, if all relevant tokens are indeed diagnosed correctly but other tokens are also diagnosed as relevant when they are not, then this will cause problems for the correct reasoning. In other words, a first precondition for reasoning is to have a low categorical cross-entropy loss. Intuitively, a pre-condition of reasoning is to correctly interpret the inputs.

Use Taylor expansion on model since its derivative to perturbation should be zero. This gives us a head start in the Taylor expansion. Try to link the T-expanded equation to an estimating equation.

But what determines $\pi_{i, C}$? A combination of knowledges and logical relationships.

Knowledges: linguistic knowledge, common knowledge and commonsense knowledge

Rationales: reasoning from logic

Armed with the sentence-level categorical cross-entropy, the individual can establish chains of thought that will finally lead to reasoning. Again, for simplicity, the exact function is not discussed here, other than that it is a potentially simple or complex way to interact. What is important is to add the categorical cross-entropy to the estimation equation.

**Benchmark testing mechanism**...

## The importance of manifold for reasoning

The first step, interpreting the received impulses (ie, the prompts), involve correctly judging what is relevant and what is not relevant. This is similar for example to how the brain receives an incredible amount of sensory inputs but chooses to focus only on those that are more relevant instead of being overwhelmed with everything else, an observation that has inspired dimensionality-reduction algorithms (eg, isometric mapping, or IsoMap, by @tenenbaum2000global describes how to find global optima while also defining the (much lower) degrees of freedom in a high-dimensional input).

For example, @intrinsic2021 study the intrinsic underlying dimensionality of the manifold of image datasets and find them to be significantly lower. In practice, inputs can even be said to be *union of manifolds* (as verified by @brown2022verifying with image datasets in an exercise similar to the one by @intrinsic2021), which means that each manifold has its own intrinsic dimensionality that is not forced upon the other manifolds. This perspective affords flexibility in the interpretation of identifying variations because they don't necessarily need to probe the same dimensions at each task.

In econometrics, @andrews2016geometric.

## Insights from human reasoning in economics

Social economics literature.

Behavioural economics literature: @gennaioli2010comes show that people focus on the features that are closer to the data (review this description). Also insights from Thiking Fast and Slow (CITE, reviewed in @shleifer2012psychologists)

## Reasoning iself as a manifold

Since proper reasoning needs to be insensitive to unimportant details, and the vector of changes depends on logical relationships between components, the set of all "reasonable" constructions is not obtained at random but reflects this lower-dimensional, underlying structure, similar to how random pixels would only rarely form human faces.

@gorban2018blessing.

@gilboa2014analogies argues why economic reasoning works in the way of creating simple, positively wrong but conceptually useful representations of reality, even when economics is studying particular cases. A marked characteristic of such models is their preference for simplicity, a theme also explored by @GILBOA20101757, who study the matching of economic theories to empirical data, generalising the evaluation of how reasonable a theory is through a combination of their likelihood (or goodness-of-fit) with a penalising factor for their complexity. Intuitively, this simplicity in reasoning is suggestive of the manifold hypothesis in reasoning as well.

@rationality2023gilboa sees rationality, or reasoning, also as a robustness to trivial detail, and also discuss different types of reasoning (subjective reasoning, etc). 

A related but not exactly the same perspective is offered by the possibility to identify models partially using random sets, ie abstracting away from point identification to situations where the data is incomplete or is described as an interval (@BERESTEANU201217). In other words, "available data combined with credible maintained assumptions may yield much information about a parameter of interest, even if they do not reveal it exactly." (@MOLINARI2020355), a key insight is to illuminate how *available* data can inform the estimation of models.

# Reasoning about economics

The model above allows us to estimate reasoning while also breaking down some of its components to better understand them. For example, we can estimate any errors in reasoning into an issue with **interpretation**, **knowledge** and **logical thinking**. The empirical estimation follows.

@theory2022gilboa distinguish between three types of inquity in economic theory: economics itself (analysis of economic phenomena), development of economic methods (the development of analytical tools needed to study economic phenomena) and the methodology of economics (the research/scientific endeavour in economics, including but not limited to theory).[^economics]

[^economics]: In fact, @theory2022gilboa even allude to the blurred lines between economics and the philosophy or sociology of economics. I don't go ino these differences here.

Another insight into *economic* thinking is from the thought experiments first introduced by Marshall (1890) - the ceteris paribus idea - and then later Ragnar Frisch and Trygve Haavelmo, more recently elaborated in more detail and more generally by @heckman2015causal, including the important distinction between correlation and causation. @marschak1944random start their influential paper by acknowledging that economists can't conduct experiments (although that has been relaxed somewhat, it still remains the case at least in macroeconomics).

# Empirical estimation

Each *task* $\theta \in \Theta$ can be asked in various different ways, each one being called a *question* $q \in \theta$. Questions vary with respect to their adversarial aspect; it is this variation within each question that allows the empirical estimation of the effects associated with interpretation or with knowledge. Most of the variations are originally those tested in @alzahrani2024benchmarks. The variation in response between the questions within each task will comprise the evaluation of the actual reasoning capabilities. As alluded to before, the variations are organised into those that measure the stability of a response to adversarial interpretation answers, and those that measure the stability across the knowledge dimension. In practice, each task has hundreds of different $q$. These groups are described in more detail next.

## Variations related to interpretation

There are several classes of variations that can help test an LMs' interpretation. 

### Choice variations

Here the choices remain the same for a task but vary in their order across questions

- random choice order

- biased choice order

- uncommon answer choice symbols

- common but unordered answer choice symbols


### Word variations

The main idea here is to introduce or change words that are irrelevant. This is along the lines of the test conducted by @perez2024testing.

Another one is to conduct random word repetition as if it were a typo

## Variations related to knowledge

Changing key words related to field knowledge with other field knowledge words but that would not make a sense to an expert. This can be compared with just changing the same words into another generic word. Comparing responses between both should indicate the level of knowledge used by the model (should it? need to think more)

## Estimation formula

The main formula is akin to the linear probability model since $a_{q}$ is either zero or one:

$$
a_{q} = \beta_{\theta} \theta + \beta_{\text{Interpretation}} \eta_q + \beta_{\text{Knowledge}} \kappa_q + \epsilon_q
$$

Another idea to explore is whether these variations can actually instrument interpretation and knowledge. This would allow the formula to estimate the reasoning bit.

# Operational characteristics

- avoid becoming part of training data

Some drawbacks of using academic papers include:

- bias to report only positive findings (and to do so in a way that is generous towards said findings)
- Also, academic papers suffer from false negatives: many contributions that are now considered classics have been previously rejected (@mighty1994fallen).

# Conclusions

 As economic agents and policymakers harness generative artificial intelligence (AI) to reap considerable efficiencies, and thus their societal footprint becomes larger, a benchmark for economic reasoning is needed. I suggest ways to implement such a benchmark, and measure the current performance of a selected list of LMs.

Let me conclude with Ken Arrow's impossibility theorem (CITE), or rather the story of how he achieved this incredibly influential result. Arrow first attempted to improve upon two-century-old Condorcet's paradox, and studied ways in which individual preferences could be aggregated while satisfying some intuitive conditions. It was only through repeated failures to do so that he switched the focus to attempting to prove its impossibility. While Arrow can be safely used as a prime example of economic reasoning, the point this anecdote illustrates is that breakthroughs in economic knowledge require also inspiration (in this case from the appeal of addressing Condorcet's paradox) as well as persistence and ability to change one's focus. The current work focuses on developing robust benchmarks of models' reasoning abilities in economics; further work exploring their contributions to inspiration[^ideation] and to methodological assistance (as in the example to change focus) are also warranted for a more complete assessment of models' abilities to provide cognitive support to human economists.

[^ideation]: @korinek2023gen illustrates use of AI models to help economists have new ideas for work.

# Annex 1: discussion of biases in human surveys and how they could affect LM questionnaires

* Section A-4 in @stantcheva2023run

The goal of this annex is to list side-by-side the main human biases that affect survey responses and their corresponding machine version, if any (from a theoretical perspective - it would be interesting to test if LMs carry over some of these biases that are supposed to be only human, which could suggest they are parroting or in extremis developing sources of bias like shame, etc).

# References