> "Machines will be capable, within twenty years, of doing any work a man can do" - AI pioneer Herbert Simon, in 1965

# Introduction

Large language models (LLMs), in particular those classified as generative artificial intelligence (AI), increasingly support use cases in finance and economics (@korinek2023gen), including in central banks (@araujo2024artificial). A key value proposition is their ability to automate some high-skill, nuanced tasks that require some "understanding" of complex texts. These models are tested for their ability to reason, often boasting seemingly impressive results: for example, OpenAI's GPT-4 boats human-like performance across tests of reasoning and more than 80% correct results in academic and professional micro- and macroeconomics tests (@achiam2023gpt). Still, even such advanced models can fail miserably in simple reasoning tasks: for example, the same model can correctly solve a logical puzzle requiring reasoning about higher order knowledge but fails when irrelevant details are changed (@perez2024testing). Building on results such as this, this work discusses how to systematically measure *economic* reasoning, combining literatures from economics and computer science. In practical terms, the task at hand is to come up with a benchmark task that can help measure economic reasoning in AI models in a comparable way. This task is of first-order importance given the break-neck speed of evolution of LLMs (@yang2023harnessing) and their potential risks (@danielsson2022artificial).

Such benchmark tasks are crucial for the comparison of the abilities of AI models over time and in the cross-section. Results of benchmark tasks are now a staple of the evaluation of LLMs by developers when developing and releasing a model, as the results highlight the model's evolution compared to previous versions and  its performance when stacked against other models. Third-party organisations also compile leaderboards with running results that allow the general public to keep track of the most performant models.[^leaderboards] In short, benchmark tasks are useful because they provide a comparable metric on which to track the state-of-the-art for the particular abilities that each task measures. Usually, this metric is the percentage of correct answers. However, even if specific tasks have evolved over time to become more challenging for AI models, in many cases it is not straightforward to separate correct answers due to probabilistic association or "stochastic parroting" (@bender2021dangers) from those that are the result of a "reasoning process".

[^leaderboards]: Commonly followed leaderboards include LMSYS's ChatbotArena and Huggingface's Open LLM Leaderboard.

This paper proposes a working model of reasoning that is simple and intuitive, yet results in a challenging benchmark test for AI systems. This reasoning model underlies the empirical benchmark task introduced in this paper. When a user confronts an AI system with a prompt $Q$ requiring a given response $\alpha \in \mathcal{A}$,[^prompt] where $\mathcal{A}_C \subseteq \mathcal{A}$ is the set of correct answer(s) and $\mathcal{A}_I = \mathcal{A} \setminus \mathcal{A}_C$ is the set of incorrect choices, the accuracy of an AI's response (ie, $\mathbb{1}(\hat{\alpha} \in \mathcal{A}_C)$) reflects only one of three possible answering processes: chance, probabilistic assignment or a correct *reasoning* process. The first one is simply $P_{\text{chance}}(\hat{\alpha} \in \mathcal{A}_C) = \#\mathcal{A}_C / \#\mathcal{A}$. The second process depends on a measure of some distance $d(\cdot)$ between $(Q, \mathcal{A})$ and the AI's training data $D$: $P_{\text{prob}}(\hat{\alpha} \in \mathcal{A}_C) = P(\hat{\alpha} \in \mathcal{A}_C|d((Q, \mathcal{A}), D))$.

[^prompt]: Usually $Q$ will be a string of text, but more recent models can take multi-model content that combines text, images, videos, sounds, etc.

The third, and most interesting answering process is *reasoning*. It can only be said to occur if $\hat{\alpha}$ results from the successful application of three sequential steps. First comes *information filtering*, where the AI filters the relevant information in $Q$, distinguishing it from trivia that is not important for finding at least one $\hat{\alpha} \in \mathcal{A}_C$, $\phi = f(Q)$. The second step, *knowledge association*, involves the retrieval of implicit or explicit knowledge $\theta$ to augment the filtered prompt information, if necessary: $\kappa = k(\phi, \theta)$. And the third step is *logic attribution*: applying logic operations (deducive, inducive, and other types of logic) on the filtered prompt information augmented with $\theta$ to uncover the corret answer: $\alpha = l(Q, \mathcal{A}, \kappa)$. Only AIs that demonstrate the ability of completing these three steps can reason. Intuitively, answering a question correctly based on reasoning requires each of these steps in sequence: if one of them fails (even if it is the last step), a correct answer will necessarily be a result of chance or probabilistic association.

The discussion above relates to reasoning in general, but it can also apply to arbitrary subsets of the set of all $f, k, l$. To measure *economic reasoning*, field-specific criteria can be applied to each of these three steps. For example, the same prompt $Q$ might contain some information that is relevant to the reasoning process of a veterinarian and some other for an economist, even if they both ultimately reach the correct answer. Or the implicit and explicit knowledge that augments the prompt, enabling the correct logic attribution, could be different. And finally, the logic steps can involve different levels of counterfactual considerations and relevant thought experiments. This implies that there are several ways to filter information, of which the "economics" way is one of them ($\exists f_{\text{economics}} \in F$), and similarly with knowledgde association ($\exists k_{\text{economics}} \in K$) and logic attribution $\exists l_{\text{economics}} \in L$).[^fields] Therefore, the model described in this paper can be generalised to other fields of knowledge, or made more specific to measure reasoning in sub-fields such as monetary policy. The paper drops the field-specific notation to avoid cluter, but it is always implicit in the functions named above.

[^fields]: Or rather, in practice there are several ways professionals in the same field would reason but they would arguably tend to be more similar to each other than to professionals of other sciences.

From the informal definitions of the answering processes described above, measuring reasoning in practice requires assessing $\mathbb{1}(\hat{\alpha} \in \mathcal{A}_C)$ multiple times for the same AI. While this is not possible with human subjects due to factors like learning or priming, AI systems can be easily started fresh or respawned and so it is trivial to test whether accuracy is due to chance alone. But it is still difficult to separate probabilistic answering from reasoning without a "structural" model of reasoning, especially since the training data $D$ is not observed in advanced AI systems due to commercial protection or simply because it is very large and difficult to analyse. The main contribution of this paper is precisely in offering a model of reasoning that can be sufficiently rich to reflect theoretical and empirical knowledge about reasoning and AI systems, while staying simple enough that it can be engineered into a benchmark test to allow diagnosing current and future AI systems.

Functions $f$, $k$ and $l$ are interesting by themselves, but the goal of this paper is to identify when they are successfully put to use by AI systems, rather than probing their inner workings. This is because the measurement of reasoning capabilities in AI systems requires only the demonstation that these functions are part of the answering process. For notation simplicity, call $\phi^*, \kappa^*, \lambda^* \in \{0, 1\}$ the indicators that each of these functions was used to find $\hat{\alpha}$. These indicators allow the estimation of whether or not reasoning was the basis for a correct answer to $Q$: $\rho^*(\hat{\alpha} \in \mathcal{A}_C) = \phi^* \kappa^* \lambda^*$ ($\rho^*$ is not determined in cases where $\hat{\alpha} \in \mathcal{A}_I$). The empirical version of the reasoning prompt-answering model for a given AI system is then:

$$
\hat{\rho}^* = \hat{\phi}^* \hat{\kappa}^* \hat{\lambda}^*,
$$ {#eq-empiric}

where the identification of the right-hand side factors depends on variations of the prompt $Q$. The key idea is to leverage insights from the social economics and expectation formation literatures and create identifying variation in $Q$, adapted from how this is done with human subjects to study their own reasoning functions about economic problems (see @stantcheva2023run, @FUSTER2023107 and the works cited in these papers). Specifically, with $\epsilon_i^j$ one of the $M_j$ variations introduced in the seed question $Q$ to identify the use of an information filtering function ($j=f$), knowledge association ($j=k$) or logic attribution ($j=l$):

$$
\begin{aligned}
\hat{\phi}^* &= \prod_{i}^{M_f} \mathbf{1}(f(Q + \epsilon_i^f) = f(Q)) \\
\hat{\kappa}^*(\hat{\phi}^*=1) &= \prod_{i}^{M_k} \mathbf{1}(k(Q + \epsilon_i^k) = k(Q)) \\
\hat{\lambda}^*(\hat{\phi}^* =1,\hat{\kappa}^*=1) &= \prod_{i}^{M_l} \mathbf{1}(l(Q + \epsilon_i^l) = l(Q)),
\end{aligned}
$$ {#eq-empiricsyst}

For example, the tests described in @perez2024testing comprise variations of unimportant details of the puzzle while leaving the key details unchanged; I interpret them as $\epsilon^f$ variations to the same prompt because this exercise probes the ability of ChatGPT to recognise the truly important information and abastract from everything else. $\hat{\kappa}^*$ and $\hat{\lambda}^*$ are only estimated upon the successful completion of the previous steps. In other words, estimation is the result of evaluating answers by the same AI system to multiple versions of seed question $Q \in \mathcal{Q}$, where each version is an $\epsilon^j$-neighbourhood of $Q$ that should prompt the same result if an AI system is using the same $j \in \{f, k, l\}$ function.[^noresid] Ultimately, $\hat{\alpha}(Q)$ is only considered to be the result of a reasoning process when all of the relevant variations for the same question are answered correctly - in other words, the AI model has evaded "banana skins" that try to trick it into revealing lack of information filtering, spurious knowledge association or faulty logics. 

[^noresid]: Note the estimation equations do not have residue terms because the equalities are directly observed.

The empirical challenge is then in creating the $Q + \epsilon^j$ variations for each step $j$. There are several classes of variations that can help test an AI systems' $f$. The key principle to generate $\epsilon^f$ for these identifying variations is to keep core information intact while varying the order of content or of answer options; change, include or remove trivial content; or including random symbols in lieu of non-important actual such as the codes identifying each response (ie, using country flags instead of alphabetic letters to identify options). Because the core information is always constant in all these $M_f$ variations of $Q$, an AI system that filters information adequately would continue to concentrate on the main information, which is intact. As for testing $k$ (conditional on $\hat{\phi}^*=1$), the variations $Q + \epsilon_i^k$ include different degrees of explanation of important concepts for the task at hand, up until a level where no or almost no explicit knowledge is included and the model must rely on commonsense knowledge. Importantly, alternatives also vary in how much they use common vs jargon words, leveraging the fact that common words might have more than one meaning and therefore more easily trigger an AI system into falsely assigning a wrong knowledge space with it. Another source of variation to probe $k$ is to include words related to alternative views about the question, which serves to probe how the AI system uses knowledge it has access to from the prompt. Finally, anther way to probe knowledge is to change clear important keywords for their pronouns, since recognising referents require considerable amount of commonsense knowledge. And as for logic attribution, the variations include those are related to the "flip-flop" experiment: simply asking AI systems to confirm their answers often make them switch answers, even if their original response was correct (@laban2023you, @xie2023ask). Relatedly, these tests can to follow up a response with a reply that either asks for confirmation ("are you sure?") or that doubts the AI system's answer in some way, even if it was originally corrected. Another way to create variation that identifies logic attribution is to change the possible answers - for example, from having the correct answer be one in a list of other categories to a question asking how likely it is that the correct answer is in a given set. Another source of prompt variation to probe $l$ is analogical prompting (@yasunaga2023large), whereby the prompt includes instructions for the AI system to generate examples of similar situation at hand. These variation types are discussed in more detail in a specific section.

In addition to the collection of variations described above, three practical adaptations in $\mathcal{Q}$ make it a benchmark for economic reasoning. First, the scope of topics that are included in $\mathcal{Q}$ should ideally focus on issues of significance to economics. At its most essential form, testing for economic reasoning is the same as probing if the model is able to think in terms of logical operators on information that is of relevance to economics. However, this is subjective because economic thought constantly evolves. At the same time, including only "classical" economics carries a high risk of using material that is containd in the training set of AI systems. Second, even for each given topic, the types of questions considered relevant in economics are specific. Both of these issues are dealt with in practice by using recent published academic work as source material to construct seed questions. These sources contain content whose topic and research questions are by definition of interest to the economics field, and moreover have the advantage of being novel by design, creating a natural check for the ability of AI models to generalise reasoning. The third adaptation is to design logic attribution questions related to economic thought. For example, @heckman2023econometric posit that policy problems - ie, the subject of considerable if not most or all of economics - can be analysed in four different levels. The first one is impact of a policy on the environment it was implemented in; the next level is to understand the mechanisms (or "channels") by which these effects occur. The third level of reasoning is to forecast the policy impact in environments that have not yet seen the policy, while the fourth and most sophisticated is to forecast policy effects that were never implemented before.

Most of the above discussions relate to the answering process of an AI system to a single prompt, $Q$. But in practice the useful way to measure reasoning abilities is to measure $\hat{\rho}^*$ extensively in a broad range of economic questions and in a comparable way. This is done by establishing a benchmark task $\mathcal{T}$, ie compiling all the prompt variations for a reasonable amount of seed prompts $Q$. The benchmark result is then $R = M_{\mathbf{Q}}^{-1}\sum_{Q \in \mathcal{Q}} \hat{\rho}^*(Q)$, where $M_{\mathbf{Q}}$ is the number of different seed questions. By the same token, each of the three steps can be measured separately, building on their empirical identifications $R_{j \in (f, k, l)} = M_{j}^{-1}\sum_{Q \in \mathcal{Q}} \hat{j}^*(Q)$. Note that identification of $\kappa$ and $\lambda$ need the sequential conditionality on the previous steps $\phi$ and $\phi, \kappa$ respectively in order to be identified.[^Brier]

[^Brier]: The benchmark result could also be compiled as a metric that is not subject to the false "emerging abilities" results (@schaeffer2024emergent), for example the Brier score (@brier1950verification).

This benchmark task is, by design, more difficult than others due to its sequential conditioning. Is it really necessary to have such a hard-to-score task? While evidence amounts that large language models (LLMs) cannot perform advanced reasoning, at least not when fine-tuned or with access to external reference material or sophisticated math funcionalities, it is important to have a challenging reasoning benchmark because the latter two possibilities are increasingly being used. Both fine-tuning on specific data and plugging LLMs into sources of knowlege (as in retrieval-augmented generations, or RAGs) or with plugins as Wolfram Alpha or Mathematica might provide models with some reasoning ability. For this reason, it is important to have a robust way of measuring the reasoning abilities of AI models. Another argument is that part of AI models' reasoning performance out of the box is due to its limitation to train only on available written language only (@BrowningLeCun2022). Of course, this dataset is a very limited snapshot of human experience, even with massive data compilations. However, current and future models will be able to leverage a significant part of the other non-text data (eg, video, audio, pictures) and thus could reasonably attempt to achieve better reasoning. This provides another reason to maintain a high bar in reasoning benchmarks.

This benchmark evaluation also addresses a poignant issue for the economics profession: the lack of publicly available data about how these benchmarks are created and any, and tested. For example, @achiam2023gpt are not clear about the academic and profession tests on micro- and macroeconomics that are used amongst various other tests to measure GPT-4's performance in those fields. Conversely, various other benchmark tasks do have a publicly available methodology and even evaluation interface, which greatly facilitates the engagement with model developers, general users and third-party model evaluators.

## Literature

This work builds on, and seeks to expand, two general literature streams. More technical aspects of this work are based on specific literatures that are discussed within each section.

The first body of works is on benchmarking tasks for AI models. A substantial body of work creates and discusses model benchmarking in general; a voluminous and well-organised compilation of references is @storks2020recent. Interested readers are strongly encouraged to that paper for space considerations, while selected benchmarks are described in @sec-bench. This paper draws from insighs in the more general AI reasoning literature. As other parts of AI development, it is informed and inspired by neuroscience as well (@HASSABIS2017245). An early and influential contribution is the Chinese room experiment by @searle1980minds and its resulting arguments that AI could not reason by itself. @chollet2019measure offers a wide review of the literature and insightful perspectives on definitions of AI intelligence and how it should influence the design of benchmarks. The influential works of @bubeck2023sparks and @wei2022emergent hint at acquisition of advanced capabilities by large-scale language models such as GPT-4, although @bubeck2023sparks also point to many instances where reasoning breaks. @schaeffer2024emergent present evidence that these "emerging abilities" that come with scale are actually a spurious by-product of the choice of metrics to emasure these abilities. 

@mitchell2023debate summarises the disagreement in the AI academic and practitioner fields as to whether AI models have some form of understanding, and by implication, potentially also reasoning. @wei2022chain claim that writing the prompt in a way that offers chain-of-thought examples improves the reasoning abilities of LLMs, although this was later demonstrated to be as generalisable (@dziri2024faith, @prystawski2024think). @BrowningLeCun2022 argue that AI models trained on written language alone will never be able to reason, and @moskvichev2023conceptarc show that humans outperform advanced AI systems in terms of abstract thinking. @shiffrin2023probing argues for better understanding of why AI systems succeed or fail in tests. @wu2023reasoning, @lewis2024using, @perez2024testing, @srivastava2024functional and other papers all use variations in prompts as some form of probing the degree to which LLMs are actually reasoning. And more recently, @luo2024large present results of forward-looking benchmarks in the field of neuroscience, suggesting that LLMs can form some reasoning ability after fine-tuned on academic literature.

Another literature this papers draws from is the nascent corpus on the evaluation of AI systems such as LLMs in economic settings. An early foray into questions related to AI's ability to conduct economic reasoning is due to @parkes2015economic. But their angle is more on how AIs can be used to estimate synthetic economic agents - machina oeconomicus - ideal versions of purely rational agents, rather than on the measurement and the implications of AIs acquiring economic reasoning abilities. In any case, @parkes2015economic see economic reasoning as the ability to understand and solve complex game-theoretical environments (eg, the poker example). @mei2024turing do an extensive comparison of personality traits from the behaviour of ChatGPT with human behaviour in games that require cooperation, finding that its performance is consistent with humans, and when it deviates the AI models tend to behave in the altruistic and cooperative than the mass distribution of humans. Interestingly, ChatGPT responds differently to different formulations of the same situation. In contrast to @mei2024turing, this paper and its empirical counterpart are more generall, and discuss reasoning as a whole. Another contrast to that paper is that the current benchmark is focused on reasoning ability only, not personality. @perez2024testing illustrate the brittleness of a leading AI's reasoning on higher-order knowledge, which has markedly lower performance when trivial details in the prompts are different. Similarly, @korinek2023gen report (in his Chat 23) that results from a technical prompt in economics are reasonable but also brittle, with answers changing when prompt wording changes or even simply if the tasks are re-ordered.

## The challenges with reasoning: a simple illustration

This section builds on the intuition that in true reasoning, the result should be robust to minute perturbations, ie the model is a constant function over the domain of the input. Formally, both $\mathbf{V}(X) = y$ and $\mathbf{V}(X + \epsilon) = y$ for an infinitesimal $\epsilon$, even as $\mathbf{V}(X') = y', y \neq y'$, ie it should be a locally constant function. This implies the derivative with respect to the input prompt is zero. Using as an approachable example the simplest possible neural network, the logistic regression $\mathbf{N}(x) = \sigma(Wx + b)$, such robustness further implies that $\frac{d\mathbf{N}}{d x} = \sigma(Wx + b)(1-\sigma(Wx + b))W = 0$. Because $W$ cannot be a zero vector in a functioning network that is responsive to its inputs and $\sigma(Wx + b)(1-\sigma(Wx + b)) = 0$ has no solution because neither term is 0 or 1 in a sigmoid function with finite inputs, the neural network cannot be a constant function. This extremely simplified example, which holds for recursive architectures of similarly simple layers, does not bode well for the robustness of results given small perturbations in the input prompt.

# Existing benchmarks {#sec-bench}

There are numerous benchmark tasks, and their number is increasing with the sophistication of newer AI models (@storks2020recent). In fact, there is nowadays a whole "benchmark safari" (to which this paper is a contribution). Below I summarise important characteristics of the main ones in each type, to further illustrate why a new benchmark is needed and more specifically how one geared towards economics can be constructed.

**Reference resolution**: these benchmarks measure the ability to identify referents (ie, linguistic mentions, pronouns). Because of the ambiguities involved with resolving references, these benchmarks are very complicate to solve, requiring intense commonsense knowledge. Typically these benchmarks require handcrafted data, and therefore the datasets tend to be smaller compared to other benchmarks that are more automatable. Primary examples are Winograd and WinoGrande (@levesque2012winograd). @BROWNING2023104031 see the failure of these challenges to present a serious challenge to modern LLMs as a sign that lingusitic tests are unlikely to be good proxies for reasoning abilities.

**Question answering**: those are benchmarks that combine language processing and reasoning skills. But most of these benchmarks can be successfully solved without deep understanding of the concepts at hand or without reasoning, ie only from linguistic context and knowledge. However, there is a number of benchmarks in this group that typically requires external knowledge. The premier example is the ARC (@clark2018think), while there are many other examples: MCTest, RACE, NarrativeQA, MCScript, ProPara, MultiRC, ARCT, SQuAD 2.0,
CoQA, QuAC, and many others.

**Textual entailment**: another type of benchmarks tests whether AI systems can correctly find the "entailment", or the second part of a complete logical statement, starting generally with a hypothesis. Some of these benchmarks also test ability to recognise contradiction. As can be expected, such tasks requires substantial amount of commonsense knowledge. The highlighted example: SherLlic (@schmitt2019sherliic), although like in other cases, examples abound: Other examples include RTE, conversational entailment, SICK, SNLI, MultiNLI, SCITail, and others.

**Intuitive psychology**: other benchmarks focus on the ability of AI systems to conduct inference based on emotions and intensions, conditional on behaviour as described in the prompt. This test also requires considerable commonsense knowledge. A key example in this group is the benchmark test "SocialIQA". (@sap2019socialiqa). Other examples include Triangle-COPA, ROCStories, Event2Mind.Interestigly results of increasing performance in other tests suggest some level of commonsense knowledge is learnable through data. 

**Plausible inference**: there are benchmarks also that concentrate on measuring logic, and more specifically abductive logic, ie reaching hypothetical, intermediate certainty or even uncertain conclusions based on a limited context. These benchmarks require linguistic, common and commonsense knowledge. Highlighted examples are the SWAG (@zellers2018swagaf) and HellaSWAG (@zellers2019hellaswag) benchmarks, while other examples include COPA, CBT, ROCStories, LAMBADA, JOCI, CLOTH, ReCoRD, AlphaNLI.

**Multiple tasks**: some of them include some tasks requiring some economics-specific knowledge but as part of a much larger suite of tasks. More recently, BIG Bench (@srivastava2022beyond) a large, crowd-sourced benchmark exercise gathered considerable amount of questions to comprehensively test models; other examples are: bAbI, Inference Is Everything, DNC, GLUE, SuperGLUE. 

**Expert tasks**: as can be expected, those benchmarks are geared towards field-specific knowledge (eg, law, medicine). They usually include publicly available test datasets. In any case, successfully filling them usually require substantial common and commonsense knowledge. Crucially for this paper, there is no economics-specific benchmark that I know of.

The roster of benchmarks is impressive, including tests that focus on various types of knowledge (ie, linguistic, common and commonsense), evaluation of likely psychological state and logic attribution. However, a main point that is missing is the ability to attribute correct results to reasoning as opposed to even chance. The result from existing benchmarks is largely, if not completely, directly related to the number of questions correctly answered. However, this measures only the model's ability to answer correctly, *not necessarily* its reasoning capabilities. The latter are part of a latent state space sitting between the input prompt and the answer. More concretely, for an input prompt $Q$, which includes a question and any necessary explicit information, the language model is a function $m$ that maps it to a given response: $m : X \to y$. In order to show that it is done by reasoning, we need tests (and more specifically, measurements) that convey some information about the inner workings of this function. In addition, most benchmarks are the not derived from specialists' words.

A more advanced benchmark, along the lines described inthis work, is @tarunesh2023lonli.

# A model of reasoning

The model presented in the Introduction is arguably simplistic, but is fundamented on considerable body of literature that combines sensory pathways, knowledge combination and the execution of logical. [^worldmodel] This section goes into more detail about the choices in the reasoning model.

[^worldmodel]: The difference between this models and the "world model" concept found in some machine learning papers (eg, @wu2023reasoning), namely the conditions under which a mapping between prompt and output answer are evaluated, is that the world model concept seems to represent as one two concepts that are important to remain distinct: that of the existence or not of reasoning abilities by models, and conditional on that, the ability to correctly parse information and link it to relevant knowledge. 

## Information filtering

Reasoning should be robust to irrelevant input or to changes in minutiae that are not crucial for the logic relations to follow. This result follows as a result of the efficient coding hypothesis in physiology (@barlow1961possible, @olshausen1996natural, @loh2014efficient), which postulates that sensory pathways need to reduce the dimensionality of inputs in living organisms. This insight is itself a biological and social observation ispired by Shannon's [-@shannon1948mathematical] landmark information theory. The AI literature has of course known this of years, and it inspired the concept of attention (@larochelle2010learning, @mnih2014recurrent), which later inspired the self-attention and ultimately the game-changing transformer architecture (@vaswani2023attention).

The first step, filtering the received impulses (ie, the prompts), involve correctly judging what is relevant and what is not relevant. This is similar for example to how the brain receives an incredible amount of sensory inputs but chooses to focus only on those that are more relevant instead of being overwhelmed with everything else, an observation that has inspired dimensionality-reduction algorithms (eg, isometric mapping, or IsoMap, by @tenenbaum2000global describes how to find global optima while also defining the (much lower) degrees of freedom in a high-dimensional input).

Appropriate perception should understand that the information of relevance to understanding a problem is actually much lower-dimensional. Beyond the biological (and more specifically neural) requirement for inputs to focus on the most important aspects, this observation is also consistent with a bedrock of the machine learning literature, the manifold hypothesis. This hypothesis is based on the theoretical and empirical observations that most real-world realisations are high-dimensional embeddings of much lower-dimensional manifolds.[^manif] @cayton2008algorithms offers an early review of the main algorithms for estimating empirically the underlying manifold. @bengio2013representation discusses extensively the idea of representation learning (and its various techniques, mostly unsupervised), which can be seen as manifold learning. However, they might also approximate the wrong manifold or not have a single solution to a same manifold (@lee2023geometric).

[^manif]: For example, @intrinsic2021 study the intrinsic underlying dimensionality of the manifold of image datasets and find them to be significantly lower than their observed dimensionality. In practice, inputs can even be said to be *union of manifolds* (as verified by @brown2022verifying with image datasets in an exercise similar to the one by @intrinsic2021), which means that each manifold has its own intrinsic dimensionality that is not forced upon the other manifolds. This perspective affords flexibility in the interpretation of identifying variations because they don't necessarily need to probe the same dimensions at each task.

The requirement for perception modulation is also aligned with @prystawski2024think's finding that reasoning abilities in LLMs require "locality" in concepts until a final link between a prompt and its final answer (if far away) can be achieved. In a way, this is also similar to the small world network model (@kleinberg2000small). In humans, the literatures on rational inattention and neuroeconomics (@SIMS2003665, @caplin2022rationally, @dean2023experimental, @hebert2021neighborhood) models human information processing as subject to a cost that grows with the informational content, which is a closer representation of how people actually process information. In linguistics, the information bottleneck literature discusses how ideas are compressed into words by a trade-off between lexicon complexity vs accuracy (@zaslavsky2018efficient), and more recently also consistency (@chen2023information).

## Knowledge association

Assuming a prompt has been correctly parsed, the reasoning mechanism must now match it with the relevant knowledge, which can come from the prompt itself or from commonsense knowledge.

Knowledge can be linguistic, common or commonsense (@davis2015commonsense). @mahowald2023dissociating uses insights including from neuroscience to distinguish the first type of knowledge with the latter two (grouped as "functional" knowledge), and argue that LLMs have essentially mastered the former while still having a spotty record on the latter. For example, LLMs learn grammar, semantic, hierarchical structures, abstractions and constructions that provide a realistic linguistic knowledge.

@BRANSFORD1972717 show in experiments that contextual knowledge (in this case akin to commonsense) are essential for proper understanding in humans. The first experiments tested understanding by subjects of a grammatically correct, non-metaphorical passage that required an unusual and very specific, but highly relatable image as context for proper understanding. Note that these characteristics of the passage (correct, non-metaphorical) and of the context (unusual but relatable and easy to understand) both contribute to isolate the identification of this exercise in the aspect of whether contextual knowledge is required. [^prior]

[^prior]: Interestingly, the same paper also demonstrates that prior knowledge itself is not necessarily readily available but needs to be "activated". This is not further discussed in the context of this paper as it is not a mechanism necessary for measuring reasoning abilities in AI models.

Which type of knowledge to match to the prompt? One way of seeing this is through the lends of a query in a knowledge graph. @kleinberg1999authoritative distinguishes specific and broad queries, each giving rise to one problem: that of a scarcity of correct answers and abundance of correct answers, respectively. Further, @kleinberg1999authoritative offers the fundamental ideas of *authority*. Similarly, @kleinberg1999authoritative acknowledges that measuring the authority level of a node in a knowledge graph from explicit information alone (what he calls *endogenous* measure). On the contrary, even so much as using strings from the query itself might mislead answers due to an abundance of other sources that are based on the string and a scarcity of correctly authoritative sources that use the string. Interestingly, while the principal eigenvector of the square of the adjacency matrix offers the weights of authoritativeness especially for broad queries, the non-principal eigenvectors can offer insights into the authoritativeness of more specialised queries, and also due to their negative entries offer authorities of different perspectives (ie, weighting pros and cons).

So the lower-rank approximation of the knowledge graph should vary with how broad/specific a query is, and also with the level of pros/cons required. In @kleinberg1999authoritative, the authoritativeness measures comes from the eigenvectors of $A^TA$, with the principal eigenvector being the used as a broad authoritativeness metric, and the $n$th eigenvectors for $n>1$ as the more specific, and potentially discordand, authorities. This implies that having question variations that refer to opposite views in the prompt should also help probe the level of knowledge (given the theoretical model that it is a function of existing knowledge graph) versus response flipping due to probabilistic association with the terms introduced to describe the pros and cons. 

Information provision experiments (@FUSTER2023107) can help elucidate the knowledge needed by the AI. 

## Logic attribution

Once a prompt has been correctly parsed to focus only on relevant information and the necessary knowledge has been associated with it, an AI model can assign specific logic relationships between different components of a prompt and the existing knowledge. Commonly studied logic relationships are induction, deduction, analogy, abstraction, abduction(@walton2014abductive, @johnson2015logic, @davis2015commonsense, @dziri2024faith), among others. Naturally, an essential condition for appropriate use of logical relationships is that the inputs $\phi$ and $\kappa$ are correct; otherwise, even the purest application of logic would either lead to a failure or to a correct answer by chance.

Similar to many other social disciplines, economics requires the analytical judgment referred to by @robbins1932essay in the analyses of events as a basis to extrapolate and predict, and this has a bearing on how economic reasoning should be benchmarked. Economic inference depends primarily on articulating unobservable quantities, theorecised and estimated on the basis of observable measures. This is unlike other major disciplines. For example, in human and veterinary medicine, all physiological and pathological variables of clinical importance are observable, even if that is not yet technologically feasible today. In the medical sciences, theoretical models merely fill in the gaps in the absence of a technologically feasible complete measurement. In contrast, many economically relevant quantities are latent variables that cannot by definition be observed, and always require a model applied to data to be estimated, implicit or not.

A quantitative test for economic reasoning must take into account that selecting a correct answer in an economics question through reasoning will always depend on an unobserved transformation of the information received and the existing knowledge. AI models may also happen to choose the correct answer from either luck of through simple token probability. It is easy to see why a correct answer selected by chance is not informative about the reasoning abilities of a model. Since reasoning should be robust to perturbations in the input data, is locally complete, meaning that an LM that can correctly deduce that A implies B is also able to understand that A' does not imply B, or that A does not imply B'. In other words, a logic relationship that appears to be correct but whose obvious corolary is not achieved by an LLM cannot be said to have been reasoned in the first place.

## Reasoning iself as a low dimensional operation

Why is the information filtering important for identifying reasoning? Since proper reasoning needs to be insensitive to unimportant details, and the vector of changes depends on logical relationships between components, the set of all "reasonable" constructions is not obtained at random but reflects a lower-dimensional, underlying space.

@gilboa2014analogies argues that economic reasoning works by creating positively wrong but conceptually useful representations of reality, even when economics is studying particular cases. A marked characteristic of such models is their preference for simplicity, a theme also explored by @GILBOA20101757, who study the matching of economic theories to empirical data, generalising the evaluation of how reasonable a theory is through a combination of their likelihood (or goodness-of-fit) with a penalising factor for their complexity. Intuitively, this simplicity in reasoning is suggestive of the manifold hypothesis in reasoning as well. @rationality2023gilboa sees rationality, or reasoning, also as a robustness to trivial detail. 

A related, more empirically applicable perspective, is the possibility to identify models partially using random sets, ie abstracting away from point identification to situations where the data is incomplete or is described as an interval (@BERESTEANU201217). In other words, combining assumptions with available data to inform the estimation (even if partial) of a parameter of interest (@MOLINARI2020355), which are situations that could arise for example when the data originates from a lower-dimensional manifold but is observed as an embeddeding in higher dimensions.

All of these insights above, together with the points made in each specific step, lend credence to the intuition that the reasoning process itself entails some dimensionality reduction, in line with the existence of evidence that some real-world data are in fact realisations of a manifold (@intrinsic2021).

# Reasoning about economics

The model above allows us to estimate reasoning while also breaking down some of its components to better understand them. For example, we can estimate any errors in reasoning into an issue with information filtering, knowledge association and logic attribution. The goal of this section is to clarify that the working definition of economic reasoning used in this work is very broad and is not meant to assign one or another form of reasoning with greater weight, let alone one economic school of thought over the other. Rather, the goal is to reflect over time also slower-moving features of our profession, such as the ebb and flow of schools of thought (@pribram1953patterns).

@theory2022gilboa distinguish between three types of inquity in economic theory: economics itself (analysis of economic phenomena), development of economic methods (the development of analytical tools needed to study economic phenomena) and the methodology of economics (the research/scientific endeavour in economics, including but not limited to theory).[^economics] Naturally, they would all be in scope of this exercise.

[^economics]: In fact, @theory2022gilboa even allude to the blurred lines between economics and the philosophy or sociology of economics. I don't go ino these differences here.

Another insight into *economic* thinking is from the thought experiments first introduced by Marshall (1890) - the ceteris paribus idea - and then later Ragnar Frisch and Trygve Haavelmo, more recently elaborated in more detail and more generally by @heckman2015causal, including the important distinction between correlation and causation. @marschak1944random start their influential paper by acknowledging that economists can't conduct experiments (although that has been relaxed somewhat, it still remains the case at least in macroeconomics). Reasoning in economics can be taken as exploring the latest space through models, or thought experiments, an insight generalised by @heckman2023econometric. @pribram1953patterns adds to argue that even the areas of focus - the answer to *what matters* - changes over time.

Other sciences, in contrast, might have more concrete definitions of reasoning. For example, the human and veterinary medicines require hard data to conduct reasoning. In its absence, due for example to technological, biological or economic constraints, professionals need to theorise and especially rely on abductive reasoning. But if one day all of these constraints were relaxes and medical and veterinary doctors had access to all possible data about their patients, they would only need to reason with respect to the physiological or pathological relationship between these observed variables. Economists, in contrast, will always need to conduct thought experiments and think in latent terms to conduct any type of meaningful economic reasoning.

# Empirical estimation {#sec-variation}

Recall from @eq-empiric that estimation requires an evaluation of several variations of the same seed question $Q$ for each component. These variations form the empirical data. Questions vary with respect to their adversarial aspect; it is this variation within each question that allows the empirical estimation of the effects associated with interpretation or with knowledge. Most of the variations are originally those tested in @alzahrani2024benchmarks. The variation in response between the questions within each task will comprise the evaluation of the actual reasoning capabilities. As alluded to before, the variations are organised into those that measure the stability of a response to adversarial interpretation answers, and those that measure the stability across the knowledge dimension. In practice, each task has hundreds of different $q$. These groups are described in more detail in the next section.

To illustrate these procedures, this section uses randomly selected papers from the current issues (as of drafting) of leading English-language journals as source material to create seed questions.[^journals] In alphabetic order by journal, these are the *American Economic Review* (@houmark2024nurture - henceforth P1), *Econometrica* (@gomez2024wealth - P2), *Journal of Financial Economics* (@ARDIA2024103805 - P3), *Journal of Monetary Economics* (@DYRDA202474 - P4), *Journal of Political Economy* (@digiovanni2024foreign - P5), *Review of Economic Studies* (@andrews2023misspecified - P6) and *Quarterly Journal of Economics* (@risch2023taxing - P7). @tbl-papers presents the characteristics of the sampled papers.

[^journals]: These journals were selected for this preliminary version as a mix of their wide impact and general interest audience. Other ways of measuring impact lead to different rankings. See @auer2023journal for a recent example focused on central banking-related topics.

| Paper ID | No of authors | Affiliation countries | JEL codes |
|------|------|------|------|
| P1 | 3 | DK, DE | I24, I26, J12, J13, J24 |
| P2 | 2 | US | N/A |
| P3 | 4 | CA, CH, LU | C55, C58, G11, G12, G23 |
| P4 | 3 | CA | E6, F23, H25, H27 |
| P5 | 3 | FR, US | N/A |
| P6 | 2 | US | C10, C12 |
| P7 | 1 | US | H22, H25, H32, J21, J31 |

: Sample of randomly selected papers from current issues of English-language, high-impact academic journals in economics. {#tbl-papers}

## Process overview

Each paper $P$ undergoes the following process:[^process] first, the paper is read, with a focus on the introduction, which in economics papers usually covers the main material of each manuscript.

[^process]: NB: the author is currently fine-tuning the process during the estimation phase and for this reason this section in next versions might be different.

## Adjusted adversarial filtering

The key idea in this section is to use the adversarial filtering process proposed by @zellers2018swagaf and @zellers2019hellaswag, adapted to the current application. A generative model creates an initial set of incorrect responses to each video. This is then fed to the adversarial filtering routine, which is executed iteratively. First, the source data is split into training and testing sub-samples. An AI model is trained on the training sub-sample and used to identify those in the testing data that are easier to correctly answer. Those easy alternatives are then taken out of the sample, and newly generated alternatives replace them. The process is repeated until stabilisation.

A few adjustments are needed for the current case: this process occurrs separately for each of the seed questions $q$ and each of the three steps. This results in a collection of $\mathbf{W}_q^{(1)} = \epsilon^f, \epsilon^k, \epsilon^l$ for each $q$. The iterative filtering then proceeds as described above, until the performance of adversarial filters have degraded to an arbitrarily low point.

## Variations related to information filtering

There are several classes of variations that can help test an AI model' interpretation of the input, ie which information to focus versus to ignore or to abstract away. The key principle of these identifying variations is to keep core information intact while varying the order of content or of answer options; change, include or remove trivial content; or including random symbols in lieu of non-important actual such as the codes identifying each response (ie, using country flags instead of alphabetic letters to identify options). Because the core information is always constant in all these $M_f$ variations of $Q$, an AI system that filters would capture these changes and continue to concentrate on the main information, which is intact.

**Choice variations**: In this dimension, each variation of a seed question would have the same set of choices, but in varying order. Following @alzahrani2024benchmarks, this variation can be random, leverage bias to include the correct answers at the beginning or end of the set, explore identifying choice alternatives with uncommon answer choice symbols (eg, non-standard letters instead of a-d), and even common but unordered answer choice symbols.

**Word variations**: The main idea here is to introduce or change words that are irrelevant and not part of the main concept. This is along the lines of the tests conducted by @perez2024testing and @lewis2024using: varying unimportant information while keeping the core details requiring reasoning constant causes machine accuracy to degrade.[^analogy] Humans in the other hand, continue to perform well in the face of such problems. In the current benchmark, a human annotates the seed question to identify words that can be safely changed without changing the underlying reasoning task for questin $q$, an adversarial filtering model would create multiple alternatives from real and when needed simulated words.

[^analogy]: @lewis2024using calls this test "counterfactual analogy tests"; while analogy is a form of logic, in this current framework I associate these types of tests more with the information-filtering ability of models. Once a model with ideal resoning ability can show that it recognises what information is really needed and consequently knows to abstract from everything else, it could then (depending on the context and knowledge it has about the problem) use analogy as one of the logic steps to finalise its reasoning about a problem.

**Irrelevant information**: The same seed question can be augmented to include irrelevant information to varying degrees, including flooding the question in the midst of completely random information. This can be done through random string, strings guaranteed to not be relevant to the case in point, or even completely gibberish words.

## Variations related to knowledge

Recall that the seed questions use as little jargon as possible. This is useful for two reasons. First, using non-jargon greatly increases the chance that all the words in $q$ are part of the training data dictionaries, and thus any performance issue is related to the network architecture itself and not to its choice of dictionary size. Second, less technical words probably carry more generalistic meaning than specific words, and thus could trick stochastic parrot models into providing answers that have a closer probabilistic association with these general words more than it would if the model purely reasoned (even if it ultimately reasoned incorrectly).

Adversarially testing knowledge involves include faux technical jargon in a way that would materially change the answer if wrongly interpreted as existing jargon. For example, ask if the interpretation of a given statement would change if "the heteroscedasticity is over-identified in the vector space". Obviously such a passage would not make a sense to an expert but could fool an AI model. Comparing responses between both should indicate the level of knowledge used by the model. The location for the random faux jargon can be specified by humans with a special token, or completely randomly set by an adversarial filtering model.

A third source of variation to probe a model's knowledge comes from the intuition that knowing about something also involves knowing how to oppose ideas about it, analogous to how information authorities in non-principal eigenvectors are opposed to each other in @kleinberg1999authoritative. In the benchmark model, this results in a third source of knowledge variation whenever relevant to the question $q$ that retains the same prompt but includes specific wordings related to pros and cons.

## Variations related to logic attribution

The primary way to test for logic is to include a term that leads to the reverse conclusion, and check if that would alter the results. The implications from this filter to the analysis of logic is obvious. In practice, this can be implemented by humans but also by adversarial filtering.

Another way to test logic is to conduct the flip-flop experiment: simply asking LLMs to confirm their answers often make them switch answers, even if their original response was correct (@laban2023you, @xie2023ask). The key idea of these tests is to follow up a response with a reply that either asks for confirmation ("are you sure?") or that doubts the AI models's answer in some way, even if it was originally corrected. Controlling for the ability to correctly interpret and filter the incoming information, the response shouldn't change since no other change has occurred in the AI models' filtering ability or knowledge base (neither on the original model weights obviously but also including the information content of the prompt).

A few studies concentrate on evaluating the abilities of LLMs to reason by analogy. @webb2023emergent documents evidence that LLMs are able to reason due to their performance on various types of analogy tests related to sequences of letters, words or digits, including a novel test that was completely new to the LLM. 

## Variations related to economic inference

In addition to the variations described above, one type of question variantion that can help estimate reasoning in AI systems is the type that explores the four policy-related quesitons organised by @heckman2023econometric. These questions, mentioned above, provide the inspiration for the variations on reasoning that are associated with measuring economic reasoning. In short, these variations essentially ensure that the benchmark asks questions recognising the effect of interest that is measured in each paper, and questions about the specific mechanisms. Other more sophisticated quesitons, which can also be created for each seed question paper if the original manuscript contains the necessary content, are about the forecast of policy effect in different environments, and of policies that were not implemented (ie, completely generalising and forming a structural "model of the world").

# Practical considerations

## Avoiding spillover into training data

The strategy to use newly published academic papers as sources might broadly avoid that most of the content has been used in AI model training. However, most published papers in economics are previously published as working papers, which means they are potentially in the public domain at training time so cannot be guaranteed to be completely novel. While this is mitigated by the arguably low dissemination of secondary material about working papers (for example, one could reasonably conjecture that few recent working papers immediately become the topic of teaching notes or are referred to in more detail by other papers), a more robust practical strategy is needed, especially as the training dataset of many of the most advanced models is not publicly known.

One way of dealing with this is by introducing in a variation of the questions a random string that is almost guaranteed to be unique and that is not found in common text datasets used to train LLMs. This is of course not perfect, because it cannot guarantee that the original paper is not part of training data, but can at least ensure that if the seed questions themselves are for some reason used to train models, this could be identified by model developers (and if the training data is available, also by third-party evaluators).

## Lessons from human surveys

I use a considerable amount of specific advice on human surveys from @stantcheva2023run to generate identifying variation in the questions. Specifically, all the questions avoid jargon to the best extent possible, and only include questions that are either of the coeteris paribus type, or that include as options assessments on the statements of the form "correct", "incorrect", "equal" or "I don't know".[^beyond] Particular care is taken with respect to introducing variations in the seed questions that can help measure each of the three reasoning components of information filtering, knowledge association and logic attribution.

Consideration is given to whether each question should be presented to a separate instance of the LM, or the full questionnaire could be shared in the same "chat", which would be akin to the "few-shot" prompting. Another practical advice as part of the estimation is to prototype the questions (I used GPT-4 for the prototyping).

[^beyond]: Future versions of this benchmark could also include open ended questions (as in @ferrario2022eliciting), and even follow-up questions ("are thre any other reasons"). These open-ended questions that are similar in nature to closed-end questions could be assessed by a fine-tuned LLM.

## Desirable characteristics of a benchmark

A benchmark task for economic reasoning should ideally have the following characteristics in order to be useful and maintain relevance even in a scenario where model developers are able to acquire a significant body of economically relevant texts (eg, new papers).

**Inform performance on different components of reasoning**: An ideal benchmark can help practitioners intuitively grasp the performance of the models in each major "task" that is performed in the process of reasoning. This would help developers and users better understand what the models are good or bad at, and judge their adequateness accordingly. It can also support a more granular understanding of the acquisition of reasoning capabilities throughout the training process and scaling of language models (@biderman2023pythia).

**Evolve over time**: Economic reasoning evolves over time. For example, the Lucas critique (@lucas1976econometric) was influential in shifting macroeconomic modelling, while the credibility revolution described in @angrist2008mostly was similarly influential in microeconomic work. A historical perspective on the thought about causality going back to the early 18th century is found in @heckman2015causal, and Debreu [-@debreu1984economic] describes the evolution of economic theory up until that point. @lewbel2019identification offers a historical perspective on the issue of identification. For this reason, it is important to consider new works as they are incorporated in the live economic debate. This can be most directly done by drawing from academic papers in general interest economic journals, which benefit from wide impact in the profession. However, there are two main drawbacks of using academic papers to proxy for the development of economic reasoning over time. The first is the widely discussed publication bias (@andrews2019identification), but a perhaps equally important issue is that of unobserved false negatives: if many contributions that are now considered classics have been previously rejected (@mighty1994fallen), there are probably many others who will not be available for the incorporation as a benchmark task.

**Make data available**: Availability of data is crucial for developers to test their models in-house, and for model evaluators to suggest improvements to this benchmark. For this reason, an initial set of publicly available $Q_{\text{public}}$ containing $q$s and their variations will be put in the public domain. Periodically, as a new set $Q_{\text{hidden}}$ is added, older questions will also be made available, ensuring developers will have access to a rolling set of new testing material.

**Cover different levels of economic reasoning**: An ideal economic reasoning benchmark tests whether the model is able to recognise increasingly sophisticated levels of economic reasoning. When faced with $Q$ that contains a statemet of the economic problem, and a summary of the methodology and main findings, an AI model must recognise the type of analysis that was conducted. Drawing from the definitions more recently stated in @heckman2023econometric, those are, in order of analytical prowess, (a) the impact of a given intervention in a specific environment; (b) understanding the mechanisms by which the intervention might work; (c) forecasting the effects of the same intervention in other environments or states of the world; and (d) forecasting the effects of never-before-implemented interventions in various environments.

**Receive inputs from the public**: In order to truly reflect the breadth and diversity in economic thought, an ideal benchmark should be open to receiving suggested questions from the public.[^crowd] For example, economists publishing a new paper could suggest a source question based on their work. A practical way to achieve this is to create clear instructions and a standardised form that would be filled by that external user presenting the suggestion, coupled with a script that takes in the source quesiton(s) and introduces the necessary variations in information, knowledge and logic to achieve identification. The author or other maintainers of the benchmark will then review the submissions..

[^crowd]: This approach was followed, for example, by @srivastava2022beyond.

# Results

(to be filled once first estimations are concluded)

# Preliminary considerations[^concl]

[^concl]: This section is the basis for the conclusions in a future version after the empirical estimation is completed.

In this paper I propose a working model of economic reasoning that can be used to benchmark AI models' reasoning abilities across three sequential cognitive tasks: filtering the incoming information, associating it with existing knowledge, and performing logic tasks to reach a correct answer. The model in this paper resembles the more sophisticated idea by LeCun [-@lecun2022path] that AIs require a combination of "mental" modules that can separably executve perception (eg, take in a prompt that might be text-only or text-and-image), calculate the cost of processing, and imagine action sequences. This model is placed in a growing consensus in the scientific community that reasoning is evidenced by certain characteristics: abstracting away the unnecessary details, matching that with explicit and implicit relevant knowledge, and then deploying logic operations to achieve the correct prompt. This paper elaborates such a model in more detail, and offers an empirical strategy to estimate each of its components.

Understanding AI models' ability to reason and go beyond a pure probabilistic exercise is crucial as these models have an increasing importance in society. For example, models that suggest economic actions to people should better reflect latent, structural models that represent people's preferences or well-being rather than simply a prediction based on their observed behaviour, as argued by @kleinberg2023inversion. Models that reason better can rise up to the challenge of learning actual metrics of interest instead of real-world measurements, because the latter have added noise from human cognitive and heuristics limitations, which are then amplified by multiple types of biases in data. And with the increasing linguistic prowess of language models, their use in the economic research process (@korinek2023gen, @ludwig2024machine) is likely to increase, putting a premium on the ability to measure how well these models can reason. Of course, this should also be done with care, given results from the human literature pointing to the inability to actually benchmark skills (@heckman2022measuring).

 As economic agents and policymakers harness generative artificial intelligence (AI) to reap considerable efficiencies,[^braingpt] and thus their societal footprint becomes larger, a benchmark for economic reasoning is needed. I suggest ways to implement such a benchmark, and measure the current performance of a selected list of LMs. This benchmark is designed from the beginning to be continuously challenging for AI models to solve, anticipating further gains in their performance. In another avenue, it would be interesting to use the model in this paper as a basis for further exploring the functional forms of $f$, $k$ and $l$. This could offer new paths to explore the seemingly high performance of these models. In turn, greater understanding of these models could support their further use in economic settings such as research. For example, the work in @bybee2023ghost relies on the ability of LLMs to read financial news and explain their thought process.

 [^braingpt]: Including by teaming up of human and AI economists, as suggested by @luo2024large in the field of neuroscience.

Let me conclude with Ken Arrow's impossibility theorem (@arrow1950difficulty), or rather the story of how he achieved this incredibly influential result. Arrow first attempted to improve upon two-century-old Condorcet's paradox, and studied ways in which individual preferences could be aggregated while satisfying some intuitive conditions. It was only through repeated failures to do so that he switched the focus to attempting to prove its impossibility. While Arrow can be safely used as a prime example of economic reasoning, the point this anecdote illustrates is that breakthroughs in economic knowledge require also inspiration (in this case from the appeal of addressing Condorcet's paradox) as well as persistence and ability to change one's focus. The current work focuses on developing robust benchmarks of models' reasoning abilities in economics; further work exploring their contributions to inspiration[^ideation] and to methodological assistance (as in the example to change focus) are also warranted for a more complete assessment of models' abilities to provide cognitive support to human economists.

[^ideation]: @korinek2023gen illustrates use of AI models to help economists have new ideas for work, and @ludwig2024machine presents ways machine learning can contibute to generation of hypothesis.

# References