# Preemptive Analysis: A Case Study on Exact Matching Rigidity

ReAct underperforms in HotpotQA compared to Standard prompting and CoT prompting. This result is not necessarily impossible as the original paper's ReAct underperforms compared to CoT. In this notebook, you'll find my error analysis on the ReAct trajectories for our 500-sample HotpotQA benchmark test. 

This preemptive error analysis on ReAct is motivated by its underperformance in the EM metric for the HotpotQA benchmark. Its purpose is to identify the types of errors that ReAct makes, the limitations in evaluation, and provide a remedy for the shortcomings of the EM metric.

- ethereal-firebrand-4: https://wandb.ai/agential/hotpotqa/runs/852rl2lq
- generous-sound-15: https://wandb.ai/agential/hotpotqa/runs/not8mo5r

These 2 ReAct runs were ran before LLM-as-a-Judge evaluation was implemented in the EM metric. After inspecting the results via the wandb UI, we classify the errors into 3 categories broadly: no answers, unknown/unable to find, and false negatives. 

These errors can be due to:
- the benchmark itself (for the sake of generality and comparisons, we use a larger sample size of 500 as opposed to 100, the distractor sample size)
- our implementation of ReAct
  - the underlying method's *core* logic
  - the search/lookup tool's effectiveness (we use LangChain's search and `DocStoreExplorer` tools as opposed to the original paper's custom-defined BeautifulSoup4 webscraper for wikipedia queries)
- the evaluation strategy (EM is strict)

## Ethereal Firebrand

I manually inspect the predicted answers versus ground truth answers to categorize all the errors.

In [1]:
import wandb
run = wandb.init()
artifact = run.use_artifact('agential/hotpotqa/run-852rl2lq-etherealfirebrand4_eval:v0', type='run_table')
artifact_dir = artifact.download()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mvincenttu[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m:   1 of 1 files downloaded.  


In [4]:
import json

with open('artifacts/run-852rl2lq-etherealfirebrand4_eval-v0\ethereal-firebrand-4_eval.table.json', 'r') as file:
    file_input = json.load(file)

In [6]:
import pandas as pd

df = pd.DataFrame(file_input['data'])
df.columns = file_input['columns']

In [7]:
subset_EM0 = df[df['EM'] == 0]
subset_EM1 = df[df['EM'] == 1]

In [8]:
subset_EM1

Unnamed: 0,question,answer,predicted_answer,EM,precision,recall,f1
7,Tommy's Honour was a drama film that included ...,War & Peace,War & Peace,1,1.000000,1.0,1.000000
9,"""Tunak"", is a bhangra/pop love song by an arti...",1967,1967,1,1.000000,1.0,1.000000
15,"Jason Regler, stated that he had the idea for...",an organ,funereal organ,1,0.500000,1.0,0.666667
16,The Soul of Buddha is a 1918 American silent ...,the George Washington Bridge,George Washington Bridge,1,1.000000,1.0,1.000000
22,Who directed the second film in a British seri...,Oliver Parker,Oliver Parker,1,1.000000,1.0,1.000000
...,...,...,...,...,...,...,...
485,"Not Richard, But Dick was an album by the band...",Philadelphia,Philadelphia,1,1.000000,1.0,1.000000
486,Van Andel Institute was founded in part by wha...,Jay Van Andel,Jay Van Andel,1,1.000000,1.0,1.000000
492,Which documentary Hacking Democracy or The Mem...,The Memory of Justice,The Memory of Justice,1,1.000000,1.0,1.000000
495,To Catch a Predator was devoted to impersonati...,jurisdiction,varies by jurisdiction,1,0.333333,1.0,0.500000


In [10]:
for index, row in subset_EM1.iterrows():
    print(f"Row {index}:")
    print(row.answer, "<=====>", row.predicted_answer)
    print()  # Add a blank line for better readability

Row 7:
War & Peace <=====> War & Peace

Row 9:
1967 <=====> 1967

Row 15:
an organ <=====> funereal organ

Row 16:
the George Washington Bridge <=====> George Washington Bridge

Row 22:
Oliver Parker <=====> Oliver Parker

Row 24:
Rafael Márquez <=====> Rafael Márquez

Row 25:
South Korea <=====> South Korea

Row 27:
1954 <=====> 1954

Row 28:
France. <=====> Roissy-en-France, France

Row 29:
director <=====> film director

Row 32:
Harry F. Sinclair <=====> Harry F. Sinclair

Row 33:
bass <=====> bass guitar

Row 39:
Argentinian <=====> Argentinian

Row 42:
Dundee Canal <=====> Dundee Canal

Row 44:
The Most Dangerous Game <=====> The Most Dangerous Game

Row 46:
no <=====> no

Row 47:
family <=====> grass family

Row 51:
Manchester Orchestra <=====> Manchester Orchestra

Row 52:
Perfect Strangers <=====> Perfect Strangers

Row 53:
World War II <=====> World War II

Row 55:
BMW X6 <=====> BMW X6

Row 56:
no <=====> no

Row 57:
Yakuza <=====> Yakuza series

Row 61:
no <=====> no

Row 63

- no answer
- unknown
- False negatives (it's wrong but it should be right)

In [11]:
for index, row in subset_EM0.iterrows():
    print(f"Row {index}:")
    print(row.answer, "<=====>", row.predicted_answer)
    print()  # Add a blank line for better readability

Row 0:
flotilla <=====> 

Row 1:
15 August 1843 <=====> 1843

Row 2:
Armenian <=====> Armenia

Row 3:
Centre of Excellence <=====> unable to provide an answer

Row 4:
Saoirse Ronan <=====> Information not found

Row 5:
The Lorax <=====> Not found

Row 6:
Scott Adkins <=====> 

Row 8:
Midnight Oil <=====> Midnight Oil did not choose their band name by drawing it out of a hat

Row 10:
Captain John Underhill <=====> Willem Kieft

Row 11:
The Outsiders <=====> The Way

Row 12:
to ensure wide visibility and understanding of cases in a region <=====> answer not available

Row 13:
Keelung <=====> New Taipei City

Row 14:
yes <=====> Édouard Roger-Vasselin

Row 17:
Vitor Belfort <=====> 

Row 18:
tip of the Baja California <=====> Sinaloa

Row 19:
The Tugurt language <=====> Berber language

Row 20:
Q <=====> 

Row 21:
2002 <=====> 

Row 23:
a British composer <=====> America, Britain

Row 26:
Bury St Edmunds <=====> 

Row 30:
Noel Gallagher <=====> 

Row 31:
Grammar, logic, and rhetoric <====

In [12]:
no_answer = [
0,
6,
17,
20,
21,
26,
30,
34,
36,
38,
40,
41,
45,
49,
50,
58,
62,
72,
86,
90,
92,
94,
97,
98,
99,
100,
102,
107,
110,
113,
114,
118,
126,
128,
130,
131,
136,
143,
144,
145,
153,
156,
157,
168,
175,
176,
180,
181,
182,
183,
186,
189,
195,
197,
203,
204,
206,
212,
213,
215,
216,
222,
223,
229,
231,
233,
234,
239,
241,
245,
247,
251,
257,
258,
264,
270,
272,
274,
275,
281,
284,
289,
294,
298,
306,
309,
310,
311,
315,
316,
317,
321,
325,
326,
329,
333,
338,
339,
341,
342,
343,
344,
346,
347,
349,
351,
360,
361,
367,
374,
378,
379,
382,
388,
392,
395,
398,
402,
407,
416,
431,
435,
437,
444,
455,
462,
465,
466,
468,
479,
482,
483,
489,
490,
491,
493,
497,
499,
]

unknown = [
3,
4,
5,
8,
12,
43,
48,
54,
80,
82,
85,
95,
101,
111,
115,
116,
124,
133,
135,
149,
154,
155,
160,
162,
170,
177,
196,
200,
218,
221,
225,
226,
242,
249,
260,
277,
287,
290,
292,
301,
302,
304,
353,
358,
362,
364,
370,
386,
404,
420,
423,
424,
428,
448,
449,
458,
467,
481,
488,
]

false_negatives = [
2,
68,
78,
83,
91,
93,
103,
138,
151,
165,
202,
227,
232,
236,
237,
244,
246,
262,
266,
269,
286,
307,
327,
328,
330,
332,
368,
371,
380,
387,
393,
396,
397,
410,
415,
425,
427,
434,
440,
445,
447,
453,
460,
487,
]



In [13]:
len(no_answer), len(unknown), len(false_negatives)  # 138 no answers, 59 unknowns, and 44 false negatives due to the strictness of EM evaluation.

(138, 59, 44)

We've categorized all suspicious errors (we ignore counting errors where the predicted answer is obviously wrong). This gives us a clue as to the possible issues. 

1. The false negatives are clearly due to the strictness of exact matching after the manual inspection.
2. A substantial amount of the questions, ReAct does not finish its trajectory with an answer. The original ReAct implementation does not have any safeguard for when it reaches the final step of the trajectory and does not have an answer. I stay true to this implementation. 
3. The unknown/unable to find cases indicate limitations in the search and lookup tools.

1 and 3 are stylistic choices that are malleable. To verify 2's independent of the core ReAct implementation, we manually inspect an example trajectory's input prompts and output completions of the Ethereal Firebrand run.

In [15]:
import wandb
run = wandb.init()
artifact = run.use_artifact('agential/hotpotqa/ethereal-firebrand-4:v0', type='output')
artifact_dir = artifact.download()

[34m[1mwandb[0m:   1 of 1 files downloaded.  


In [2]:
import pickle 

with open("artifacts\ethereal-firebrand-4-v0\outputs.pkl", 'rb') as f:
    outputs = pickle.load(f)

In [5]:
len(outputs)

500

In [6]:
outputs[4]

ReActOutput(answer='Information not found', total_prompt_tokens=25188, total_completion_tokens=1896, total_tokens=27084, total_prompt_cost=0.012594, total_completion_cost=0.002844, total_cost=0.015437999999999999, total_prompt_time=23.883926153182983, total_time=31.59857416152954, additional_info=[ReActStepOutput(thought='I need to search for information about Billy Howle and the upcoming British drama film directed by Dominic Cooke to find out who he will be seen opposite.', action_type='Search', query='Billy Howle', observation="Could not find [Billy Howle]. Similar: ['Billy Howle', 'On Chesil Beach (film)', 'MotherFatherSon', 'Howle', 'The Perfect Couple (TV series)', 'Infinite Storm', 'The Serpent (TV series)', 'Outlaw King', 'The Sense of an Ending (film)', 'Chloe (TV series)']", answer='', external_tool_info={'search_result': "Could not find [Billy Howle]. Similar: ['Billy Howle', 'On Chesil Beach (film)', 'MotherFatherSon', 'Howle', 'The Perfect Couple (TV series)', 'Infinite St

In [10]:
for step in outputs[4].additional_info:
    print(f"thought: {step.thought} \naction: {step.action_type} \nquery: {step.query} \nobservation: {step.observation}\n\n")

thought: I need to search for information about Billy Howle and the upcoming British drama film directed by Dominic Cooke to find out who he will be seen opposite. 
action: Search 
query: Billy Howle 
observation: Could not find [Billy Howle]. Similar: ['Billy Howle', 'On Chesil Beach (film)', 'MotherFatherSon', 'Howle', 'The Perfect Couple (TV series)', 'Infinite Storm', 'The Serpent (TV series)', 'Outlaw King', 'The Sense of an Ending (film)', 'Chloe (TV series)']


thought: I should search for the upcoming British drama film directed by Dominic Cooke instead. 
action: Search 
query: Dominic Cooke upcoming British drama film 
observation: The Gold is a British drama television series written by Neil Forsyth and co-produced by his Tannadice Pictures production label. It stars Hugh Bonneville, Dominic Cooper, Charlotte Spencer, Sean Harris, Jack Lowden and Tom Cullen and is a dramatisation of events around the Brink's-Mat robbery in 1983. It is directed by Aneil Karia and Lawrence Goug

Thoughts, actions, and observations all seem to be correct. The LLM does occasionally struggle with generating a plausible search query and tends to finish when it cannot find the answer directly from search.

In [13]:
for step in outputs[4].additional_info:
    print("<THOUGHT PROMPT =========================================>")
    print(step.thought_response.input_text)
    print("<THOUGHT PROMPT =========================================>")
    print("<THOUGHT OUTPUT =========================================>")
    print(repr(step.thought_response.output_text))
    print("<THOUGHT OUTPUT =========================================>")

    print("<ACTION PROMPT =========================================>")
    print(step.action_response.input_text)
    print("<ACTION PROMPT =========================================>")
    print("<ACTION OUTPUT =========================================>")
    print(repr(step.action_response.output_text))
    print("<ACTION OUTPUT =========================================>")

Solve a question answering task with interleaving Thought, Action, Observation steps. Thought can reason about the current situation, and Action can be three types: 
(1) Search[entity], which searches the exact entity on Wikipedia and returns the first paragraph if it exists. If not, it will return some similar entities to search.
(2) Lookup[keyword], which returns the next sentence containing keyword in the last passage successfully found by Search.
(3) Finish[answer], which returns the answer and finishes the task.
You have a maximum of 6 steps.

Here are some examples:
Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?
Thought 1: I need to search Colorado orogeny, find the area that the eastern sector of the Colorado orogeny extends into, then find the elevation range of the area.
Action 1: Search[Colorado orogeny]
Observation 1: The Colorado orogeny was an episode of mountain building (an orogeny) in Colorado and surroun

The core implementation of ReAct is correct. A ReAct trajectory is composed of `n` steps and at each step we have a triplet of thought-action-observation strings. This concludes the underperformance of ReAct could only be due to evaluation/benchmark dataset and size and/or the effectiveness of the search/lookup tool. 

Refer to `react_error_analysis_1.ipynb` for the analysis on `generous-sound-15`. The analysis on `generous-sound-15` is brief and the errors and error type counts are similar to `ethereal-firebrand-4`. 

Thus, the possible areas of improvement are:
- improved evaluation
  - We opt for LLM-as-a-judge for false negatives (cases where EM returns False, we ask the LLM to judge if the output is semantically similar compared to the ground truth answer; this judge outputs a score of 0 or 1)
- possible changes to the search tool
  - Our LangChain [GoogleSearchAPIWrapper](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.google_search.GoogleSearchAPIWrapper.html) can be replaced with other search tools, possibly [GoogleSerperAPIWrapper](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.google_serper.GoogleSerperAPIWrapper.html) or a search tool specifically for Wikipedia.

As for the ReAct implementation ablations on the search tool, these will be covered in our ablation studies later.

# Inspecting ReAct Errors after Implementing LLM-as-a-Judge