In [1]:
import sys
sys.path.append('../')
from utils.notebook_imports import *
from tactic_and_execution.thought_action_observation_data_structure import *
from traj_runner import TrajRunner
from collections import Counter
from utils.llm_requests import LLMRequestManager

# In this notebook, we show how to run trajectories and evaluate them

### first make sure you download the standalone and hybrid data from our HF [repo](https://huggingface.co/datasets/yuan-yang/ReWild)
### then put then under the `data` folder

## Now let's load it

In [2]:
strain = jload('../data/standalone_train.json')
stest = jload('../data/standalone_test.json')

### as you can see the `train` and `test` contain information of about the problem `input`, `label`, `src`, `dataset`, `orig_data` (the data in its original form), and `gt_traj` the ground-truth trajs

In [3]:
print('=== train ===')
display(Counter([k for e in strain for k in e]))
display(Counter([e['dataset'] for e in strain]))
print('=== test ===')
display(Counter([k for e in stest for k in e]))
display(Counter([e['dataset'] for e in stest]))

=== train ===


Counter({'input': 4303,
         'label': 4303,
         'src': 4303,
         'dataset': 4303,
         'orig_data': 4303,
         'gt_traj': 4303})

Counter({'folio': 928, 'gsm8k': 1213, 'reclor': 1099, 'proscript': 1063})

=== test ===


Counter({'input': 938,
         'label': 938,
         'src': 938,
         'dataset': 938,
         'orig_data': 938,
         'gt_traj': 938})

Counter({'folio': 192, 'gsm8k': 234, 'reclor': 235, 'proscript': 277})

## let's run a mini experiment that generates 5 trajetories with gpt4o

### first you need to init the `LLMRequestManager` which handles requests to all API LLMs
### you need to pass the path to the api key file with every line being `<type>: <key>` where `<type>` should be one of the following: `openai`, `gemini`, `mistral`, `anthropic`, `cohere`
### the order does not matter and it's ok to skip those you don't have

In [4]:
manager = LLMRequestManager('../llm-api-keys.txt')

In [5]:
# you can check the available models by
manager.available_request_funcs

{<bound method LLMRequestManager.claude_request of <utils.llm_requests.LLMRequestManager object at 0x7f5047a9ca90>>,
 <bound method LLMRequestManager.cohere_request of <utils.llm_requests.LLMRequestManager object at 0x7f5047a9ca90>>,
 <bound method LLMRequestManager.gemini_request of <utils.llm_requests.LLMRequestManager object at 0x7f5047a9ca90>>,
 <bound method LLMRequestManager.gpt_request of <utils.llm_requests.LLMRequestManager object at 0x7f5047a9ca90>>,
 <bound method LLMRequestManager.mistral_request of <utils.llm_requests.LLMRequestManager object at 0x7f5047a9ca90>>}

### this provides a unified interface for all models through func `default_request`
### for example, you can send an oridnary request like this

In [6]:
resp = manager.default_request(
    model=manager.model_gpt35, # check all supported models in the LLMRequestManager class!
    messages=[
        {'role': 'user', 'content': 'hello, what type of llm model are you?'}
    ],
    return_messages=False
)
resp

"I am an AI digital assistant, so I don't have a specific model type like an LLM (Large Language Model). I am trained on a variety of language models and algorithms to assist with various tasks and provide information. How can I assist you today?"

## All trajectory-running related functionalities are handled by `TrajRunner` class

In [7]:
runner = TrajRunner(tactics_dir='../data/tactics')

### the `datasetname2dataset_traj_info` object in runner stores all the important info about how to observe the traj for different dataset

In [8]:
runner.datasetname2dataset_traj_info

{'reclor': DatasetTrajInfo(tactic_name='any_program', prog_action_name='Write program', observer_class=<class 'tactic_and_execution.any_program_observer.AnyProgramObserver'>, obs_feedback_kwargs={'gt_answer': 'SAMPLE_LABEL', 'answer_set': ['1', '2', '3', '4']}),
 'gsm8k': DatasetTrajInfo(tactic_name='math', prog_action_name='Build math model', observer_class=<class 'tactic_and_execution.math_observer.MathObserver'>, obs_feedback_kwargs={'gt_answer': [<class 'functools.partial'>, <function MathObserver.check_answer at 0x7f504aa40ee0>, 'SAMPLE_LABEL']}),
 'proscript': DatasetTrajInfo(tactic_name='graph', prog_action_name='Build graph model', observer_class=<class 'tactic_and_execution.graph_observer.GraphObserver'>, obs_feedback_kwargs={'gt_answer': [<class 'functools.partial'>, <function GraphObserver.check_answer at 0x7f504aa5e0d0>, 'SAMPLE_LABEL', 0.6]}),
 'folio': DatasetTrajInfo(tactic_name='predicate_logic_z3', prog_action_name='Build FOL model', observer_class=<class 'tactic_and_e

### when running it will read the dataset name from `sample['dataset']` and load the corresponding observer and answer check functions

### you will run the trajs using `run_all_trajs` function; to run it you need to prepare two things
1. a `request_func`
2. a ICL prompt dict

In [9]:
# prepare a request func using model gpt4o
partial_request_func = partial(manager.default_request, model=manager.model_gpt4o, return_messages=True)

In [10]:
# load the icl prompt
icl_prompt_dict = jload('../data/icl_prompt_dict.json')
[k for k in icl_prompt_dict]

['folio',
 'gsm8k',
 'reclor',
 'proscript',
 'hyb',
 'folio_baby',
 'gsm8k_baby',
 'proscript_baby',
 'reclor_baby']

### the `icl_prompt_dict` contains the ICL examples of running standalone and routing trajectories; they will be included into the init prompt to guide the LLMs (You can construct your own ones of course). To not blow up your budget and the context window, I use a two stage prompting: for the first  `use_icl_before_round` round, we use icl prompt and after that we use simplified prompt (`baby`) to guide the the model to follow the  format, see details in our paper appendix.

## now let's run a mini experiment on 5 gsm8k questions

In [11]:
mini_gtest = [e for e in stest if e['dataset'] == 'gsm8k'][:5]

In [12]:
runner.run_all_trajs(
        data=mini_gtest,
        save_key='traj_eval',
        request_func=partial_request_func,
        filter_func=None, # you can pass additional func to not generate for some examples
        use_icl_before_round=2,
        max_n_requests=7,
        max_n_consecutive_wrong_actions=3,
        max_n_wrong_actions=10, # meaning we don't terminate by this condition
        verbose=True,
        give_traj_to_request_func=False, # you will set it True when running Tiger model
        save_path='../logs/mini_traj_eval.json',
        tqdm_func=nbtqdm,
        fuzzy_parse=True,
        stop_on_answer=True,
        use_baby_prompt_after_icl=True, # whether to use baby prompt after icl prompt round
        hyb_run=False, # set it to True if you are running hybrid data
        icl_prompt_dict=icl_prompt_dict,
        sub_traj_init_kwargs={'use_icl_before_round': 2}, # hyb_run subtraj init kwargs, not used here
        sub_traj_run_kwargs={'give_traj_to_request_func': False}, # hyb_run subtraj run kwargs, not used here
    )

  0%|          | 0/5 [00:00<?, ?it/s]

=== response === status: exec wrong

[96m### Thought
Given that this is a simple arithmetic problem involving the calculation of the candle's length reduction over a specific time period, I'll create a plan to solve it using basic arithmetic operations with Python. The candle melts by 2 centimeters every hour, and we need to find out how much shorter it will be after burning from 1:00 PM to 5:00 PM, which is a duration of 4 hours.

### Action
## Name
Plan
## Input
the problem given
## Output
I'll calculate the total reduction in the candle's length by multiplying the rate of melting per hour by the number of hours the candle burns. Below is the Python script to do this:

```python
# Constants
melt_rate_per_hour = 2  # centimeters per hour
burn_duration_hours = 5 - 1  # from 1:00 PM to 5:00 PM, which is 4 hours

# Calculate total reduction in candle length
total_reduction = melt_rate_per_hour * burn_duration_hours

print(total_reduction)
```

Next, I'll implement this math model accord

## now the run finished, you can inspect the trajs following examples in `inspect-traj.ipynb`

## to evaluate the trajs you will use the functions from the `eval_traj.py`

In [19]:
from metrics.eval_trajs import calc_standalone_metric, calc_hyb_metric

In [20]:
calc_standalone_metric(mini_gtest, runner)

gsm8k 0 5 0.00
gsm8k 0 5 acc_withprog 0.00
gsm8k 0 5 acc_withprogwithbleu 0.00
gsm8k codebleu 0.40
gsm8k 5 5 acc_soft 100.00
gsm8k 4 5 acc_soft_withprog 80.00
gsm8k 4 5 acc_soft_withprog_withbleu 80.00
all 5 5 100.00
all 4 5 acc_withprog 80.00
all 4 5 acc_withprogwithbleu 80.00
all codebleu 0.40
0.00 0.00 0.00 0.40 100.00 80.00 80.00 0.40 100.00 80.00 80.00 0.40
gsm8k Wrong Ans. 0
gsm8k Runtime error 0
gsm8k Wrong Format 5
folio Wrong Ans. 0
folio Runtime error 0
folio Wrong Format 0
proscript Wrong Ans. 0
proscript Runtime error 0
proscript Wrong Format 0
reclor Wrong Ans. 0
reclor Runtime error 0
reclor Wrong Format 0
all Wrong Ans. 0
all Runtime error 0
all Wrong Format 5
0 0 5 0 0 0 0 0 0 0 0 0 0 0 5


## Alternatively, the above process is implemented in `traj_infer.py` and you can refer to scripts in `scripts` folder to see how to run the full inference