## Gantry Setup

In [1]:
import gantry
import os
from gantry.llm import EvaluationRun
import gantry.dataset as gdataset

  from .autonotebook import tqdm as notebook_tqdm
You haven't set dataset working directory yet, by default Gantry will use /Users/abhishyant/dtparse. You can overwrite the default settingusing set_working_directory method.


First, let's list out all our runs in the dataset.

In [2]:
eval_dataset = gdataset.get_dataset("evaluation_dataset")
eval_dataset.list_runs()

['run_2023_03_21_15_33_25',
 'run_2023_03_21_15_33_40',
 'run_2023_03_21_17_16_38',
 'run_2023_03_21_17_25_54 (bug)',
 'run_2023_03_21_17_26_11 (latest, fix)',
 'run_2023_03_21_17_17_05']

In [3]:
# TODO: change the run ids here from the output above
fix_run_id = 'run_2023_03_21_17_26_11'
bug_run_id = 'run_2023_03_21_17_25_54'

We can see we have a run tagged as `bug`, and a run tagged as both `latest` and `fix`. We can instantiate the run using either the tag or the run identifier:

In [4]:
fix_run = EvaluationRun.from_tag(eval_dataset, "fix")
fix_run = EvaluationRun(eval_dataset, fix_run_id)

Now, we can compare how this run did vs our "bug" run.

In [5]:
fix_run.compare_to_runs(tags=["bug"])

Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)


Unnamed: 0,inputs.date,inputs.request,inputs.time,label,"run_2023_03_21_17_26_11_output (latest, fix)",run_2023_03_21_17_25_54_output (bug)
0,2023-02-23,noon Sunday,11:48:05,2023-02-26 12:00:00,2023-02-26 12:00:00,2023-02-23 12:00:00
1,2023-02-23,noon last Sunday,12:47:57,2023-02-19 12:00:00,2023-02-19 12:00:00,2023-02-24 12:00:00


It can be a bit difficult to immediately see the difference in the outputs with just a side by side comparison. To get a nicer diff, we can use the `diff_outputs` function.

In [6]:
fix_run.diff_outputs(tag="bug")

Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)


Inputs:
{'date': '2023-02-23', 'request': 'noon Sunday', 'time': '11:48:05'}
-----------------
*** run_2023_03_21_17_26_11_output (latest, fix)
--- run_2023_03_21_17_25_54_output (bug)
***************
*** 1 ****
! 2023-02-26 12:00:00
--- 1 ----
! 2023-02-23 12:00:00

Inputs:
{'date': '2023-02-23', 'request': 'noon last Sunday', 'time': '12:47:57'}
-----------------
*** run_2023_03_21_17_26_11_output (latest, fix)
--- run_2023_03_21_17_25_54_output (bug)
***************
*** 1 ****
! 2023-02-19 12:00:00
--- 1 ----
! 2023-02-24 12:00:00



Finally, we can view the metrics available for our run

In [7]:
fix_run.metrics

Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)


{'exact_match': 1.0}

### Misc Commands

We also support a couple different ways of comparing runs. First, we can compare via run_id directly

In [8]:
fix_run.compare_to_runs(run_ids=[bug_run_id])

Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)


Unnamed: 0,inputs.date,inputs.request,inputs.time,label,"run_2023_03_21_17_26_11_output (latest, fix)",run_2023_03_21_17_25_54_output (bug)
0,2023-02-23,noon Sunday,11:48:05,2023-02-26 12:00:00,2023-02-26 12:00:00,2023-02-23 12:00:00
1,2023-02-23,noon last Sunday,12:47:57,2023-02-19 12:00:00,2023-02-19 12:00:00,2023-02-24 12:00:00


We can also compare all the runs in the dataset if we don't pass in any parameters.

In [9]:
fix_run.compare_to_runs()

Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/hugg

Unnamed: 0,inputs.date,inputs.request,inputs.time,label,"run_2023_03_21_17_26_11_output (latest, fix)",run_2023_03_21_15_33_25,run_2023_03_21_15_33_40,run_2023_03_21_17_16_38,run_2023_03_21_17_25_54_output (bug),run_2023_03_21_17_17_05
0,2023-02-23,noon Sunday,11:48:05,2023-02-26 12:00:00,2023-02-26 12:00:00,False,False,2023-02-23 12:00:00,2023-02-23 12:00:00,2023-02-26 12:00:00
1,2023-02-23,noon last Sunday,12:47:57,2023-02-19 12:00:00,2023-02-19 12:00:00,False,False,2023-02-24 12:00:00,2023-02-24 12:00:00,2023-02-19 12:00:00


Finally, we can also specify a run id in the `diff_outputs` function as well

In [10]:
fix_run.diff_outputs(run_id=bug_run_id)

Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)
Found cached dataset evaluation_dataset (/Users/abhishyant/.cache/huggingface/datasets/evaluation_dataset/default/1.0.0/29c7befd9cc42ed58a3d38362835c4a5e355c60dd7c20cfc3bb7e6ed0c93f998)


Inputs:
{'date': '2023-02-23', 'request': 'noon Sunday', 'time': '11:48:05'}
-----------------
*** run_2023_03_21_17_26_11_output (latest, fix)
--- run_2023_03_21_17_25_54_output (bug)
***************
*** 1 ****
! 2023-02-26 12:00:00
--- 1 ----
! 2023-02-23 12:00:00

Inputs:
{'date': '2023-02-23', 'request': 'noon last Sunday', 'time': '12:47:57'}
-----------------
*** run_2023_03_21_17_26_11_output (latest, fix)
--- run_2023_03_21_17_25_54_output (bug)
***************
*** 1 ****
! 2023-02-19 12:00:00
--- 1 ----
! 2023-02-24 12:00:00

