Refactor & standardize evaluation with `Evaluator` #287

vict0rsch · 2024-02-15T23:46:53Z

remove test logic from main GFlowNetAgent
rename to eval() logic to be semantically distinct from (unit, integration etc.)-tests
clean up evaluation return values (to dict instead of messy tuple)
fix logger vs evaluator roles
isolate plotting logic

Check out tutorial and docs @ https://gflownet.readthedocs.io/en/evaluator

Questions / Need help

clean up BaseEvaluator.compute_density_metrics
define / discuss behaviour of .eval() in Scenario 1
prevent buffer from systematically writing files (very annoying when evaluating a trained gfn)
Update all existing experiment configs?
use utils in test__gflownet_minimal_runs ? (for instance gflownet_for_tests in conftest.py or at least common.py:gflownet_from_config()

Scenario 1

$ python main.py user=$USER +experiments=scrabble/jay.yaml logger.do.online=False evaluator.metrics=\'l1,kl,jsd\' evaluator.checkpoints_period=5

from gflownet.evaluator.base import BaseEvaluator
gfn_run_dir = "path to previous dir"

gfne = BaseEvaluator.from_dir(gfn_run_dir)

results = gfne.eval()
figs = gfne.plot(**results["data"])

Scenario 2

$ python main.py user=$USER +experiments=icml23/ctorus device=cpu logger.do.online=False evaluator.checkpoints_period=20

from gflownet.evaluator.base import BaseEvaluator
gfn_run_dir = "path to previous dir"

gfne = BaseEvaluator.from_dir(gfn_run_dir)

results = gfne.eval()
figs = gfne.plot(**results["data"])

for f, fig in figs.items():
    fig.savefig(f"{f}.pdf")

gflownet/gflownet.py

gflownet/evaluator/base.py

gflownet/utils/logger.py

vict0rsch · 2024-02-19T19:38:30Z

@alexhernandezgarcia what was the reason for storing metrics as attributes to the GFlowNetAgent? Can I safely remove this procedure from the evaluation / logging logic? Or would there be some side effects somewhere in your opinion? Looks ok from the code but making sure with you

alexhernandezgarcia · 2024-02-19T20:02:01Z

@alexhernandezgarcia what was the reason for storing metrics as attributes to the GFlowNetAgent? Can I safely remove this procedure from the evaluation / logging logic? Or would there be some side effects somewhere in your opinion? Looks ok from the code but making sure with you

No important reason. I think Nikita started it that way at some point and we just never followed it up. Like I said, we should not feel restricted by the way things are currently done.

gflownet/evaluator/base.py

vict0rsch · 2024-02-21T00:11:33Z

Controversial change @alexhernandezgarcia @michalkoziarski @carriepl @josephdviviano @AlexandraVolokhova thoughts?

5581e71 (#287)

If you agree this should be used in the eval_gflownet.py script and in tests.env.common::BaseTestsCommon.test__gflownet_minimal_runs

gflownet/gflownet.py

alexhernandezgarcia · 2024-02-21T00:34:37Z

Controversial change @alexhernandezgarcia @michalkoziarski @carriepl @josephdviviano @AlexandraVolokhova thoughts?

5581e71 (#287)

If you agree this should be used in the eval_gflownet.py script and in tests.env.common::BaseTestsCommon.test__gflownet_minimal_runs

I am not against using a method to hide all the complexity in the tests or in the evaluation script. However, I think it is good that main.py shows explicitly the components that are needed and what depends on what.

Also, as a separate comments: I would try if possible to not include too many changes in a single PR - especially if they are out of the scope of the PR. In other words, I would try to spin off changes like this one in a separate PR. It's just to make the review process a tad easier.

vict0rsch · 2024-02-21T18:03:16Z

Controversial change @alexhernandezgarcia @michalkoziarski @carriepl @josephdviviano @AlexandraVolokhova thoughts?
5581e71 (#287)
If you agree this should be used in the eval_gflownet.py script and in tests.env.common::BaseTestsCommon.test__gflownet_minimal_runs

I am not against using a method to hide all the complexity in the tests or in the evaluation script. However, I think it is good that main.py shows explicitly the components that are needed and what depends on what.

Also, as a separate comments: I would try if possible to not include too many changes in a single PR - especially if they are out of the scope of the PR. In other words, I would try to spin off changes like this one in a separate PR. It's just to make the review process a tad easier.

Sure I'll move that to another PR to merge after this one then

vict0rsch · 2024-02-21T18:29:11Z

@alexhernandezgarcia I'm following those steps https://stackoverflow.com/a/30893291/3867406 -> have you pulled or fetched from THIS branch (evaluator)?

If I edit the commit history on MY machine then FORCE push to Github but you have a local version of evaluator that will have diverged we may end up blowing everything up, in which case I suggest we just keep the offending commit (5581e71 (#287))

alexhernandezgarcia · 2024-02-21T18:40:32Z

@alexhernandezgarcia I'm following those steps https://stackoverflow.com/a/30893291/3867406 -> have you pulled or fetched from THIS branch (evaluator)?

If I edit the commit history on MY machine then FORCE push to Github but you have a local version of evaluator that will have diverged we may end up blowing everything up, in which case I suggest we just keep the offending commit (5581e71 (#287))

You can go ahead without impact on my work / local copies.

josephdviviano · 2024-02-22T14:24:19Z

Controversial change @alexhernandezgarcia @michalkoziarski @carriepl @josephdviviano @AlexandraVolokhova thoughts?

5581e71 (#287)

If you agree this should be used in the eval_gflownet.py script and in tests.env.common::BaseTestsCommon.test__gflownet_minimal_runs

I can't see what this controversial change is

vict0rsch · 2024-02-22T16:25:49Z

Controversial change @alexhernandezgarcia @michalkoziarski @carriepl @josephdviviano @AlexandraVolokhova thoughts?
5581e71 (#287)
If you agree this should be used in the eval_gflownet.py script and in tests.env.common::BaseTestsCommon.test__gflownet_minimal_runs

I can't see what this controversial change is

Great, that means I reverted the commit appropriately :p

I think we should not have multiple places in the code that instantiate a GFlowNetAgent from a Hydra config. That's why I created gflownet.utils.common::gflownet_from_config(). I was suggesting to use it in main.py but Alex suggested we don't do it in this PR and mix things up (and seems against it for main.py though I think I can convince him 😄)

josephdviviano · 2024-02-22T16:34:37Z

Without total context I agree with @vict0rsch - this sounds like something that should be one and done.

gflownet/evaluator/__init__.py

alexhernandezgarcia · 2024-03-08T22:47:53Z

gflownet/evaluator/__init__.py

+    for name, metric in results["metrics"].items():
+        print(f"{name:20}: {metric:.4f}")
+
+    data = results.get("data", {})


Note to self: understand why the argument {} is needed.

gflownet/evaluator/__init__.py

Co-authored-by: Alex <alexhg15@gmail.com>

carriepl · 2024-05-29T13:47:12Z

config/experiments/scrabble/jay.yaml

@@ -51,21 +52,21 @@ policy:
    shared_weights: False
    checkpoint: backward

+# Evaluator


The format for these evaluator arguments is different than in the icml23/ctorus.yaml config file. Is that a problem?

I am unsure what this comment refers to exactly, but I would just say that the icml23/ctorus.yaml file is really old (January 2023) so it would be fine to deprecate it / adapt it if needed. Yes, it contains the experiments of a paper, but I believe it's ok to adapt it to the new state of the repo.

carriepl · 2024-05-29T13:50:51Z

docs/conf.py

+
+
+# def setup(sphinx):
+#     sphinx.connect("autoapi-skip-member", skip_util_classes)


Not sure what this part is meant to do. Is that something outdates that should be removed from the PR? Or is this a work in progress that should be finished and then uncommented?

scripts/eval_gflownet.py

gflownet/gflownet.py

main.py

carriepl · 2024-06-04T20:03:03Z

Alright, at this point :

I think that the conflicts should be sorted out
the tests, black and isort are happy
I've done the first 3 out of 5 sanity check runs and they look great
I have not done the changes to the logger outlined by @vict0rsch. That will be the next step.

alexhernandezgarcia · 2024-06-05T13:46:33Z

Thanks Pierre Luc!

* I have not done the changes to the logger outlined by @vict0rsch. That will be the next step.

Could this be done in a new PR or should it be done before merging?

carriepl · 2024-06-05T17:27:52Z

Interesting... the CI is currently failing because of a test that was passing before my last commit which is only changing a commit. I guess that this is a test that fails very infrequently. At this point, I don't think it's related to this PR but I could be wrong.

alexhernandezgarcia · 2024-06-05T20:10:27Z

Interesting... the CI is currently failing because of a test that was passing before my last commit which is only changing a commit. I guess that this is a test that fails very infrequently. At this point, I don't think it's related to this PR but I could be wrong.

Don't worry. I am pretty sure this is related to tests of the Batch class that in this still branch use torch.equal but it should be fine with torch.isclose. All this has been changed in the famous big PR.

alexhernandezgarcia

I had to make a few additional changes after checking the sanity runs (Tetris topK figures were missing). The solution is a bit of a quick fix.

I have realised that a bunch of things will need more work, but this is a great step forwards since the Evaluator will give us the flexibility to extend the evaluation without having to do ugly additions to the former test() function of the GFN.

I have added a couple of quick issues about things that are needed as a reminder.

Great work everyone!!! I will merge.