Example Guide: Benchmark on Saved Models #92

maxmynter · 2024-02-26T14:38:33Z

This PR introduces artifact handling alongside a guide and example for these artifacts. The example concerns a bigger model than the other examples we have written so I believe it can serve as a nice case study as well. So if you see any clunky-ness w.r.t. to the nnbench usage, that can be a pointer to (a) improve the example, or (b) add nnbench functionality.

Artifact Class

Extend the Artifact class and move the loading logic to a ArtifactLoader to separate the concerns between loading and deserializing.
Add an ArtifactCollection wrapper around an iterable to iterate through artifacts.

Other nnbench changes

nnbench.runner can take all contexts that the Context.update method can handle. This is useful if someone wants to ad-hoc add values in a dict or with a new Context to the context creation of a runner.
(bugfix) Base reporter can now also handle when no custom formatters are supplied.

Example

The example is a token classification task. It is mostly an adaptation from here. The logic is in the examples/on_saved/src/training/training.py.

When people want to recreate the model training themselves they can copy it into a free GPU runtime on e.g. Google Colab, install the dependencies (!pip install transformers[torch] datasets) and run !python training.py. After About 5 minutes they have the necessary files ready for download. Alternatively they can get a model directly from Huggingface (e.g. this).

Evaluation is executed by running python runner.py <path-to-model-folder-1> <path-2> ....

The benchmarks are in the benchmark.py. Namely, accuracy, precision, recall, and f1 averaged and per label, plus some model metadata. We cache the evaluation processing where appropriate and use our nnbench.parametrize for label specific metrics

Questions

Especially around the dataset the Artifact wrapper feels a little artificial, especially w.r.t. the necessary preprocessing of the data to a dataloader. But maybe that is because Huggingface already provides a wrapper that is not available for custom datasets.
I did not add a merging of Benchmark Records. I think that is more on the side of whatever backend we use. There I would want the run for specific models to be different benchmark records. An exception to this could be when i want deltas in the metrics between the current and a baseline model. In this case supplying the baseline to the reporter (and have it cached) feels like the most natural implementation. What do you think?

Remaining ToDo's

Clean gitignore
Docstrings
Prepare Pyproject.toml
~~Add common loaders (in a follow up PR?)~~ -> Follow up Issue (Add common ArtifactLoaders to make nnbench provide them out of the box #97)
Add tokenizer saving to the training.py script
Write Guide
Clean History so we can merge with rebase (as this PR is quite big, I think that makes more sense than squashing).

Closes #63

nicholasjng

Additional comments:

The lazy artifact loading is really sweet!
We could define a functools.partial alias for the parametrization decorator with all the class labels to refactor the raw injections a little bit.
Could you share a screenshot on how the results look in a table?

nicholasjng · 2024-03-06T10:04:46Z

src/nnbench/runner.py

@@ -187,7 +187,7 @@ def run(
        path_or_module: str | os.PathLike[str],
        params: dict[str, Any] | Parameters | None = None,
        tags: tuple[str, ...] = (),
-        context: Sequence[ContextProvider] = (),
+        context: Sequence[ContextProvider | dict[str, Any] | Context] = (),


A sequence of contexts? I'm not sure that's correct. We take one Context object, and add/update it throughout our benchmarks.

Can you elaborate on why this was necessary?

I wanted to pass information to the context parameter of the runner. Specifically the name of the validation data set that I used (there are multiple versions of the CoNLL dataset).

As the Context.update also handles ContextProvider | dict[str, Any] | "Context", I thought it makes sense that the runner can handle all these types as well since the runner uses Context.update() to handle the provided context.

Good thinking. I'm not a fan of this typing, which is because the Context.update typing is too permissive.

Since the context is supposed to group different value sets on top-level keys (e.g. git info under "git"), I think the following should be true:

Context.update merges two Context objects, so "other" needs to be a Context.

Context.add(provider) inserts a context provider's result into the context under a single top-level context key.

We drop raw dict support entirely, outside of Context.make() and the constructor.

Then, the typing of context in BenchmarkRunner.run() becomes Context | Sequence[ContextProvider] - the former if you want to supply your own hand-built context, the latter if you want it assembled from atomic providers. How does that sound?

examples/on_saved/pyproject.toml

.gitignore

docs/tutorials/on_saved.md

examples/on_saved/src/benchmark.py

docs/tutorials/on_saved.md

src/nnbench/types.py

maxmynter · 2024-03-06T11:37:39Z

We could define a functools.partial alias for the parametrization decorator with all the class labels to refactor the raw injections a little bit.

Agree

Could you share a screenshot on how the results look in a table?

There you go :)

maxmynter · 2024-03-06T15:35:46Z

Using functools.partial does not work straightforward. The collection then does not recognize the parametrized benchmarks anymore. Therefore, this is out of scope for this PR.

Edit: Opened a new issue with this #96

nicholasjng

In general:

If you use the english word of a type (say, Context), please use lower-case spelling, since the word and not the type abstraction is meant.

Example: The context to update -> context is lowercase, since it means the english word "context", represented by the Context object.

src/nnbench/runner.py

src/nnbench/types.py

src/nnbench/context.py

Now `update` handles the context provided by `ContextProvider`s and `add` merges another context into this one. Also adapt the logic and accepted types of `runner.run` to reflect these changes. Adapt tests as well.

We allow None to be supplied to custom formatters but it throws an error when calling .get. Therefore patch in an empty dict if `custom_formatter` is None.

The artifact logic is separated into three classes. The loader is for loading the artifact. It's load() method returns a local filepath. The Artifact class handles the artifact deserialization from the local disk. The Artifact collection is a wrapper around the artifacts for convenient iteration.

nicholasjng · 2024-03-07T10:38:33Z

src/nnbench/runner.py

+                # we do not allow multiple values for a context key.
+                duplicates = set(ctx.keys()) & set(ctx_cand.keys())
+                if duplicates:
+                    dupe, *_ = duplicates
+                    raise ValueError(f"got multiple values for context key {dupe!r}")
+                ctx.update(ctx_cand)


This method is suboptimal with the unique key requirement, but we could implement a key check under the hood in Context.update(), and add raw providers here in the loop with Context.add(). That's at most a follow-up, though.

Nice, I can open an issue for that. #99

maxmynter force-pushed the saved-artifacts branch 2 times, most recently from 87ccf66 to f08f0d9 Compare February 27, 2024 16:09

maxmynter marked this pull request as ready for review February 29, 2024 08:26

maxmynter requested a review from AdrianoKF February 29, 2024 08:26

maxmynter force-pushed the saved-artifacts branch 2 times, most recently from 963f752 to 7348387 Compare March 5, 2024 16:12

maxmynter requested a review from nicholasjng March 5, 2024 16:17

nicholasjng reviewed Mar 6, 2024

View reviewed changes

AdrianoKF reviewed Mar 6, 2024

View reviewed changes

src/nnbench/types.py Outdated Show resolved Hide resolved

maxmynter force-pushed the saved-artifacts branch 4 times, most recently from fd3ca00 to 5f30973 Compare March 6, 2024 16:51

maxmynter mentioned this pull request Mar 6, 2024

Make nnbench.parametrize work with functools.partial #96

Closed

maxmynter requested review from AdrianoKF and nicholasjng March 6, 2024 16:59

maxmynter mentioned this pull request Mar 6, 2024

Add common ArtifactLoaders to make nnbench provide them out of the box #97

Closed

nicholasjng reviewed Mar 6, 2024

View reviewed changes

nicholasjng mentioned this pull request Mar 7, 2024

Add a lightning demo to the README.md #98

Merged

maxmynter added 3 commits March 7, 2024 11:23

Uncouple Context update and add

a00caf9

Now `update` handles the context provided by `ContextProvider`s and `add` merges another context into this one. Also adapt the logic and accepted types of `runner.run` to reflect these changes. Adapt tests as well.

Handle no custom formatter supplied BenchmarkReporter

4da8c1f

We allow None to be supplied to custom formatters but it throws an error when calling .get. Therefore patch in an empty dict if `custom_formatter` is None.

maxmynter force-pushed the saved-artifacts branch from 5f30973 to 781088d Compare March 7, 2024 10:25

maxmynter requested a review from nicholasjng March 7, 2024 10:26

nicholasjng approved these changes Mar 7, 2024

View reviewed changes

maxmynter merged commit 521c11a into main Mar 7, 2024
5 checks passed

maxmynter mentioned this pull request Mar 7, 2024

Improve ensure unique keys in Contexts upon merge. #99

Closed

2 tasks

maxmynter deleted the saved-artifacts branch March 7, 2024 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Guide: Benchmark on Saved Models #92

Example Guide: Benchmark on Saved Models #92

maxmynter commented Feb 26, 2024 •

edited

Loading

nicholasjng left a comment

nicholasjng Mar 6, 2024

maxmynter Mar 6, 2024

nicholasjng Mar 6, 2024

maxmynter commented Mar 6, 2024

maxmynter commented Mar 6, 2024 •

edited

Loading

nicholasjng left a comment

nicholasjng Mar 7, 2024

maxmynter Mar 7, 2024 •

edited

Loading

Example Guide: Benchmark on Saved Models #92

Example Guide: Benchmark on Saved Models #92

Conversation

maxmynter commented Feb 26, 2024 • edited Loading

Artifact Class

Other nnbench changes

Example

Questions

Remaining ToDo's

nicholasjng left a comment

Choose a reason for hiding this comment

nicholasjng Mar 6, 2024

Choose a reason for hiding this comment

maxmynter Mar 6, 2024

Choose a reason for hiding this comment

nicholasjng Mar 6, 2024

Choose a reason for hiding this comment

maxmynter commented Mar 6, 2024

maxmynter commented Mar 6, 2024 • edited Loading

nicholasjng left a comment

Choose a reason for hiding this comment

nicholasjng Mar 7, 2024

Choose a reason for hiding this comment

maxmynter Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

maxmynter commented Feb 26, 2024 •

edited

Loading

maxmynter commented Mar 6, 2024 •

edited

Loading

maxmynter Mar 7, 2024 •

edited

Loading