Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Guide: Benchmark on Saved Models #92

Merged
merged 3 commits into from
Mar 7, 2024
Merged

Conversation

maxmynter
Copy link
Collaborator

@maxmynter maxmynter commented Feb 26, 2024

This PR introduces artifact handling alongside a guide and example for these artifacts. The example concerns a bigger model than the other examples we have written so I believe it can serve as a nice case study as well. So if you see any clunky-ness w.r.t. to the nnbench usage, that can be a pointer to (a) improve the example, or (b) add nnbench functionality.

Artifact Class

  • Extend the Artifact class and move the loading logic to a ArtifactLoader to separate the concerns between loading and deserializing.
  • Add an ArtifactCollection wrapper around an iterable to iterate through artifacts.

Other nnbench changes

  • nnbench.runner can take all contexts that the Context.update method can handle. This is useful if someone wants to ad-hoc add values in a dict or with a new Context to the context creation of a runner.
  • (bugfix) Base reporter can now also handle when no custom formatters are supplied.

Example

The example is a token classification task. It is mostly an adaptation from here. The logic is in the examples/on_saved/src/training/training.py.

When people want to recreate the model training themselves they can copy it into a free GPU runtime on e.g. Google Colab, install the dependencies (!pip install transformers[torch] datasets) and run !python training.py. After About 5 minutes they have the necessary files ready for download. Alternatively they can get a model directly from Huggingface (e.g. this).

Evaluation is executed by running python runner.py <path-to-model-folder-1> <path-2> ....

The benchmarks are in the benchmark.py. Namely, accuracy, precision, recall, and f1 averaged and per label, plus some model metadata. We cache the evaluation processing where appropriate and use our nnbench.parametrize for label specific metrics

Questions

  • Especially around the dataset the Artifact wrapper feels a little artificial, especially w.r.t. the necessary preprocessing of the data to a dataloader. But maybe that is because Huggingface already provides a wrapper that is not available for custom datasets.
  • I did not add a merging of Benchmark Records. I think that is more on the side of whatever backend we use. There I would want the run for specific models to be different benchmark records. An exception to this could be when i want deltas in the metrics between the current and a baseline model. In this case supplying the baseline to the reporter (and have it cached) feels like the most natural implementation. What do you think?

Remaining ToDo's

Closes #63

@maxmynter maxmynter force-pushed the saved-artifacts branch 2 times, most recently from 87ccf66 to f08f0d9 Compare February 27, 2024 16:09
@maxmynter maxmynter marked this pull request as ready for review February 29, 2024 08:26
@maxmynter maxmynter force-pushed the saved-artifacts branch 2 times, most recently from 963f752 to 7348387 Compare March 5, 2024 16:12
Copy link
Collaborator

@nicholasjng nicholasjng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments:

  • The lazy artifact loading is really sweet!
  • We could define a functools.partial alias for the parametrization decorator with all the class labels to refactor the raw injections a little bit.
  • Could you share a screenshot on how the results look in a table?

@@ -187,7 +187,7 @@ def run(
path_or_module: str | os.PathLike[str],
params: dict[str, Any] | Parameters | None = None,
tags: tuple[str, ...] = (),
context: Sequence[ContextProvider] = (),
context: Sequence[ContextProvider | dict[str, Any] | Context] = (),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sequence of contexts? I'm not sure that's correct. We take one Context object, and add/update it throughout our benchmarks.

Can you elaborate on why this was necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to pass information to the context parameter of the runner. Specifically the name of the validation data set that I used (there are multiple versions of the CoNLL dataset).

As the Context.update also handles ContextProvider | dict[str, Any] | "Context", I thought it makes sense that the runner can handle all these types as well since the runner uses Context.update() to handle the provided context.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. I'm not a fan of this typing, which is because the Context.update typing is too permissive.

Since the context is supposed to group different value sets on top-level keys (e.g. git info under "git"), I think the following should be true:

  • Context.update merges two Context objects, so "other" needs to be a Context.
  • Context.add(provider) inserts a context provider's result into the context under a single top-level context key.
  • We drop raw dict support entirely, outside of Context.make() and the constructor.

Then, the typing of context in BenchmarkRunner.run() becomes Context | Sequence[ContextProvider] - the former if you want to supply your own hand-built context, the latter if you want it assembled from atomic providers. How does that sound?

examples/on_saved/pyproject.toml Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
docs/tutorials/on_saved.md Outdated Show resolved Hide resolved
examples/on_saved/src/benchmark.py Outdated Show resolved Hide resolved
docs/tutorials/on_saved.md Outdated Show resolved Hide resolved
docs/tutorials/on_saved.md Outdated Show resolved Hide resolved
docs/tutorials/on_saved.md Outdated Show resolved Hide resolved
docs/tutorials/on_saved.md Outdated Show resolved Hide resolved
docs/tutorials/on_saved.md Outdated Show resolved Hide resolved
src/nnbench/types.py Outdated Show resolved Hide resolved
@maxmynter
Copy link
Collaborator Author

We could define a functools.partial alias for the parametrization decorator with all the class labels to refactor the raw injections a little bit.

Agree

Could you share a screenshot on how the results look in a table?

There you go :)

Screenshot 2024-03-06 at 12 37 01

@maxmynter
Copy link
Collaborator Author

maxmynter commented Mar 6, 2024

Using functools.partial does not work straightforward. The collection then does not recognize the parametrized benchmarks anymore. Therefore, this is out of scope for this PR.

Edit: Opened a new issue with this #96

Copy link
Collaborator

@nicholasjng nicholasjng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general:

If you use the english word of a type (say, Context), please use lower-case spelling, since the word and not the type abstraction is meant.

Example: The context to update -> context is lowercase, since it means the english word "context", represented by the Context object.

src/nnbench/runner.py Outdated Show resolved Hide resolved
src/nnbench/types.py Outdated Show resolved Hide resolved
src/nnbench/types.py Outdated Show resolved Hide resolved
src/nnbench/types.py Outdated Show resolved Hide resolved
src/nnbench/types.py Outdated Show resolved Hide resolved
src/nnbench/types.py Outdated Show resolved Hide resolved
src/nnbench/types.py Outdated Show resolved Hide resolved
src/nnbench/context.py Outdated Show resolved Hide resolved
src/nnbench/context.py Outdated Show resolved Hide resolved
src/nnbench/context.py Outdated Show resolved Hide resolved
Now `update` handles the context provided by `ContextProvider`s and `add`
merges another context into this one.
Also adapt the logic and accepted types of `runner.run` to reflect these
changes. Adapt tests as well.
We allow None to be supplied to custom formatters but it throws an error when calling .get. Therefore patch in an empty dict if `custom_formatter` is None.
The artifact logic is separated into three classes.
The loader is for loading the artifact. It's load()
method returns a local filepath.

The Artifact class handles the artifact deserialization
from the local disk.

The Artifact collection is a wrapper around the artifacts
for convenient iteration.
Comment on lines +247 to +252
# we do not allow multiple values for a context key.
duplicates = set(ctx.keys()) & set(ctx_cand.keys())
if duplicates:
dupe, *_ = duplicates
raise ValueError(f"got multiple values for context key {dupe!r}")
ctx.update(ctx_cand)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is suboptimal with the unique key requirement, but we could implement a key check under the hood in Context.update(), and add raw providers here in the loop with Context.add(). That's at most a follow-up, though.

Copy link
Collaborator Author

@maxmynter maxmynter Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I can open an issue for that. #99

@maxmynter maxmynter merged commit 521c11a into main Mar 7, 2024
5 checks passed
@maxmynter maxmynter deleted the saved-artifacts branch March 7, 2024 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example: Implement benchmarking on saved artifacts
3 participants