-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example Guide: Benchmark on Saved Models #92
Conversation
87ccf66
to
f08f0d9
Compare
963f752
to
7348387
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional comments:
- The lazy artifact loading is really sweet!
- We could define a
functools.partial
alias for the parametrization decorator with all the class labels to refactor the raw injections a little bit. - Could you share a screenshot on how the results look in a table?
src/nnbench/runner.py
Outdated
@@ -187,7 +187,7 @@ def run( | |||
path_or_module: str | os.PathLike[str], | |||
params: dict[str, Any] | Parameters | None = None, | |||
tags: tuple[str, ...] = (), | |||
context: Sequence[ContextProvider] = (), | |||
context: Sequence[ContextProvider | dict[str, Any] | Context] = (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A sequence of contexts? I'm not sure that's correct. We take one Context
object, and add/update it throughout our benchmarks.
Can you elaborate on why this was necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to pass information to the context parameter of the runner. Specifically the name of the validation data set that I used (there are multiple versions of the CoNLL dataset).
As the Context.update
also handles ContextProvider | dict[str, Any] | "Context"
, I thought it makes sense that the runner can handle all these types as well since the runner uses Context.update()
to handle the provided context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking. I'm not a fan of this typing, which is because the Context.update
typing is too permissive.
Since the context is supposed to group different value sets on top-level keys (e.g. git info under "git"), I think the following should be true:
Context.update
merges twoContext
objects, so "other" needs to be aContext
.Context.add(provider)
inserts a context provider's result into the context under a single top-level context key.- We drop raw dict support entirely, outside of
Context.make()
and the constructor.
Then, the typing of context
in BenchmarkRunner.run()
becomes Context | Sequence[ContextProvider]
- the former if you want to supply your own hand-built context, the latter if you want it assembled from atomic providers. How does that sound?
Using Edit: Opened a new issue with this #96 |
fd3ca00
to
5f30973
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general:
If you use the english word of a type (say, Context), please use lower-case spelling, since the word and not the type abstraction is meant.
Example: The context to update -> context is lowercase, since it means the english word "context", represented by the Context
object.
Now `update` handles the context provided by `ContextProvider`s and `add` merges another context into this one. Also adapt the logic and accepted types of `runner.run` to reflect these changes. Adapt tests as well.
We allow None to be supplied to custom formatters but it throws an error when calling .get. Therefore patch in an empty dict if `custom_formatter` is None.
The artifact logic is separated into three classes. The loader is for loading the artifact. It's load() method returns a local filepath. The Artifact class handles the artifact deserialization from the local disk. The Artifact collection is a wrapper around the artifacts for convenient iteration.
5f30973
to
781088d
Compare
# we do not allow multiple values for a context key. | ||
duplicates = set(ctx.keys()) & set(ctx_cand.keys()) | ||
if duplicates: | ||
dupe, *_ = duplicates | ||
raise ValueError(f"got multiple values for context key {dupe!r}") | ||
ctx.update(ctx_cand) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is suboptimal with the unique key requirement, but we could implement a key check under the hood in Context.update()
, and add raw providers here in the loop with Context.add()
. That's at most a follow-up, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I can open an issue for that. #99
This PR introduces artifact handling alongside a guide and example for these artifacts. The example concerns a bigger model than the other examples we have written so I believe it can serve as a nice case study as well. So if you see any clunky-ness w.r.t. to the nnbench usage, that can be a pointer to (a) improve the example, or (b) add nnbench functionality.
Artifact Class
Artifact
class and move the loading logic to aArtifactLoader
to separate the concerns between loading and deserializing.ArtifactCollection
wrapper around an iterable to iterate through artifacts.Other nnbench changes
nnbench.runner
can take all contexts that theContext.update
method can handle. This is useful if someone wants to ad-hoc add values in a dict or with a newContext
to the context creation of a runner.Example
The example is a token classification task. It is mostly an adaptation from here. The logic is in the
examples/on_saved/src/training/training.py
.When people want to recreate the model training themselves they can copy it into a free GPU runtime on e.g. Google Colab, install the dependencies (
!pip install transformers[torch] datasets
) and run!python training.py
. After About 5 minutes they have the necessary files ready for download. Alternatively they can get a model directly from Huggingface (e.g. this).Evaluation is executed by running
python runner.py <path-to-model-folder-1> <path-2> ...
.The benchmarks are in the
benchmark.py
. Namely, accuracy, precision, recall, and f1 averaged and per label, plus some model metadata. We cache the evaluation processing where appropriate and use ournnbench.parametrize
for label specific metricsQuestions
Artifact
wrapper feels a little artificial, especially w.r.t. the necessary preprocessing of the data to a dataloader. But maybe that is because Huggingface already provides a wrapper that is not available for custom datasets.Remaining ToDo's
Add common loaders (in a follow up PR?)-> Follow up Issue (Add common ArtifactLoaders to make nnbench provide them out of the box #97)training.py
scriptCloses #63