Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [ENH] Reimplement benchmarking with kotsu #1

Closed
wants to merge 8 commits into from

Conversation

DBCerigo
Copy link

@DBCerigo DBCerigo commented Jun 22, 2022

(Note, the first 3 sections of the PR message Dan added for us, the rest are from sktime's PR message template.)

Usage instructions:

  • See the reference issues below for getting context
  • See https://www.sktime.org/en/latest/developer_guide.html#developer-guide to set up sktime dev env (I used pyenv, made a venv, pip install -e .[dev], pip install pre-commit, pre-commit install).
  • Dependencies to install specific for this PR: autoreload, kotsu, jupyter, typing_extensions
  • Try running the sktime/benchmarking/run.ipynb

To do / decide now :

  • @alex-hh @ali-tny review and share thoughts on implementation etc.
  • Dan was getting a break from kotsu with typing_extension module not found error as we import Literal from that in kotsu - I "fixed" it by pip install typing_extensions in my sktime venv. I think this might be an issue in kotsu as we didn't include typing_extensions in its requirements?! Weird that it doesn't break in kotsu CI.
  • Should we use a separate file for where define the main validation funcs and another file for registering them?
  • What should be unit tested, and what should be considered like "benchmarking infra"?
    • Write those unit tests.

To discuss with sktime team Tuesday 28th:

  • What type of dependency will kotsu be? (Dan thinks soft.)
  • What interface wanted for running the benchmarking? A python script? A jupyter notebook?
  • Register estimators in estimator modules or in benchmarking module? (Dan thinks within benchmarking.)
  • Saving copies of the folds? Saving estimator predictions? Or just save end scores.
  • Do we want to implement visualisations/explorations/further analysis? In benchmarking there's some analysis (e.g. t-tests etc.) modules. Or just store the base benchmark results and let people do analysis on those as they see fit?
  • We're trying to write as little code as pos and rely on sktime functionality as possible. evaluate only takes a single scorer - we would want to avoid running most training and pred for every scoring metrics.

Reference Issues/PRs

See discussion: sktime#2777
See discussion: sktime#2804
See spike: sktime#2805

What does this implement/fix? Explain your changes.

Implement benchmarking using kotsu.

  • Instantiation of a validation_registry and estimator_registry in benchmarking's init
  • A validations.py module which defines and registers validation runs.
  • An 'estimators.py' module which imports and registers estimators + their configs (i.e. param specifications)
  • A run.py module and/or a run.ipynb which runs the registry of estimators

Likely we will eventually want to have separate registries corresponding to different types of predictions (classification, regression, etc.), for now this just implements for regression (point prediction).

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

The understandability, flexibility, extendability, and maintainability, of this benchmarking system implementation.

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • I've added unit tests and made sure they pass locally.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.

@DBCerigo DBCerigo changed the title [WIP] [ENH] Implement benchmarking with kotsu [WIP] [ENH] Reimplement benchmarking with kotsu Jun 22, 2022
@DBCerigo DBCerigo force-pushed the benchmarking-kotsu branch 2 times, most recently from 96f4b9d to 0a78ab8 Compare June 23, 2022 14:42
@DBCerigo DBCerigo requested review from alex-hh and ali-tny June 23, 2022 14:48
Copy link

@ali-tny ali-tny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

killer work dude! couple of small comments throughout.

Re your description todo list:

  • Dan was getting a break from kotsu with typing_extension module not found error as we import Literal from that in kotsu - I "fixed" it by pip install typing_extensions in my sktime venv. I think this might be an issue in kotsu as we didn't include typing_extensions in its requirements?! Weird that it doesn't break in kotsu CI.

It's because typing_extensions is provided as a dev dependency by mypy & black, so it wasn't caught in CI. We should add it to dependencies (its just backports for old python versions, but we may as well support old python versions rather than just using from typing import ...)

  • Should we use a separate file for where define the main validation funcs and another file for registering them?

I'm in favour of registering them close to where they're defined. If it gets unweildy, can revisit.

  • What should be unit tested, and what should be considered like "benchmarking infra"?

Main validation function (forecasting_validation) should be tested here and the rest is ok, I think.

My thoughts on the things to discuss with sktime team:

  • What type of dependency will kotsu be? (Dan thinks soft.)

Just to check: the point of the benchmarking module is to provide a way for users of sktime to test against standard benchmarks, is that right? If that's right, then I agree that it should be soft.

  • What interface wanted for running the benchmarking? A python script? A jupyter notebook?

If it's for users, then I agree that a jupyter notebook (as you have written) in the examples folder is appropriate

  • Register estimators in estimator modules or in benchmarking module? (Dan thinks within benchmarking.)

Agree

  • Saving copies of the folds? Saving estimator predictions? Or just save end scores.
  • Do we want to implement visualisations/explorations/further analysis? In benchmarking there's some analysis (e.g. t-tests etc.) modules. Or just store the base benchmark results and let people do analysis on those as they see fit?

I left a comment in the PR, but a quick example plot I think is a nice idea (and the dataframe is self explanatory, so they can build on top)

  • We're trying to write as little code as pos and rely on sktime functionality as possible. evaluate only takes a single scorer - we would want to avoid running most training and pred for every scoring metrics.

Agree

results = {}
for scorer in scorers:
scorer_name = scorer.name
scores_df = evaluate(forecaster=estimator, y=y, cv=cv_splitter, scoring=scorer)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agree this is a bummer, looking at the source code, but agree that we should consider changing sktime as out of scope for this PR

that being said: that evaluate function already returns a dataframe with a column for the scorer, so it would be trivial to allow it to accept a list of scorers. so if kotsu gets used and we have validations with more than one scorer, we should consider implementing that later for an easy speedup

Copy link

@alex-hh alex-hh Jun 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with all of this - to stick closely to sktime convention and avoid the unpleasantness we could only accept a single scorer in forecasting_validation, and then view implementing multi-scoring as something for future but also fine with leaving as is.

Comment on lines 11 to 16
def forecasting_validation(
dataset_loader: Callable,
cv_splitter: BaseSplitter,
scorers: List[BaseMetric],
estimator: BaseForecaster,
) -> List:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should really have a **kwargs, otherwise it will fail if artefacts_store_dir is passed to run in the notebook

this seems like an easy footgun - is there a way we can inspect the signature of the function in python at registration-time to fail if it's not present? (in kotsu, i mean)

Comment on lines +8 to +12
estimator_registry.register(
id="Naive-v1",
entry_point=NaiveForecaster,
kwargs={"strategy": "mean", "sp": 12},
)
Copy link

@ali-tny ali-tny Jun 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the intention is for sktime users to use the benchmarking, i think we should do this in the run notebook. if benchmarking isn't for that reason then ignore

Comment on lines 5 to 9
estimator_registry = registration.ModelRegistry()
validation_registry = registration.ValidationRegistry()

from sktime.benchmarking import estimators # noqa
from sktime.benchmarking import validations # noqa
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could put the registries in their own file (registries.py?) to avoid noqa-ing (but also open to it being nicer that they're in the init)

"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"./results.csv\")"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matplotlib is a soft dependency - we could add a plot of the output as well if that dependency is present (taking inspiration from our other projects - we wrote some plotting for them right?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some examples of sktime evaluation-relevant plot functions in the sktime tutorial

Copy link

@alex-hh alex-hh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

What I'm not 100% clear on is whether we're trying to provide an example of how to, as a user, extend or construct your own benchmark, or whether we're trying to demonstrate how a standardised benchmark could be baked-in to the benchmarking module.

It seems like the latter? In which case is it worth being completely explicit about this in terms of the PR description (and maybe comments/docstrings). Am finding myself having to make a bit of mental effort to keep these usages separate but that might just be me!

One example of this is that I think the registries should be named to reflect their identities as basically components of an (extensible) standard forecasting benchmark. (e.g. estimators_registry -> forecasting_estimators).

I think questions of where registration of validations and estimators should be done also depend on how exactly we envision the benchmarking module being used. Another alternative would be to have a single module for each benchmark. e.g. forecast_benchmark.py where all instantiation and registration of the relevant validations and estimators is done. This would make clear that they are (in my interpretation) basically benchmark-specific but may have other downsides.

results = {}
for scorer in scorers:
scorer_name = scorer.name
scores_df = evaluate(forecaster=estimator, y=y, cv=cv_splitter, scoring=scorer)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't investigated fully but we might want to add a cv_splitter.clone() or cv_splitter.reset() b.c. cv_splitter seems to have state which could get updated by evaluate, though this might not be an issue in practice.

results = {}
for scorer in scorers:
scorer_name = scorer.name
scores_df = evaluate(forecaster=estimator, y=y, cv=cv_splitter, scoring=scorer)
Copy link

@alex-hh alex-hh Jun 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with all of this - to stick closely to sktime convention and avoid the unpleasantness we could only accept a single scorer in forecasting_validation, and then view implementing multi-scoring as something for future but also fine with leaving as is.

"""Implements running validations on estimators storing results for benchmarking."""
from kotsu import registration

estimator_registry = registration.ModelRegistry()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these registries are forecasting - specific should they be named as such? e.g. forecasting_estimators or sthg. otherwise they give a misleading impression of being global

"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"./results.csv\")"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some examples of sktime evaluation-relevant plot functions in the sktime tutorial


_tags = {
"requires-fh-in-fit": False,
# "scitype:y": "univariate",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wha dis?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops!

@DBCerigo
Copy link
Author

@alex-hh @ali-tny I'm gonna rebase this, and then open a PR from this branch into turing.../sktime repo, and share with the people we're meeting with tomorrow.

Let's leave this PR open for now though, to refer back to, and it raises a few kotsu specific things to continue thinking about also.

@DBCerigo
Copy link
Author

FYI @alex-hh @ali-tny sktime#2884

See you at the meeting tomorrow 🤙

@DBCerigo
Copy link
Author

DBCerigo commented Aug 2, 2022

Close as replaced by sktime#2977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants