[WIP] [ENH] Reimplement benchmarking with `kotsu` #1

DBCerigo · 2022-06-22T15:39:31Z

(Note, the first 3 sections of the PR message Dan added for us, the rest are from sktime's PR message template.)

Usage instructions:

See the reference issues below for getting context
See https://www.sktime.org/en/latest/developer_guide.html#developer-guide to set up sktime dev env (I used pyenv, made a venv, pip install -e .[dev], pip install pre-commit, pre-commit install).
Dependencies to install specific for this PR: autoreload, kotsu, jupyter, typing_extensions
Try running the sktime/benchmarking/run.ipynb

To do / decide now :

@alex-hh @ali-tny review and share thoughts on implementation etc.
Dan was getting a break from kotsu with typing_extension module not found error as we import Literal from that in kotsu - I "fixed" it by pip install typing_extensions in my sktime venv. I think this might be an issue in kotsu as we didn't include typing_extensions in its requirements?! Weird that it doesn't break in kotsu CI.
Should we use a separate file for where define the main validation funcs and another file for registering them?
What should be unit tested, and what should be considered like "benchmarking infra"?
- Write those unit tests.

To discuss with sktime team Tuesday 28th:

What type of dependency will kotsu be? (Dan thinks soft.)
What interface wanted for running the benchmarking? A python script? A jupyter notebook?
Register estimators in estimator modules or in benchmarking module? (Dan thinks within benchmarking.)
Saving copies of the folds? Saving estimator predictions? Or just save end scores.
Do we want to implement visualisations/explorations/further analysis? In benchmarking there's some analysis (e.g. t-tests etc.) modules. Or just store the base benchmark results and let people do analysis on those as they see fit?
We're trying to write as little code as pos and rely on sktime functionality as possible. evaluate only takes a single scorer - we would want to avoid running most training and pred for every scoring metrics.

Reference Issues/PRs

See discussion: sktime#2777
See discussion: sktime#2804
See spike: sktime#2805

What does this implement/fix? Explain your changes.

Implement benchmarking using kotsu.

Instantiation of a validation_registry and estimator_registry in benchmarking's init
A validations.py module which defines and registers validation runs.
An 'estimators.py' module which imports and registers estimators + their configs (i.e. param specifications)
A run.py module and/or a run.ipynb which runs the registry of estimators

Likely we will eventually want to have separate registries corresponding to different types of predictions (classification, regression, etc.), for now this just implements for regression (point prediction).

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

The understandability, flexibility, extendability, and maintainability, of this benchmarking system implementation.

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors.
Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
I've added unit tests and made sure they pass locally.
The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.

ali-tny

killer work dude! couple of small comments throughout.

Re your description todo list:

Dan was getting a break from kotsu with typing_extension module not found error as we import Literal from that in kotsu - I "fixed" it by pip install typing_extensions in my sktime venv. I think this might be an issue in kotsu as we didn't include typing_extensions in its requirements?! Weird that it doesn't break in kotsu CI.

It's because typing_extensions is provided as a dev dependency by mypy & black, so it wasn't caught in CI. We should add it to dependencies (its just backports for old python versions, but we may as well support old python versions rather than just using from typing import ...)

Should we use a separate file for where define the main validation funcs and another file for registering them?

I'm in favour of registering them close to where they're defined. If it gets unweildy, can revisit.

What should be unit tested, and what should be considered like "benchmarking infra"?

Main validation function (forecasting_validation) should be tested here and the rest is ok, I think.

My thoughts on the things to discuss with sktime team:

What type of dependency will kotsu be? (Dan thinks soft.)

Just to check: the point of the benchmarking module is to provide a way for users of sktime to test against standard benchmarks, is that right? If that's right, then I agree that it should be soft.

What interface wanted for running the benchmarking? A python script? A jupyter notebook?

If it's for users, then I agree that a jupyter notebook (as you have written) in the examples folder is appropriate

Register estimators in estimator modules or in benchmarking module? (Dan thinks within benchmarking.)

Agree

Saving copies of the folds? Saving estimator predictions? Or just save end scores.
Do we want to implement visualisations/explorations/further analysis? In benchmarking there's some analysis (e.g. t-tests etc.) modules. Or just store the base benchmark results and let people do analysis on those as they see fit?

I left a comment in the PR, but a quick example plot I think is a nice idea (and the dataframe is self explanatory, so they can build on top)

We're trying to write as little code as pos and rely on sktime functionality as possible. evaluate only takes a single scorer - we would want to avoid running most training and pred for every scoring metrics.

Agree

ali-tny · 2022-06-26T10:34:28Z

sktime/benchmarking/validations.py

+    results = {}
+    for scorer in scorers:
+        scorer_name = scorer.name
+        scores_df = evaluate(forecaster=estimator, y=y, cv=cv_splitter, scoring=scorer)


totally agree this is a bummer, looking at the source code, but agree that we should consider changing sktime as out of scope for this PR

that being said: that evaluate function already returns a dataframe with a column for the scorer, so it would be trivial to allow it to accept a list of scorers. so if kotsu gets used and we have validations with more than one scorer, we should consider implementing that later for an easy speedup

agree with all of this - to stick closely to sktime convention and avoid the unpleasantness we could only accept a single scorer in forecasting_validation, and then view implementing multi-scoring as something for future but also fine with leaving as is.

ali-tny · 2022-06-26T10:43:25Z

sktime/benchmarking/validations.py

+def forecasting_validation(
+    dataset_loader: Callable,
+    cv_splitter: BaseSplitter,
+    scorers: List[BaseMetric],
+    estimator: BaseForecaster,
+) -> List:


should really have a **kwargs, otherwise it will fail if artefacts_store_dir is passed to run in the notebook

this seems like an easy footgun - is there a way we can inspect the signature of the function in python at registration-time to fail if it's not present? (in kotsu, i mean)

ali-tny · 2022-06-26T10:44:49Z

sktime/benchmarking/estimators.py

+estimator_registry.register(
+    id="Naive-v1",
+    entry_point=NaiveForecaster,
+    kwargs={"strategy": "mean", "sp": 12},
+)


if the intention is for sktime users to use the benchmarking, i think we should do this in the run notebook. if benchmarking isn't for that reason then ignore

ali-tny · 2022-06-26T10:47:19Z

sktime/benchmarking/__init__.py

 estimator_registry = registration.ModelRegistry()
 validation_registry = registration.ValidationRegistry()
+
+from sktime.benchmarking import estimators  # noqa
+from sktime.benchmarking import validations  # noqa


could put the registries in their own file (registries.py?) to avoid noqa-ing (but also open to it being nicer that they're in the init)

ali-tny · 2022-06-26T10:49:20Z

sktime/benchmarking/run.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"./results.csv\")"


matplotlib is a soft dependency - we could add a plot of the output as well if that dependency is present (taking inspiration from our other projects - we wrote some plotting for them right?)

there are some examples of sktime evaluation-relevant plot functions in the sktime tutorial

alex-hh

Looking good!

What I'm not 100% clear on is whether we're trying to provide an example of how to, as a user, extend or construct your own benchmark, or whether we're trying to demonstrate how a standardised benchmark could be baked-in to the benchmarking module.

It seems like the latter? In which case is it worth being completely explicit about this in terms of the PR description (and maybe comments/docstrings). Am finding myself having to make a bit of mental effort to keep these usages separate but that might just be me!

One example of this is that I think the registries should be named to reflect their identities as basically components of an (extensible) standard forecasting benchmark. (e.g. estimators_registry -> forecasting_estimators).

I think questions of where registration of validations and estimators should be done also depend on how exactly we envision the benchmarking module being used. Another alternative would be to have a single module for each benchmark. e.g. forecast_benchmark.py where all instantiation and registration of the relevant validations and estimators is done. This would make clear that they are (in my interpretation) basically benchmark-specific but may have other downsides.

alex-hh · 2022-06-26T13:03:47Z

sktime/benchmarking/validations.py

+    results = {}
+    for scorer in scorers:
+        scorer_name = scorer.name
+        scores_df = evaluate(forecaster=estimator, y=y, cv=cv_splitter, scoring=scorer)


Haven't investigated fully but we might want to add a cv_splitter.clone() or cv_splitter.reset() b.c. cv_splitter seems to have state which could get updated by evaluate, though this might not be an issue in practice.

alex-hh · 2022-06-26T13:04:20Z

sktime/benchmarking/validations.py

+    results = {}
+    for scorer in scorers:
+        scorer_name = scorer.name
+        scores_df = evaluate(forecaster=estimator, y=y, cv=cv_splitter, scoring=scorer)


agree with all of this - to stick closely to sktime convention and avoid the unpleasantness we could only accept a single scorer in forecasting_validation, and then view implementing multi-scoring as something for future but also fine with leaving as is.

alex-hh · 2022-06-26T13:06:59Z

sktime/benchmarking/__init__.py

+"""Implements running validations on estimators storing results for benchmarking."""
+from kotsu import registration
+
+estimator_registry = registration.ModelRegistry()


if these registries are forecasting - specific should they be named as such? e.g. forecasting_estimators or sthg. otherwise they give a misleading impression of being global

alex-hh · 2022-06-26T13:09:46Z

sktime/benchmarking/run.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"./results.csv\")"


there are some examples of sktime evaluation-relevant plot functions in the sktime tutorial

DBCerigo · 2022-06-27T21:06:07Z

sktime/benchmarking/tests/test_validations.py

+
+    _tags = {
+        "requires-fh-in-fit": False,
+        # "scitype:y": "univariate",


DBCerigo · 2022-06-27T21:09:29Z

@alex-hh @ali-tny I'm gonna rebase this, and then open a PR from this branch into turing.../sktime repo, and share with the people we're meeting with tomorrow.

Let's leave this PR open for now though, to refer back to, and it raises a few kotsu specific things to continue thinking about also.

- Added **kwargs to catch artefacts store dir (if passed) - Fix return type of forecasting validation function

DBCerigo · 2022-06-27T21:37:22Z

FYI @alex-hh @ali-tny sktime#2884

See you at the meeting tomorrow 🤙

DBCerigo · 2022-08-02T19:58:54Z

Close as replaced by sktime#2977

DBCerigo changed the title ~~[WIP] [ENH] Implement benchmarking with kotsu~~ [WIP] [ENH] Reimplement benchmarking with kotsu Jun 22, 2022

DBCerigo added 2 commits June 22, 2022 17:12

[ENH] Add doc string to benchmarking module

3c14c5e

[ENH] Init validation and estimator registries

0a8e2bf

DBCerigo force-pushed the benchmarking-kotsu branch 2 times, most recently from 96f4b9d to 0a78ab8 Compare June 23, 2022 14:42

DBCerigo requested review from alex-hh and ali-tny June 23, 2022 14:48

ali-tny requested changes Jun 26, 2022

View reviewed changes

alex-hh requested changes Jun 26, 2022

View reviewed changes

DBCerigo commented Jun 27, 2022

View reviewed changes

DBCerigo added 6 commits June 27, 2022 17:12

[ENH] Implement forecasting validation

7d4630d

- Added **kwargs to catch artefacts store dir (if passed) - Fix return type of forecasting validation function

[ENH] Implement factory for forecaster validation

f5e81b3

[ENH] Register validators for benchmarking

21e76cc

[ENH] Register estimators for benchmarking

9681d13

[ENH] Import in init to trigger registering

03e8154

[ENH] Add benchmarking run nb

c483a60

DBCerigo force-pushed the benchmarking-kotsu branch from ef65534 to c483a60 Compare June 27, 2022 21:13

DBCerigo closed this Aug 2, 2022

DBCerigo mentioned this pull request Aug 2, 2022

[ENH] Using Kotsu as a benchmarking framework - Implementation in sktime sktime/sktime#2804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [ENH] Reimplement benchmarking with `kotsu` #1

[WIP] [ENH] Reimplement benchmarking with `kotsu` #1

DBCerigo commented Jun 22, 2022 •

edited

Loading

ali-tny left a comment •

edited

Loading

ali-tny Jun 26, 2022

alex-hh Jun 26, 2022 •

edited

Loading

ali-tny Jun 26, 2022

ali-tny Jun 26, 2022 •

edited

Loading

ali-tny Jun 26, 2022

ali-tny Jun 26, 2022

alex-hh Jun 26, 2022

alex-hh left a comment •

edited

Loading

alex-hh Jun 26, 2022

alex-hh Jun 26, 2022 •

edited

Loading

alex-hh Jun 26, 2022

alex-hh Jun 26, 2022

DBCerigo Jun 27, 2022

ali-tny Jun 28, 2022

DBCerigo commented Jun 27, 2022

DBCerigo commented Jun 27, 2022

DBCerigo commented Aug 2, 2022

[WIP] [ENH] Reimplement benchmarking with kotsu #1

[WIP] [ENH] Reimplement benchmarking with kotsu #1

Conversation

DBCerigo commented Jun 22, 2022 • edited Loading

Usage instructions:

To do / decide now :

To discuss with sktime team Tuesday 28th:

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions

ali-tny left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh Jun 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ali-tny Jun 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hh Jun 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DBCerigo commented Jun 27, 2022

DBCerigo commented Jun 27, 2022

DBCerigo commented Aug 2, 2022

[WIP] [ENH] Reimplement benchmarking with `kotsu` #1

[WIP] [ENH] Reimplement benchmarking with `kotsu` #1

DBCerigo commented Jun 22, 2022 •

edited

Loading

ali-tny left a comment •

edited

Loading

alex-hh Jun 26, 2022 •

edited

Loading

ali-tny Jun 26, 2022 •

edited

Loading

alex-hh left a comment •

edited

Loading

alex-hh Jun 26, 2022 •

edited

Loading