BUG/ENH: Data processing refactor #88

adriangb · 2020-10-01T22:32:03Z

Closes #78, Closes #100, Closes #83.

This PR implements a sklearn transformer based interface for data processing that replaces {pre,post}process_{X,y}.

codecov-commenter · 2020-10-01T22:37:04Z

Codecov Report

Merging #88 into master will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #88   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         5    +2     
  Lines          526       552   +26     
=========================================
+ Hits           526       552   +26

Impacted Files	Coverage Δ
scikeras/_utils.py	`100.00% <100.00%> (ø)`
scikeras/utils/__init__.py	`100.00% <100.00%> (ø)`
scikeras/utils/transformers.py	`100.00% <100.00%> (ø)`
scikeras/wrappers.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71ccf67...843192f. Read the comment docs.

adriangb · 2020-10-02T00:01:31Z

@stsievert please feel free to wait until I at least try to split up these changes into 2 PRs before reviewing the diff, but it would be helpful to get your input on at least the general approach before spending more time on it.

adriangb · 2020-10-02T05:00:12Z

I successfully split up the PR, diff is now very reasonable and all tests are passing (failure is due to lack of coverage on new ValueErrors). Take a look whenever you get a chance @stsievert. Thanks!

tests/test_errors.py

scikeras/wrappers.py

tests/test_api.py

tests/test_input_outputs.py

pyproject.toml

scikeras/utils/transformers.py

scikeras/_utils.py

tests/multi_output_models.py

tests/test_input_outputs.py

tests/test_parameters.py

scikeras/wrappers.py

adriangb · 2020-10-03T00:31:37Z

Got all of the tests for new errors implemented 😄

scikeras/utils/__init__.py

scikeras/wrappers.py

stsievert

Why are attributes set inside preprocess_y? As a user I wouldn't think about setting attributes inside preprocess_y, I'd only be concerned with preprocessing the targets.

Same goes for the reset parameter – I don't understand why any preprocessing function needs to receive a reset parameter. Why does preprocessing need to be concerned about state?

By that measure, I think the interface you had earlier is cleaner (where preprocess_y also returned a dictionary of meta attributes), and I think I'm coming around the idea of having BaseWrapper.__init__ accept feature_encoder and target_encoder keywords.

scikeras/utils/__init__.py

scikeras/wrappers.py

adriangb · 2020-10-04T16:13:34Z

Why are attributes set inside preprocess_y? As a user I wouldn't think about setting attributes inside preprocess_y, I'd only be concerned with preprocessing the targets.

We need to set those attributes somewhere, do you think it would be better if we returned a dictionary and set them in fit like before? If you're a user writing your own preprocess_y you don't have to set any attributes. If you need attributes in postprocess_y or you want to pass attributes to your model building function, then you would set attributes (or put them in the self.meta_ dict, whichever we settle for). That is, these attributes should be an implementation detail of the functions since SciKeras itself does not check them. They are produced by preprocess_y and consumed by postprocess_y and the model building function, which are all user defined.*

Same goes for the reset parameter – I don't understand why any preprocessing function needs to receive a reset parameter. Why does preprocessing need to be concerned about state?

For re-use of transformers. reset is what tells preprocess_y to re-use fitted, instantiated transformers instead of creating new ones. This enables the following behavior, which would not work otherwise:

X1, y1 = np.array([[1, 2, 3]]).T, np.array([1, 2, 3])
est = KerasClassifier(loss="categorical_crossentropy")
est.fit(X1, y1)  # y1 gets passed through OneHotEncoder to match categorical_crossentropy
X2, y2 = X1, np.array([1, 2, 2])
est.partial_fit(X2, y2)  # y2 re-uses the same OneHotEncoder instance, giving 3 columns instead of 2 if it used a new instance

By that measure, I think the interface you had earlier is cleaner (where preprocess_y also returned a dictionary of meta attributes), and

I'm not opposed to going back to the returning a dictionary and storing it in meta_, but as mentioned in #88 (comment), we'd need to figure out how to handle parameters that are typically base-level attributes of estimators, like n_outputs_.

I think I'm coming around the idea of having BaseWrapper.__init__ accept feature_encoder and target_encoder keywords

Let's for now work on the interface we have. It should be possible in the future to package it all up into transformers which could be passed as __init__ params. This alternative interface has its own complications (ex: requiring access to the loss parameter).

* That said, some of these parameters are checked by other things within the Scikit-Learn ecosystem, but I think if a user is overriding preprocess_y and doesn't set n_outputs_ and then runs into a Scikit-Learn error they should be able to figure out what went wrong a fix it easily, I don't think we should validate that they set anything parameters.

adriangb · 2020-10-05T19:03:07Z

Collecting the pending topics from last round of reviews:

Adding the `"multiclass-onehot"` target type (comment)

Requirement: all basic use cases that work with Keras OR sklearn should work with SciKeras.
This boils down to:

How do we define "basic"?
- Number of inputs/outputs?
- Loss function?
For "advanced" uses, do we pass data through and let Keras fail if it is not formatted correctly, or do we raise an error in SciKeras to tell the user that SciKeras does not know how to handle this type of data?
The original wrappers would automatically one-hot encode targets if the loss function is "categorical_crossentropy". Do we continue to support this API?

The current solution solves these issues by:

Define "basic" as:
- Single input/output
Fail with a clear ValueError if sklearn's type_of_target reports a multi-output model ("multiclass_multioutput" and "multilabel-indicator"). Do not fail based on the loss function.
Continue to support automatically one-hot encoding of targets for "categorical_crossentropy", but only if the user explicitly passes it as a loss function. No other loss function based data manipulation.
Create a new type of target (multiclass-onehot) to represent the subset of multilabel-indicator where there is only 1 label per samples (i.e. "one hot encoding") which maps directly to the common use case in Keras where you one-hot-encode your target and use the "categorical_crossentropy" target. Unlike the general multilabel-indicator, SciKeras does not fail with a ValueError and instead passes this data through to Keras. This is con, since we generally don't want to modify the sklearn API.

One alternative solution is:

Do not fail if sklearn's type_of_target reports a multi-output model.
Continue to support automatically one-hot encoding of targets for "categorical_crossentropy", but only if the user explicitly passes it as a loss function. No other loss function based data manipulation.
No need to create a new type of target. This is a huge plus, we don't want to modify the sklearn API.
Users may be confused when they try to implement a multi-output model and run into TF errors (example).

stsievert · 2020-10-06T04:32:00Z

Let's for now work on the interface we have. It should be possible in the future

The future is less than 12 hours away. I'll make a PR to this branch that that cleans up the code. It simplifies the logic around initialization and only makes internal changes.

adriangb · 2020-10-06T04:49:01Z

The future is less than 12 hours away. I'll make a PR to this branch that that cleans up the code. It simplifies the logic around initialization and only makes internal changes.

That is great, I look forward to it!

stsievert · 2020-10-06T15:53:53Z

The original wrappers would automatically one-hot encode targets if the loss function is "categorical_crossentropy". Do we continue to support this API?

Yes, that API should be supported. I don't think SciKeras should break the API that TF implemented.

adriangb · 2020-10-09T22:58:47Z

Thanks for your work over in #93 @stsievert . We should be up to date here now.

I think the only pending thing is type_of_target(y)=="multiclass-onehot".

codecov-io · 2020-10-09T23:02:53Z

Codecov Report

Merging #88 into master will decrease coverage by 0.35%.
The diff coverage is 99.09%.

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
- Coverage   99.80%   99.45%   -0.36%     
==========================================
  Files           3        5       +2     
  Lines         526      548      +22     
==========================================
+ Hits          525      545      +20     
- Misses          1        3       +2

Impacted Files	Coverage Δ
scikeras/wrappers.py	`99.00% <97.87%> (-0.74%)`	⬇️
scikeras/_utils.py	`100.00% <100.00%> (ø)`
scikeras/utils/__init__.py	`100.00% <100.00%> (ø)`
scikeras/utils/transformers.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6897d83...aebbf89. Read the comment docs.

…tput

…ature

…der; remove lambda from BaseWrapper._initialize

adriangb · 2020-10-18T16:40:36Z

I did a quick once-over on the diff, found a couple of typos and minor fixes.

@stsievert , when you have a chance, can you take a look at my responses/fixes to the last couple of comments, and let me know if we can resolve those or if there is more work needed? Thanks!

adriangb requested a review from stsievert October 2, 2020 00:00