Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: Data processing refactor #88

Merged
merged 72 commits into from
Nov 3, 2020
Merged

BUG/ENH: Data processing refactor #88

merged 72 commits into from
Nov 3, 2020

Conversation

adriangb
Copy link
Owner

@adriangb adriangb commented Oct 1, 2020

Closes #78, Closes #100, Closes #83.

This PR implements a sklearn transformer based interface for data processing that replaces {pre,post}process_{X,y}.

@codecov-commenter
Copy link

codecov-commenter commented Oct 1, 2020

Codecov Report

Merging #88 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #88   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         5    +2     
  Lines          526       552   +26     
=========================================
+ Hits           526       552   +26     
Impacted Files Coverage Δ
scikeras/_utils.py 100.00% <100.00%> (ø)
scikeras/utils/__init__.py 100.00% <100.00%> (ø)
scikeras/utils/transformers.py 100.00% <100.00%> (ø)
scikeras/wrappers.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71ccf67...843192f. Read the comment docs.

@adriangb
Copy link
Owner Author

adriangb commented Oct 2, 2020

@stsievert please feel free to wait until I at least try to split up these changes into 2 PRs before reviewing the diff, but it would be helpful to get your input on at least the general approach before spending more time on it.

@adriangb
Copy link
Owner Author

adriangb commented Oct 2, 2020

I successfully split up the PR, diff is now very reasonable and all tests are passing (failure is due to lack of coverage on new ValueErrors). Take a look whenever you get a chance @stsievert. Thanks!

tests/test_input_outputs.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
tests/multi_output_models.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
@adriangb
Copy link
Owner Author

adriangb commented Oct 3, 2020

Got all of the tests for new errors implemented 😄

scikeras/utils/__init__.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are attributes set inside preprocess_y? As a user I wouldn't think about setting attributes inside preprocess_y, I'd only be concerned with preprocessing the targets.

Same goes for the reset parameter – I don't understand why any preprocessing function needs to receive a reset parameter. Why does preprocessing need to be concerned about state?

By that measure, I think the interface you had earlier is cleaner (where preprocess_y also returned a dictionary of meta attributes), and I think I'm coming around the idea of having BaseWrapper.__init__ accept feature_encoder and target_encoder keywords.

scikeras/utils/__init__.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
@adriangb
Copy link
Owner Author

adriangb commented Oct 4, 2020

Why are attributes set inside preprocess_y? As a user I wouldn't think about setting attributes inside preprocess_y, I'd only be concerned with preprocessing the targets.

We need to set those attributes somewhere, do you think it would be better if we returned a dictionary and set them in fit like before? If you're a user writing your own preprocess_y you don't have to set any attributes. If you need attributes in postprocess_y or you want to pass attributes to your model building function, then you would set attributes (or put them in the self.meta_ dict, whichever we settle for). That is, these attributes should be an implementation detail of the functions since SciKeras itself does not check them. They are produced by preprocess_y and consumed by postprocess_y and the model building function, which are all user defined.*

Same goes for the reset parameter – I don't understand why any preprocessing function needs to receive a reset parameter. Why does preprocessing need to be concerned about state?

For re-use of transformers. reset is what tells preprocess_y to re-use fitted, instantiated transformers instead of creating new ones. This enables the following behavior, which would not work otherwise:

X1, y1 = np.array([[1, 2, 3]]).T, np.array([1, 2, 3])
est = KerasClassifier(loss="categorical_crossentropy")
est.fit(X1, y1)  # y1 gets passed through OneHotEncoder to match categorical_crossentropy
X2, y2 = X1, np.array([1, 2, 2])
est.partial_fit(X2, y2)  # y2 re-uses the same OneHotEncoder instance, giving 3 columns instead of 2 if it used a new instance

By that measure, I think the interface you had earlier is cleaner (where preprocess_y also returned a dictionary of meta attributes), and

I'm not opposed to going back to the returning a dictionary and storing it in meta_, but as mentioned in #88 (comment), we'd need to figure out how to handle parameters that are typically base-level attributes of estimators, like n_outputs_.

I think I'm coming around the idea of having BaseWrapper.__init__ accept feature_encoder and target_encoder keywords

Let's for now work on the interface we have. It should be possible in the future to package it all up into transformers which could be passed as __init__ params. This alternative interface has its own complications (ex: requiring access to the loss parameter).

* That said, some of these parameters are checked by other things within the Scikit-Learn ecosystem, but I think if a user is overriding preprocess_y and doesn't set n_outputs_ and then runs into a Scikit-Learn error they should be able to figure out what went wrong a fix it easily, I don't think we should validate that they set anything parameters.

@adriangb
Copy link
Owner Author

adriangb commented Oct 5, 2020

Collecting the pending topics from last round of reviews:

Adding the "multiclass-onehot" target type (comment)

Requirement: all basic use cases that work with Keras OR sklearn should work with SciKeras.
This boils down to:

  • How do we define "basic"?
    • Number of inputs/outputs?
    • Loss function?
  • For "advanced" uses, do we pass data through and let Keras fail if it is not formatted correctly, or do we raise an error in SciKeras to tell the user that SciKeras does not know how to handle this type of data?
  • The original wrappers would automatically one-hot encode targets if the loss function is "categorical_crossentropy". Do we continue to support this API?

The current solution solves these issues by:

  • Define "basic" as:
    • Single input/output
  • Fail with a clear ValueError if sklearn's type_of_target reports a multi-output model ("multiclass_multioutput" and "multilabel-indicator"). Do not fail based on the loss function.
  • Continue to support automatically one-hot encoding of targets for "categorical_crossentropy", but only if the user explicitly passes it as a loss function. No other loss function based data manipulation.
  • Create a new type of target (multiclass-onehot) to represent the subset of multilabel-indicator where there is only 1 label per samples (i.e. "one hot encoding") which maps directly to the common use case in Keras where you one-hot-encode your target and use the "categorical_crossentropy" target. Unlike the general multilabel-indicator, SciKeras does not fail with a ValueError and instead passes this data through to Keras. This is con, since we generally don't want to modify the sklearn API.

One alternative solution is:

  • Do not fail if sklearn's type_of_target reports a multi-output model.
  • Continue to support automatically one-hot encoding of targets for "categorical_crossentropy", but only if the user explicitly passes it as a loss function. No other loss function based data manipulation.
  • No need to create a new type of target. This is a huge plus, we don't want to modify the sklearn API.
  • Users may be confused when they try to implement a multi-output model and run into TF errors (example).

@stsievert
Copy link
Collaborator

Let's for now work on the interface we have. It should be possible in the future

The future is less than 12 hours away. I'll make a PR to this branch that that cleans up the code. It simplifies the logic around initialization and only makes internal changes.

@adriangb
Copy link
Owner Author

adriangb commented Oct 6, 2020

The future is less than 12 hours away. I'll make a PR to this branch that that cleans up the code. It simplifies the logic around initialization and only makes internal changes.

That is great, I look forward to it!

@stsievert
Copy link
Collaborator

The original wrappers would automatically one-hot encode targets if the loss function is "categorical_crossentropy". Do we continue to support this API?

Yes, that API should be supported. I don't think SciKeras should break the API that TF implemented.

@adriangb
Copy link
Owner Author

adriangb commented Oct 9, 2020

Thanks for your work over in #93 @stsievert . We should be up to date here now.

I think the only pending thing is type_of_target(y)=="multiclass-onehot".

@codecov-io
Copy link

codecov-io commented Oct 9, 2020

Codecov Report

Merging #88 into master will decrease coverage by 0.35%.
The diff coverage is 99.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
- Coverage   99.80%   99.45%   -0.36%     
==========================================
  Files           3        5       +2     
  Lines         526      548      +22     
==========================================
+ Hits          525      545      +20     
- Misses          1        3       +2     
Impacted Files Coverage Δ
scikeras/wrappers.py 99.00% <97.87%> (-0.74%) ⬇️
scikeras/_utils.py 100.00% <100.00%> (ø)
scikeras/utils/__init__.py 100.00% <100.00%> (ø)
scikeras/utils/transformers.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6897d83...aebbf89. Read the comment docs.

@adriangb adriangb mentioned this pull request Oct 9, 2020
5 tasks
@adriangb
Copy link
Owner Author

I did a quick once-over on the diff, found a couple of typos and minor fixes.

@stsievert , when you have a chance, can you take a look at my responses/fixes to the last couple of comments, and let me know if we can resolve those or if there is more work needed? Thanks!

@adriangb adriangb merged commit b0bb39f into master Nov 3, 2020
@adriangb adriangb deleted the data-processing-ref branch December 5, 2020 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants