Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: remove input dtype checks #59

Merged
merged 36 commits into from
Aug 17, 2020
Merged

Conversation

stsievert
Copy link
Collaborator

@stsievert stsievert commented Aug 15, 2020

What does this PR implement?
It removes checks on the input data. What if float32 data is passed? That's common in ML.

TODO:

  • Can we use OrdinalEncoder instead of LabelEncoder to avoid one more dtype conversion?
  • Test for KerasRegressor handling of integer X and/or y
  • Test for KerasClassifier handling of object/string y

@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2020

Codecov Report

Merging #59 into master will increase coverage by 0.25%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
+ Coverage   99.51%   99.77%   +0.25%     
==========================================
  Files           3        3              
  Lines         413      443      +30     
==========================================
+ Hits          411      442      +31     
+ Misses          2        1       -1     
Impacted Files Coverage Δ
scikeras/_utils.py 98.70% <100.00%> (+0.03%) ⬆️
scikeras/wrappers.py 100.00% <100.00%> (+0.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf7ea78...052e9b1. Read the comment docs.

@adriangb
Copy link
Owner

Thank you for this! I agree that we should allow any data type. Will we need to handle data conversions to make sure X and y have the same type? Some tests are failing now because of that. I realize that in general users should not be feeding X and y of different types, but since most Scikit-Learn estimators do not have any problems with that I believe that here we should figure out how to allow that as well.

@stsievert
Copy link
Collaborator Author

stsievert commented Aug 16, 2020

Will we need to handle data conversions to make sure X and y have the same type?

Do types always have to be the same for the input and output? It'd probably be fair to use np.promote_types for typecasting.

Does the input/output type matching change with run_eagerly?

@adriangb
Copy link
Owner

Do types always have to be the same for the input and output? It'd probably be fair to use np.promote_types for typecasting.
Does the input/output type matching change with run_eagerly?

I am not 100% sure but I think the answers are yes and no respectively.

@adriangb
Copy link
Owner

@stsievert I implemented an initial prototype for a type conversion-deconversion system. It could probably benefit from some more thought before being implemented.

@stsievert
Copy link
Collaborator Author

stsievert commented Aug 16, 2020

Why do we need an implementation of type-casting? The error message is clear and easy to resolve:

TypeError: Input 'y' of 'Sub' Op has type float32 that does not match type int64 of argument 'x'.

Plus, this test passes on 4f0329d :

@pytest.mark.parametrize("X_dtype", ["float32", "float64"])
@pytest.mark.parametrize("y_dtype", ["int64", "int32", "uint8", "uint16"])
@pytest.mark.parametrize("run_eagerly", [True, False])
def test_classifier_handles_types(X_dtype, y_dtype, run_eagerly):
    clf = KerasClassifier(build_fn=dynamic_classifier, run_eagerly=run_eagerly)
    n, d = 100, 20
    n_classes = 10
    X = np.random.uniform(size=(n, d)).astype(X_dtype)
    y = np.random.choice(n_classes, size=n).astype(y_dtype)
    clf.fit(X, y)
    assert clf.score(X, y) >= 0

It looks like y is being type-cast to a float with LabelEncoder.self.model_.fit is being called with (X.dtype, y.dtype) == ("float32", "float64") when the inputs dtypes are (X.dtype, y.dtype) == ("float32", "uint8").

@adriangb
Copy link
Owner

adriangb commented Aug 16, 2020

Thank you for the input. I went back and did some more testing and determined a couple of things:

  • Scikit-Learn only really expects the output dtype for y to match it's input dtype.
  • The TF failures were due to our custom R2 loss function, which could not handle different dtypes. Because of how TF builds the graphs, the error doesn't show up in wrappers.py and instead shows up inside TF, which confused me and I would imagine would confuse a user as well.

With this knowledge, I was able to remove this type casting system and have only a single cast back to the input dtype for y in KerasClassifier.postprocess_y. Now all tests are passing, including the one you propose in #59 (comment).

@adriangb
Copy link
Owner

adriangb commented Aug 16, 2020

I'm seeing some failures on Windows now, it looks like we're running into tensorflow/probability#886 or something similar.

Comment on lines 258 to 274
if OS_IS_WINDOWS:
# see tensorflow/probability#886
if not isinstance(X, np.ndarray): # list, tuple, etc.
X = [
X_.astype(np.int64) if X_.dtype == np.int32 else X_
for X_ in X
]
else:
X = X.astype(np.int64) if X.dtype == np.int32 else X
if not isinstance(y, np.ndarray): # list, tuple, etc.
y = [
y_.astype(np.int64) if y_.dtype == np.int32 else y_
for y_ in y
]
else:
y = y.astype(np.int64) if y.dtype == np.int32 else y

Copy link
Owner

@adriangb adriangb Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The windows failures are now passing with this hack... I think this is an actual bug in TF that they should fix, but I don't have a Windows device to really test on, so this will have to do for now...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a minimal working example to reproduce the bug on Windows? There's not a MWE in tensorflow/probability#886.

Copy link
Owner

@adriangb adriangb Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I did not see one. It might be as easy as:

import numpy as np
from tensorflow.python.framework.constant_op import convert_to_eager_tensor


convert_to_eager_tensor(np.array([1], dtype=np.int32))

@adriangb
Copy link
Owner

@stsievert I think this is ready for you to take another look

Copy link
Collaborator Author

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I certainly like the reduction in type-casting. Now, no special processing is required except for special cases (Windows + when dtypes of np.int32 are passed).

I have some suggestions and nits below.

scikeras/wrappers.py Show resolved Hide resolved
Comment on lines 258 to 274
if OS_IS_WINDOWS:
# see tensorflow/probability#886
if not isinstance(X, np.ndarray): # list, tuple, etc.
X = [
X_.astype(np.int64) if X_.dtype == np.int32 else X_
for X_ in X
]
else:
X = X.astype(np.int64) if X.dtype == np.int32 else X
if not isinstance(y, np.ndarray): # list, tuple, etc.
y = [
y_.astype(np.int64) if y_.dtype == np.int32 else y_
for y_ in y
]
else:
y = y.astype(np.int64) if y.dtype == np.int32 else y

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a minimal working example to reproduce the bug on Windows? There's not a MWE in tensorflow/probability#886.

scikeras/wrappers.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Show resolved Hide resolved
scikeras/wrappers.py Show resolved Hide resolved
tests/test_input_outputs.py Outdated Show resolved Hide resolved
scikeras/wrappers.py Outdated Show resolved Hide resolved
@adriangb
Copy link
Owner

I think this is looking great! Thank you @stsievert for all of your work.

@adriangb adriangb merged commit 6e616c1 into adriangb:master Aug 17, 2020
@adriangb adriangb deleted the dtypes branch August 17, 2020 19:33
)
X = check_array(X, allow_nd=True, dtype=["float64", "int"])
X = check_array(X, allow_nd=True, dtype=get_dtype(X))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the function get_dtype for? Why not pass dtype="numeric" to check_array? At first glance it looks like check_array does the same thing as get_dtype:

dtype: string, type, list of types or None (default=”numeric”)
Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

sklearn.utils.check_array docs

It looks like the outputs of check_array(X, dtype=get_dtype(X)) and check_array(X, dtype="numeric") only differ when X has an object dtype (in which case they output objects of backend.floatx() and float64 respectively). Is that accurate?

Copy link
Owner

@adriangb adriangb Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. The outcome we want is:

  • object input: float32 output
  • numeric input: same type as output
    The complication comes from the fact that the inputs can be arrays, dataframes, lists of lists, etc.

# instead of always float64 (sklearns default)
tf_backend_dtype = np.dtype(tf.keras.backend.floatx())

def get_dtype(arr) -> np.dtype:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a typo in this function: it looks like a np.ndarray with arr.dtype.kind == "O" will needlessly create another array. Maybe this function instead?

def get_dtype(arr):
    if isinstance(arr, np.ndarray):
        if arr.dtype.kind != "O":
            return arry.dtype
        else:
            return output_dtype

    # arr is not an ndarray
    arr_dtype = np.asarray(arr).dtype
    if arr_dtype.kind != "O":
        return arr_dtype
    return output_dtype

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's on purpose. This is for when we get an iterable of numpy arrays, for example a list of numpy arrays. I tried doing arr[0].dtype but that would then fail some tests from the Scikit-Learn checks that specifically check that you do not try to index the inputs before converting them to an array.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread your comment. You're right. There is a needless conversion. Let me make another PR to fix that.

dtype_y_pred = np.dtype(y_pred.dtype.as_numpy_dtype())
dest_dtype = np.promote_types(dtype_y_pred, dtype_y_true)
y_true = tf.cast(y_true, dtype=dest_dtype)
y_pred = tf.cast(y_pred, dtype=dest_dtype)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this batch of type-casting necessary? Does the backend calculation of R^2 require it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires y_true and y_pred to match dtypes. We could just cast to float32 (since I think y_pred will always be float32) instead of trying to figure out the destination dtype using np.promote_types.

This was referenced Aug 17, 2020
@stsievert stsievert mentioned this pull request Aug 27, 2020
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants