-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: remove input dtype checks #59
Conversation
Codecov Report
@@ Coverage Diff @@
## master #59 +/- ##
==========================================
+ Coverage 99.51% 99.77% +0.25%
==========================================
Files 3 3
Lines 413 443 +30
==========================================
+ Hits 411 442 +31
+ Misses 2 1 -1
Continue to review full report at Codecov.
|
Thank you for this! I agree that we should allow any data type. Will we need to handle data conversions to make sure |
Do types always have to be the same for the input and output? It'd probably be fair to use np.promote_types for typecasting. Does the input/output type matching change with |
I am not 100% sure but I think the answers are yes and no respectively. |
@stsievert I implemented an initial prototype for a type conversion-deconversion system. It could probably benefit from some more thought before being implemented. |
Why do we need an implementation of type-casting? The error message is clear and easy to resolve:
Plus, this test passes on 4f0329d : @pytest.mark.parametrize("X_dtype", ["float32", "float64"])
@pytest.mark.parametrize("y_dtype", ["int64", "int32", "uint8", "uint16"])
@pytest.mark.parametrize("run_eagerly", [True, False])
def test_classifier_handles_types(X_dtype, y_dtype, run_eagerly):
clf = KerasClassifier(build_fn=dynamic_classifier, run_eagerly=run_eagerly)
n, d = 100, 20
n_classes = 10
X = np.random.uniform(size=(n, d)).astype(X_dtype)
y = np.random.choice(n_classes, size=n).astype(y_dtype)
clf.fit(X, y)
assert clf.score(X, y) >= 0 It looks like |
Thank you for the input. I went back and did some more testing and determined a couple of things:
With this knowledge, I was able to remove this type casting system and have only a single cast back to the input dtype for |
I'm seeing some failures on Windows now, it looks like we're running into tensorflow/probability#886 or something similar. |
This reverts commit 476055b.
scikeras/wrappers.py
Outdated
if OS_IS_WINDOWS: | ||
# see tensorflow/probability#886 | ||
if not isinstance(X, np.ndarray): # list, tuple, etc. | ||
X = [ | ||
X_.astype(np.int64) if X_.dtype == np.int32 else X_ | ||
for X_ in X | ||
] | ||
else: | ||
X = X.astype(np.int64) if X.dtype == np.int32 else X | ||
if not isinstance(y, np.ndarray): # list, tuple, etc. | ||
y = [ | ||
y_.astype(np.int64) if y_.dtype == np.int32 else y_ | ||
for y_ in y | ||
] | ||
else: | ||
y = y.astype(np.int64) if y.dtype == np.int32 else y | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The windows failures are now passing with this hack... I think this is an actual bug in TF that they should fix, but I don't have a Windows device to really test on, so this will have to do for now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a minimal working example to reproduce the bug on Windows? There's not a MWE in tensorflow/probability#886.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I did not see one. It might be as easy as:
import numpy as np
from tensorflow.python.framework.constant_op import convert_to_eager_tensor
convert_to_eager_tensor(np.array([1], dtype=np.int32))
@stsievert I think this is ready for you to take another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. I certainly like the reduction in type-casting. Now, no special processing is required except for special cases (Windows + when dtypes of np.int32 are passed).
I have some suggestions and nits below.
scikeras/wrappers.py
Outdated
if OS_IS_WINDOWS: | ||
# see tensorflow/probability#886 | ||
if not isinstance(X, np.ndarray): # list, tuple, etc. | ||
X = [ | ||
X_.astype(np.int64) if X_.dtype == np.int32 else X_ | ||
for X_ in X | ||
] | ||
else: | ||
X = X.astype(np.int64) if X.dtype == np.int32 else X | ||
if not isinstance(y, np.ndarray): # list, tuple, etc. | ||
y = [ | ||
y_.astype(np.int64) if y_.dtype == np.int32 else y_ | ||
for y_ in y | ||
] | ||
else: | ||
y = y.astype(np.int64) if y.dtype == np.int32 else y | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a minimal working example to reproduce the bug on Windows? There's not a MWE in tensorflow/probability#886.
I think this is looking great! Thank you @stsievert for all of your work. |
) | ||
X = check_array(X, allow_nd=True, dtype=["float64", "int"]) | ||
X = check_array(X, allow_nd=True, dtype=get_dtype(X)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the function get_dtype
for? Why not pass dtype="numeric"
to check_array
? At first glance it looks like check_array
does the same thing as get_dtype
:
dtype: string, type, list of types or None (default=”numeric”)
Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.
It looks like the outputs of check_array(X, dtype=get_dtype(X))
and check_array(X, dtype="numeric")
only differ when X
has an object dtype (in which case they output objects of backend.floatx()
and float64
respectively). Is that accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. The outcome we want is:
- object input: float32 output
- numeric input: same type as output
The complication comes from the fact that the inputs can be arrays, dataframes, lists of lists, etc.
# instead of always float64 (sklearns default) | ||
tf_backend_dtype = np.dtype(tf.keras.backend.floatx()) | ||
|
||
def get_dtype(arr) -> np.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a typo in this function: it looks like a np.ndarray
with arr.dtype.kind == "O"
will needlessly create another array. Maybe this function instead?
def get_dtype(arr):
if isinstance(arr, np.ndarray):
if arr.dtype.kind != "O":
return arry.dtype
else:
return output_dtype
# arr is not an ndarray
arr_dtype = np.asarray(arr).dtype
if arr_dtype.kind != "O":
return arr_dtype
return output_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's on purpose. This is for when we get an iterable of numpy arrays, for example a list of numpy arrays. I tried doing arr[0].dtype
but that would then fail some tests from the Scikit-Learn checks that specifically check that you do not try to index the inputs before converting them to an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread your comment. You're right. There is a needless conversion. Let me make another PR to fix that.
dtype_y_pred = np.dtype(y_pred.dtype.as_numpy_dtype()) | ||
dest_dtype = np.promote_types(dtype_y_pred, dtype_y_true) | ||
y_true = tf.cast(y_true, dtype=dest_dtype) | ||
y_pred = tf.cast(y_pred, dtype=dest_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this batch of type-casting necessary? Does the backend calculation of R^2 require it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires y_true
and y_pred
to match dtypes. We could just cast to float32
(since I think y_pred
will always be float32
) instead of trying to figure out the destination dtype using np.promote_types
.
What does this PR implement?
It removes checks on the input data. What if
float32
data is passed? That's common in ML.TODO:
X
and/ory
y