MAINT: remove input dtype checks #59

stsievert · 2020-08-15T22:56:29Z

What does this PR implement?
It removes checks on the input data. What if float32 data is passed? That's common in ML.

TODO:

Can we use OrdinalEncoder instead of LabelEncoder to avoid one more dtype conversion?
Test for KerasRegressor handling of integer X and/or y
Test for KerasClassifier handling of object/string y

codecov-commenter · 2020-08-15T22:59:32Z

Codecov Report

Merging #59 into master will increase coverage by 0.25%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
+ Coverage   99.51%   99.77%   +0.25%     
==========================================
  Files           3        3              
  Lines         413      443      +30     
==========================================
+ Hits          411      442      +31     
+ Misses          2        1       -1

Impacted Files	Coverage Δ
scikeras/_utils.py	`98.70% <100.00%> (+0.03%)`	⬆️
scikeras/wrappers.py	`100.00% <100.00%> (+0.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf7ea78...052e9b1. Read the comment docs.

adriangb · 2020-08-15T23:18:05Z

Thank you for this! I agree that we should allow any data type. Will we need to handle data conversions to make sure X and y have the same type? Some tests are failing now because of that. I realize that in general users should not be feeding X and y of different types, but since most Scikit-Learn estimators do not have any problems with that I believe that here we should figure out how to allow that as well.

stsievert · 2020-08-16T00:26:55Z

Will we need to handle data conversions to make sure X and y have the same type?

Do types always have to be the same for the input and output? It'd probably be fair to use np.promote_types for typecasting.

Does the input/output type matching change with run_eagerly?

adriangb · 2020-08-16T00:30:53Z

Do types always have to be the same for the input and output? It'd probably be fair to use np.promote_types for typecasting.
Does the input/output type matching change with run_eagerly?

I am not 100% sure but I think the answers are yes and no respectively.

adriangb · 2020-08-16T07:22:06Z

@stsievert I implemented an initial prototype for a type conversion-deconversion system. It could probably benefit from some more thought before being implemented.

stsievert · 2020-08-16T15:28:08Z

Why do we need an implementation of type-casting? The error message is clear and easy to resolve:

TypeError: Input 'y' of 'Sub' Op has type float32 that does not match type int64 of argument 'x'.

Plus, this test passes on 4f0329d :

@pytest.mark.parametrize("X_dtype", ["float32", "float64"])
@pytest.mark.parametrize("y_dtype", ["int64", "int32", "uint8", "uint16"])
@pytest.mark.parametrize("run_eagerly", [True, False])
def test_classifier_handles_types(X_dtype, y_dtype, run_eagerly):
    clf = KerasClassifier(build_fn=dynamic_classifier, run_eagerly=run_eagerly)
    n, d = 100, 20
    n_classes = 10
    X = np.random.uniform(size=(n, d)).astype(X_dtype)
    y = np.random.choice(n_classes, size=n).astype(y_dtype)
    clf.fit(X, y)
    assert clf.score(X, y) >= 0

It looks like y is being type-cast to a float with LabelEncoder.self.model_.fit is being called with (X.dtype, y.dtype) == ("float32", "float64") when the inputs dtypes are (X.dtype, y.dtype) == ("float32", "uint8").

adriangb · 2020-08-16T16:53:58Z

Thank you for the input. I went back and did some more testing and determined a couple of things:

Scikit-Learn only really expects the output dtype for y to match it's input dtype.
The TF failures were due to our custom R2 loss function, which could not handle different dtypes. Because of how TF builds the graphs, the error doesn't show up in wrappers.py and instead shows up inside TF, which confused me and I would imagine would confuse a user as well.

With this knowledge, I was able to remove this type casting system and have only a single cast back to the input dtype for y in KerasClassifier.postprocess_y. Now all tests are passing, including the one you propose in #59 (comment).

adriangb · 2020-08-16T17:17:08Z

I'm seeing some failures on Windows now, it looks like we're running into tensorflow/probability#886 or something similar.

…ers.

This reverts commit 476055b.

adriangb · 2020-08-16T18:07:24Z

scikeras/wrappers.py

+        if OS_IS_WINDOWS:
+            # see tensorflow/probability#886
+            if not isinstance(X, np.ndarray):  # list, tuple, etc.
+                X = [
+                    X_.astype(np.int64) if X_.dtype == np.int32 else X_
+                    for X_ in X
+                ]
+            else:
+                X = X.astype(np.int64) if X.dtype == np.int32 else X
+            if not isinstance(y, np.ndarray):  # list, tuple, etc.
+                y = [
+                    y_.astype(np.int64) if y_.dtype == np.int32 else y_
+                    for y_ in y
+                ]
+            else:
+                y = y.astype(np.int64) if y.dtype == np.int32 else y
+


The windows failures are now passing with this hack... I think this is an actual bug in TF that they should fix, but I don't have a Windows device to really test on, so this will have to do for now...

Is there a minimal working example to reproduce the bug on Windows? There's not a MWE in tensorflow/probability#886.

Unfortunately, I did not see one. It might be as easy as:

import numpy as np from tensorflow.python.framework.constant_op import convert_to_eager_tensor convert_to_eager_tensor(np.array([1], dtype=np.int32))

adriangb · 2020-08-16T18:19:01Z

@stsievert I think this is ready for you to take another look

stsievert

This looks pretty good. I certainly like the reduction in type-casting. Now, no special processing is required except for special cases (Windows + when dtypes of np.int32 are passed).

I have some suggestions and nits below.

scikeras/wrappers.py

stsievert · 2020-08-17T00:49:18Z

scikeras/wrappers.py

+        if OS_IS_WINDOWS:
+            # see tensorflow/probability#886
+            if not isinstance(X, np.ndarray):  # list, tuple, etc.
+                X = [
+                    X_.astype(np.int64) if X_.dtype == np.int32 else X_
+                    for X_ in X
+                ]
+            else:
+                X = X.astype(np.int64) if X.dtype == np.int32 else X
+            if not isinstance(y, np.ndarray):  # list, tuple, etc.
+                y = [
+                    y_.astype(np.int64) if y_.dtype == np.int32 else y_
+                    for y_ in y
+                ]
+            else:
+                y = y.astype(np.int64) if y.dtype == np.int32 else y
+


Is there a minimal working example to reproduce the bug on Windows? There's not a MWE in tensorflow/probability#886.

scikeras/wrappers.py

tests/test_input_outputs.py

scikeras/wrappers.py

…types

tests/test_input_outputs.py

…types

adriangb · 2020-08-17T19:24:48Z

I think this is looking great! Thank you @stsievert for all of your work.

stsievert · 2020-08-17T20:21:09Z

scikeras/wrappers.py

            )
-        X = check_array(X, allow_nd=True, dtype=["float64", "int"])
+        X = check_array(X, allow_nd=True, dtype=get_dtype(X))


What's the function get_dtype for? Why not pass dtype="numeric" to check_array? At first glance it looks like check_array does the same thing as get_dtype:

dtype: string, type, list of types or None (default=”numeric”)
Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

—sklearn.utils.check_array docs

It looks like the outputs of check_array(X, dtype=get_dtype(X)) and check_array(X, dtype="numeric") only differ when X has an object dtype (in which case they output objects of backend.floatx() and float64 respectively). Is that accurate?

Exactly. The outcome we want is:

object input: float32 output

numeric input: same type as output
The complication comes from the fact that the inputs can be arrays, dataframes, lists of lists, etc.

stsievert · 2020-08-17T20:26:31Z

scikeras/wrappers.py

+        # instead of always float64 (sklearns default)
+        tf_backend_dtype = np.dtype(tf.keras.backend.floatx())
+
+        def get_dtype(arr) -> np.dtype:


I think there's a typo in this function: it looks like a np.ndarray with arr.dtype.kind == "O" will needlessly create another array. Maybe this function instead?

def get_dtype(arr): if isinstance(arr, np.ndarray): if arr.dtype.kind != "O": return arry.dtype else: return output_dtype # arr is not an ndarray arr_dtype = np.asarray(arr).dtype if arr_dtype.kind != "O": return arr_dtype return output_dtype

No, that's on purpose. This is for when we get an iterable of numpy arrays, for example a list of numpy arrays. I tried doing arr[0].dtype but that would then fail some tests from the Scikit-Learn checks that specifically check that you do not try to index the inputs before converting them to an array.

I misread your comment. You're right. There is a needless conversion. Let me make another PR to fix that.

stsievert · 2020-08-17T20:38:18Z

scikeras/wrappers.py

+        dtype_y_pred = np.dtype(y_pred.dtype.as_numpy_dtype())
+        dest_dtype = np.promote_types(dtype_y_pred, dtype_y_true)
+        y_true = tf.cast(y_true, dtype=dest_dtype)
+        y_pred = tf.cast(y_pred, dtype=dest_dtype)


Why is this batch of type-casting necessary? Does the backend calculation of R^2 require it?

It requires y_true and y_pred to match dtypes. We could just cast to float32 (since I think y_pred will always be float32) instead of trying to figure out the destination dtype using np.promote_types.

MAINT: remove input dtype checks

aa19fac

Remove sample_weight dtype

d53f280

adriangb force-pushed the dtypes branch from d53f280 to 780b655 Compare August 16, 2020 07:20

stsievert and others added 14 commits August 16, 2020 12:48

MAINT: remove input dtype checks

58025fc

Remove sample_weight dtype

b439b39

Add type conversion and reversion

b1e919d

use tf.math directly

646c423

add test for sample_weights in score

57fd6cf

Remove dtype casting sytem, only return y to input dtype for classifi…

64ee678

…ers.

Add check within _fit call

75f22aa

Remove unused constants

94617e0

Revert "Remove unused constants"

e598237

This reverts commit 476055b.

use np.promote_types instead of always casting to float64

60f1471

Remove another unused cast

eeee16f

Try converting from int32 to int64 on win

42441cd

flip list/dtype logic

16413d6

check for not array instead of for list

88625de

adriangb force-pushed the dtypes branch from 1d3f890 to 88625de Compare August 16, 2020 17:54

fix np type

660469b

adriangb reviewed Aug 16, 2020

View reviewed changes

Add regressor dtype checks

7d9360d

stsievert added 2 commits August 16, 2020 20:11

No need to copy y

f14a0cf

Fix typo (KerasRegressor → StrictRegressor)

f225501

stsievert commented Aug 17, 2020

View reviewed changes

stsievert added 5 commits August 16, 2020 20:36

Merge branch 'dtypes' of https://github.com/stsievert/scikeras into d…

b27815d

…types

Implement helper wrapper fn

a054374

typo

e922f50

lint

05d4efd

Stronger test

d274a1c

adriangb reviewed Aug 17, 2020

View reviewed changes

tests/test_input_outputs.py Outdated Show resolved Hide resolved

adriangb reviewed Aug 17, 2020

View reviewed changes

tests/test_input_outputs.py Outdated Show resolved Hide resolved

adriangb and others added 11 commits August 16, 2020 22:03

fix failures

0454c6f

make test less introspective and avoid unecessary cast

697b2ca

Add int dtypes

8e30063

Merge branch 'dtypes' of https://github.com/stsievert/scikeras into d…

da8262e

…types

fix up edge cases

3adbd99

add int type checks for sample_weights

2fc9451

check for floatx instead of float32

ffa65e9

Remove unused constants

94c5d1b

Make helper function

474f397

Use uint8 with OneHotEncoder since it gets converted to float32 anyway

6bef99f

Rename test

052e9b1

adriangb merged commit 6e616c1 into adriangb:master Aug 17, 2020

adriangb deleted the dtypes branch August 17, 2020 19:33

stsievert commented Aug 17, 2020

View reviewed changes

This was referenced Aug 17, 2020

MAINT: fix needless ndarray creation #63

Merged

Parameter routing #50

Closed

stsievert mentioned this pull request Aug 27, 2020

Requirements for SciKeras v0.2.0 #68

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: remove input dtype checks #59

MAINT: remove input dtype checks #59

stsievert commented Aug 15, 2020 •

edited by adriangb

Loading

codecov-commenter commented Aug 15, 2020 •

edited

Loading

adriangb commented Aug 15, 2020

stsievert commented Aug 16, 2020 •

edited

Loading

adriangb commented Aug 16, 2020

adriangb commented Aug 16, 2020

stsievert commented Aug 16, 2020 •

edited

Loading

adriangb commented Aug 16, 2020 •

edited

Loading

adriangb commented Aug 16, 2020 •

edited

Loading

adriangb Aug 16, 2020 •

edited

Loading

stsievert Aug 17, 2020

adriangb Aug 17, 2020 •

edited

Loading

adriangb commented Aug 16, 2020

stsievert left a comment

stsievert Aug 17, 2020

adriangb commented Aug 17, 2020

stsievert Aug 17, 2020

adriangb Aug 17, 2020 •

edited

Loading

stsievert Aug 17, 2020

adriangb Aug 17, 2020

adriangb Aug 17, 2020

stsievert Aug 17, 2020

adriangb Aug 17, 2020

MAINT: remove input dtype checks #59

MAINT: remove input dtype checks #59

Conversation

stsievert commented Aug 15, 2020 • edited by adriangb Loading

codecov-commenter commented Aug 15, 2020 • edited Loading

Codecov Report

adriangb commented Aug 15, 2020

stsievert commented Aug 16, 2020 • edited Loading

adriangb commented Aug 16, 2020

adriangb commented Aug 16, 2020

stsievert commented Aug 16, 2020 • edited Loading

adriangb commented Aug 16, 2020 • edited Loading

adriangb commented Aug 16, 2020 • edited Loading

adriangb Aug 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

adriangb commented Aug 16, 2020

stsievert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Aug 17, 2020

Choose a reason for hiding this comment

adriangb Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsievert commented Aug 15, 2020 •

edited by adriangb

Loading

codecov-commenter commented Aug 15, 2020 •

edited

Loading

stsievert commented Aug 16, 2020 •

edited

Loading

stsievert commented Aug 16, 2020 •

edited

Loading

adriangb commented Aug 16, 2020 •

edited

Loading

adriangb commented Aug 16, 2020 •

edited

Loading

adriangb Aug 16, 2020 •

edited

Loading

adriangb Aug 17, 2020 •

edited

Loading

adriangb Aug 17, 2020 •

edited

Loading