Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: fix needless ndarray creation #63

Merged
merged 4 commits into from
Aug 18, 2020

Conversation

adriangb
Copy link
Owner

As per:

@stsievert let me know if there's anything else you catch from #59 . I should have waited for you to merge, sorry about that!

@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2020

Codecov Report

Merging #63 into master will increase coverage by 0.25%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #63      +/-   ##
==========================================
+ Coverage   99.51%   99.76%   +0.25%     
==========================================
  Files           3        3              
  Lines         413      432      +19     
==========================================
+ Hits          411      431      +20     
+ Misses          2        1       -1     
Impacted Files Coverage Δ
scikeras/_utils.py 98.70% <100.00%> (+0.03%) ⬆️
scikeras/wrappers.py 100.00% <100.00%> (+0.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e616c1...637fc04. Read the comment docs.

@stsievert
Copy link
Collaborator

Why not this implementation of get_dtype? If I'm correct, it the same features expressed in #59 (comment)

dtype = None if isinstance(X, np.ndarray) and X.dtype.kind != "O" else tf.keras.backend.floatx()
X = check_array(X, allow_nd=True, dtype=dtype)

@adriangb
Copy link
Owner Author

adriangb commented Aug 17, 2020

Why not this implementation of get_dtype? If I'm correct, it the same features expressed in #59 (comment)

dtype = None if isinstance(X, np.ndarray) and X.dtype.kind != "O" else tf.keras.backend.floatx()
X = check_array(X, allow_nd=True, dtype=dtype)

I think this would not work for X = [np.array([0], dtype=np.int32)] (i.e. list of arrays)? dtype would be tf.keras.backend.floatx() when it should be int32

@stsievert
Copy link
Collaborator

stsievert commented Aug 18, 2020

Why not this implementation?

def _check_array_dtype(arr):
    if not isinstance(arr, np.ndarray):
        return _check_array_dtype(np.asarray(arr))
    
    if arr.dtype.kind == "O":
        return tf.keras.backend.floatx()
    else:
        return None  # check_array won't do any casting with dtype=None     

@adriangb
Copy link
Owner Author

Why not this implementation?

That is a bit cleaner, the single recursion helps. Thank you.

Comment on lines -990 to -995
def _validate_data(self, X, y=None, reset=True):
"""Convert y to float, regressors cannot accept int."""
if y is not None:
y = check_array(y, ensure_2d=False)
return super()._validate_data(X=X, y=y, reset=reset)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can actually be removed now, thanks to the changes in BaseWrapper._validate_data!

Comment on lines -1064 to +1052
dtype_y_true = np.dtype(y_true.dtype.as_numpy_dtype())
dtype_y_pred = np.dtype(y_pred.dtype.as_numpy_dtype())
dest_dtype = np.promote_types(dtype_y_pred, dtype_y_true)
y_true = tf.cast(y_true, dtype=dest_dtype)
y_pred = tf.cast(y_pred, dtype=dest_dtype)
# y_pred will always be float32 so we cast y_true to float32
y_true = tf.cast(y_true, dtype=y_pred.dtype)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always casting to float32 works. I think that trying to match the dtypes here was overkill and probably less efficient.

Comment on lines -299 to +308
"X_dtype", ["float32", "float64", "int64", "int32", "uint8", "uint16"]
"X_dtype",
["float32", "float64", "int64", "int32", "uint8", "uint16", "object"],
)
@pytest.mark.parametrize(
"y_dtype", ["float32", "float64", "int64", "int32", "uint8", "uint16"]
"y_dtype",
["float32", "float64", "int64", "int32", "uint8", "uint16", "object"],
)
@pytest.mark.parametrize(
"s_w_dtype", ["float32", "float64", "int64", "int32", "uint8", "uint16"]
"s_w_dtype",
["float32", "float64", "int64", "int32", "uint8", "uint16", "object"],
Copy link
Owner Author

@adriangb adriangb Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests now take a long time with all of these dtypes. I think we need to split this up into:

  1. Test a single dtype at a time for all inputs (i.e. y, X and sw all have float32, etc.).
  2. Test that mixed dtypes work (we can do 1 float and 1 int for each input, giving 6 tests)

I'm not sure where/how we need to be testing for run_eagerly, but if we can reduce that parametrization as well I think that would be nice.

I'll leave this for now, but I'll probably make a separate PR to try to address this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've noticed that these tests take a long time too.

I think run_eagerly should be tested, certainly with different input types. run_eagerly controls if the program is compiled in Tensorflow ahead of time, and if TF can do dynamic type inference.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it needs to be tested against both of the tests proposed above, or can we pick the higher risk one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think run_eagerly only needs to be tested on (2) above. But you could include one set of parameters with the same dtype just make sure it passes:

@pytest.mark.parametrize("y_dtype", ["float32", "float64", "uint8", "int16"])
@pytest.mark.parametrize("run_eagerly", [True, False])
def test_mixed_dtypes(y_dtype, run_eagerly):
    n, d = 100, 10
    n_classes = 4
    X = np.random.uniform(size=(n, d)).astype("float32")
    y = np.random.choice(n_classes, size=n).astype(y_dtype)
    est = KerasClassifier(..., run_eagerly=run_eagerly)
    est.fit(X, y)
    est.score(X, y)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that. Let's do that next PR.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way @stsievert, if you want tests to run faster, you can do:

pip install pytest-xdist
pytest -n auto

@adriangb
Copy link
Owner Author

I think this is ready to merge @stsievert ?

@stsievert
Copy link
Collaborator

Yup, this looks good to me.

@adriangb adriangb merged commit 6156c29 into master Aug 18, 2020
@adriangb adriangb deleted the hotfix/fix-needless-ndarray-creation branch August 18, 2020 14:13
@stsievert stsievert mentioned this pull request Aug 27, 2020
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants