-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly round grid values for integer data in partial dependence #4096
Conversation
2b44920
to
f90a8d0
Compare
Codecov Report
@@ Coverage Diff @@
## main #4096 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 349 349
Lines 37548 37588 +40
=======================================
+ Hits 37429 37469 +40
Misses 119 119
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
f90a8d0
to
8781d03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few small comments, the only important one is with respect to test_partial_dependence_grid_values_for_numeric_data
pipeline = linear_regression_pipeline | ||
if nullable_y_ltype == "BooleanNullable": | ||
pipeline = logistic_regression_binary_pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be able to test partial dependence with various y logical types, so setting different pipelines here lets me test boolean nullable (the linear regression pipeline errors with Regression pipeline can only handle numeric target data
otherwise)
) | ||
assert len(part_dep["partial_dependence"]) == min( | ||
grid_resolution, | ||
len(X[int_col].dropna().unique()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need dropna()
here? Are we ever expected to have nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops this is a holdover from when I first implemented this in my nullable types branch when I was testing this with data that actually contains null values, but because of how we currently handle nullable logical types in our imputer components, these pipelines can't actually take in these nullable types with nans on main.
I can get rid of the dropna!
assert len(part_dep["partial_dependence"]) == min( | ||
grid_resolution, | ||
len(X[int_col].dropna().unique()), | ||
) | ||
assert len(part_dep["feature_values"]) == min( | ||
grid_resolution, | ||
len(X[int_col].dropna().unique()), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we need to check the length these columns individually? We should just be able to check the length of the part_dep
data frame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! This is just logic used from other tests in this file, but I agree there's no need.
y = ww.init_series(pd.Series([True, False] * 25), logical_type="Boolean") | ||
X = pd.DataFrame({"col": pd.Series(range(len(y)))}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test fail on main? It looks like with a dataset length of 50, none of the parameterized grid resolution values would result in fractions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The age fraction and double tests pass on main, but all the remaining tests currently fail. Age and Integer fail on the check for fractional values for all of the grid resolution parameters (more on that in a sec), and the AgeNullable and IntegerNullable types fail because of the type conversion error.
Regarding your point about the grid_resolution parameterization as it relates to the length/number of unique values in X being 50 - you're right that that's not what I want, but they're all currently resulting in fractions. When the grid resolution is strictly lower than the number of unique values, we calculate our own grid values, which is where the fractional values come from.
I'm dealing with an off by one error here - I want the final number to be greater than 50. And in that case, I would expect those tests to pass on main, which I'm fine with. This test is both about recording the existing behavior and fixing broken behavior.
I'm going to change the final grid resolution number to be 60.
closes #4095
This PR fixes the above issue by choosing to round the fractional grid values explicitly in
_grid_from_X
rather than leave it up to the integer dtypes to truncate values or raise an error at woodwork initalization.