Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly round grid values for integer data in partial dependence #4096

Merged
merged 3 commits into from
Mar 22, 2023

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Mar 21, 2023

closes #4095

This PR fixes the above issue by choosing to round the fractional grid values explicitly in _grid_from_X rather than leave it up to the integer dtypes to truncate values or raise an error at woodwork initalization.

@tamargrey tamargrey force-pushed the fix-pd-with-int-nullable branch 2 times, most recently from 2b44920 to f90a8d0 Compare March 21, 2023 18:53
@codecov
Copy link

codecov bot commented Mar 21, 2023

Codecov Report

Merging #4096 (9f0c54a) into main (be8b543) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4096     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        349     349             
  Lines      37548   37588     +40     
=======================================
+ Hits       37429   37469     +40     
  Misses       119     119             
Impacted Files Coverage Δ
...l/model_understanding/_partial_dependence_utils.py 99.4% <100.0%> (+0.1%) ⬆️
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tamargrey tamargrey marked this pull request as ready for review March 21, 2023 19:08
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few small comments, the only important one is with respect to test_partial_dependence_grid_values_for_numeric_data

Comment on lines +3045 to +3047
pipeline = linear_regression_pipeline
if nullable_y_ltype == "BooleanNullable":
pipeline = logistic_regression_binary_pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be able to test partial dependence with various y logical types, so setting different pipelines here lets me test boolean nullable (the linear regression pipeline errors with Regression pipeline can only handle numeric target data otherwise)

)
assert len(part_dep["partial_dependence"]) == min(
grid_resolution,
len(X[int_col].dropna().unique()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need dropna() here? Are we ever expected to have nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops this is a holdover from when I first implemented this in my nullable types branch when I was testing this with data that actually contains null values, but because of how we currently handle nullable logical types in our imputer components, these pipelines can't actually take in these nullable types with nans on main.

I can get rid of the dropna!

Comment on lines 3056 to 3063
assert len(part_dep["partial_dependence"]) == min(
grid_resolution,
len(X[int_col].dropna().unique()),
)
assert len(part_dep["feature_values"]) == min(
grid_resolution,
len(X[int_col].dropna().unique()),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we need to check the length these columns individually? We should just be able to check the length of the part_dep data frame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! This is just logic used from other tests in this file, but I agree there's no need.

Comment on lines +3112 to +3113
y = ww.init_series(pd.Series([True, False] * 25), logical_type="Boolean")
X = pd.DataFrame({"col": pd.Series(range(len(y)))})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test fail on main? It looks like with a dataset length of 50, none of the parameterized grid resolution values would result in fractions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The age fraction and double tests pass on main, but all the remaining tests currently fail. Age and Integer fail on the check for fractional values for all of the grid resolution parameters (more on that in a sec), and the AgeNullable and IntegerNullable types fail because of the type conversion error.

Regarding your point about the grid_resolution parameterization as it relates to the length/number of unique values in X being 50 - you're right that that's not what I want, but they're all currently resulting in fractions. When the grid resolution is strictly lower than the number of unique values, we calculate our own grid values, which is where the fractional values come from.

I'm dealing with an off by one error here - I want the final number to be greater than 50. And in that case, I would expect those tests to pass on main, which I'm fine with. This test is both about recording the existing behavior and fixing broken behavior.

I'm going to change the final grid resolution number to be 60.

@tamargrey tamargrey merged commit 6fd5daf into main Mar 22, 2023
@tamargrey tamargrey deleted the fix-pd-with-int-nullable branch March 22, 2023 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants