Explicitly round grid values for integer data in partial dependence #4096

tamargrey · 2023-03-21T18:44:22Z

This PR fixes the above issue by choosing to round the fractional grid values explicitly in _grid_from_X rather than leave it up to the integer dtypes to truncate values or raise an error at woodwork initalization.

codecov · 2023-03-21T18:55:33Z

Codecov Report

Merging #4096 (9f0c54a) into main (be8b543) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4096     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        349     349             
  Lines      37548   37588     +40     
=======================================
+ Hits       37429   37469     +40     
  Misses       119     119

Impacted Files	Coverage Δ
...l/model_understanding/_partial_dependence_utils.py	`99.4% <100.0%> (+0.1%)`	⬆️
...del_understanding_tests/test_partial_dependence.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

eccabay

Looks great! Just a few small comments, the only important one is with respect to test_partial_dependence_grid_values_for_numeric_data

eccabay · 2023-03-21T20:25:25Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+    pipeline = linear_regression_pipeline
+    if nullable_y_ltype == "BooleanNullable":
+        pipeline = logistic_regression_binary_pipeline


What's the reasoning behind this?

I want to be able to test partial dependence with various y logical types, so setting different pipelines here lets me test boolean nullable (the linear regression pipeline errors with Regression pipeline can only handle numeric target data otherwise)

eccabay · 2023-03-21T20:26:55Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+    )
+    assert len(part_dep["partial_dependence"]) == min(
+        grid_resolution,
+        len(X[int_col].dropna().unique()),


Why do we need dropna() here? Are we ever expected to have nulls?

oops this is a holdover from when I first implemented this in my nullable types branch when I was testing this with data that actually contains null values, but because of how we currently handle nullable logical types in our imputer components, these pipelines can't actually take in these nullable types with nans on main.

I can get rid of the dropna!

eccabay · 2023-03-21T20:28:52Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+    assert len(part_dep["partial_dependence"]) == min(
+        grid_resolution,
+        len(X[int_col].dropna().unique()),
+    )
+    assert len(part_dep["feature_values"]) == min(
+        grid_resolution,
+        len(X[int_col].dropna().unique()),
+    )


Any reason we need to check the length these columns individually? We should just be able to check the length of the part_dep data frame

Good call! This is just logic used from other tests in this file, but I agree there's no need.

eccabay · 2023-03-21T20:32:47Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+    y = ww.init_series(pd.Series([True, False] * 25), logical_type="Boolean")
+    X = pd.DataFrame({"col": pd.Series(range(len(y)))})


Does this test fail on main? It looks like with a dataset length of 50, none of the parameterized grid resolution values would result in fractions.

The age fraction and double tests pass on main, but all the remaining tests currently fail. Age and Integer fail on the check for fractional values for all of the grid resolution parameters (more on that in a sec), and the AgeNullable and IntegerNullable types fail because of the type conversion error.

Regarding your point about the grid_resolution parameterization as it relates to the length/number of unique values in X being 50 - you're right that that's not what I want, but they're all currently resulting in fractions. When the grid resolution is strictly lower than the number of unique values, we calculate our own grid values, which is where the fractional values come from.

I'm dealing with an off by one error here - I want the final number to be greater than 50. And in that case, I would expect those tests to pass on main, which I'm fine with. This test is both about recording the existing behavior and fixing broken behavior.

I'm going to change the final grid resolution number to be 60.

tamargrey force-pushed the fix-pd-with-int-nullable branch 2 times, most recently from 2b44920 to f90a8d0 Compare March 21, 2023 18:53

tamargrey marked this pull request as ready for review March 21, 2023 19:08

auto-assign bot assigned tamargrey Mar 21, 2023

tamargrey requested review from eccabay, jeremyliweishih, chukarsten and christopherbunn March 21, 2023 19:08

Tamar Grey added 2 commits March 21, 2023 16:01

explicitly round grid values for integer data in partial dependence

15e318a

Add release note

8781d03

tamargrey force-pushed the fix-pd-with-int-nullable branch from f90a8d0 to 8781d03 Compare March 21, 2023 20:02

eccabay approved these changes Mar 21, 2023

View reviewed changes

PR Comments

9f0c54a

christopherbunn approved these changes Mar 22, 2023

View reviewed changes

tamargrey merged commit 6fd5daf into main Mar 22, 2023

tamargrey deleted the fix-pd-with-int-nullable branch March 22, 2023 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly round grid values for integer data in partial dependence #4096

Explicitly round grid values for integer data in partial dependence #4096

tamargrey commented Mar 21, 2023 •

edited

Loading

codecov bot commented Mar 21, 2023 •

edited

Loading

eccabay left a comment

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

		y = ww.init_series(pd.Series([True, False] * 25), logical_type="Boolean")
		X = pd.DataFrame({"col": pd.Series(range(len(y)))})

Explicitly round grid values for integer data in partial dependence #4096

Explicitly round grid values for integer data in partial dependence #4096

Conversation

tamargrey commented Mar 21, 2023 • edited Loading

codecov bot commented Mar 21, 2023 • edited Loading

Codecov Report

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey commented Mar 21, 2023 •

edited

Loading

codecov bot commented Mar 21, 2023 •

edited

Loading