-
Notifications
You must be signed in to change notification settings - Fork 92
Datetime support for 1-way Partial Dependence #2454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2454 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 283 283
Lines 25589 25635 +46
=======================================
+ Hits 25488 25534 +46
Misses 101 101
Continue to review full report at Codecov.
|
| part_dep = partial_dependence(pipeline, X, features=20) | ||
| # keeps the test from running too long. The part below still runs for 3 other tests | ||
| if grid == 100 or size == 100: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test takes about 1 min to run locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang. Just the test case where grid=100 or all test cases take 1 minute to run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton The entire test takes about 1 minute locally! Leaving the grid and size unconstrained makes it run much longer
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job overcoming the issue and getting the performance increase. I think you're definitely on the right track, but I do have some concerns about using the sklearn private functions as well as duplicating some of the code in sklearn's partial_dependence(). I'd like to explore maybe modifying the data so that you can send it into the stock part_dep function and then perhaps fix it after it comes out. Not sure if that's a good solution, though so would be happy to pair on it.
evalml/tests/model_understanding_tests/test_partial_dependence.py
Outdated
Show resolved
Hide resolved
| predictions = predictions.reshape( | ||
| -1, X.shape[0], *[val.shape[0] for val in values] | ||
| ) | ||
|
|
||
| averaged_predictions = averaged_predictions.reshape( | ||
| -1, *[val.shape[0] for val in values] | ||
| ) | ||
| preds = { | ||
| "average": averaged_predictions, | ||
| "individual": predictions, | ||
| "values": [value_dates], | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so like you said, this is straight from sklearn...so is there any way that we can modify X before it gets passed into sklearn_partial_dependence() so that it works? I'm a little leary of utilizing sklearn's private functions and somewhat duplicating their partial_dependence code.
| timestamps = np.array( | ||
| [X_dt - pd.Timestamp("1970-01-01")] // np.timedelta64(1, "s") | ||
| ).reshape(-1, 1) | ||
| grid, values = _grid_from_X( | ||
| timestamps, percentiles=percentiles, grid_resolution=grid_resolution | ||
| ) | ||
| grid_dates = pd.to_datetime( | ||
| pd.Series(grid.squeeze()), unit="s" | ||
| ).values.reshape(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, I think it's clever. Converting to seconds after Unix time, yes? So ultimately your using a large integer number to compute the grid. Perhaps we can use this calculation to replace the dates with seconds after unix time, run the data through sklearn's stock partial dependence, then convert the datetime feature back after the fact? To me, I think that would mitigate some of the code duplication and use of private funcs issues.
|
@chukarsten filed an issue to track fixing/cleaning this up here |
freddyaboulton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 Thanks for doing this! The speed-up looks fantastic. I left some suggestions for adding test coverage. Definitely agree that the next step is to clean up our partial dependence - possible roll our own implementation!
| ) | ||
| if any(is_datetime): | ||
| timestamps = np.array( | ||
| [X_dt - pd.Timestamp("1970-01-01")] // np.timedelta64(1, "s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add coverage for the following cases:
- There's more than one date feature in
X - The date features are all not equally spaced
- The date features have hours/minutes/seconds
We can do this in a separate test from the one you've already modified. That test is great but just thinking of possible corner cases to guard against. Imo, it's ok to only try a small grid size and simple pipeline to keep that test fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton updated the tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bchen1116 Looks good to me! Is the runtime reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton It takes less than 20 seconds locally to run
evalml/pipelines/components/transformers/preprocessing/datetime_featurizer.py
Outdated
Show resolved
Hide resolved
| part_dep = partial_dependence(pipeline, X, features=20) | ||
| # keeps the test from running too long. The part below still runs for 3 other tests | ||
| if grid == 100 or size == 100: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang. Just the test case where grid=100 or all test cases take 1 minute to run?
fix #2385
We test partial dependence with the fraud dataset (1000 rows only), using the Elastic Net Classifier estimator
Testing the runtime for the new implementation:


Output:
Comparing the runtime on the main branch:



We cut all of these tests short since it takes about 5-7 minutes per iteration here.
Overall, we see we have a drastic improvement in runtime