Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime support for 1-way Partial Dependence #2454

Merged
merged 24 commits into from
Jul 15, 2021
Merged

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Jun 28, 2021

fix #2385

We test partial dependence with the fraud dataset (1000 rows only), using the Elastic Net Classifier estimator

Testing the runtime for the new implementation:
image
Output:
image

Comparing the runtime on the main branch:
image
image
image
We cut all of these tests short since it takes about 5-7 minutes per iteration here.

Overall, we see we have a drastic improvement in runtime

@bchen1116 bchen1116 self-assigned this Jun 28, 2021
@codecov
Copy link

codecov bot commented Jun 28, 2021

Codecov Report

Merging #2454 (815d453) into main (86e7a02) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2454     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        283     283             
  Lines      25589   25635     +46     
=======================================
+ Hits       25488   25534     +46     
  Misses       101     101             
Impacted Files Coverage Δ
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86e7a02...815d453. Read the comment docs.

part_dep = partial_dependence(pipeline, X, features=20)
# keeps the test from running too long. The part below still runs for 3 other tests
if grid == 100 or size == 100:
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test takes about 1 min to run locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang. Just the test case where grid=100 or all test cases take 1 minute to run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton The entire test takes about 1 minute locally! Leaving the grid and size unconstrained makes it run much longer

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job overcoming the issue and getting the performance increase. I think you're definitely on the right track, but I do have some concerns about using the sklearn private functions as well as duplicating some of the code in sklearn's partial_dependence(). I'd like to explore maybe modifying the data so that you can send it into the stock part_dep function and then perhaps fix it after it comes out. Not sure if that's a good solution, though so would be happy to pair on it.

Comment on lines +718 to +729
predictions = predictions.reshape(
-1, X.shape[0], *[val.shape[0] for val in values]
)

averaged_predictions = averaged_predictions.reshape(
-1, *[val.shape[0] for val in values]
)
preds = {
"average": averaged_predictions,
"individual": predictions,
"values": [value_dates],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so like you said, this is straight from sklearn...so is there any way that we can modify X before it gets passed into sklearn_partial_dependence() so that it works? I'm a little leary of utilizing sklearn's private functions and somewhat duplicating their partial_dependence code.

Comment on lines +697 to +705
timestamps = np.array(
[X_dt - pd.Timestamp("1970-01-01")] // np.timedelta64(1, "s")
).reshape(-1, 1)
grid, values = _grid_from_X(
timestamps, percentiles=percentiles, grid_resolution=grid_resolution
)
grid_dates = pd.to_datetime(
pd.Series(grid.squeeze()), unit="s"
).values.reshape(-1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, I think it's clever. Converting to seconds after Unix time, yes? So ultimately your using a large integer number to compute the grid. Perhaps we can use this calculation to replace the dates with seconds after unix time, run the data through sklearn's stock partial dependence, then convert the datetime feature back after the fact? To me, I think that would mitigate some of the code duplication and use of private funcs issues.

@bchen1116
Copy link
Contributor Author

@chukarsten filed an issue to track fixing/cleaning this up here

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Thanks for doing this! The speed-up looks fantastic. I left some suggestions for adding test coverage. Definitely agree that the next step is to clean up our partial dependence - possible roll our own implementation!

)
if any(is_datetime):
timestamps = np.array(
[X_dt - pd.Timestamp("1970-01-01")] // np.timedelta64(1, "s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add coverage for the following cases:

  • There's more than one date feature in X
  • The date features are all not equally spaced
  • The date features have hours/minutes/seconds

We can do this in a separate test from the one you've already modified. That test is great but just thinking of possible corner cases to guard against. Imo, it's ok to only try a small grid size and simple pipeline to keep that test fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton updated the tests!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bchen1116 Looks good to me! Is the runtime reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton It takes less than 20 seconds locally to run

part_dep = partial_dependence(pipeline, X, features=20)
# keeps the test from running too long. The part below still runs for 3 other tests
if grid == 100 or size == 100:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang. Just the test case where grid=100 or all test cases take 1 minute to run?

@bchen1116 bchen1116 merged commit 3404a82 into main Jul 15, 2021
@chukarsten chukarsten mentioned this pull request Jul 22, 2021
@freddyaboulton freddyaboulton deleted the bc_2385_datetime branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up partial dependence for date-time features
3 participants