-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get histogram bins for decision boundaries #2972
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2972 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 310 312 +2
Lines 29495 29775 +280
=======================================
+ Hits 29404 29684 +280
Misses 91 91
Continue to review full report at Codecov.
|
@@ -379,13 +379,13 @@ | |||
"from evalml.pipelines import RegressionPipeline\n", | |||
"\n", | |||
"X_regress, y_regress = evalml.demos.load_diabetes()\n", | |||
"X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X_regress, y_regress, problem_type='regression')\n", | |||
"X_train_reg, X_test_reg, y_train_reg, y_test_reg = evalml.preprocessing.split_data(X_regress, y_regress, problem_type='regression')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug in the doc previously. Since we named this X_train, y_train
, etc, we ended up passing this regression dataset to the binary classification pipeline, which is incorrect. Issue for that filed here
if n_bins is not None: | ||
bins = [i / n_bins for i in range(n_bins + 1)] | ||
else: | ||
bins = np.histogram_bin_edges(pos_preds, bins="fd", range=(0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses the Freedman-Diaconis (fd) rule to find the ideal number of histogram bins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 Thank you so much for this! I think this is awesome. I like how we're iteratively building the confusion matrix based on the histogram counts although it took me maybe a bit too long to figure out what was happening 😂
I left some comments I want to resolve before merge, mainly about making the return types a bit more intuitive.
I have one non-blocking broader comment that I want your thoughts on: It seems you've reimplemented optimize_thresholds
except you can do it for more objectives at a time and you get some confusion matrices along the way. I wonder if we can just move this into the pipeline method and refactor optimize_thresholds
in this case? It seems weird that we have two ways to optimize thresholds in separate parts of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing looks really good. I did have a few minor comments about moving around the helper functions into standard metrics as they seem to be a better fit there.
|
||
# let's iterate through the list to find the vals | ||
for k, v in objective_dict.items(): | ||
obj_val = v[1](val_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this call rather than moving towards using the input y/y_pred
because it's much faster on larger datasets/larger n_bins
(see the PR description for a time comparison!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, Brian, I think the docstring changes are nice, thanks for doing them. I think also the dictionary key values are definitely a lot cleaner, good suggestion by Freddy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @bchen1116 ! This looks great! Thanks for renaming the dataframe columns and adding the json option. Think this will be super helpful. The timing difference is crazy!
Documentation is here
![image](https://user-images.githubusercontent.com/22552445/139880793-4f7fba6f-4998-414d-839f-c76caf546925.png)
Doc write up is in conf
Decided not to include recall as an objective to optimize since the optimal will always be 0.
Using the current binning method vs using y_pred/y for the original objective functions
![image](https://user-images.githubusercontent.com/22552445/139781127-51dd46c2-679f-407a-bcf9-cb4832759510.png)