Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get histogram bins for decision boundaries #2972

Merged
merged 18 commits into from
Nov 3, 2021
Merged

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Oct 27, 2021

Documentation is here
image

Doc write up is in conf

Decided not to include recall as an objective to optimize since the optimal will always be 0.

Using the current binning method vs using y_pred/y for the original objective functions
image

@bchen1116 bchen1116 self-assigned this Oct 27, 2021
@codecov
Copy link

codecov bot commented Oct 27, 2021

Codecov Report

Merging #2972 (8d3e936) into main (74564cf) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2972     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        310     312      +2     
  Lines      29495   29775    +280     
=======================================
+ Hits       29404   29684    +280     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/model_understanding/__init__.py 100.0% <100.0%> (ø)
evalml/model_understanding/decision_boundary.py 100.0% <100.0%> (ø)
...odel_understanding_tests/test_decision_boundary.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74564cf...8d3e936. Read the comment docs.

@@ -379,13 +379,13 @@
"from evalml.pipelines import RegressionPipeline\n",
"\n",
"X_regress, y_regress = evalml.demos.load_diabetes()\n",
"X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X_regress, y_regress, problem_type='regression')\n",
"X_train_reg, X_test_reg, y_train_reg, y_test_reg = evalml.preprocessing.split_data(X_regress, y_regress, problem_type='regression')\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug in the doc previously. Since we named this X_train, y_train, etc, we ended up passing this regression dataset to the binary classification pipeline, which is incorrect. Issue for that filed here

if n_bins is not None:
bins = [i / n_bins for i in range(n_bins + 1)]
else:
bins = np.histogram_bin_edges(pos_preds, bins="fd", range=(0, 1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the Freedman-Diaconis (fd) rule to find the ideal number of histogram bins

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Thank you so much for this! I think this is awesome. I like how we're iteratively building the confusion matrix based on the histogram counts although it took me maybe a bit too long to figure out what was happening 😂

I left some comments I want to resolve before merge, mainly about making the return types a bit more intuitive.

I have one non-blocking broader comment that I want your thoughts on: It seems you've reimplemented optimize_thresholds except you can do it for more objectives at a time and you get some confusion matrices along the way. I wonder if we can just move this into the pipeline method and refactor optimize_thresholds in this case? It seems weird that we have two ways to optimize thresholds in separate parts of the codebase.

evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved
evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved
evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved
evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved
evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved
evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing looks really good. I did have a few minor comments about moving around the helper functions into standard metrics as they seem to be a better fit there.

evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved

# let's iterate through the list to find the vals
for k, v in objective_dict.items():
obj_val = v[1](val_list)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this call rather than moving towards using the input y/y_pred because it's much faster on larger datasets/larger n_bins (see the PR description for a time comparison!)

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, Brian, I think the docstring changes are nice, thanks for doing them. I think also the dictionary key values are definitely a lot cleaner, good suggestion by Freddy.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bchen1116 ! This looks great! Thanks for renaming the dataframe columns and adding the json option. Think this will be super helpful. The timing difference is crazy!

docs/source/user_guide/model_understanding.ipynb Outdated Show resolved Hide resolved
@bchen1116 bchen1116 merged commit 9790d79 into main Nov 3, 2021
@chukarsten chukarsten mentioned this pull request Nov 9, 2021
@freddyaboulton freddyaboulton deleted the bc_decision_boundary_cm branch May 13, 2022 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants