Get histogram bins for decision boundaries #2972

bchen1116 · 2021-10-27T16:42:05Z

Documentation is here

Doc write up is in conf

Decided not to include recall as an objective to optimize since the optimal will always be 0.

Using the current binning method vs using y_pred/y for the original objective functions

codecov · 2021-10-27T16:52:30Z

Codecov Report

Merging #2972 (8d3e936) into main (74564cf) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2972     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        310     312      +2     
  Lines      29495   29775    +280     
=======================================
+ Hits       29404   29684    +280     
  Misses        91      91

Impacted Files	Coverage Δ
evalml/model_understanding/__init__.py	`100.0% <100.0%> (ø)`
evalml/model_understanding/decision_boundary.py	`100.0% <100.0%> (ø)`
...odel_understanding_tests/test_decision_boundary.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74564cf...8d3e936. Read the comment docs.

bchen1116 · 2021-10-27T18:11:08Z

docs/source/user_guide/model_understanding.ipynb

@@ -379,13 +379,13 @@
    "from evalml.pipelines import RegressionPipeline\n",
    "\n",
    "X_regress, y_regress = evalml.demos.load_diabetes()\n",
-    "X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X_regress, y_regress, problem_type='regression')\n",
+    "X_train_reg, X_test_reg, y_train_reg, y_test_reg = evalml.preprocessing.split_data(X_regress, y_regress, problem_type='regression')\n",


This was a bug in the doc previously. Since we named this X_train, y_train, etc, we ended up passing this regression dataset to the binary classification pipeline, which is incorrect. Issue for that filed here

bchen1116 · 2021-10-27T20:26:31Z

evalml/model_understanding/decision_boundary.py

+    if n_bins is not None:
+        bins = [i / n_bins for i in range(n_bins + 1)]
+    else:
+        bins = np.histogram_bin_edges(pos_preds, bins="fd", range=(0, 1))


This uses the Freedman-Diaconis (fd) rule to find the ideal number of histogram bins

freddyaboulton

@bchen1116 Thank you so much for this! I think this is awesome. I like how we're iteratively building the confusion matrix based on the histogram counts although it took me maybe a bit too long to figure out what was happening 😂

I left some comments I want to resolve before merge, mainly about making the return types a bit more intuitive.

I have one non-blocking broader comment that I want your thoughts on: It seems you've reimplemented optimize_thresholds except you can do it for more objectives at a time and you get some confusion matrices along the way. I wonder if we can just move this into the pipeline method and refactor optimize_thresholds in this case? It seems weird that we have two ways to optimize thresholds in separate parts of the codebase.

evalml/model_understanding/decision_boundary.py

chukarsten

The testing looks really good. I did have a few minor comments about moving around the helper functions into standard metrics as they seem to be a better fit there.

evalml/model_understanding/decision_boundary.py

bchen1116 · 2021-11-02T15:09:06Z

evalml/model_understanding/decision_boundary.py

+
+        # let's iterate through the list to find the vals
+        for k, v in objective_dict.items():
+            obj_val = v[1](val_list)


I kept this call rather than moving towards using the input y/y_pred because it's much faster on larger datasets/larger n_bins (see the PR description for a time comparison!)

chukarsten

Awesome, Brian, I think the docstring changes are nice, thanks for doing them. I think also the dictionary key values are definitely a lot cleaner, good suggestion by Freddy.

freddyaboulton

Thank you @bchen1116 ! This looks great! Thanks for renaming the dataframe columns and adding the json option. Think this will be super helpful. The timing difference is crazy!

docs/source/user_guide/model_understanding.ipynb

…nto bc_decision_boundary_cm

initial implementatin

96de4c8

bchen1116 self-assigned this Oct 27, 2021

update release notes

e2d34d5

bchen1116 added 4 commits October 27, 2021 13:15

fix api

533b74e

add main

fde4b1d

add to docs

2235e96

remove print

6c56a44

bchen1116 commented Oct 27, 2021

View reviewed changes

bchen1116 added 2 commits October 27, 2021 14:40

update doc

899e2f2

add additional test

95e2a53

bchen1116 commented Oct 27, 2021

View reviewed changes

bchen1116 requested review from dsherry, freddyaboulton, angela97lin, christopherbunn, chukarsten, eccabay, jeremyliweishih, ParthivNaresh and rpeck October 27, 2021 20:34

bchen1116 added 2 commits October 28, 2021 10:35

Merge branch 'main' into bc_decision_boundary_cm

f7a39b0

update release notes

d254443

bchen1116 added the priority label Oct 28, 2021

freddyaboulton suggested changes Oct 28, 2021

View reviewed changes

chukarsten reviewed Oct 29, 2021

View reviewed changes

evalml/model_understanding/decision_boundary.py Outdated Show resolved Hide resolved

bchen1116 added 3 commits November 1, 2021 23:31

update code

3b0bb3a

fix release notes

de527d5

fix docs

121ea74

bchen1116 commented Nov 2, 2021

View reviewed changes

bchen1116 requested review from chukarsten and freddyaboulton November 2, 2021 15:41

chukarsten approved these changes Nov 2, 2021

View reviewed changes

Merge branch 'main' into bc_decision_boundary_cm

f503453

freddyaboulton approved these changes Nov 2, 2021

View reviewed changes

docs/source/user_guide/model_understanding.ipynb Outdated Show resolved Hide resolved

bchen1116 added 4 commits November 2, 2021 13:54

fix doc

d3c829c

Merge branch 'bc_decision_boundary_cm' of github.com:alteryx/evalml i…

2e51d1a

…nto bc_decision_boundary_cm

Merge branch 'main' into bc_decision_boundary_cm

4f42a90

Merge branch 'main' into bc_decision_boundary_cm

8d3e936

bchen1116 merged commit 9790d79 into main Nov 3, 2021

chukarsten mentioned this pull request Nov 9, 2021

Release v0.37.0 #3029

Merged

freddyaboulton deleted the bc_decision_boundary_cm branch May 13, 2022 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get histogram bins for decision boundaries #2972

Get histogram bins for decision boundaries #2972

bchen1116 commented Oct 27, 2021 •

edited

Loading

codecov bot commented Oct 27, 2021 •

edited

Loading

bchen1116 Oct 27, 2021

bchen1116 Oct 27, 2021

freddyaboulton left a comment

chukarsten left a comment

bchen1116 Nov 2, 2021

chukarsten left a comment

freddyaboulton left a comment

Get histogram bins for decision boundaries #2972

Get histogram bins for decision boundaries #2972

Conversation

bchen1116 commented Oct 27, 2021 • edited Loading

codecov bot commented Oct 27, 2021 • edited Loading

Codecov Report

bchen1116 Oct 27, 2021

Choose a reason for hiding this comment

bchen1116 Oct 27, 2021

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

bchen1116 Nov 2, 2021

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

bchen1116 commented Oct 27, 2021 •

edited

Loading

codecov bot commented Oct 27, 2021 •

edited

Loading