Implement Cost-Benefit Matrix objective for binary classification #1038

angela97lin · 2020-08-07T20:43:46Z

Closes #1025.

Notes here: https://alteryx.quip.com/u4ioAV4ztaya/Custom-Objectives-classification-cost-benefit

codecov · 2020-08-10T16:06:01Z

Codecov Report

Merging #1038 into main will increase coverage by 5.58%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1038      +/-   ##
==========================================
+ Coverage   94.32%   99.91%   +5.58%     
==========================================
  Files         183      187       +4     
  Lines       10167    10278     +111     
==========================================
+ Hits         9590    10269     +679     
+ Misses        577        9     -568

Impacted Files	Coverage Δ
...alml/objectives/binary_classification_objective.py	`100.00% <ø> (ø)`
evalml/objectives/fraud_cost.py	`100.00% <ø> (ø)`
evalml/pipelines/__init__.py	`100.00% <ø> (ø)`
evalml/objectives/__init__.py	`100.00% <100.00%> (ø)`
evalml/objectives/cost_benefit_matrix.py	`100.00% <100.00%> (ø)`
evalml/pipelines/graph_utils.py	`100.00% <100.00%> (+50.79%)`	⬆️
.../tests/objective_tests/test_cost_benefit_matrix.py	`100.00% <100.00%> (ø)`
evalml/tests/pipeline_tests/test_graph_utils.py	`100.00% <100.00%> (+48.92%)`	⬆️
evalml/tests/utils_tests/test_graph_utils.py	`100.00% <100.00%> (ø)`
evalml/utils/__init__.py	`100.00% <100.00%> (ø)`
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfd216d...ac185c5. Read the comment docs.

evalml/pipelines/graph_utils.py

.. warning:: **Breaking Changes**

freddyaboulton

@angela97lin I think this looks good! I left some minor questions and comments.

evalml/objectives/cost_benefit_matrix.py

evalml/utils/gen_utils.py

evalml/objectives/cost_benefit_matrix.py

freddyaboulton · 2020-08-11T15:37:05Z

evalml/objectives/cost_benefit_matrix.py

+    greater_is_better = True
+    score_needs_proba = False
+
+    def __init__(self, true_positive, true_negative, false_positive, false_negative):


I'm not familiar with this objective but why would someone apply non-zero costs for true positives and true negatives?

I think I understand better now after looking at the example linked in the notes. Maybe using payoff instead of cost in the docstring (and maybe parameter name too) would be less confusing?

Aha yes, we decided on this name after a discussion with @dsherry and @kmax12, but definitely open to more suggestions and discussion! Was also thinking it could make sense to make it more obvious that these parameters refer to the cost associated with each value. 🤔

Got it. To put my two cents in, I was confused because costs are typically non-negative and something to minimize but here the objective has greater_is_better so I vote to change the docstring or the greater_is_better flag.

@freddyaboulton so your confusion was that when you saw the word "cost," you thought it was referring to currency, rather than the ML cost function?

Regardless, I think the current naming works and that we should just do whatever is necessary to make the docs clear.

I thought it was weird that I would incur a cost for correctly identifying a true positive or true negative and that in this case we're trying to maximize the cost (I typically think of cost as being minimized in ML).

I think using "payoff" or "reward" would be clearer but I agree that the parameter names work and we should just change the docs! And maybe I'm alone in my confusion! 🤣

evalml/tests/objective_tests/test_cost_benefit_matrix.py

jeremyliweishih

LGTM. Is the plan to reimplment fraud cost after this?

jeremyliweishih · 2020-08-11T20:55:40Z

evalml/objectives/cost_benefit_matrix.py

+        cost_matrix = np.array([[self.true_negative, self.false_positive],
+                                [self.false_negative, self.true_positive]])
+
+        total_cost = np.multiply(conf_matrix.values, cost_matrix).sum()


very clean 🧼

Sweet, matrix math for the win!

It always bothered me that they called this method "multiply" when what its really doing is element-wise multiplication, and to get true matrix multiplication you have to do matmul, lol 🤷‍♂️

angela97lin · 2020-08-11T22:12:54Z

@jeremyliweishih Haha good question! I think there's a lot of similarities between the two but they're also slightly different since FraudCost takes in features (X) and uses them to calculate the score, while this is feature-agnostic :d

dsherry

@angela97lin this is really cool!! I agree with what you mentioned the other day: I was expecting this would take more code. I think that's a sign that our APIs are working well ✨😂

This is good to merge IMO. But let's resolve a couple points before calling this work complete:

If we're going to move some of the pipeline graph utils, we should move all of them. The question is, where? Gen utils is fine. Another option is that we could create a new namespace evalml/graph or evalml/understanding. We can circle back on this after this PR is merged, but we should address this before the August release because its a breaking change
Your PR adds API documentation for the new objective. We should also add something to the user guide and/or tutorial. Fine to do that after this PR, and we can discuss ideas elsewhere. One idea would be to replace the "Example: Fraud Detection" section in the objectives guide with a cost-benefit example. Another would be just to add a short section mentioning that objective in the objectives guide. And another idea would be to add a tutorial for it.

docs/source/api_reference.rst

evalml/pipelines/graph_utils.py

dsherry · 2020-08-13T14:03:06Z

evalml/objectives/cost_benefit_matrix.py

+    greater_is_better = True
+    score_needs_proba = False
+
+    def __init__(self, true_positive, true_negative, false_positive, false_negative):


@freddyaboulton so your confusion was that when you saw the word "cost," you thought it was referring to currency, rather than the ML cost function?

Regardless, I think the current naming works and that we should just do whatever is necessary to make the docs clear.

dsherry · 2020-08-13T14:04:23Z

evalml/objectives/cost_benefit_matrix.py

+            true_positive (float): Cost associated with true positive predictions
+            true_negative (float): Cost associated with true negative predictions
+            false_positive (float): Cost associated with false positive predictions
+            false_negative (float): Cost associated with false negative predictions


Could we call these true_positive_cost, etc?

There's an argument to be made for having default values (0) for each. I think I prefer the way it is now, where users have to specify each of the 4 costs in order to use the objective. So, no change needed there IMO, just sharing that thought.

Yup, that sounds good to me! I also was wondering about using default values or not but liked the idea that the user should really consider each of these parameters so decided against it. tldr; I think we're in agreement :D

dsherry · 2020-08-13T14:05:44Z

evalml/objectives/cost_benefit_matrix.py

+        cost_matrix = np.array([[self.true_negative, self.false_positive],
+                                [self.false_negative, self.true_positive]])
+
+        total_cost = np.multiply(conf_matrix.values, cost_matrix).sum()


Sweet, matrix math for the win!

dsherry · 2020-08-13T14:07:09Z

evalml/objectives/cost_benefit_matrix.py

+        cost_matrix = np.array([[self.true_negative, self.false_positive],
+                                [self.false_negative, self.true_positive]])
+
+        total_cost = np.multiply(conf_matrix.values, cost_matrix).sum()


It always bothered me that they called this method "multiply" when what its really doing is element-wise multiplication, and to get true matrix multiplication you have to do matmul, lol 🤷‍♂️

dsherry · 2020-08-13T14:22:32Z

evalml/tests/objective_tests/test_cost_benefit_matrix.py

+    y_true = pd.Series([0, 1, 2])
+    y_predicted = pd.Series([1, 0, 1])
+    with pytest.raises(ValueError, match="y_true contains more than two unique values"):
+        cbm.score(y_true, y_predicted)


I forgot to comment on your unit tests. Look good! A couple minor suggestions:

Try float cost instead of int, make sure math still works

What happens if cost is None, could just raise error in __init__, if true_positive_cost is None or true_negative_cost is None or ...: raise InvalidParameterException('...')

Can confusion_matrix return anything weird/invalid? Incorrect dimensions?

angela97lin · 2020-08-13T18:07:29Z

@dsherry Thanks for your comments!!

RE consolidating graph utils, I filed #1053 since I didn't want to introduce too many line changes unrelated to this PR--I've assigned the issue to myself and will put up a PR for it shortly after this.
RE adding tutorials: Agreed! That's what #1027 tracks 😄

I'll address your test comment and update this PR according.

angela97lin added 4 commits August 6, 2020 17:46

init

d51c459

move conf matrix util and add tests

5e9a7be

move conf matrix util and add tests

808917f

parametrize

470907d

angela97lin self-assigned this Aug 7, 2020

angela97lin added 2 commits August 7, 2020 16:48

Merge branch 'main' into 1025_cost_benefit_obj

cfe01b6

release notes and linting

8d63287

angela97lin added 4 commits August 10, 2020 12:44

Merge branch 'main' into 1025_cost_benefit_obj

eec85e5

Merge branch 'main' into 1025_cost_benefit_obj

8f5b76b

cleanup and api ref

5ad59ae

add to init

319567e

angela97lin commented Aug 10, 2020

View reviewed changes

evalml/pipelines/graph_utils.py Show resolved Hide resolved

add release note about

9fbaa02

.. warning:: **Breaking Changes**

angela97lin requested review from dsherry, freddyaboulton and kmax12 August 10, 2020 22:02

angela97lin marked this pull request as ready for review August 10, 2020 22:02

angela97lin added 2 commits August 10, 2020 20:49

update release notes

161c2ee

fix spacing in release notes

9a84dba

angela97lin requested review from jeremyliweishih and eccabay August 11, 2020 15:25

angela97lin added this to the August 2020 milestone Aug 11, 2020

Merge branch 'main' into 1025_cost_benefit_obj

a5d28ab

freddyaboulton approved these changes Aug 11, 2020

View reviewed changes

address comments, move conf mat to graph_utils

c3781c8

jeremyliweishih approved these changes Aug 11, 2020

View reviewed changes

angela97lin added 2 commits August 12, 2020 11:30

Merge branch 'main' into 1025_cost_benefit_obj

dea1ec3

Merge branch 'main' into 1025_cost_benefit_obj

32d2954

dsherry approved these changes Aug 13, 2020

View reviewed changes

dsherry reviewed Aug 13, 2020

View reviewed changes

Merge branch 'main' into 1025_cost_benefit_obj

4e6c7df

angela97lin mentioned this pull request Aug 13, 2020

Move graph utils + prediction explanations to "evalml/model_understanding" #1053

Closed

angela97lin added 5 commits August 13, 2020 14:22

add test

699a5a4

fix tests

034d479

Merge branch 'main' into 1025_cost_benefit_obj

0fdcabd

merge and lint

e26e687

Merge branch 'main' into 1025_cost_benefit_obj

ac185c5

angela97lin merged commit 81356e6 into main Aug 13, 2020

angela97lin deleted the 1025_cost_benefit_obj branch August 13, 2020 20:23

dsherry mentioned this pull request Aug 25, 2020

Release v0.13.1 #1101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Cost-Benefit Matrix objective for binary classification #1038

Implement Cost-Benefit Matrix objective for binary classification #1038

angela97lin commented Aug 7, 2020 •

edited

codecov bot commented Aug 10, 2020 •

edited

freddyaboulton left a comment

freddyaboulton Aug 11, 2020

freddyaboulton Aug 11, 2020 •

edited

angela97lin Aug 11, 2020

freddyaboulton Aug 11, 2020 •

edited

dsherry Aug 13, 2020

freddyaboulton Aug 13, 2020

jeremyliweishih left a comment

jeremyliweishih Aug 11, 2020

dsherry Aug 13, 2020

dsherry Aug 13, 2020

angela97lin commented Aug 11, 2020

dsherry left a comment

dsherry Aug 13, 2020

dsherry Aug 13, 2020

angela97lin Aug 13, 2020

dsherry Aug 13, 2020

dsherry Aug 13, 2020

dsherry Aug 13, 2020

angela97lin commented Aug 13, 2020

Implement Cost-Benefit Matrix objective for binary classification #1038

Implement Cost-Benefit Matrix objective for binary classification #1038

Conversation

angela97lin commented Aug 7, 2020 • edited

codecov bot commented Aug 10, 2020 • edited

Codecov Report

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Aug 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Aug 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin commented Aug 11, 2020

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin commented Aug 13, 2020

angela97lin commented Aug 7, 2020 •

edited

codecov bot commented Aug 10, 2020 •

edited

freddyaboulton Aug 11, 2020 •

edited

freddyaboulton Aug 11, 2020 •

edited