Compute percent-better-than-baseline for all objectives #1244

freddyaboulton · 2020-09-29T23:32:48Z

Pull Request Description

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2020-09-29T23:33:20Z

Codecov Report

Merging #1244 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1244   +/-   ##
=======================================
  Coverage   99.93%   99.93%           
=======================================
  Files         207      207           
  Lines       12927    12997   +70     
=======================================
+ Hits        12918    12988   +70     
  Misses          9        9

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.59% <100.00%> (+0.01%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e81b19d...40ed0d3. Read the comment docs.

freddyaboulton · 2020-09-30T16:19:21Z

evalml/tests/automl_tests/test_automl.py

-        assert automl.results == loaded_automl.results
-        pd.testing.assert_frame_equal(automl.rankings, loaded_automl.rankings)
+
+        for id_, pipeline_results in automl.results['pipeline_results'].items():


Had to modify this check because np.nan != np.nan

freddyaboulton · 2020-09-30T16:20:07Z

evalml/tests/automl_tests/test_automl.py

+@patch("evalml.pipelines.ModeBaselineBinaryPipeline.score", return_value={'Log Loss Binary': 1, 'F1': 1})
+@patch('evalml.pipelines.BinaryClassificationPipeline.score')
+@patch('evalml.pipelines.BinaryClassificationPipeline.fit')
+def test_percent_better_than_baseline_scores_different_folds(mock_fit,


I noticed that all our other unit tests assume the same score is returned on each fold so I decided to add one where the score is different for each fold.

freddyaboulton · 2020-09-30T16:22:16Z

evalml/automl/automl_search.py

+            objective_class = get_objective(obj_name)
+            percent_better = objective_class.calculate_percent_difference(mean_cv_all_objectives[obj_name],
+                                                                          self._baseline_cv_scores[obj_name])
+            percent_better_than_baseline[obj_name] = percent_better


@gsheni This is what it would look like in automl.results :

'percent_better_than_baseline_all_objectives': {'Log Loss Binary': 98.9276200826952, 'MCC Binary': nan, 'AUC': 98.17222663713376, 'Precision': nan, 'F1': nan, 'Balanced Accuracy Binary': 91.62898962401862, 'Accuracy Binary': 53.49893478518168}

This is slightly different from what you posted in the original issue but I think it's more human-readable. Happy to change though.

The reasoning for a list of dictionaries is that it would allow for adding more information in the future.

For example, EvalML could calculate a boolean that looks at the baseline and pipeline score to determine if its actually better (even if baselinePercentChange is NaN).

However, this is not a major sticking point, and the end-user could get the scores, make the comparison, and calculate the boolean.

'percent_better_than_baseline_all_objectives': [ {'objective' : 'Log Loss Binary', 'baselinePercentChange': 98.9276200826952, 'isBetterThanBaseline': True}, {'objective' : 'MCC Binary', 'baselinePercentChange': nan, 'isBetterThanBaseline': True}...}

Looks good to me!

Yep. Users can certainly easily compute that boolean if they want to.

dsherry

@freddyaboulton looks awesome!!

I left a discussion about whether we still keep percent_better_than_baseline for the primary objective separate in results. Let's resolve that and then merge. I didn't have a great conclusion there--thinking about it now, let's leave it the way you have it now, just wanted to make you aware of that topic.

dsherry · 2020-10-01T13:49:05Z

evalml/automl/automl_search.py

+            for field, value in fold_data['all_objective_scores'].items():
+                if field.lower() in objective_names:
+                    scores[field] += value
+        return {objective_name: score / n_folds for objective_name, score in scores.items()}


Cool. If score is int this won't carry the remainder, so for safety please do float(score) / n_folds

Good catch!

dsherry · 2020-10-01T13:50:36Z

evalml/automl/automl_search.py

+            objective_class = get_objective(obj_name)
+            percent_better = objective_class.calculate_percent_difference(mean_cv_all_objectives[obj_name],
+                                                                          self._baseline_cv_scores[obj_name])
+            percent_better_than_baseline[obj_name] = percent_better


Looks good to me!

Yep. Users can certainly easily compute that boolean if they want to.

dsherry · 2020-10-01T13:52:50Z

evalml/automl/automl_search.py

@@ -686,7 +706,8 @@ def _add_result(self, trained_pipeline, parameters, training_time, cv_data, cv_s
            "high_variance_cv": high_variance_cv,
            "training_time": training_time,
            "cv_data": cv_data,
-            "percent_better_than_baseline": percent_better,
+            "percent_better_than_baseline_all_objectives": percent_better_than_baseline,
+            "percent_better_than_baseline": percent_better_than_baseline[self.objective.name],


@freddyaboulton so in cv_data, do we have separate fields for the primary objective vs the additional ones? Let's just make sure we're consistent across our code. Relatedly, right now, does percent_better_than_baseline_all_objectives also include the primary objective?

I think this would be cleanest if percent_better_than_baseline were simply a dict here in results, and then in full_rankings when we make the rankings leaderboard, we extract that to make the column. But if that's not what we do in cv_data, let's punt on that. What do you think?

@dsherry I just checked and cv_data has a field for the primary objective (score) and for all objectives (all_objective_scores - this includes the primary objective as well). I think that means that percent_better_than_baseline and percent_better_than_baseline_all_objectives follow the same pattern we use for scores.

Maybe we keep it like this in this PR and I file an issue for consolidating scores and percent better into a single field in the results?

That sounds good!

dsherry · 2020-10-01T13:53:47Z

evalml/tests/automl_tests/test_automl.py

+        for id_, pipeline_results in automl.results['pipeline_results'].items():
+            loaded_ = loaded_automl.results['pipeline_results'][id_]
+            for name in pipeline_results:
+                # Use np to check percent_better_than_baseline because of (possible) nans


Thanks for the comment, helpful here!

dsherry · 2020-10-01T13:54:31Z

evalml/tests/automl_tests/test_automl.py

+    mock_scores = {obj.name: i for i, obj in enumerate(core_objectives)}
+    mock_baseline_scores = {obj.name: i + 1 for i, obj in enumerate(core_objectives)}
+    answer = {obj.name: obj.calculate_percent_difference(mock_scores[obj.name],
+                                                         mock_baseline_scores[obj.name]) for obj in core_objectives}


dsherry · 2020-10-01T13:54:53Z

evalml/tests/automl_tests/test_automl.py

+                          allowed_pipelines=[DummyPipeline], objective="auto")
+
+    with patch(baseline_pipeline_class + ".score", return_value=mock_baseline_scores):
+        automl.search(X, y, data_checks=None)


Should we mock pipeline fit and score here too to reduce runtime?

Good suggestion! I'll mock fit for the baseline pipelines since that's the only mock missing

dsherry · 2020-10-01T13:55:16Z

evalml/tests/automl_tests/test_automl.py

+@patch("evalml.pipelines.ModeBaselineBinaryPipeline.score", return_value={'Log Loss Binary': 1, 'F1': 1})
+@patch('evalml.pipelines.BinaryClassificationPipeline.score')
+@patch('evalml.pipelines.BinaryClassificationPipeline.fit')
+def test_percent_better_than_baseline_scores_different_folds(mock_fit,


…search results.

…ly when scores differ across folds.

…fore search

…etter_than_baseline is nan

freddyaboulton commented Sep 30, 2020

View reviewed changes

freddyaboulton force-pushed the 1184-percent-better-than-baseline-all-objs branch from b22ac9f to 1a3f4ea Compare September 30, 2020 16:24

freddyaboulton marked this pull request as ready for review September 30, 2020 16:24

freddyaboulton requested review from dsherry, jeremyliweishih, bchen1116, gsheni, angela97lin, christopherbunn and eccabay September 30, 2020 16:24

dsherry approved these changes Oct 1, 2020

View reviewed changes

freddyaboulton force-pushed the 1184-percent-better-than-baseline-all-objs branch from 1a3f4ea to d6ac162 Compare October 1, 2020 16:08

freddyaboulton added 6 commits October 1, 2020 13:56

Adding 'percent_better_than_baseline_all_objectives' field to automl …

e118ad0

…search results.

Adding test to check scores for multiple folds.

895fd0e

Adding test to check if percent better than baseline computed correct…

2a2fb3e

…ly when scores differ across folds.

Adding PR 1244 to release notes.

c687857

Refactoring automl_search get_mean_cv_score method.

9baf49f

Using float(score) in _get_mean_cv_scores_for_all_objectives.

624796b

freddyaboulton force-pushed the 1184-percent-better-than-baseline-all-objs branch from d6ac162 to 624796b Compare October 1, 2020 19:19

freddyaboulton added 2 commits October 1, 2020 16:22

Changed _add_result to handle case where add_to_rankings is called be…

205cd8a

…fore search

Adding check to test_add_to_rankings_no_search to make sure percent_b…

40ed0d3

…etter_than_baseline is nan

dsherry assigned freddyaboulton Oct 2, 2020

freddyaboulton merged commit 6fba86a into main Oct 2, 2020

freddyaboulton deleted the 1184-percent-better-than-baseline-all-objs branch October 2, 2020 14:31

dsherry mentioned this pull request Oct 29, 2020

Release v0.15.0 #1370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute percent-better-than-baseline for all objectives #1244

Compute percent-better-than-baseline for all objectives #1244

freddyaboulton commented Sep 29, 2020

codecov bot commented Sep 29, 2020 •

edited

Loading

freddyaboulton Sep 30, 2020

freddyaboulton Sep 30, 2020

dsherry Oct 1, 2020

freddyaboulton Sep 30, 2020

gsheni Sep 30, 2020 •

edited

Loading

dsherry Oct 1, 2020

dsherry left a comment

dsherry Oct 1, 2020

freddyaboulton Oct 1, 2020

dsherry Oct 1, 2020

dsherry Oct 1, 2020

freddyaboulton Oct 1, 2020

dsherry Oct 1, 2020

dsherry Oct 1, 2020

dsherry Oct 1, 2020

dsherry Oct 1, 2020

freddyaboulton Oct 1, 2020

dsherry Oct 1, 2020

Compute percent-better-than-baseline for all objectives #1244

Compute percent-better-than-baseline for all objectives #1244

Conversation

freddyaboulton commented Sep 29, 2020

Pull Request Description

codecov bot commented Sep 29, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsheni Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 29, 2020 •

edited

Loading

gsheni Sep 30, 2020 •

edited

Loading