Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute percent-better-than-baseline for all objectives #1244

Merged
merged 8 commits into from Oct 2, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fix #1184


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Sep 29, 2020

Codecov Report

Merging #1244 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1244   +/-   ##
=======================================
  Coverage   99.93%   99.93%           
=======================================
  Files         207      207           
  Lines       12927    12997   +70     
=======================================
+ Hits        12918    12988   +70     
  Misses          9        9           
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.59% <100.00%> (+0.01%) ⬆️
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e81b19d...40ed0d3. Read the comment docs.

assert automl.results == loaded_automl.results
pd.testing.assert_frame_equal(automl.rankings, loaded_automl.rankings)

for id_, pipeline_results in automl.results['pipeline_results'].items():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to modify this check because np.nan != np.nan

@patch("evalml.pipelines.ModeBaselineBinaryPipeline.score", return_value={'Log Loss Binary': 1, 'F1': 1})
@patch('evalml.pipelines.BinaryClassificationPipeline.score')
@patch('evalml.pipelines.BinaryClassificationPipeline.fit')
def test_percent_better_than_baseline_scores_different_folds(mock_fit,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that all our other unit tests assume the same score is returned on each fold so I decided to add one where the score is different for each fold.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

objective_class = get_objective(obj_name)
percent_better = objective_class.calculate_percent_difference(mean_cv_all_objectives[obj_name],
self._baseline_cv_scores[obj_name])
percent_better_than_baseline[obj_name] = percent_better
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsheni This is what it would look like in automl.results :

'percent_better_than_baseline_all_objectives': {'Log Loss Binary': 98.9276200826952,
    'MCC Binary': nan,
    'AUC': 98.17222663713376,
    'Precision': nan,
    'F1': nan,
    'Balanced Accuracy Binary': 91.62898962401862,
    'Accuracy Binary': 53.49893478518168}

This is slightly different from what you posted in the original issue but I think it's more human-readable. Happy to change though.

Copy link
Member

@gsheni gsheni Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The reasoning for a list of dictionaries is that it would allow for adding more information in the future.
  • For example, EvalML could calculate a boolean that looks at the baseline and pipeline score to determine if its actually better (even if baselinePercentChange is NaN).
  • However, this is not a major sticking point, and the end-user could get the scores, make the comparison, and calculate the boolean.
'percent_better_than_baseline_all_objectives': [
   {'objective' : 'Log Loss Binary',
    'baselinePercentChange': 98.9276200826952,
    'isBetterThanBaseline': True},
   {'objective' : 'MCC Binary',
    'baselinePercentChange': nan,
    'isBetterThanBaseline': True}...}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Yep. Users can certainly easily compute that boolean if they want to.

dsherry
dsherry approved these changes Oct 1, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton looks awesome!!

I left a discussion about whether we still keep percent_better_than_baseline for the primary objective separate in results. Let's resolve that and then merge. I didn't have a great conclusion there--thinking about it now, let's leave it the way you have it now, just wanted to make you aware of that topic.

for field, value in fold_data['all_objective_scores'].items():
if field.lower() in objective_names:
scores[field] += value
return {objective_name: score / n_folds for objective_name, score in scores.items()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. If score is int this won't carry the remainder, so for safety please do float(score) / n_folds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

objective_class = get_objective(obj_name)
percent_better = objective_class.calculate_percent_difference(mean_cv_all_objectives[obj_name],
self._baseline_cv_scores[obj_name])
percent_better_than_baseline[obj_name] = percent_better
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Yep. Users can certainly easily compute that boolean if they want to.

@@ -686,7 +706,8 @@ def _add_result(self, trained_pipeline, parameters, training_time, cv_data, cv_s
"high_variance_cv": high_variance_cv,
"training_time": training_time,
"cv_data": cv_data,
"percent_better_than_baseline": percent_better,
"percent_better_than_baseline_all_objectives": percent_better_than_baseline,
"percent_better_than_baseline": percent_better_than_baseline[self.objective.name],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton so in cv_data, do we have separate fields for the primary objective vs the additional ones? Let's just make sure we're consistent across our code. Relatedly, right now, does percent_better_than_baseline_all_objectives also include the primary objective?

I think this would be cleanest if percent_better_than_baseline were simply a dict here in results, and then in full_rankings when we make the rankings leaderboard, we extract that to make the column. But if that's not what we do in cv_data, let's punt on that. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I just checked and cv_data has a field for the primary objective (score) and for all objectives (all_objective_scores - this includes the primary objective as well). I think that means that percent_better_than_baseline and percent_better_than_baseline_all_objectives follow the same pattern we use for scores.

Maybe we keep it like this in this PR and I file an issue for consolidating scores and percent better into a single field in the results?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good!

for id_, pipeline_results in automl.results['pipeline_results'].items():
loaded_ = loaded_automl.results['pipeline_results'][id_]
for name in pipeline_results:
# Use np to check percent_better_than_baseline because of (possible) nans
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, helpful here!

mock_scores = {obj.name: i for i, obj in enumerate(core_objectives)}
mock_baseline_scores = {obj.name: i + 1 for i, obj in enumerate(core_objectives)}
answer = {obj.name: obj.calculate_percent_difference(mock_scores[obj.name],
mock_baseline_scores[obj.name]) for obj in core_objectives}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

allowed_pipelines=[DummyPipeline], objective="auto")

with patch(baseline_pipeline_class + ".score", return_value=mock_baseline_scores):
automl.search(X, y, data_checks=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mock pipeline fit and score here too to reduce runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I'll mock fit for the baseline pipelines since that's the only mock missing

@patch("evalml.pipelines.ModeBaselineBinaryPipeline.score", return_value={'Log Loss Binary': 1, 'F1': 1})
@patch('evalml.pipelines.BinaryClassificationPipeline.score')
@patch('evalml.pipelines.BinaryClassificationPipeline.fit')
def test_percent_better_than_baseline_scores_different_folds(mock_fit,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@freddyaboulton freddyaboulton force-pushed the 1184-percent-better-than-baseline-all-objs branch from 1a3f4ea to d6ac162 Compare October 1, 2020 16:08
@freddyaboulton freddyaboulton force-pushed the 1184-percent-better-than-baseline-all-objs branch from d6ac162 to 624796b Compare October 1, 2020 19:19
@freddyaboulton freddyaboulton merged commit 6fba86a into main Oct 2, 2020
@freddyaboulton freddyaboulton deleted the 1184-percent-better-than-baseline-all-objs branch October 2, 2020 14:31
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add percent_better_than_baseline on a per metric basis
3 participants