-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute percent-better-than-baseline for all objectives #1244
Compute percent-better-than-baseline for all objectives #1244
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1244 +/- ##
=======================================
Coverage 99.93% 99.93%
=======================================
Files 207 207
Lines 12927 12997 +70
=======================================
+ Hits 12918 12988 +70
Misses 9 9
Continue to review full report at Codecov.
|
assert automl.results == loaded_automl.results | ||
pd.testing.assert_frame_equal(automl.rankings, loaded_automl.rankings) | ||
|
||
for id_, pipeline_results in automl.results['pipeline_results'].items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to modify this check because np.nan != np.nan
@patch("evalml.pipelines.ModeBaselineBinaryPipeline.score", return_value={'Log Loss Binary': 1, 'F1': 1}) | ||
@patch('evalml.pipelines.BinaryClassificationPipeline.score') | ||
@patch('evalml.pipelines.BinaryClassificationPipeline.fit') | ||
def test_percent_better_than_baseline_scores_different_folds(mock_fit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that all our other unit tests assume the same score is returned on each fold so I decided to add one where the score is different for each fold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
objective_class = get_objective(obj_name) | ||
percent_better = objective_class.calculate_percent_difference(mean_cv_all_objectives[obj_name], | ||
self._baseline_cv_scores[obj_name]) | ||
percent_better_than_baseline[obj_name] = percent_better |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsheni This is what it would look like in automl.results
:
'percent_better_than_baseline_all_objectives': {'Log Loss Binary': 98.9276200826952,
'MCC Binary': nan,
'AUC': 98.17222663713376,
'Precision': nan,
'F1': nan,
'Balanced Accuracy Binary': 91.62898962401862,
'Accuracy Binary': 53.49893478518168}
This is slightly different from what you posted in the original issue but I think it's more human-readable. Happy to change though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The reasoning for a list of dictionaries is that it would allow for adding more information in the future.
- For example, EvalML could calculate a boolean that looks at the baseline and pipeline score to determine if its actually better (even if
baselinePercentChange
is NaN). - However, this is not a major sticking point, and the end-user could get the scores, make the comparison, and calculate the boolean.
'percent_better_than_baseline_all_objectives': [
{'objective' : 'Log Loss Binary',
'baselinePercentChange': 98.9276200826952,
'isBetterThanBaseline': True},
{'objective' : 'MCC Binary',
'baselinePercentChange': nan,
'isBetterThanBaseline': True}...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Yep. Users can certainly easily compute that boolean if they want to.
b22ac9f
to
1a3f4ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton looks awesome!!
I left a discussion about whether we still keep percent_better_than_baseline
for the primary objective separate in results
. Let's resolve that and then merge. I didn't have a great conclusion there--thinking about it now, let's leave it the way you have it now, just wanted to make you aware of that topic.
evalml/automl/automl_search.py
Outdated
for field, value in fold_data['all_objective_scores'].items(): | ||
if field.lower() in objective_names: | ||
scores[field] += value | ||
return {objective_name: score / n_folds for objective_name, score in scores.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. If score
is int
this won't carry the remainder, so for safety please do float(score) / n_folds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
objective_class = get_objective(obj_name) | ||
percent_better = objective_class.calculate_percent_difference(mean_cv_all_objectives[obj_name], | ||
self._baseline_cv_scores[obj_name]) | ||
percent_better_than_baseline[obj_name] = percent_better |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Yep. Users can certainly easily compute that boolean if they want to.
@@ -686,7 +706,8 @@ def _add_result(self, trained_pipeline, parameters, training_time, cv_data, cv_s | |||
"high_variance_cv": high_variance_cv, | |||
"training_time": training_time, | |||
"cv_data": cv_data, | |||
"percent_better_than_baseline": percent_better, | |||
"percent_better_than_baseline_all_objectives": percent_better_than_baseline, | |||
"percent_better_than_baseline": percent_better_than_baseline[self.objective.name], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton so in cv_data
, do we have separate fields for the primary objective vs the additional ones? Let's just make sure we're consistent across our code. Relatedly, right now, does percent_better_than_baseline_all_objectives
also include the primary objective?
I think this would be cleanest if percent_better_than_baseline
were simply a dict here in results
, and then in full_rankings
when we make the rankings leaderboard, we extract that to make the column. But if that's not what we do in cv_data
, let's punt on that. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I just checked and cv_data
has a field for the primary objective (score
) and for all objectives (all_objective_scores
- this includes the primary objective as well). I think that means that percent_better_than_baseline
and percent_better_than_baseline_all_objectives
follow the same pattern we use for scores.
Maybe we keep it like this in this PR and I file an issue for consolidating scores and percent better into a single field in the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good!
for id_, pipeline_results in automl.results['pipeline_results'].items(): | ||
loaded_ = loaded_automl.results['pipeline_results'][id_] | ||
for name in pipeline_results: | ||
# Use np to check percent_better_than_baseline because of (possible) nans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, helpful here!
mock_scores = {obj.name: i for i, obj in enumerate(core_objectives)} | ||
mock_baseline_scores = {obj.name: i + 1 for i, obj in enumerate(core_objectives)} | ||
answer = {obj.name: obj.calculate_percent_difference(mock_scores[obj.name], | ||
mock_baseline_scores[obj.name]) for obj in core_objectives} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
allowed_pipelines=[DummyPipeline], objective="auto") | ||
|
||
with patch(baseline_pipeline_class + ".score", return_value=mock_baseline_scores): | ||
automl.search(X, y, data_checks=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mock pipeline fit and score here too to reduce runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! I'll mock fit for the baseline pipelines since that's the only mock missing
@patch("evalml.pipelines.ModeBaselineBinaryPipeline.score", return_value={'Log Loss Binary': 1, 'F1': 1}) | ||
@patch('evalml.pipelines.BinaryClassificationPipeline.score') | ||
@patch('evalml.pipelines.BinaryClassificationPipeline.fit') | ||
def test_percent_better_than_baseline_scores_different_folds(mock_fit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
1a3f4ea
to
d6ac162
Compare
…ly when scores differ across folds.
d6ac162
to
624796b
Compare
…etter_than_baseline is nan
Pull Request Description
Fix #1184
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.