Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data splitting errors if target is float for classification problems #2050

Merged
merged 3 commits into from
Mar 30, 2021

Conversation

angela97lin
Copy link
Contributor

Closes #1954

@angela97lin angela97lin self-assigned this Mar 29, 2021
@codecov
Copy link

codecov bot commented Mar 29, 2021

Codecov Report

Merging #2050 (337e215) into main (66da502) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2050     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         280      280             
  Lines       22907    22934     +27     
=========================================
+ Hits        22898    22925     +27     
  Misses          9        9             
Impacted Files Coverage Δ
evalml/automl/engine/engine_base.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66da502...337e215. Read the comment docs.

@patch('evalml.pipelines.ClassificationPipeline.score', return_value={"Log Loss Multiclass": 0.3})
@patch('evalml.pipelines.BinaryClassificationPipeline.score', return_value={"Log Loss Binary": 0.3})
@patch('evalml.automl.engine.EngineBase.train_pipeline')
def test_automl_supports_float_targets_for_classification(mock_train, mock_binary_score, mock_multi_score, mock_regression_score,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for regression too just to make sure we're not messing things up re passing the right target values c:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I really like this test. Very clear. 👍

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go and very nice and concise. Love the tests. My only question, which I think is possibly made obsolete by the upcoming engine redesign will be where does the test belong, either an engine test or an automl test. I don't consider it a blocking concern, though, so whichever way you decide to go is good.

@patch('evalml.pipelines.ClassificationPipeline.score', return_value={"Log Loss Multiclass": 0.3})
@patch('evalml.pipelines.BinaryClassificationPipeline.score', return_value={"Log Loss Binary": 0.3})
@patch('evalml.automl.engine.EngineBase.train_pipeline')
def test_automl_supports_float_targets_for_classification(mock_train, mock_binary_score, mock_multi_score, mock_regression_score,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I really like this test. Very clear. 👍

automl = AutoMLSearch(X_train=X, y_train=y, problem_type=automl_type, random_seed=0, n_jobs=1)
automl.search()

# Assert that we train pipeline on the original target, not the encoded one used in EngineBase for data splitting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an EngineBase test? @freddyaboulton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear ya! I think I want to keep this as an AutoML test though, because this is something that should work for a user interfacing with AutoMLSearch. I think right now, because EngineBase is where the data is being passed to, we're mocking and checking but that doesn't have to be the case, in which case we would just update what we're mocking.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent to me!

y_pd_encoded = y_pd
# Encode target for classification problems so that we can support float targets. This is okay because we only use split to get the indices to split on
if is_classification(automl.problem_type):
y_mapping = {original_target: encoded_target for (encoded_target, original_target) in enumerate(y_pd.value_counts().index)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean!

@angela97lin angela97lin merged commit 8556450 into main Mar 30, 2021
@angela97lin angela97lin deleted the 1954_data_splitting_indices branch March 30, 2021 06:38
@chukarsten chukarsten mentioned this pull request Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Classification: data splitting errors if target is float
4 participants