Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make usage of data types consistent across codebase #1039

Merged
merged 8 commits into from Aug 12, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Aug 10, 2020

Closes #1006 by cleaning up some of our calls to selecting dtypes to use what we have in place in gen_utils instead.

@angela97lin angela97lin self-assigned this Aug 10, 2020
@codecov
Copy link

codecov bot commented Aug 11, 2020

Codecov Report

Merging #1039 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1039   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files         183      183           
  Lines       10143    10147    +4     
=======================================
+ Hits        10134    10138    +4     
  Misses          9        9           
Impacted Files Coverage Δ
evalml/data_checks/invalid_targets_data_check.py 100.00% <100.00%> (ø)
...ents/estimators/classifiers/catboost_classifier.py 100.00% <100.00%> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.00% <100.00%> (ø)
.../transformers/preprocessing/datetime_featurizer.py 100.00% <100.00%> (ø)
evalml/pipelines/regression_pipeline.py 100.00% <100.00%> (ø)
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <100.00%> (ø)
evalml/utils/gen_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd784a2...8e62f38. Read the comment docs.

@angela97lin angela97lin added this to the August 2020 milestone Aug 11, 2020
@angela97lin angela97lin marked this pull request as ready for review August 11, 2020 15:53
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks good! I wonder if we should create a datetime_dtypes = [np.datetime64] too. This might be useful for future time series work.

@angela97lin
Copy link
Contributor Author

@freddyaboulton I think that's a great suggestion! Will update to include :D

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I still see 'category' showing up in some SimpleImputer and DateTimeFeaturizer and the numeric types in IDColumnsDataCheck. Should those be included as well? Otherwise looks great!

@angela97lin
Copy link
Contributor Author

@jeremyliweishih Yeah, for SimpleImputer we need special handling of category types, for DateTimeFeaturizer we want to convert to category type and for IDColumnsDataCheck we just want to exclude floats and bools but not ints (and hence not all numerics!).

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -815,20 +819,20 @@ def test_results_getter(mock_fit, mock_score, caplog, X_y_binary):


@pytest.mark.parametrize("automl_type", [ProblemTypes.BINARY, ProblemTypes.MULTICLASS])
@pytest.mark.parametrize("target_type", ["categorical", "string", "bool", "float", "int"])
@pytest.mark.parametrize("target_type", numeric_and_boolean_dtypes + categorical_dtypes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "string" is not included here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is a little confusing but previously, the "string" parametrized case was covering the default case (aka no conversion to another type), where the default target type is object. So yup, numeric_and_boolean_dtypes contains "object" instead of "string" but since there are no checks to convert data types if target_type == object, this still covers the same case :D

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin thank you for cleaning this up!! I left one question on a test

@angela97lin angela97lin merged commit 2cbb312 into main Aug 12, 2020
@angela97lin angela97lin deleted the 1006_consistent_dtypes branch August 12, 2020 15:28
@dsherry dsherry mentioned this pull request Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make usage of data types consistent across codebase
5 participants