Make usage of data types consistent across codebase #1039

angela97lin · 2020-08-10T22:33:10Z

Closes #1006 by cleaning up some of our calls to selecting dtypes to use what we have in place in gen_utils instead.

codecov · 2020-08-11T01:30:25Z

Codecov Report

Merging #1039 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1039   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files         183      183           
  Lines       10143    10147    +4     
=======================================
+ Hits        10134    10138    +4     
  Misses          9        9

Impacted Files	Coverage Δ
evalml/data_checks/invalid_targets_data_check.py	`100.00% <100.00%> (ø)`
...ents/estimators/classifiers/catboost_classifier.py	`100.00% <100.00%> (ø)`
...onents/estimators/regressors/catboost_regressor.py	`100.00% <100.00%> (ø)`
.../transformers/preprocessing/datetime_featurizer.py	`100.00% <100.00%> (ø)`
evalml/pipelines/regression_pipeline.py	`100.00% <100.00%> (ø)`
evalml/pipelines/utils.py	`100.00% <100.00%> (ø)`
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`
...ta_checks_tests/test_invalid_targets_data_check.py	`100.00% <100.00%> (ø)`
evalml/tests/pipeline_tests/test_pipelines.py	`100.00% <100.00%> (ø)`
evalml/utils/gen_utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd784a2...8e62f38. Read the comment docs.

freddyaboulton

@angela97lin Looks good! I wonder if we should create a datetime_dtypes = [np.datetime64] too. This might be useful for future time series work.

angela97lin · 2020-08-11T16:18:32Z

@freddyaboulton I think that's a great suggestion! Will update to include :D

jeremyliweishih

LGTM. I still see 'category' showing up in some SimpleImputer and DateTimeFeaturizer and the numeric types in IDColumnsDataCheck. Should those be included as well? Otherwise looks great!

angela97lin · 2020-08-11T22:08:10Z

@jeremyliweishih Yeah, for SimpleImputer we need special handling of category types, for DateTimeFeaturizer we want to convert to category type and for IDColumnsDataCheck we just want to exclude floats and bools but not ints (and hence not all numerics!).

eccabay

LGTM!

dsherry · 2020-08-12T14:50:01Z

evalml/tests/automl_tests/test_automl.py

@@ -815,20 +819,20 @@ def test_results_getter(mock_fit, mock_score, caplog, X_y_binary):


 @pytest.mark.parametrize("automl_type", [ProblemTypes.BINARY, ProblemTypes.MULTICLASS])
-@pytest.mark.parametrize("target_type", ["categorical", "string", "bool", "float", "int"])
+@pytest.mark.parametrize("target_type", numeric_and_boolean_dtypes + categorical_dtypes)


So "string" is not included here?

Ah this is a little confusing but previously, the "string" parametrized case was covering the default case (aka no conversion to another type), where the default target type is object. So yup, numeric_and_boolean_dtypes contains "object" instead of "string" but since there are no checks to convert data types if target_type == object, this still covers the same case :D

dsherry

@angela97lin thank you for cleaning this up!! I left one question on a test

init

58f4d58

angela97lin self-assigned this Aug 10, 2020

angela97lin added 2 commits August 10, 2020 21:22

update more

f80e3e6

release notes

914963d

angela97lin added 2 commits August 11, 2020 11:24

lint

99cd602

past tense release note

8d32dc0

angela97lin added this to the August 2020 milestone Aug 11, 2020

Merge branch 'main' into 1006_consistent_dtypes

ef1f9df

angela97lin marked this pull request as ready for review August 11, 2020 15:53

angela97lin requested review from freddyaboulton, dsherry, eccabay and jeremyliweishih and removed request for freddyaboulton August 11, 2020 15:53

freddyaboulton approved these changes Aug 11, 2020

View reviewed changes

jeremyliweishih approved these changes Aug 11, 2020

View reviewed changes

add datetime dtypes list

cf771f7

eccabay approved these changes Aug 12, 2020

View reviewed changes

fix tets

8e62f38

dsherry reviewed Aug 12, 2020

View reviewed changes

dsherry approved these changes Aug 12, 2020

View reviewed changes

angela97lin merged commit 2cbb312 into main Aug 12, 2020

angela97lin deleted the 1006_consistent_dtypes branch August 12, 2020 15:28

dsherry mentioned this pull request Aug 25, 2020

Release v0.13.1 #1101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make usage of data types consistent across codebase #1039

Make usage of data types consistent across codebase #1039

angela97lin commented Aug 10, 2020 •

edited

codecov bot commented Aug 11, 2020 •

edited

freddyaboulton left a comment •

edited

angela97lin commented Aug 11, 2020

jeremyliweishih left a comment

angela97lin commented Aug 11, 2020

eccabay left a comment

dsherry Aug 12, 2020

angela97lin Aug 12, 2020

dsherry left a comment

Make usage of data types consistent across codebase #1039

Make usage of data types consistent across codebase #1039

Conversation

angela97lin commented Aug 10, 2020 • edited

codecov bot commented Aug 11, 2020 • edited

Codecov Report

freddyaboulton left a comment • edited

Choose a reason for hiding this comment

angela97lin commented Aug 11, 2020

jeremyliweishih left a comment

Choose a reason for hiding this comment

angela97lin commented Aug 11, 2020

eccabay left a comment

Choose a reason for hiding this comment

dsherry Aug 12, 2020

Choose a reason for hiding this comment

angela97lin Aug 12, 2020

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

angela97lin commented Aug 10, 2020 •

edited

codecov bot commented Aug 11, 2020 •

edited

freddyaboulton left a comment •

edited