Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle incoming Object dtype data #1645

Merged
merged 12 commits into from
Mar 7, 2023
Merged

Handle incoming Object dtype data #1645

merged 12 commits into from
Mar 7, 2023

Conversation

ParthivNaresh
Copy link
Collaborator

@ParthivNaresh ParthivNaresh commented Feb 14, 2023

Fixes: #1646, #1647

Changes made:

  • Handled numeric inference for incoming object dtype data. Perf tests for Woodwork and EvalML here.
  • Expanded null string representations to include a blank space
  • Made medcouple compliant with Int64 dtype which also fixes the current Woodwork perf test blocker

To prevent holding up this PR any further, I'll be including perf tests for potential inference sampling changes at another time.

@codecov
Copy link

codecov bot commented Feb 14, 2023

Codecov Report

Merging #1645 (4827ca0) into main (215abfe) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1645   +/-   ##
=======================================
  Coverage   98.78%   98.78%           
=======================================
  Files          98       98           
  Lines       11653    11728   +75     
=======================================
+ Hits        11511    11586   +75     
  Misses        142      142           
Impacted Files Coverage Δ
woodwork/config.py 100.00% <ø> (ø)
woodwork/utils.py 100.00% <ø> (ø)
.../statistics_utils/_get_box_plot_info_for_column.py 98.80% <100.00%> (+0.04%) ⬆️
woodwork/tests/accessor/test_statistics.py 100.00% <100.00%> (ø)
woodwork/tests/conftest.py 100.00% <100.00%> (ø)
woodwork/tests/logical_types/test_logical_types.py 100.00% <100.00%> (ø)
woodwork/tests/type_system/test_ltype_inference.py 100.00% <100.00%> (ø)
woodwork/tests/utils/test_read_file.py 100.00% <100.00%> (ø)
woodwork/tests/utils/test_utils.py 100.00% <100.00%> (ø)
woodwork/type_sys/inference_functions.py 100.00% <100.00%> (ø)
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -93,7 +93,7 @@

DEFAULT_TYPE = Unknown

INFERENCE_SAMPLE_SIZE = 100000
INFERENCE_SAMPLE_SIZE = 10_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an integer with no underscore?

Suggested change
INFERENCE_SAMPLE_SIZE = 10_000
INFERENCE_SAMPLE_SIZE = 10000

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to follow PEP 515 for this since it makes it easier to read

@ParthivNaresh ParthivNaresh self-assigned this Feb 14, 2023
coeff = np.abs(skew(series))
except ValueError:
# skew can't handle Int64 dtype
coeff = np.abs(skew(series.astype("float64")))
Copy link
Collaborator Author

@ParthivNaresh ParthivNaresh Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the cause of the current LG ww perf issue

@@ -93,7 +93,7 @@

DEFAULT_TYPE = Unknown

INFERENCE_SAMPLE_SIZE = 100000
INFERENCE_SAMPLE_SIZE = 100_000
Copy link
Collaborator Author

@ParthivNaresh ParthivNaresh Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to keep this at 100,000 for the time being. Reducing this exposes issues with larger datasets like zillow which has a column that gets inferred as IntegerNullable but actually has a float in one of its >90,000 observations. Attempting to cast this as Int64 throws an error.

@ParthivNaresh ParthivNaresh changed the title [DO NOT MERGE] Handle incoming Object dtype data Handle incoming Object dtype data Feb 17, 2023
@ParthivNaresh ParthivNaresh marked this pull request as ready for review February 27, 2023 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle incoming object dtype data for numeric data
5 participants