Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sampling for inference in woodwork more consistent #1083

Merged
merged 7 commits into from
Jul 30, 2021

Conversation

davesque
Copy link
Contributor

  • Removes any sampling that occurs in inference functions defined in woodwork.type_sys.inference_functions.
  • Updates TypeSystem.infer_logical_type to standardize inference sampling. Inference sampling is now done via a .head(100000) call that uses the appropriate API depending on the collection type (pandas, dask, or koalas).
  • Changes the special case for all-null series to apply before sampling for pandas series and after sampling for dask and koalas series. The thinking behind this is that we would ideally use the entire series to determine if there are any valid records and it is reasonably performant to do that when the series is already in-memory, but not otherwise.

@davesque davesque changed the base branch from main to fix-release-notes July 28, 2021 22:31
@codecov
Copy link

codecov bot commented Jul 28, 2021

Codecov Report

Merging #1083 (a534d7a) into main (b0a10dc) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main     #1083   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           46        46           
  Lines         8183      8184    +1     
=========================================
+ Hits          8183      8184    +1     
Impacted Files Coverage Δ
woodwork/type_sys/inference_functions.py 100.00% <100.00%> (ø)
woodwork/type_sys/type_system.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0a10dc...a534d7a. Read the comment docs.

@davesque davesque changed the base branch from fix-release-notes to main July 28, 2021 22:33
@davesque davesque requested review from thehomebrewnerd, gsheni, tamargrey and rwedge and removed request for thehomebrewnerd July 28, 2021 22:35
@gsheni gsheni requested a review from tuethan1999 July 29, 2021 01:02
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks pretty good. Just a few relatively minor comments that we can discuss further as necessary.

woodwork/type_sys/type_system.py Outdated Show resolved Hide resolved
woodwork/type_sys/inference_functions.py Outdated Show resolved Hide resolved
Comment on lines 268 to 269
else:
raise ValueError(f"Unexpected arg type `{type(series)}`") # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to include this else clause, since we don't claim to support any other dataframe/series types beyond pandas, Dask or Koalas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often include final else arms like this in cases where it's clear that every other conditional arm should have handled whatever value was input to a function. It's my way of saying, "This collection of if conditionals is trying to identify what the type is of some argument and handle it accordingly. If any of the if conditionals don't match the value, then it's some kind of value with the wrong type." Also, static code analyzers such as mypy will often flag if statements that lack such a final else arm as potentially failing to handle all cases.

Copy link
Contributor

@rwedge rwedge Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user interpretability, would "Unexpected series type" or "Unsupported series type" be more descriptive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can make that change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davesque Makes sense, and I can see the benefit of leaving this if we do add support for new types, this could help alert us to a new condition we need to handle.

Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@@ -11,6 +11,7 @@ Future Release
* The criteria for categorical type inference have changed (:pr:`1065`)
* The meaning of both the ``categorical_threshold`` and
``numeric_categorical_threshold`` settings have changed (:pr:`1065`)
* Make sampling for type inference more consistent (:pr:`1083`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the change in sampling size be considered a breaking change because the logical types inferred at init could be different now than they were before for the same dataframe (both because we're now using the head and also taking more entries)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suppose it could be. I'm not sure actually. You'll potentially get a different result but you'll also probably get a more correct one. I suppose it's worth a note at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended update made here: 3b6661b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I made some edits to improve the wording. Probably should just look at the overall changes tab.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking changes section looks good to go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants