-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sampling for inference in woodwork more consistent #1083
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1083 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 46 46
Lines 8183 8184 +1
=========================================
+ Hits 8183 8184 +1
Continue to review full report at Codecov.
|
3264ecd
to
dd68385
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks pretty good. Just a few relatively minor comments that we can discuss further as necessary.
woodwork/type_sys/type_system.py
Outdated
else: | ||
raise ValueError(f"Unexpected arg type `{type(series)}`") # pragma: no cover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to include this else clause, since we don't claim to support any other dataframe/series types beyond pandas, Dask or Koalas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often include final else arms like this in cases where it's clear that every other conditional arm should have handled whatever value was input to a function. It's my way of saying, "This collection of if conditionals is trying to identify what the type is of some argument and handle it accordingly. If any of the if conditionals don't match the value, then it's some kind of value with the wrong type." Also, static code analyzers such as mypy will often flag if statements that lack such a final else arm as potentially failing to handle all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For user interpretability, would "Unexpected series type" or "Unsupported series type" be more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davesque Makes sense, and I can see the benefit of leaving this if we do add support for new types, this could help alert us to a new condition we need to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
e5aeb2b
to
df0b91f
Compare
@@ -11,6 +11,7 @@ Future Release | |||
* The criteria for categorical type inference have changed (:pr:`1065`) | |||
* The meaning of both the ``categorical_threshold`` and | |||
``numeric_categorical_threshold`` settings have changed (:pr:`1065`) | |||
* Make sampling for type inference more consistent (:pr:`1083`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the change in sampling size be considered a breaking change because the logical types inferred at init could be different now than they were before for the same dataframe (both because we're now using the head and also taking more entries)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suppose it could be. I'm not sure actually. You'll potentially get a different result but you'll also probably get a more correct one. I suppose it's worth a note at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommended update made here: 3b6661b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I made some edits to improve the wording. Probably should just look at the overall changes tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breaking changes section looks good to go!
1034a60
to
a534d7a
Compare
woodwork.type_sys.inference_functions
.TypeSystem.infer_logical_type
to standardize inference sampling. Inference sampling is now done via a.head(100000)
call that uses the appropriate API depending on the collection type (pandas, dask, or koalas).