-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix get_unused_primitives only recognizes lowercase primitive strings #1733
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1733 +/- ##
=======================================
Coverage 98.69% 98.69%
=======================================
Files 138 138
Lines 15368 15373 +5
=======================================
+ Hits 15168 15173 +5
Misses 200 200
Continue to review full report at Codecov.
|
Thanks for the contribution @HenryRocha! This fix makes sense to me. Would you be able to add a test for the behavior being fixed here in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this solution might lead to surprising behavior. I see the original motivator for this was described in issue #1729 with the example of "Count". That string ends up getting compared with the Count primitive's name which, by convention and with all primitives, is the snake case version of the class name ("count"). But what about names like "NumUnique"? After this change, that will be converted to "numunique" which will then be compared with the NumUnique primitive's name which is "num_unique" (with an underscore).
It seems like the real issue here is with the overly flexible behavior of the agg_primitives
kwarg on the dfs
function. Perhaps it should only accept class objects as arguments? That seems way less ambiguous and more buttoned down. It seems like accepting either class names or primitive names creates an ambiguity that is difficult to resolve. After all, it seems natural that if you can provide class objects as args, you could also provide the name of those objects as args. But then you'd need to do some kind of parsing to convert the names to snake case. But then that codifies the requirement that primitive names be the snake case version of their corresponding class name. And that begs the question of why we have the name
class property at all to begin with.
What are everyone's thoughts on this?
CC: @tamargrey
@davesque You raise some interesting things to think about. Here are a couple of my thoughts:
|
Yep, that was my understanding as well. I guess what I was pointing out was that, based on the description in the original ticket, it seemed that it was expected that the method would also accept the names of primitive class objects as string inputs. Having it work that way causes issues for the reason you mention: because you expect to be able to provide "CumCount" as a meaningful input but it doesn't get recognized. You could give "Cum_Count" like you say (which isn't really the name of a primitive or the class name of a primitive). An input like that might even indicate some issue on the user's end but we'd end up silently accepting it. Of course, "cum_count" would also be accepted (and correctly found) but that already works anyway without this update. The way that this change leads to this ambiguity makes me think that we shouldn't even really have primitive names as a separate concept from just the class name of primitives. It just means that we have this thing that people need to use to look up primitives which seems obscure. As a developer, I'd expect to just provide the name of the class or the class itself or an instance of it. The fact that the primitive classes have a class property that defines their name (which can be literally anything; it could be "☃") is sort of like a hidden detail that only serves to complicate things. That seems to have been proven by #1729.
I agree that this would be a fix. I think that's sort of the de facto way that things are being done now since doing otherwise causes the error. I guess it would amount to doing a bit of extra validation on string inputs and emitting a more meaningful error. I still have some concerns though about what I mentioned above; that having a separate primitive name feels too complicated.
Yeah, I agree. I did mention that idea of only accepting class objects but I'm not crazy about for this reason that you mention. We should probably continue to provide the convenience of string inputs. CC: @thehomebrewnerd @tamargrey @jeff-hernandez @tuethan1999 @gsheni @rwedge I guess a big question I have from all this is something I should pose to the entire ML tools team: do we remember why we decided to have primitive names separate from primitive class names? I want to default to assuming that there was a specific reason for it that I'm overlooking. |
The "variable types" typing system used before we started using Woodwork originally had a "name" attribute that was defined separately from the class name, but we changed it to be an automatically generated snake_case version of the class name Switching to snake case for the string name probably added some unnecessary potential for user error |
@rwedge Cool. Useful context. Does having the separate snake case name seem like something that might still be required for some reason? |
@davesque Thanks for bringing up these points about string inputs for primitives. I agree that using the primitive name unnecessarily obfuscates for users what they should be passing in. I wanted to point out that we have a similar string input api in Woodwork for logical types, but there we ultimately use the class name. I agree that the ease of use of passing in strings is something we should keep, and whether we allow snake case or upper/lower case variations can be up for further discussions. But that conversation is probably one that should happen on its own issue, and it should probably happen for woodwork logical types as well. But for the purposes of this PR: While we're allowing strings like |
@tamargrey I agree. Let's punt on allowing snake case or upper/lower case variations. We can get in this MR once a unit test has been added. We originally used to only allow primitive class objects to be passed to DFS. We changed this a long time ago to make it easier to run DFS (without having to import all primitives). I would like to keep this behavior going forward.
@HenryRocha once a unit test has been added, a member on our team will review the MR and we can get it merged soon. Thanks for the contribution! |
@tamargrey I've added a unit test that should account for different cased strings, please check if this test is enough. Also, I think the merge conflict has to with the v1.0.0 release, which ended up resetting the future releases section of the |
@HenryRocha yep, let's merge in the changes from main. You'll only need the one release note in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor update to the release notes, but otherwise looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixes #1729
Cast specified primitives to lowercase strings in
get_unused_primitives
.