Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix get_unused_primitives only recognizes lowercase primitive strings #1733

Merged
merged 6 commits into from
Oct 22, 2021

Conversation

HenryRocha
Copy link
Contributor

Fixes #1729

Cast specified primitives to lowercase strings in get_unused_primitives.

@codecov
Copy link

codecov bot commented Oct 11, 2021

Codecov Report

Merging #1733 (066b333) into main (bbad3f7) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1733   +/-   ##
=======================================
  Coverage   98.69%   98.69%           
=======================================
  Files         138      138           
  Lines       15368    15373    +5     
=======================================
+ Hits        15168    15173    +5     
  Misses        200      200           
Impacted Files Coverage Δ
featuretools/synthesis/utils.py 100.00% <100.00%> (ø)
featuretools/tests/synthesis/test_dfs_method.py 99.53% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbad3f7...066b333. Read the comment docs.

@tamargrey
Copy link
Contributor

Thanks for the contribution @HenryRocha! This fix makes sense to me. Would you be able to add a test for the behavior being fixed here in test_dfs_method.py? There's a couple of tests there that test the unused primitives warnings like test_warns_with_unused_primitives that will be good templates for this test.

Copy link
Contributor

@davesque davesque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this solution might lead to surprising behavior. I see the original motivator for this was described in issue #1729 with the example of "Count". That string ends up getting compared with the Count primitive's name which, by convention and with all primitives, is the snake case version of the class name ("count"). But what about names like "NumUnique"? After this change, that will be converted to "numunique" which will then be compared with the NumUnique primitive's name which is "num_unique" (with an underscore).

It seems like the real issue here is with the overly flexible behavior of the agg_primitives kwarg on the dfs function. Perhaps it should only accept class objects as arguments? That seems way less ambiguous and more buttoned down. It seems like accepting either class names or primitive names creates an ambiguity that is difficult to resolve. After all, it seems natural that if you can provide class objects as args, you could also provide the name of those objects as args. But then you'd need to do some kind of parsing to convert the names to snake case. But then that codifies the requirement that primitive names be the snake case version of their corresponding class name. And that begs the question of why we have the name class property at all to begin with.

What are everyone's thoughts on this?

CC: @tamargrey

@thehomebrewnerd
Copy link
Contributor

@davesque You raise some interesting things to think about. Here are a couple of my thoughts:

  • I could be wrong, but I think DFS expects snake case strings as input to the primitives lists, so if you supplied something like "CumCount" you would get an unknown primitive error since that string doesn't match any primitive names. The input doesn't seem to be strictly snake case though because "Cum_Count" does get recognized as a valid primitive.
  • An alternate solution could be to require that the primitive names supplied as strings match the primitive name exactly and we don't do any lowercasing of the DFS inputs like we seem to be doing now before we match them to primitive names.
  • I do think this problem stems from the flexibility we allow in the inputs. However, as a user I definitely like having the ability to specify primitives by string rather than by class objects or instances. If we only allowed class inputs we would have to add extra imports for those classes, which is inconvenient from a usability point of view in my opinion.

@davesque
Copy link
Contributor

davesque commented Oct 14, 2021

@thehomebrewnerd

  • I could be wrong, but I think DFS expects snake case strings as input to the primitives lists, so if you supplied something like "CumCount" you would get an unknown primitive error since that string doesn't match any primitive names.

Yep, that was my understanding as well. I guess what I was pointing out was that, based on the description in the original ticket, it seemed that it was expected that the method would also accept the names of primitive class objects as string inputs. Having it work that way causes issues for the reason you mention: because you expect to be able to provide "CumCount" as a meaningful input but it doesn't get recognized. You could give "Cum_Count" like you say (which isn't really the name of a primitive or the class name of a primitive). An input like that might even indicate some issue on the user's end but we'd end up silently accepting it. Of course, "cum_count" would also be accepted (and correctly found) but that already works anyway without this update.

The way that this change leads to this ambiguity makes me think that we shouldn't even really have primitive names as a separate concept from just the class name of primitives. It just means that we have this thing that people need to use to look up primitives which seems obscure. As a developer, I'd expect to just provide the name of the class or the class itself or an instance of it. The fact that the primitive classes have a class property that defines their name (which can be literally anything; it could be "☃") is sort of like a hidden detail that only serves to complicate things. That seems to have been proven by #1729.

  • An alternate solution could be to require that the primitive names supplied as strings match the primitive name exactly and we don't do any lowercasing of the DFS inputs like we seem to be doing now before we match them to primitive names.

I agree that this would be a fix. I think that's sort of the de facto way that things are being done now since doing otherwise causes the error. I guess it would amount to doing a bit of extra validation on string inputs and emitting a more meaningful error. I still have some concerns though about what I mentioned above; that having a separate primitive name feels too complicated.

  • I do think this problem stems from the flexibility we allow in the inputs. However, as a user I definitely like having the ability to specify primitives by string rather than by class objects or instances. If we only allowed class inputs we would have to add extra imports for those classes, which is inconvenient from a usability point of view in my opinion.

Yeah, I agree. I did mention that idea of only accepting class objects but I'm not crazy about for this reason that you mention. We should probably continue to provide the convenience of string inputs.

CC: @thehomebrewnerd @tamargrey @jeff-hernandez @tuethan1999 @gsheni @rwedge

I guess a big question I have from all this is something I should pose to the entire ML tools team: do we remember why we decided to have primitive names separate from primitive class names? I want to default to assuming that there was a specific reason for it that I'm overlooking.

@rwedge
Copy link
Contributor

rwedge commented Oct 14, 2021

The "variable types" typing system used before we started using Woodwork originally had a "name" attribute that was defined separately from the class name, but we changed it to be an automatically generated snake_case version of the class name

Switching to snake case for the string name probably added some unnecessary potential for user error

@davesque
Copy link
Contributor

davesque commented Oct 14, 2021

@rwedge Cool. Useful context. Does having the separate snake case name seem like something that might still be required for some reason?

@rwedge
Copy link
Contributor

rwedge commented Oct 14, 2021

@rwedge Cool. Useful context. Does having the separate snake case name seem like something that might still be required for some reason?

@davesque - I don't think it'd be required but it could break existing user workflows if we don't recognize the snake case names that were used previously

@tamargrey
Copy link
Contributor

@davesque Thanks for bringing up these points about string inputs for primitives. I agree that using the primitive name unnecessarily obfuscates for users what they should be passing in.

I wanted to point out that we have a similar string input api in Woodwork for logical types, but there we ultimately use the class name. _parse_logical_type is where we handle the logic for whether the logical type is a string or class, and type_system.str_to_logical_type is where we actually parse the inputted string. In the end, woodwork's string matching doesn't care about upper or lower case and allows snake and camel case. So the following strings all are understood to be LogicalType.NaturalLanguage: 'NaturalLanguage', 'natural_language', 'natuRAl_lAnguage', 'NaTuRalLanGuagE'.

I agree that the ease of use of passing in strings is something we should keep, and whether we allow snake case or upper/lower case variations can be up for further discussions. But that conversation is probably one that should happen on its own issue, and it should probably happen for woodwork logical types as well.

But for the purposes of this PR: While we're allowing strings like 'Cum_Count' to be passed into DFS, I feel like we need to be able to recognize it and not have it show up in the unused primitive warnings. More likely than the multi-word example is the single word one (and this is what I ran into) where primitive strings like 'Min' or 'Day' don't actually make it clear that the camel case strings aren't allowed, so seeing them show up as unused primitives is confusing. The solution @HenryRocha implemented here makes sense to me to have in featuretools until we decide, long term, how we want to handle string inputs.

@gsheni
Copy link
Contributor

gsheni commented Oct 15, 2021

@tamargrey I agree. Let's punt on allowing snake case or upper/lower case variations. We can get in this MR once a unit test has been added.

We originally used to only allow primitive class objects to be passed to DFS. We changed this a long time ago to make it easier to run DFS (without having to import all primitives). I would like to keep this behavior going forward.
For example, if you want to use the "Count" primitive, passing in the string into DFS is much easier than having to import the class.

Note: There are certain situations where you have to use the primitive class object. If the primitive requires a parameter or you want to change a default parameter value.

@HenryRocha once a unit test has been added, a member on our team will review the MR and we can get it merged soon. Thanks for the contribution!

@davesque davesque self-requested a review October 16, 2021 01:17
davesque
davesque previously approved these changes Oct 16, 2021
@HenryRocha
Copy link
Contributor Author

@tamargrey I've added a unit test that should account for different cased strings, please check if this test is enough.

Also, I think the merge conflict has to with the v1.0.0 release, which ended up resetting the future releases section of the docs/source/release_notes.rst file. Should I merge with the main branch?

@tamargrey
Copy link
Contributor

Also, I think the merge conflict has to with the v1.0.0 release, which ended up resetting the future releases section of the docs/source/release_notes.rst file. Should I merge with the main branch?

@HenryRocha yep, let's merge in the changes from main. You'll only need the one release note in the Fixes section (the one under the Testing Changes section is covered by the Fixes one, so it's not necessary). You can look at the current release notes to make sure that the lines that were part of the 1.0.0 release don't end up duplicated in the Future Release section.

tamargrey
tamargrey previously approved these changes Oct 20, 2021
Copy link
Contributor

@tamargrey tamargrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@gsheni @davesque @rwedge I created an issue to look into different handling of string primitive inputs: #1750

Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor update to the release notes, but otherwise looks fine to me.

docs/source/release_notes.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@thehomebrewnerd thehomebrewnerd merged commit 8b73669 into alteryx:main Oct 22, 2021
@rwedge rwedge mentioned this pull request Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_unused_primitives only recognizes lowercase primitive strings
6 participants