-
Notifications
You must be signed in to change notification settings - Fork 909
Add NumericLag transform primitve #1797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
96dcfd7 to
b0208cc
Compare
Codecov Report
@@ Coverage Diff @@
## main #1797 +/- ##
=======================================
Coverage 98.72% 98.73%
=======================================
Files 141 141
Lines 15644 15722 +78
=======================================
+ Hits 15445 15523 +78
Misses 199 199
Continue to review full report at Codecov.
|
|
This primitive should have the |
| """ | ||
| name = "numeric_lag" | ||
| input_types = [ColumnSchema(semantic_tags={'time_index'}), ColumnSchema(semantic_tags={'numeric'})] | ||
| return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how important it is, but should we just leave this as return_type = ColumnSchema(semantic_tags={'numeric'}) instead? I suspect the logical type is specified to try and avoid type inference, but this will force all columns to get converted to Double, which theoretically could cause a loss of information, especially if used with very large number inputs.
I think we should try to be specific with our return logical types when possible, but if we truly don't know that the output will always be of a certain type, maybe we should let inference happen so don't end up with an unnecessary type conversion that alters our data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see your point. Happy to change to just the tag.
Though if I'm reading the way woodwork types are determined in the feature matrix, we may be automatically using Double when the numeric tag is present and no logical type has been specified ( I think that this is what keeps the nullability problem from showing up here): https://github.com/alteryx/featuretools/blob/main/featuretools/computational_backends/utils.py#L318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, think you are right, and we will end up with Double regardless. I forgot about that code in get_ww_types_from_features.
I'm still thinking it would be better to leave the logical type off here, so if we ever change the logic in get_ww_types_from_features we won't be forcing an unnecessary conversion based on the primitive return type.
| assert feature_with_name(features, rolling_transform_name) | ||
|
|
||
|
|
||
| def test_numeric_lag_works_with_non_nullable(pd_es): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious about the intended purpose of this test. Were you intending to make sure we can calculate feature values correctly and that we don't get errors if we have a non-nullable input that gets lagged and introduces null values?
If so, I think you actually need to go to the step of computing the feature matrix instead of just the features. Without computing the features, I don't think that type of error will show up.
Feel free to disregard if the intention was something else here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, you're right. I can add the calculate feature matrix step here. Though maybe that means this should be a more targeted test in test_calculate_feature_matrix or test_feature_set_calculator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably would be best to relocate the test in that case.
| features = ft.dfs(target_dataframe_name='new_log', | ||
| entityset=pd_es, | ||
| agg_primitives=[], | ||
| trans_primitives=[lag_primitive], | ||
| features_only=True) | ||
|
|
||
| fm = calculate_feature_matrix(features=features, entityset=pd_es) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could remove `feautres_only=True) and combine these into a single call:
fm, features = ft.dfs(target_dataframe_name='new_log',
entityset=pd_es,
agg_primitives=[],
trans_primitives=[lag_primitive],)
Also, should we do any checks on the features values or logical types? Not sure it's critical since those are covered in the primitive tests, but you could check that we end up with null values as expected. If we don't check anything in the feature matrix you could change the assignment of fm, features to _, features since we are just throwing away the feature matrix data in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call. Added checks for the null values in the feature matrix, and it means now we're not using the features list, so changed that to _.
Also moved the test to test_transform_features, because it seems that that's the better location for how a specific primitive's feature ends up looking in the feature matrix
thehomebrewnerd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more non-critical, non-blocking suggestion for the cfm test.
|
|
||
| assert isinstance(pd_es['new_log'].ww.logical_types['value'], Integer) | ||
|
|
||
| lag_primitive = NumericLag(periods=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but you could use a variable for periods here and below instead of hardcoding 5 everywhere. That way it might be more obvious that the assertions are checking that the number of nulls introduced is equal to the periods used for lagging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
thehomebrewnerd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
For the CFM test can we use unique cutoff times as well to test that scenario? |
rwedge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
closes #1790
The numeric lag primitive requires that the DataFrame also have a time index, though whether the time index is numeric or datetime in nature shouldn't matter.