Skip to content

Conversation

@tamargrey
Copy link
Contributor

closes #1790

The numeric lag primitive requires that the DataFrame also have a time index, though whether the time index is numeric or datetime in nature shouldn't matter.

@codecov
Copy link

codecov bot commented Dec 1, 2021

Codecov Report

Merging #1797 (b55b1e7) into main (9f8ffb1) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1797   +/-   ##
=======================================
  Coverage   98.72%   98.73%           
=======================================
  Files         141      141           
  Lines       15644    15722   +78     
=======================================
+ Hits        15445    15523   +78     
  Misses        199      199           
Impacted Files Coverage Δ
...ols/tests/synthesis/test_deep_feature_synthesis.py 99.33% <ø> (-0.01%) ⬇️
...retools/primitives/standard/transform_primitive.py 100.00% <100.00%> (ø)
...s/tests/primitive_tests/test_transform_features.py 99.39% <100.00%> (+0.01%) ⬆️
.../tests/primitive_tests/test_transform_primitive.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f8ffb1...b55b1e7. Read the comment docs.

@rwedge
Copy link
Contributor

rwedge commented Dec 1, 2021

This primitive should have the uses_full_dataframe flag, otherwise with unique cutoff times the other rows won't be present to shift correctly

"""
name = "numeric_lag"
input_types = [ColumnSchema(semantic_tags={'time_index'}), ColumnSchema(semantic_tags={'numeric'})]
return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how important it is, but should we just leave this as return_type = ColumnSchema(semantic_tags={'numeric'}) instead? I suspect the logical type is specified to try and avoid type inference, but this will force all columns to get converted to Double, which theoretically could cause a loss of information, especially if used with very large number inputs.

I think we should try to be specific with our return logical types when possible, but if we truly don't know that the output will always be of a certain type, maybe we should let inference happen so don't end up with an unnecessary type conversion that alters our data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely see your point. Happy to change to just the tag.

Though if I'm reading the way woodwork types are determined in the feature matrix, we may be automatically using Double when the numeric tag is present and no logical type has been specified ( I think that this is what keeps the nullability problem from showing up here): https://github.com/alteryx/featuretools/blob/main/featuretools/computational_backends/utils.py#L318

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, think you are right, and we will end up with Double regardless. I forgot about that code in get_ww_types_from_features.

I'm still thinking it would be better to leave the logical type off here, so if we ever change the logic in get_ww_types_from_features we won't be forcing an unnecessary conversion based on the primitive return type.

assert feature_with_name(features, rolling_transform_name)


def test_numeric_lag_works_with_non_nullable(pd_es):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the intended purpose of this test. Were you intending to make sure we can calculate feature values correctly and that we don't get errors if we have a non-nullable input that gets lagged and introduces null values?

If so, I think you actually need to go to the step of computing the feature matrix instead of just the features. Without computing the features, I don't think that type of error will show up.

Feel free to disregard if the intention was something else here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, you're right. I can add the calculate feature matrix step here. Though maybe that means this should be a more targeted test in test_calculate_feature_matrix or test_feature_set_calculator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably would be best to relocate the test in that case.

Comment on lines 1891 to 1897
features = ft.dfs(target_dataframe_name='new_log',
entityset=pd_es,
agg_primitives=[],
trans_primitives=[lag_primitive],
features_only=True)

fm = calculate_feature_matrix(features=features, entityset=pd_es)
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could remove `feautres_only=True) and combine these into a single call:

    fm, features = ft.dfs(target_dataframe_name='new_log',
                          entityset=pd_es,
                          agg_primitives=[],
                          trans_primitives=[lag_primitive],)

Also, should we do any checks on the features values or logical types? Not sure it's critical since those are covered in the primitive tests, but you could check that we end up with null values as expected. If we don't check anything in the feature matrix you could change the assignment of fm, features to _, features since we are just throwing away the feature matrix data in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. Added checks for the null values in the feature matrix, and it means now we're not using the features list, so changed that to _.

Also moved the test to test_transform_features, because it seems that that's the better location for how a specific primitive's feature ends up looking in the feature matrix

thehomebrewnerd
thehomebrewnerd previously approved these changes Dec 2, 2021
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more non-critical, non-blocking suggestion for the cfm test.


assert isinstance(pd_es['new_log'].ww.logical_types['value'], Integer)

lag_primitive = NumericLag(periods=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but you could use a variable for periods here and below instead of hardcoding 5 everywhere. That way it might be more obvious that the assertions are checking that the number of nulls introduced is equal to the periods used for lagging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

thehomebrewnerd
thehomebrewnerd previously approved these changes Dec 2, 2021
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rwedge
Copy link
Contributor

rwedge commented Dec 2, 2021

For the CFM test can we use unique cutoff times as well to test that scenario?

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add NumericLag primitive

4 participants