New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add include_cutoff_time arg to control whether data at cutoff times a… #959
Conversation
…re included in feature calculations and prevent traininig_window overlapping
Codecov Report
@@ Coverage Diff @@
## master #959 +/- ##
=======================================
Coverage 98.24% 98.25%
=======================================
Files 119 119
Lines 10945 10985 +40
=======================================
+ Hits 10753 10793 +40
Misses 192 192
Continue to review full report at Codecov.
|
@jeff-hernandez Hi, jeff. This is my first trial. Could you please review them? |
Hi @rightx2, thanks for the follow-up with this PR. I will provide a code review and keep you updated. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! The parameter works as expected and each of the test cases are included. I left a few comments, but other than that, this looks ready to merge. I will pass to @rwedge for the final approval.
featuretools/computational_backends/calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
featuretools/computational_backends/calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
featuretools/tests/computational_backend/test_calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
I think we should also update the Handling time docs page to mention this new option |
Reflect suggestion
Relfect suggestion
Reflect suggestion
Reflect suggestion
@jeff-hernandez I reflected your suggestions. Thanks. @rwedge I've tried to add |
Include data at cutoff times | ||
----------------------------------------------- | ||
|
||
There are some situations where data is right just on the cutoff time. For example, let say you have to predict one month revenue for each store using sales data. One of the them is the revenue from ``2020-01-01`` to ``2020-01-31`` and there are bunch of sale history data before that time, including the one occured at ``2020-01-01 00:00:00``. You might want to include(or exclude) the data in feature calculation for this cutoff time. This can be controlled by using the ``include_cutoff_time`` parameter to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the key difference to highlight is that include_cutoff_time=True
means "use data from this time and older" and include_cutoff_time=False
means "do not use data from this point in time or any data newer than it"
We should also explain how this impacts training window -- I think expanding the example to show how using a training window would work in both cases would be helpful to the reader
featuretools/computational_backends/calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
Include data at cutoff times -> Excluding data at cutoff times
Fix docstring
Fix docstring
@rwedge I've done some corrections. Please check them out |
@rwedge Thanks for reviewing my poor work! I thought your suggestion is more clear than mine. I reflected them all. p.s. Since I'm not damn good at English, especially writing and grammar, so feel free to correct any mistake or wrong expression, please :) |
featuretools/tests/computational_backend/test_calculate_feature_matrix.py
Outdated
Show resolved
Hide resolved
Change test func name: -> `test_include_cutoff_time`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those docs updates looks great, I'll edit as I see fit.
I think we should also add a test for how this would work with the approximate option
@rwedge I'll try that |
I updated the training window example a bit, you might need to pull |
@rwedge Hi, I've tried to find a good |
Sure, @rightx2, I'll try to explain. The gist of what approximate does:
So for how I ended up writing a test case to double check my logic: def test_approximate_dfeat_of_agg_on_target_include_cutoff_time(es):
agg_feat = ft.Feature(es['log']['id'], parent_entity=es['sessions'], primitive=Count)
agg_feat2 = ft.Feature(agg_feat, parent_entity=es['customers'], primitive=Sum)
dfeat = DirectFeature(agg_feat2, es['sessions'])
cutoff_time = pd.DataFrame({'time': [datetime(2011, 4, 9, 10, 31, 19)], 'instance_id': [0]})
feature_matrix = calculate_feature_matrix([dfeat, agg_feat],
es,
approximate=Timedelta(20, 's'),
cutoff_time=cutoff_time,
include_cutoff_time=False)
# log event 5 excluded due to approximate cutoff time point
assert feature_matrix[dfeat.get_name()].tolist() == [5]
assert feature_matrix[agg_feat.get_name()].tolist() == [5]
feature_matrix = calculate_feature_matrix([dfeat, agg_feat],
es,
approximate=Timedelta(20, 's'),
cutoff_time=cutoff_time,
include_cutoff_time=True)
# log event 5 included due to approximate cutoff time point
assert feature_matrix[dfeat.get_name()].tolist() == [6]
assert feature_matrix[agg_feat.get_name()].tolist() == [5] Based on this test it looks like approximate is working with |
@rwedge Seeing your explanation and example, I think I was almost there except understanding |
I've solved the conflict issue in |
We don't need to move the the Breaking Change part. This PR will add new functionality, if users don't use I like the extra comments in the test case, these can be hard to puzzle out what they are doing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and ready to merge once the conflicts are resolved. Thank you for the contribution @rightx2!
Add
include_cutoff_time
arg to control whether data at cutoff times are included in feature calculations and preventtraininig_window
overlappingPull Request Description
There was a data overlapping problem when calculating the feature matrix: The data at cutoff time might be used both in calculating features and in calculating target values(#918 ). This could cause data cheating and affect the result as well. There was a trial to solve the issue (#930 ), but It still didn't solve the cheating problem. So, we decided to parameterize it to control whether data at cutoff times are included in feature calculations or not(#942 ) and this PR solves it.