Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add include_cutoff_time arg to control whether data at cutoff times a… #959

Merged
merged 21 commits into from May 22, 2020

Conversation

rightx2
Copy link
Contributor

@rightx2 rightx2 commented May 9, 2020

Add include_cutoff_time arg to control whether data at cutoff times are included in feature calculations and prevent traininig_window overlapping

Pull Request Description

There was a data overlapping problem when calculating the feature matrix: The data at cutoff time might be used both in calculating features and in calculating target values(#918 ). This could cause data cheating and affect the result as well. There was a trial to solve the issue (#930 ), but It still didn't solve the cheating problem. So, we decided to parameterize it to control whether data at cutoff times are included in feature calculations or not(#942 ) and this PR solves it.

@codecov
Copy link

codecov bot commented May 9, 2020

Codecov Report

Merging #959 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #959   +/-   ##
=======================================
  Coverage   98.24%   98.25%           
=======================================
  Files         119      119           
  Lines       10945    10985   +40     
=======================================
+ Hits        10753    10793   +40     
  Misses        192      192           
Impacted Files Coverage Δ
featuretools/synthesis/dfs.py 100.00% <ø> (ø)
...computational_backends/calculate_feature_matrix.py 98.61% <100.00%> (ø)
...s/computational_backends/feature_set_calculator.py 98.55% <100.00%> (ø)
featuretools/entityset/entity.py 95.78% <100.00%> (+0.09%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 99.42% <100.00%> (+0.02%) ⬆️
featuretools/tests/synthesis/test_dfs_method.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec3b8ce...7aa6ffc. Read the comment docs.

@rightx2
Copy link
Contributor Author

rightx2 commented May 9, 2020

@jeff-hernandez Hi, jeff. This is my first trial. Could you please review them?

@jeff-hernandez jeff-hernandez self-requested a review May 11, 2020 15:35
@jeff-hernandez
Copy link
Contributor

Hi @rightx2, thanks for the follow-up with this PR. I will provide a code review and keep you updated. Thanks!

Copy link
Contributor

@jeff-hernandez jeff-hernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! The parameter works as expected and each of the test cases are included. I left a few comments, but other than that, this looks ready to merge. I will pass to @rwedge for the final approval.

docs/source/changelog.rst Outdated Show resolved Hide resolved
featuretools/synthesis/dfs.py Outdated Show resolved Hide resolved
@rwedge
Copy link
Contributor

rwedge commented May 12, 2020

I think we should also update the Handling time docs page to mention this new option

@rightx2
Copy link
Contributor Author

rightx2 commented May 13, 2020

@jeff-hernandez I reflected your suggestions. Thanks. @rwedge I've tried to add include_cutoff_time on the Handling time section. Could you review that, please?

Include data at cutoff times
-----------------------------------------------

There are some situations where data is right just on the cutoff time. For example, let say you have to predict one month revenue for each store using sales data. One of the them is the revenue from ``2020-01-01`` to ``2020-01-31`` and there are bunch of sale history data before that time, including the one occured at ``2020-01-01 00:00:00``. You might want to include(or exclude) the data in feature calculation for this cutoff time. This can be controlled by using the ``include_cutoff_time`` parameter to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key difference to highlight is that include_cutoff_time=True means "use data from this time and older" and include_cutoff_time=False means "do not use data from this point in time or any data newer than it"

We should also explain how this impacts training window -- I think expanding the example to show how using a training window would work in both cases would be helpful to the reader

Include data at cutoff times -> Excluding data at cutoff times
Fix docstring
@rightx2
Copy link
Contributor Author

rightx2 commented May 14, 2020

@rwedge I've done some corrections. Please check them out

@rightx2
Copy link
Contributor Author

rightx2 commented May 14, 2020

@rwedge Thanks for reviewing my poor work!

I thought your suggestion is more clear than mine. I reflected them all.

p.s. Since I'm not damn good at English, especially writing and grammar, so feel free to correct any mistake or wrong expression, please :)

@rwedge rwedge mentioned this pull request May 18, 2020
Change test func name: -> `test_include_cutoff_time`
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those docs updates looks great, I'll edit as I see fit.

I think we should also add a test for how this would work with the approximate option

@rightx2
Copy link
Contributor Author

rightx2 commented May 19, 2020

@rwedge I'll try that

@rwedge
Copy link
Contributor

rwedge commented May 19, 2020

I updated the training window example a bit, you might need to pull

@rightx2
Copy link
Contributor Author

rightx2 commented May 20, 2020

@rwedge Hi, I've tried to find a good approximate example for this for a couple of days, but I'm still stuck at it (have some trouble in understanding how approximate works). If you help me out, I would really appreciate it. Could you please?

@rwedge
Copy link
Contributor

rwedge commented May 21, 2020

Sure, @rightx2, I'll try to explain.

The gist of what approximate does:

  • Divide the cutoff time dataframe into buckets based on the approximate window size
  • For an instance in a bucket, don't calculate it's aggregation features using it's cutoff time
  • Instead, calculate the aggregation features for EVERY instance in the bucket using the oldest time point in the bucket
  • This means any data between the oldest time point in the bucket and the actual cutoff time for an instance won't be included in the aggregation feature values
  • The idea is if there is enough older data available for instances, losing the data within the bucket won't change the aggregation values too much (not always the case)
  • This can save computational time by only needing to perform the filtering necessary to calculatie aggregation features once per bucket instead of filtering at every unique cutoff time point in the bucket

So for how include_cutoff_time would impact approximate, I think it will change whether the oldest time point in the bucket is included or excluded.

I ended up writing a test case to double check my logic:

def test_approximate_dfeat_of_agg_on_target_include_cutoff_time(es):
    agg_feat = ft.Feature(es['log']['id'], parent_entity=es['sessions'], primitive=Count)
    agg_feat2 = ft.Feature(agg_feat, parent_entity=es['customers'], primitive=Sum)
    dfeat = DirectFeature(agg_feat2, es['sessions'])
    cutoff_time = pd.DataFrame({'time': [datetime(2011, 4, 9, 10, 31, 19)], 'instance_id': [0]})
    feature_matrix = calculate_feature_matrix([dfeat, agg_feat],
                                              es,
                                              approximate=Timedelta(20, 's'),
                                              cutoff_time=cutoff_time,
                                              include_cutoff_time=False)
    # log event 5 excluded due to approximate cutoff time point
    assert feature_matrix[dfeat.get_name()].tolist() == [5]
    assert feature_matrix[agg_feat.get_name()].tolist() == [5]

    feature_matrix = calculate_feature_matrix([dfeat, agg_feat],
                                              es,
                                              approximate=Timedelta(20, 's'),
                                              cutoff_time=cutoff_time,
                                              include_cutoff_time=True)
    # log event 5 included due to approximate cutoff time point
    assert feature_matrix[dfeat.get_name()].tolist() == [6]
    assert feature_matrix[agg_feat.get_name()].tolist() == [5]

Based on this test it looks like approximate is working with include_cutoff_time as I expected

@rightx2
Copy link
Contributor Author

rightx2 commented May 21, 2020

@rwedge Seeing your explanation and example, I think I was almost there except understanding DriectFeature(I was struggling with test_approximate_multiple_instances_per_cutoff_time to understand it and I'm still). I should have spent more time. Anyway, thanks for your help!. I've added a piece of comment so that it can help someone like me in the future. Anything else to do to be merged?

@rightx2
Copy link
Contributor Author

rightx2 commented May 21, 2020

I've solved the conflict issue in changelog.rst with master branch. Should I move the "Breaking change" part from v0.14.0 Apr 30, 2020 to the "Future Release"?

@rwedge
Copy link
Contributor

rwedge commented May 21, 2020

We don't need to move the the Breaking Change part. This PR will add new functionality, if users don't use include_cutoff_time=False their experience will not change.

I like the extra comments in the test case, these can be hard to puzzle out what they are doing.

docs/source/changelog.rst Outdated Show resolved Hide resolved
rwedge
rwedge previously approved these changes May 21, 2020
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jeff-hernandez jeff-hernandez self-requested a review May 22, 2020 15:24
jeff-hernandez
jeff-hernandez previously approved these changes May 22, 2020
Copy link
Contributor

@jeff-hernandez jeff-hernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and ready to merge once the conflicts are resolved. Thank you for the contribution @rightx2!

@rwedge rwedge dismissed stale reviews from jeff-hernandez and themself via 7aa6ffc May 22, 2020 15:45
@rwedge rwedge merged commit a71d57b into alteryx:master May 22, 2020
@rwedge rwedge mentioned this pull request May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants