Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookahead(cheating) issue #918

Closed
rightx2 opened this issue Apr 19, 2020 · 0 comments · Fixed by #930
Closed

Lookahead(cheating) issue #918

rightx2 opened this issue Apr 19, 2020 · 0 comments · Fixed by #930

Comments

@rightx2
Copy link
Contributor

rightx2 commented Apr 19, 2020

duplicated counting issue

Bug/Feature Request Description

I will quote my stack overflow question and answer: https://stackoverflow.com/a/61261300/3595632.
The problem is that feature generated by dfs() count the same value that cutoff_time Dataframe already count: When Jeff calculating lt, he already counted the amount of "2012-03-01" of store 0 for the aggregated amount of cutoff_time = "2012-03-01"(store 0) (and I think this makes sense). So, 1 month amount aggregation period: 2012-03-01 ~ 2012-03-31, March, 2012.
Let's see SUM(sales.amount) column value of (store_id = 0 & 2012-03-01 cutoff time). The value is 44 by adding up the amounts of 2012-02-05, 2012-02-10, and """2012-03-01"""!!!. It counts """2012-03-01""" here again, which is a sort of cheating and look-ahead bias.

2012-03-01 is March, not February. We must predict 1 month total amount of March, using ONLY February information....

Expected Output

I think that SUM(sales.amount) column value of (store_id = 0 & 2012-03-01 cutoff time) must be 16 + 15 = 31, excluding 2012-03-01 amount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant