You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I will quote my stack overflow question and answer: https://stackoverflow.com/a/61261300/3595632.
The problem is that feature generated by dfs() count the same value that cutoff_time Dataframe already count: When Jeff calculating lt, he already counted the amount of "2012-03-01" of store 0 for the aggregated amount of cutoff_time = "2012-03-01"(store 0) (and I think this makes sense). So, 1 month amount aggregation period: 2012-03-01 ~ 2012-03-31, March, 2012.
Let's see SUM(sales.amount) column value of (store_id = 0 & 2012-03-01 cutoff time). The value is 44 by adding up the amounts of 2012-02-05, 2012-02-10, and """2012-03-01"""!!!. It counts """2012-03-01""" here again, which is a sort of cheating and look-ahead bias.
2012-03-01 is March, not February. We must predict 1 month total amount of March, using ONLY February information....
Expected Output
I think that SUM(sales.amount) column value of (store_id = 0 & 2012-03-01 cutoff time) must be 16 + 15 = 31, excluding 2012-03-01 amount.
The text was updated successfully, but these errors were encountered:
duplicated counting issue
Bug/Feature Request Description
I will quote my stack overflow question and answer: https://stackoverflow.com/a/61261300/3595632.
The problem is that feature generated by
dfs()
count the same value thatcutoff_time
Dataframe already count: When Jeff calculatinglt
, he already counted the amount of "2012-03-01" of store 0 for the aggregated amount ofcutoff_time = "2012-03-01"(store 0)
(and I think this makes sense). So, 1 month amount aggregation period: 2012-03-01 ~ 2012-03-31, March, 2012.Let's see
SUM(sales.amount)
column value of (store_id = 0
&2012-03-01
cutoff time). The value is 44 by adding up the amounts of 2012-02-05, 2012-02-10, and """2012-03-01"""!!!. It counts """2012-03-01""" here again, which is a sort of cheating and look-ahead bias.2012-03-01 is March, not February. We must predict 1 month total amount of March, using ONLY February information....
Expected Output
I think that
SUM(sales.amount)
column value of (store_id = 0
&2012-03-01
cutoff time) must be 16 + 15 = 31, excluding 2012-03-01 amount.The text was updated successfully, but these errors were encountered: