**Pandas Aggregate’s cousin - Transform:**

https://towardsdatascience.com/pandas-transform-more-than-meets-the-eye-928542b40b56

**Lag Features & Rolling Window Statistics:**

https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/

A description of features since I found the idea of using lags and rolling windows fascinating. I just wanted to have some kind of an "intuition" of it. So here are the notes:

**First off what each feature mathematically does**

*   lag_7: sales shifted 7 steps downwards for each group. That is why the first value appears on the 7th index.

*   lag_28: sales shifted 28 steps downwards. That is why the first value appears on the 28th index.

*   rmean_7_7: rolling mean sales of a window size of 7 over column *lag_7*. First value (0.2857) appears on the 13th index because means including nan are nan.

*   rmean_7_28: rolling mean sales of a window size of 28 over column *lag_7*. First value (0.357) appears on the 34th index because that is the first time the mean formula gets all 28 non-nan values.

*   rmean_28_7: rolling mean sales of a window size of 7 over column *lag_28*. First value (0.2857) appears on the 3th index because it is the first time the mean formula gets 7 non-nan values.

*   rmean_28_28: rolling mean sales of a window size of 28 over column *lag_28*. First value appears on 55th index because that is the first time the formula here all non-nan values.

**The intuition as far as I can understand is the following:**

1. Captures the week-on-week similarity and that too of just the past week. In other words, people are likely to shop this monday similar to the last monday (except it is some special occassion).

2. Captures the weekly similarity from a month-to-month perspective. Example: people in the 1st weekend of a month shop more so that weekend looks more similar to first weeks of other months than the previous weekend. (Though 28 is arguable here. A month is generally 30. Interesting would be a variable window depending on when the comparative week starts. Dealing with edge cases like week divided into 2 months will be tricky).

**Since individual data points are prone to erratic spikes or troughs, mean provides a more "representative" picture.**

3. Captures the information regarding the sales of the whole previous week ending 7 days in the past i.e. if we are at day 14, then the average is of sales from days 1-7 NOT days 7-14. This provides the information about the whole week and not just a single day sale comparison like lag_7 to bring the lag_7 value into "better weekly context".

4. Captures the information regarding the sales of the entire previous 4 weeks ending 7 days in the past i.e. if we are at day 35, then the average is sales from days 1-28.

5. Captures the information regarding the sales of the whole week ending 4 weeks ago i.e. if we are on day 35, then the average is of sales from day 1-7. (Assuming for simplicity the month is 28 days), this provides the information of not just a month-to-month comparison of the same day (day 7 of month one vs day 7 of month two), but the entire week leading up to day 7. Again the idea I believe is to capture the whole week and not just a single day sale comparison like lag_28 to bring the lag_28 value into "better weekly context".

6. Captures the information regarding the sales of the entire previous 4 weeks ending 4 weeks in the past i.e. if we are at day 56, then the average is of days 1-28. (Assuming for simplicity the month is 28 days), the idea again is to bring the point value of lag_28 into a better context (i.e. of day 28 when being compared to day 56) into a "better monthly context".

How would you "talk" about these features?

Hey let's see how the sales were last friday compared to this friday?
Hey let's see how the sales were first weekend of the last month compared to first weekend of this month?
May be comparing last saturday to this saturday is too specific. Week-on-week same day trends are more likely to be similar if the prior week went similar too. It would make sense to not just have the last saturday but also the mean of the whole week leading upto that day to give the model the "hint" how normal the whole week was.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd /content/gdrive/My Drive/M5-Evaluation

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/.shortcut-targets-by-id/1IRMYDLHp5HGU8lqmV_ly3a9orQWGxVGB/M5-Evaluation


In [2]:
# General imports
import numpy as np
import pandas as pd
import os, sys, gc, time, warnings, pickle, psutil, random

import time

warnings.filterwarnings('ignore')

In [3]:
## Simple "Memory profilers" to see memory usage
def get_memory_usage():
    return np.round(psutil.Process(os.getpid()).memory_info()[0]/2.**30, 2) 
        
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

In [4]:
########################### Vars
#################################################################################
TARGET = 'sales'         # Our main target
END_TRAIN = 1941         # Last day in train set
MAIN_INDEX = ['id','d']  # We can identify item by these columns

DATA_PATH = './input/m5-forecasting-accuracy/'
OUTPUT_PATH = './output/KY/'

In [5]:
########################### Load Data
#################################################################################
print('Load Main Data')

# We will need only train dataset
# to show lags concept
train_df = pd.read_csv(f"{DATA_PATH}/sales_train_evaluation.csv")

# To make all calculations faster
# we will limit dataset by 'CA' state
train_df = train_df[train_df['state_id']=='CA']

Load Main Data


In [6]:
########################### Data Representation
#################################################################################

# Let's check our shape
print('Shape', train_df.shape)

Shape (12196, 1947)


In [7]:
## Horizontal representation

# If we feed directly this data to model
# our label will be values in column 'd_1913'
# all other columns will be our "features"

# In lag terminology all d_1->d_1912 columns
# are our lag features 
# (target values in previous time period)

# Good thing that we have a lot of features here
# Bad thing is that we have just 12196 "train rows"
# Note: here and after all numbers are limited to 'CA' state
train_df.iloc[:10]

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,d_5,d_6,d_7,d_8,d_9,d_10,d_11,d_12,d_13,d_14,d_15,d_16,d_17,d_18,d_19,d_20,d_21,d_22,d_23,d_24,d_25,d_26,d_27,d_28,d_29,d_30,d_31,d_32,d_33,d_34,...,d_1902,d_1903,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913,d_1914,d_1915,d_1916,d_1917,d_1918,d_1919,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,d_1930,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940,d_1941
0,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,1,3,0,1,1,1,3,0,1,1,0,0,0,2,0,3,5,0,0,1,1,0,2,1,2,2,1,0,2,4,0,0,0,0,3,3,0,1
1,HOBBIES_1_002_CA_1_evaluation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,2,1,1,0,0,0,0,0
2,HOBBIES_1_003_CA_1_evaluation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,2,2,1,2,1,1,1,0,1,1,1,0,0,1,1,0,2,1,0,0,0,0,2,1,3,0,0,1,0,1,0,2,0,0,0,2,3,0,1
3,HOBBIES_1_004_CA_1_evaluation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,3,0,1,0,5,4,1,0,1,3,7,2,0,0,1,2,4,1,6,4,0,0,0,2,2,4,2,1,1,1,1,1,0,4,0,1,3,0,2,6
4,HOBBIES_1_005_CA_1_evaluation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,1,2,1,1,0,1,1,2,2,2,4,1,0,2,3,1,0,3,2,3,1,1,3,2,3,2,2,2,2,0,0,0,2,1,0,0,2,1,0
5,HOBBIES_1_006_CA_1_evaluation,HOBBIES_1_006,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,2,0,0,0,0,1,0,0,2,4,1,0,0,0,2,1,0,0,3,0,5,2,1,0,0,1,0,0,5,2,0
6,HOBBIES_1_007_CA_1_evaluation,HOBBIES_1_007,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0
7,HOBBIES_1_008_CA_1_evaluation,HOBBIES_1_008,HOBBIES_1,HOBBIES,CA_1,CA,12,15,0,0,0,4,6,5,7,0,13,2,2,20,9,3,0,1,0,0,0,10,4,4,2,6,2,0,0,0,0,0,0,0,...,8,4,0,0,1,37,3,4,6,3,2,1,19,3,2,8,8,23,26,9,4,8,5,8,5,0,3,1,18,47,7,0,6,0,15,5,4,1,40,32
8,HOBBIES_1_009_CA_1_evaluation,HOBBIES_1_009,HOBBIES_1,HOBBIES,CA_1,CA,2,0,7,3,0,2,3,9,0,0,4,3,1,0,4,4,1,2,0,1,1,5,2,2,2,3,0,0,8,0,0,0,0,0,...,0,0,0,0,1,1,6,0,0,0,0,0,0,2,6,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
9,HOBBIES_1_010_CA_1_evaluation,HOBBIES_1_010,HOBBIES_1,HOBBIES,CA_1,CA,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,2,0,2,0,2,1,0,0,0,0,0,0,0,0,2,0,2,1,0,0,1,0,0,1,0,2,1,1,0,0,1


In [8]:
## Vertical representation

# In other hand we can think of d_ columns
# as additional labels and can significantly 
# scale up our training set to 23330948 rows

# Good thing that our model will have 
# greater input for training
# Bad thing that we are losing lags that we had
# in horizontal representation and
# also new data set consumes much more memory

index_columns = ['id','item_id','dept_id','cat_id','store_id','state_id']
train_df = pd.melt(train_df, 
                  id_vars = index_columns, 
                  var_name = 'd', 
                  value_name = TARGET)

train_df[train_df['id']=='HOBBIES_1_001_CA_1_evaluation'].iloc[:10]

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d,sales
0,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
12196,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_2,0
24392,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_3,0
36588,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_4,0
48784,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_5,0
60980,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_6,0
73176,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_7,0
85372,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_8,0
97568,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_9,0
109764,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_10,0


In [9]:
## Some minification
train_df['d'] = train_df['d'].apply(lambda x: x[2:]).astype(np.int16)

icols = ['id','item_id','dept_id','cat_id','store_id','state_id']
for col in icols:
    train_df[col] = train_df[col].astype('category')

In [10]:
########################### Lags creation
#################################################################################

# We have several "code" solutions here
# As our dataset is already sorted by d values
# we can simply shift() values
# also we have to keep in mind that 
# we need to aggregate values on 'id' level

# group and shift in loop
temp_df = train_df[['id','d',TARGET]]

start_time = time.time()
for i in range(1,8):
    print('Shifting:', i)
    temp_df['lag_'+str(i)] = temp_df.groupby(['id'])[TARGET].shift(i)
    
print('%0.2f min: Time for loops' % ((time.time() - start_time) / 60))

Shifting: 1
Shifting: 2
Shifting: 3
Shifting: 4
Shifting: 5
Shifting: 6
Shifting: 7
0.11 min: Time for loops



```
# Or same in "compact" manner
LAG_DAYS = [col for col in range(1,8)]
temp_df = train_df[['id','d',TARGET]]

start_time = time.time()
temp_df = temp_df.assign(**{
        '{}_lag_{}'.format(col, l): temp_df.groupby(['id'])[col].transform(lambda x: x.shift(l))
        for l in LAG_DAYS
        for col in [TARGET]
    })

print('%0.2f min: Time for bulk shift' % ((time.time() - start_time) / 60))
```



In [11]:
# The result
temp_df[temp_df['id']=='HOBBIES_1_001_CA_1_evaluation'].iloc[:10]

# You can notice many NaNs values - it's normal
# because there is no data for day 0,-1,-2
# (out of dataset time periods)

# Same works for test set
# be careful to make lag features:
# for day 1920 there is no data about day 1919 (until 1913)
# So if you want to predict day 1915 your 
# lag features have to start from 2 
# (1915(predicting day) - 1913(last day with label in dataset))
# and so on.

# There are few options to work 
# with NaNs in train set
## 1. drop it train_df[train_df['d']>MAX_LAG_DAY] 
## 1.1 in our case we already dropped some lines by release date
##     so you have find d.min() for each id
##     and drop train_df[train_df['d']>(train_df['d_min']+MAX_LAG_DAY)] 
## 2. If you want to keep it you can 
##    fill with '-1' to be able to convert to int
## 3. Leave as it is
## 4. Fill with mean -> not recommended

Unnamed: 0,id,d,sales,lag_1,lag_2,lag_3,lag_4,lag_5,lag_6,lag_7
0,HOBBIES_1_001_CA_1_evaluation,1,0,,,,,,,
12196,HOBBIES_1_001_CA_1_evaluation,2,0,0.0,,,,,,
24392,HOBBIES_1_001_CA_1_evaluation,3,0,0.0,0.0,,,,,
36588,HOBBIES_1_001_CA_1_evaluation,4,0,0.0,0.0,0.0,,,,
48784,HOBBIES_1_001_CA_1_evaluation,5,0,0.0,0.0,0.0,0.0,,,
60980,HOBBIES_1_001_CA_1_evaluation,6,0,0.0,0.0,0.0,0.0,0.0,,
73176,HOBBIES_1_001_CA_1_evaluation,7,0,0.0,0.0,0.0,0.0,0.0,0.0,
85372,HOBBIES_1_001_CA_1_evaluation,8,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97568,HOBBIES_1_001_CA_1_evaluation,9,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
109764,HOBBIES_1_001_CA_1_evaluation,10,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
########################### Rolling lags
#################################################################################

# We restored some day sales values from horizontal representation
# as lag features but just few of them (last 7 days or less)
# because of memory limits we can't have many lag features
# How we can get additional information from other days?

## Rolling aggregations

temp_df = train_df[['id','d','sales']]

start_time = time.time()

for i in [14,30,60]:
  print('Rolling period:', i)
  temp_df['rolling_mean_'+str(i)] = temp_df.groupby(['id'])[TARGET].apply(lambda x: x.shift(1).rolling(i).mean())
  temp_df['rolling_std_'+str(i)]  = temp_df.groupby(['id'])[TARGET].apply(lambda x: x.shift(1).rolling(i).std())

# lambda x: x.shift(1)
# 1 day shift will serve only to predict day 1914
# for other days you have to shift PREDICT_DAY-1913

# Such aggregations will help us to restore
# at least part of the information for our model
# and out of 14+30+60->104 columns we can have just 6
# with valuable information (hope it is sufficient)
# you can also aggregate by max/skew/median etc 
# also you can try other rolling periods 180,365 etc
print('%0.2f min: Time for loop' % ((time.time() - start_time) / 60))


Rolling period: 14
Rolling period: 30
Rolling period: 60
1.67 min: Time for loop


In [13]:
# The result
temp_df[temp_df['id']=='HOBBIES_1_002_CA_1_evaluation'].iloc[:20]

# Same for NaNs values - it's normal
# because there is no data for 
# 0*(rolling_period),-1*(rolling_period),-2*(rolling_period)

Unnamed: 0,id,d,sales,rolling_mean_14,rolling_std_14,rolling_mean_30,rolling_std_30,rolling_mean_60,rolling_std_60
1,HOBBIES_1_002_CA_1_evaluation,1,0,,,,,,
12197,HOBBIES_1_002_CA_1_evaluation,2,0,,,,,,
24393,HOBBIES_1_002_CA_1_evaluation,3,0,,,,,,
36589,HOBBIES_1_002_CA_1_evaluation,4,0,,,,,,
48785,HOBBIES_1_002_CA_1_evaluation,5,0,,,,,,
60981,HOBBIES_1_002_CA_1_evaluation,6,0,,,,,,
73177,HOBBIES_1_002_CA_1_evaluation,7,0,,,,,,
85373,HOBBIES_1_002_CA_1_evaluation,8,0,,,,,,
97569,HOBBIES_1_002_CA_1_evaluation,9,0,,,,,,
109765,HOBBIES_1_002_CA_1_evaluation,10,0,,,,,,


In [14]:
########################### Memory usage
#################################################################################
# Let's check our memory usage
print("{:>20}: {:>8}".format('Original rolling df',sizeof_fmt(temp_df.memory_usage(index=True).sum())))

# can we minify it?
# 1. if our dataset are aligned by index 
#    you don't need 'id' 'd' 'sales' columns
temp_df = temp_df.iloc[:,3:]
print("{:>20}: {:>8}".format('Values rolling df',sizeof_fmt(temp_df.memory_usage(index=True).sum())))

# can we make it even smaller?
# carefully change dtype and/or
# use sparce matrix to minify 0s
# Also note that lgbm accepts matrixes as input
# that is good for memory reducion 
from scipy import sparse 
temp_matrix = sparse.csr_matrix(temp_df)

# restore to df
temp_matrix_restored = pd.DataFrame(temp_matrix.todense())
restored_cols = ['roll_' + str(i) for i in list(temp_matrix_restored)]
temp_matrix_restored.columns = restored_cols

 Original rolling df:   1.3GiB
   Values rolling df:   1.1GiB


In [15]:
########################### Remove old objects
#################################################################################
del temp_df, train_df, temp_matrix, temp_matrix_restored

In [16]:
########################### Apply on grid_df
#################################################################################
# lets read grid from 
# https://www.kaggle.com/kyakovlev/m5-simple-fe
# to be sure that our grids are aligned by index
grid_df = pd.read_pickle(f"{OUTPUT_PATH}/grid_part_1.pkl")

# We need only 'id','d','sales'
# to make lags and rollings
grid_df = grid_df[['id','d','sales']]
SHIFT_DAY = 28

# Lags
# with 28 day shift
start_time = time.time()
print('Create lags')

LAG_DAYS = [col for col in range(SHIFT_DAY,SHIFT_DAY+15)]
grid_df = grid_df.assign(**{
        '{}_lag_{}'.format(col, l): grid_df.groupby(['id'])[col].transform(lambda x: x.shift(l))
        for l in LAG_DAYS
        for col in [TARGET]
    })

# Minify lag columns
for col in list(grid_df):
    if 'lag' in col:
        grid_df[col] = grid_df[col].astype(np.float16)

print('%0.2f min: Lags' % ((time.time() - start_time) / 60))

# Rollings
# with 28 day shift
start_time = time.time()
print('Create rolling aggs')
for i in [7,14,30,60,180]:
    print('Rolling period:', i)
    grid_df['rolling_mean_'+str(i)] = grid_df.groupby(['id'])[TARGET].apply(lambda x: x.shift(SHIFT_DAY).rolling(i).mean()).astype(np.float16)
    grid_df['rolling_std_'+str(i)]  = grid_df.groupby(['id'])[TARGET].apply(lambda x: x.shift(SHIFT_DAY).rolling(i).std()).astype(np.float16)

# Rollings
# with sliding shift
for d_shift in [1,7,14]: 
    print('Shifting period:', d_shift)
    for d_window in [7,14,30,60]:
        col_name = 'rolling_mean_tmp_'+str(d_shift)+'_'+str(d_window)
        grid_df[col_name] = grid_df.groupby(['id'])[TARGET].transform(lambda x: x.shift(d_shift).rolling(d_window).mean()).astype(np.float16)
    
print('%0.2f min: Lags' % ((time.time() - start_time) / 60))

Create lags
8.16 min: Lags
Create rolling aggs
Rolling period: 7
Rolling period: 14
Rolling period: 30
Rolling period: 60
Rolling period: 180
Shifting period: 1
Shifting period: 7
Shifting period: 14
16.07 min: Lags


In [17]:
########################### Export
#################################################################################
print('Save lags and rollings')
FILE_NAME = 'lags_df_'+str(SHIFT_DAY)+'.pkl'
grid_df.to_pickle(f"{OUTPUT_PATH}/{FILE_NAME}")

Save lags and rollings


In [18]:
########################### Final Output
#################################################################################
pd.set_option('display.max_rows', None)
grid_df[grid_df['id']=='HOBBIES_1_002_CA_1_evaluation'].reset_index(drop=True)

Unnamed: 0,id,d,sales,sales_lag_28,sales_lag_29,sales_lag_30,sales_lag_31,sales_lag_32,sales_lag_33,sales_lag_34,sales_lag_35,sales_lag_36,sales_lag_37,sales_lag_38,sales_lag_39,sales_lag_40,sales_lag_41,sales_lag_42,rolling_mean_7,rolling_std_7,rolling_mean_14,rolling_std_14,rolling_mean_30,rolling_std_30,rolling_mean_60,rolling_std_60,rolling_mean_180,rolling_std_180,rolling_mean_tmp_1_7,rolling_mean_tmp_1_14,rolling_mean_tmp_1_30,rolling_mean_tmp_1_60,rolling_mean_tmp_7_7,rolling_mean_tmp_7_14,rolling_mean_tmp_7_30,rolling_mean_tmp_7_60,rolling_mean_tmp_14_7,rolling_mean_tmp_14_14,rolling_mean_tmp_14_30,rolling_mean_tmp_14_60
0,HOBBIES_1_002_CA_1_evaluation,141,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,HOBBIES_1_002_CA_1_evaluation,142,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,HOBBIES_1_002_CA_1_evaluation,143,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,HOBBIES_1_002_CA_1_evaluation,144,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,HOBBIES_1_002_CA_1_evaluation,145,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,HOBBIES_1_002_CA_1_evaluation,146,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,HOBBIES_1_002_CA_1_evaluation,147,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
7,HOBBIES_1_002_CA_1_evaluation,148,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.285645,,,,,,,,,,,
8,HOBBIES_1_002_CA_1_evaluation,149,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.285645,,,,,,,,,,,
9,HOBBIES_1_002_CA_1_evaluation,150,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.285645,,,,,,,,,,,


In [19]:
########################### Final list of new features
#################################################################################
grid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47735397 entries, 0 to 47735396
Data columns (total 40 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   id                      category
 1   d                       int16   
 2   sales                   float64 
 3   sales_lag_28            float16 
 4   sales_lag_29            float16 
 5   sales_lag_30            float16 
 6   sales_lag_31            float16 
 7   sales_lag_32            float16 
 8   sales_lag_33            float16 
 9   sales_lag_34            float16 
 10  sales_lag_35            float16 
 11  sales_lag_36            float16 
 12  sales_lag_37            float16 
 13  sales_lag_38            float16 
 14  sales_lag_39            float16 
 15  sales_lag_40            float16 
 16  sales_lag_41            float16 
 17  sales_lag_42            float16 
 18  rolling_mean_7          float16 
 19  rolling_std_7           float16 
 20  rolling_mean_14         float16 
 21  rollin