# Fast creation of common time series features

I created three functions that help to generate time series features. One of them is `seasonal_features` that extract seasonal data from date columns. The second function, `lagging_features`, creates lagging and differencing features from the target. And the last one, `moving_statistics_features`, generates moving statistics variables also from the target.

The python module containing the functions can be found [here](https://github.com/abreukuse/ml_utilities/blob/master/feature_engineering_time_series.py).

I´m all about automating boring stuff, and create this "traditional" features in time series by hand every time is unproductive. So these functions allow one to generate a lot of features and start applying and assess algorithms in a faster manner.

I see a more suitable usage for these functions in the stacked time series case, but they can also be applied in simple cases of a single time series. Although statistical methods like ARIMA or LTSM and FB-Prophet are best suited for this situation.

In this demonstration I will show both scenarios: A single time series and several time series stacked.

The datasets will be downloaded from kaggle using the api. The first data will be the [Electric Production](https://www.kaggle.com/shenba/time-series-datasets) and the second is from the [Rossmann Sales Competition](https://www.kaggle.com/c/rossmann-store-sales/data). You need to accept the terms of the competition in order to download the data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

# That´s the module in question.
from feature_engineering_time_series import seasonal_features, lagging_features, moving_statistics_features

import os

os.environ['KAGGLE_USERNAME'] = "username" # kaggle username from the json file
os.environ['KAGGLE_KEY'] = "api_key" # api key from the json file

In [2]:
!kaggle datasets download -d shenba/time-series-datasets
!unzip time-series-datasets.zip

Downloading time-series-datasets.zip to /content
  0% 0.00/19.2k [00:00<?, ?B/s]
100% 19.2k/19.2k [00:00<00:00, 26.0MB/s]
Archive:  time-series-datasets.zip
  inflating: Electric_Production.csv  
  inflating: daily-minimum-temperatures-in-me.csv  
  inflating: monthly-beer-production-in-austr.csv  
  inflating: sales-of-shampoo-over-a-three-ye.csv  


# Electric Production

In [3]:
electric = pd.read_csv('Electric_Production.csv', parse_dates=['DATE'])
electric.head()

Unnamed: 0,DATE,IPG2211A2N
0,1985-01-01,72.5052
1,1985-02-01,70.672
2,1985-03-01,62.4502
3,1985-04-01,57.4714
4,1985-05-01,55.3151


In [4]:
train = electric[electric['DATE'].dt.year < 2005].copy()
validation = electric[electric['DATE'].dt.year >= 2005].copy()

In [5]:
train.tail()

Unnamed: 0,DATE,IPG2211A2N
235,2004-08-01,100.2025
236,2004-09-01,94.024
237,2004-10-01,87.5262
238,2004-11-01,89.6144
239,2004-12-01,105.7263


In [6]:
validation.head()

Unnamed: 0,DATE,IPG2211A2N
240,2005-01-01,111.1614
241,2005-02-01,101.7795
242,2005-03-01,98.9565
243,2005-04-01,86.4776
244,2005-05-01,87.2234


Here we create the features

In [7]:
for dataset in [train, validation]:
    seasonal_features(df = dataset, 
                      date_column = 'DATE',
                      which_ones = ['day','month','weekday','dayofyear','week'])
    
    lagging_features(df = dataset, 
                     target = 'IPG2211A2N',
                     lags=[1,3,5],
                     lags_diff=[1,3,5])
    
    moving_statistics_features(df = dataset, 
                               target = 'IPG2211A2N',
                               windows = [4,5,6],
                               which_ones = 'all',
                               delta_roll_mean=True)

In [8]:
train.tail()

Unnamed: 0,DATE,IPG2211A2N,DATE_day,DATE_month,DATE_weekday,DATE_dayofyear,DATE_week,lag_IPG2211A2N_1,lag_IPG2211A2N_3,lag_IPG2211A2N_5,lag_diff_IPG2211A2N_1,lag_diff_IPG2211A2N_3,lag_diff_IPG2211A2N_5,mean_IPG2211A2N_4,mean_IPG2211A2N_5,mean_IPG2211A2N_6,median_IPG2211A2N_4,median_IPG2211A2N_5,median_IPG2211A2N_6,std_IPG2211A2N_4,std_IPG2211A2N_5,std_IPG2211A2N_6,min_IPG2211A2N_4,min_IPG2211A2N_5,min_IPG2211A2N_6,max_IPG2211A2N_4,max_IPG2211A2N_5,max_IPG2211A2N_6,skew_IPG2211A2N_4,skew_IPG2211A2N_5,skew_IPG2211A2N_6,kurt_IPG2211A2N_4,kurt_IPG2211A2N_5,kurt_IPG2211A2N_6,sum_IPG2211A2N_4,sum_IPG2211A2N_5,sum_IPG2211A2N_6,delta_roll_mean_IPG2211A2N_4,delta_roll_mean_IPG2211A2N_5,delta_roll_mean_IPG2211A2N_6
235,2004-08-01,100.2025,1,8,6,214,31,101.7948,89.0302,95.4029,6.2903,15.0715,-4.3642,93.2632,93.69114,95.769117,92.26735,95.4029,95.4537,6.794611,5.961603,7.3716,86.7233,86.7233,86.7233,101.7948,101.7948,106.159,0.588058,0.203713,0.220241,-1.811542,-0.913523,-1.098903,373.0528,468.4557,574.6147,8.5316,8.10366,6.025683
236,2004-09-01,94.024,1,9,2,245,36,100.2025,95.5045,86.7233,-1.5923,11.1723,4.7996,96.633,94.65106,94.776367,97.8535,95.5045,95.4537,5.72892,6.652505,5.958093,89.0302,86.7233,86.7233,101.7948,101.7948,101.7948,-0.914363,-0.19752,-0.292015,-0.456732,-2.573291,-1.478591,386.532,473.2553,568.6582,3.5695,5.55144,5.426133
237,2004-10-01,87.5262,1,10,4,275,40,94.024,101.7948,89.0302,-6.1785,-1.4805,7.3007,97.88145,96.1112,94.54655,97.8535,95.5045,94.76425,3.707263,5.096741,5.955686,94.024,89.0302,86.7233,101.7948,101.7948,101.7948,0.021017,-0.353354,-0.119575,-4.378666,-0.855439,-1.545124,391.5258,480.556,567.2793,-3.85745,-2.0872,-0.52255
238,2004-11-01,89.6144,1,11,0,306,45,87.5262,100.2025,95.5045,-6.4978,-14.2686,-1.504,95.886875,95.8104,94.680367,97.11325,95.5045,94.76425,6.503829,5.635077,5.750228,87.5262,87.5262,87.5262,101.7948,101.7948,101.7948,-0.737397,-0.64085,-0.028049,-1.459143,-0.097404,-1.684166,383.5475,479.052,568.0822,-8.360675,-8.2842,-7.154167
239,2004-12-01,105.7263,1,12,2,336,49,89.6144,94.024,101.7948,2.0882,-10.5881,-5.8901,92.841775,94.63238,94.777733,91.8192,94.024,94.76425,5.605034,6.292347,5.639297,87.5262,87.5262,87.5262,100.2025,101.7948,101.7948,0.82022,0.091698,-0.023191,-0.516723,-2.594998,-1.566226,371.3671,473.1619,568.6664,-3.227375,-5.01798,-5.163333


In [9]:
validation.tail()

Unnamed: 0,DATE,IPG2211A2N,DATE_day,DATE_month,DATE_weekday,DATE_dayofyear,DATE_week,lag_IPG2211A2N_1,lag_IPG2211A2N_3,lag_IPG2211A2N_5,lag_diff_IPG2211A2N_1,lag_diff_IPG2211A2N_3,lag_diff_IPG2211A2N_5,mean_IPG2211A2N_4,mean_IPG2211A2N_5,mean_IPG2211A2N_6,median_IPG2211A2N_4,median_IPG2211A2N_5,median_IPG2211A2N_6,std_IPG2211A2N_4,std_IPG2211A2N_5,std_IPG2211A2N_6,min_IPG2211A2N_4,min_IPG2211A2N_5,min_IPG2211A2N_6,max_IPG2211A2N_4,max_IPG2211A2N_5,max_IPG2211A2N_6,skew_IPG2211A2N_4,skew_IPG2211A2N_5,skew_IPG2211A2N_6,kurt_IPG2211A2N_4,kurt_IPG2211A2N_5,kurt_IPG2211A2N_6,sum_IPG2211A2N_4,sum_IPG2211A2N_5,sum_IPG2211A2N_6,delta_roll_mean_IPG2211A2N_4,delta_roll_mean_IPG2211A2N_5,delta_roll_mean_IPG2211A2N_6
392,2017-09-01,98.6154,1,9,4,244,35,108.9312,102.1532,88.353,-3.2226,16.8507,7.8916,103.829675,100.73434,100.785217,105.5422,102.1532,101.5964,8.872623,10.34157,9.250621,92.0805,88.353,88.353,112.1538,112.1538,112.1538,-0.875275,-0.189635,-0.215503,-0.299591,-2.466628,-1.399629,415.3187,503.6717,604.7113,5.101525,8.19686,8.145983
393,2017-10-01,93.6137,1,10,6,274,39,98.6154,112.1538,92.0805,-10.3158,-3.5378,10.2624,105.4634,102.78682,100.381183,105.5422,102.1532,100.3843,6.181683,8.029963,9.290144,98.6154,92.0805,88.353,112.1538,112.1538,112.1538,-0.045817,-0.198141,-0.020737,-3.302152,-1.215741,-1.527714,421.8536,513.9341,602.2871,-6.848,-4.17142,-1.765783
394,2017-11-01,97.3359,1,11,2,305,44,93.6137,108.9312,102.1532,-5.0017,-18.5401,1.5332,103.328525,103.09346,101.257967,103.7733,102.1532,100.3843,8.67692,7.532794,8.099914,93.6137,93.6137,92.0805,112.1538,112.1538,112.1538,-0.161576,0.000166,0.286735,-3.825299,-1.64127,-1.677268,413.3141,515.4673,607.5478,-9.714825,-9.47976,-7.644267
395,2017-12-01,114.7212,1,12,4,335,48,97.3359,98.6154,112.1538,3.7222,-11.5953,-4.8173,99.62405,102.13,102.133867,97.97565,98.6154,100.3843,6.557448,7.978033,7.135776,93.6137,93.6137,93.6137,108.9312,112.1538,112.1538,1.36205,0.451702,0.452077,2.430981,-2.391712,-1.345058,398.4962,510.65,612.8032,-2.28815,-4.7941,-4.797967
396,2018-01-01,129.4048,1,1,0,1,1,114.7212,93.6137,108.9312,17.3853,16.1058,2.5674,101.07155,102.64348,104.228533,97.97565,98.6154,103.7733,9.343808,8.822408,8.794445,93.6137,93.6137,93.6137,114.7212,114.7212,114.7212,1.684574,0.642724,0.024986,3.133802,-1.697333,-2.446739,404.2862,513.2174,625.3712,13.64965,12.07772,10.492667


In [10]:
train = train.dropna()
validation = validation.dropna()

train.shape, validation.shape

((234, 40), (151, 40))

In [11]:
X_train = train.drop(columns=['DATE','IPG2211A2N'])
X_validation = validation.drop(columns=['DATE','IPG2211A2N'])

y_train = np.log(train['IPG2211A2N'])
y_validation = np.log(validation['IPG2211A2N'])

In [12]:
X_train.shape, y_train.shape

((234, 38), (234,))

In [13]:
X_validation.shape, y_validation.shape

((151, 38), (151,))

## Pipeline

In [14]:
# Creating a training pipeline

train = electric[electric['DATE'].dt.year < 2005].copy()
validation = electric[electric['DATE'].dt.year >= 2005].copy()

def drop_columns(X, columns):
    X = X.drop(columns=columns)
    return X

def drop_NaN(X):
    X = X.dropna()
    return X

pipeline = make_pipeline(
    FunctionTransformer(seasonal_features, kw_args={'date_column': 'DATE', 
                                                    'which_ones': ['day',
                                                                   'month',
                                                                   'weekday',
                                                                   'dayofyear',
                                                                   'week'],
                                                    'copy': True}
                        ),
                         
    FunctionTransformer(lagging_features, kw_args={'target': 'IPG2211A2N',
                                                   'lags': [1,3,5],
                                                   'lags_diff': [1,3,5],
                                                   'copy': True}
                        ),
    
    FunctionTransformer(moving_statistics_features, kw_args={'target': 'IPG2211A2N',
                                                             'windows': [4,5,6],
                                                             'which_ones': 'all',
                                                             'delta_roll_mean': True,
                                                             'copy': True}
                        ),

    FunctionTransformer(drop_columns, kw_args={'columns': ['DATE',
                                                           'IPG2211A2N']}
                        ),
                         
    FunctionTransformer(drop_NaN)                         
                         
)

In [15]:
train.head()

Unnamed: 0,DATE,IPG2211A2N
0,1985-01-01,72.5052
1,1985-02-01,70.672
2,1985-03-01,62.4502
3,1985-04-01,57.4714
4,1985-05-01,55.3151


In [16]:
validation.head()

Unnamed: 0,DATE,IPG2211A2N
240,2005-01-01,111.1614
241,2005-02-01,101.7795
242,2005-03-01,98.9565
243,2005-04-01,86.4776
244,2005-05-01,87.2234


In [17]:
X_train = pipeline.fit_transform(train)
X_validation = pipeline.transform(validation)

y_train = np.log(train['IPG2211A2N'])
y_validation = np.log(validation['IPG2211A2N'])

In [18]:
X_train.head()

Unnamed: 0,DATE_day,DATE_month,DATE_weekday,DATE_dayofyear,DATE_week,lag_IPG2211A2N_1,lag_IPG2211A2N_3,lag_IPG2211A2N_5,lag_diff_IPG2211A2N_1,lag_diff_IPG2211A2N_3,lag_diff_IPG2211A2N_5,mean_IPG2211A2N_4,mean_IPG2211A2N_5,mean_IPG2211A2N_6,median_IPG2211A2N_4,median_IPG2211A2N_5,median_IPG2211A2N_6,std_IPG2211A2N_4,std_IPG2211A2N_5,std_IPG2211A2N_6,min_IPG2211A2N_4,min_IPG2211A2N_5,min_IPG2211A2N_6,max_IPG2211A2N_4,max_IPG2211A2N_5,max_IPG2211A2N_6,skew_IPG2211A2N_4,skew_IPG2211A2N_5,skew_IPG2211A2N_6,kurt_IPG2211A2N_4,kurt_IPG2211A2N_5,kurt_IPG2211A2N_6,sum_IPG2211A2N_4,sum_IPG2211A2N_5,sum_IPG2211A2N_6,delta_roll_mean_IPG2211A2N_4,delta_roll_mean_IPG2211A2N_5,delta_roll_mean_IPG2211A2N_6
6,1,7,0,182,27,58.0904,57.4714,70.672,2.7753,-4.3598,-14.4148,58.331775,60.79982,62.750717,57.7809,58.0904,60.2703,2.992227,6.096827,7.250726,55.3151,55.3151,55.3151,62.4502,70.672,72.5052,1.038936,1.369508,0.602598,1.929142,1.578214,-1.869565,233.3271,303.9991,376.5043,-0.241375,-2.70942,-4.660317
7,1,8,3,213,31,62.6202,55.3151,62.4502,4.5298,5.1488,-8.0518,58.374275,59.18946,61.103217,57.7809,58.0904,60.2703,3.070407,3.223846,5.503575,55.3151,55.3151,55.3151,62.6202,62.6202,70.672,1.086086,0.130496,1.103005,1.999443,-2.363537,1.205089,233.4971,295.9473,366.6193,4.245925,3.43074,1.516983
8,1,9,6,244,35,63.2485,58.0904,57.4714,0.6283,7.9334,0.7983,59.81855,59.34912,59.865967,60.3553,58.0904,60.2703,3.780713,3.438337,3.325735,55.3151,55.3151,55.3151,63.2485,63.2485,63.2485,-0.421327,0.205311,-0.299521,-3.380592,-2.406188,-2.217197,239.2742,296.7456,359.1958,3.42995,3.89938,3.382533
9,1,10,1,274,40,60.5846,62.6202,55.3151,-2.6639,2.4942,3.1132,61.135925,59.97176,59.555033,61.6024,60.5846,59.3375,2.327032,3.292067,3.116429,58.0904,55.3151,55.3151,63.2485,63.2485,63.2485,-0.831527,-0.62651,-0.080631,-0.89401,-1.110508,-1.633221,244.5437,299.8588,357.3302,-0.551325,0.61284,1.029567
10,1,11,4,305,44,56.3154,63.2485,58.0904,-4.2692,-6.3048,1.0003,60.692175,60.17182,59.362367,61.6024,60.5846,59.3375,3.13155,2.951067,3.301261,56.3154,56.3154,55.3151,63.2485,63.2485,63.2485,-1.301892,-0.364765,-0.012928,1.178267,-1.975846,-2.124056,242.7687,300.8591,356.1742,-4.376775,-3.85642,-3.046967


In [19]:
X_validation.head()

Unnamed: 0,DATE_day,DATE_month,DATE_weekday,DATE_dayofyear,DATE_week,lag_IPG2211A2N_1,lag_IPG2211A2N_3,lag_IPG2211A2N_5,lag_diff_IPG2211A2N_1,lag_diff_IPG2211A2N_3,lag_diff_IPG2211A2N_5,mean_IPG2211A2N_4,mean_IPG2211A2N_5,mean_IPG2211A2N_6,median_IPG2211A2N_4,median_IPG2211A2N_5,median_IPG2211A2N_6,std_IPG2211A2N_4,std_IPG2211A2N_5,std_IPG2211A2N_6,min_IPG2211A2N_4,min_IPG2211A2N_5,min_IPG2211A2N_6,max_IPG2211A2N_4,max_IPG2211A2N_5,max_IPG2211A2N_6,skew_IPG2211A2N_4,skew_IPG2211A2N_5,skew_IPG2211A2N_6,kurt_IPG2211A2N_4,kurt_IPG2211A2N_5,kurt_IPG2211A2N_6,sum_IPG2211A2N_4,sum_IPG2211A2N_5,sum_IPG2211A2N_6,delta_roll_mean_IPG2211A2N_4,delta_roll_mean_IPG2211A2N_5,delta_roll_mean_IPG2211A2N_6
246,1,7,4,182,26,99.5076,86.4776,101.7795,12.2842,0.5511,-11.6538,93.041275,94.78892,97.517667,93.08995,98.9565,99.23205,7.158509,7.328336,9.361621,86.4776,86.4776,86.4776,99.5076,101.7795,111.1614,-0.004261,-0.51674,0.070047,-5.916325,-3.131444,-0.682656,372.1651,473.9446,585.106,6.466325,4.71868,1.989933
247,1,8,0,213,31,108.3501,87.2234,98.9565,8.8425,21.8725,6.5706,95.389675,96.10304,97.049117,93.3655,98.9565,99.23205,10.504651,9.236082,8.579891,86.4776,86.4776,86.4776,108.3501,108.3501,108.3501,0.571901,0.162729,-0.26558,-2.753171,-1.558684,-1.254711,381.5587,480.5152,582.2947,12.960425,12.24706,11.300983
248,1,9,3,244,35,109.4862,99.5076,86.4776,1.1361,22.2628,10.5297,101.141825,98.20898,98.333567,103.92885,99.5076,99.23205,10.295324,11.068115,9.904326,87.2234,86.4776,86.4776,109.4862,109.4862,109.4862,-1.066148,-0.141665,-0.198825,-0.1237,-2.980839,-1.846527,404.5673,491.0449,590.0014,8.344375,11.27722,11.152633
249,1,10,5,274,39,99.1155,108.3501,87.2234,-10.3707,-0.3921,12.6379,104.11485,100.73656,98.360067,103.92885,99.5076,99.31155,5.568034,8.961945,9.906538,99.1155,87.2234,86.4776,109.4862,109.4862,109.4862,0.031636,-0.788728,-0.210656,-5.768573,0.250002,-1.845334,416.4594,503.6828,590.1604,-4.99935,-1.62106,0.755433
250,1,11,1,305,44,89.7567,109.4862,99.5076,-9.3588,-18.5934,2.5333,101.677125,101.24322,98.906583,103.7328,99.5076,99.31155,9.204504,8.030165,9.184011,89.7567,89.7567,87.2234,109.4862,109.4862,109.4862,-0.801293,-0.491939,-0.129425,-1.455721,-0.603173,-1.711912,406.7085,506.2161,593.4395,-11.920425,-11.48652,-9.149883


# Rossmann Sales

In [20]:
!kaggle competitions download -c rossmann-store-sales
!unzip train.csv.zip

Downloading store.csv to /content
  0% 0.00/44.0k [00:00<?, ?B/s]
100% 44.0k/44.0k [00:00<00:00, 15.2MB/s]
Downloading sample_submission.csv to /content
  0% 0.00/310k [00:00<?, ?B/s]
100% 310k/310k [00:00<00:00, 40.7MB/s]
Downloading train.csv.zip to /content
 75% 5.00M/6.71M [00:00<00:00, 23.3MB/s]
100% 6.71M/6.71M [00:00<00:00, 26.6MB/s]
Downloading test.csv.zip to /content
  0% 0.00/192k [00:00<?, ?B/s]
100% 192k/192k [00:00<00:00, 62.1MB/s]
Archive:  train.csv.zip
  inflating: train.csv               


In [21]:
data = pd.read_csv('train.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [22]:
data.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


In [23]:
data['Date'] = pd.to_datetime(data['Date'])
data = data.loc[(data['Open']==1) & (data['Sales']!=0), ['Store','Date','Sales']].sort_values(by=['Store','Date'])

In [24]:
data.head(20)

Unnamed: 0,Store,Date,Sales
1014980,1,2013-01-02,5530
1013865,1,2013-01-03,4327
1012750,1,2013-01-04,4486
1011635,1,2013-01-05,4997
1009405,1,2013-01-07,7176
1008290,1,2013-01-08,5580
1007175,1,2013-01-09,5471
1006060,1,2013-01-10,4892
1004945,1,2013-01-11,4881
1003830,1,2013-01-12,4952


In [25]:
data.tail(20)

Unnamed: 0,Store,Date,Sales
25644,1115,2015-07-09,5686
24529,1115,2015-07-10,5844
23414,1115,2015-07-11,7164
21184,1115,2015-07-13,10598
20069,1115,2015-07-14,7562
18954,1115,2015-07-15,6039
17839,1115,2015-07-16,6590
16724,1115,2015-07-17,7874
15609,1115,2015-07-18,7264
13379,1115,2015-07-20,6083


In [26]:
cut_date = '2015-06-01'
train = data.loc[data['Date'] < cut_date].copy()
validation = data.loc[data['Date'] >= cut_date].copy()

In [27]:
data.shape

(844338, 3)

In this case the features need to be created taking into consideration each store. So the argument `group_by` was used in `lagging_features` and `moving_statistics_features`.

In [28]:
for dataset in [train, validation]:
    seasonal_features(df = dataset, 
                      date_column = 'Date',
                      which_ones = ['day','month','weekday','dayofyear','week'])
    
    lagging_features(df = dataset,
                     target = 'Sales',
                     lags=[1,3,5],
                     lags_diff=[1,3,5],
                     group_by='Store')
    
    moving_statistics_features(df = dataset,
                               target = 'Sales', 
                               windows = [4,5,6],
                               which_ones = 'all',
                               group_by = 'Store',
                               delta_roll_mean = True)

In [29]:
train.tail()

Unnamed: 0,Store,Date,Sales,Date_day,Date_month,Date_weekday,Date_dayofyear,Date_week,lag_Sales_1,lag_Sales_3,lag_Sales_5,lag_diff_Sales_1,lag_diff_Sales_3,lag_diff_Sales_5,mean_Sales_4,mean_Sales_5,mean_Sales_6,median_Sales_4,median_Sales_5,median_Sales_6,std_Sales_4,std_Sales_5,std_Sales_6,min_Sales_4,min_Sales_5,min_Sales_6,max_Sales_4,max_Sales_5,max_Sales_6,skew_Sales_4,skew_Sales_5,skew_Sales_6,kurt_Sales_4,kurt_Sales_5,kurt_Sales_6,sum_Sales_4,sum_Sales_5,sum_Sales_6,delta_roll_mean_Sales_4,delta_roll_mean_Sales_5,delta_roll_mean_Sales_6
74704,1115,2015-05-26,6726,26,5,1,146,22,8005.0,6244.0,7583.0,-444.0,360.0,-698.0,7585.75,7585.2,7771.5,7825.0,7645.0,7825.0,953.025839,825.345503,867.872283,6244.0,6244.0,6244.0,8449.0,8449.0,8703.0,-1.308042,-1.255328,-1.120179,1.965039,2.299929,1.66186,30343.0,37926.0,46629.0,419.25,419.8,233.5
73589,1115,2015-05-27,6156,27,5,2,147,22,6726.0,8449.0,7645.0,-1279.0,482.0,-857.0,7356.0,7413.8,7442.0,7365.5,7645.0,7614.0,1040.719943,910.509583,817.308754,6244.0,6244.0,6244.0,8449.0,8449.0,8449.0,-0.02719,-0.329991,-0.481173,-4.148244,-1.871136,-0.832349,29424.0,37069.0,44652.0,-630.0,-687.8,-716.0
72474,1115,2015-05-28,6364,28,5,3,148,22,6156.0,8005.0,6244.0,-570.0,-2293.0,-1489.0,7334.0,7116.0,7204.166667,7365.5,6726.0,7185.5,1072.507032,1048.963059,962.755923,6156.0,6156.0,6156.0,8449.0,8449.0,8449.0,-0.092482,0.528534,0.128481,-3.899884,-2.597593,-2.209067,29336.0,35580.0,43225.0,-1178.0,-960.0,-1048.166667
71359,1115,2015-05-29,8037,29,5,4,149,22,6364.0,6726.0,8449.0,208.0,-1641.0,120.0,6812.75,7140.0,6990.666667,6545.0,6726.0,6545.0,828.991908,1025.12609,987.17246,6156.0,6156.0,6156.0,8005.0,8449.0,8449.0,1.545177,0.541155,0.898966,2.360556,-2.538254,-1.431672,27251.0,35700.0,41944.0,-448.75,-776.0,-626.666667
70244,1115,2015-05-30,9228,30,5,5,150,22,8037.0,6156.0,8005.0,1673.0,1311.0,-412.0,6820.75,7057.6,7289.5,6545.0,6726.0,7365.5,844.344864,902.87225,987.323807,6156.0,6156.0,6156.0,8037.0,8037.0,8449.0,1.560785,0.388803,-0.026548,2.417906,-3.052703,-2.664676,27283.0,35288.0,43737.0,1216.25,979.4,747.5


In [30]:
validation.tail()

Unnamed: 0,Store,Date,Sales,Date_day,Date_month,Date_weekday,Date_dayofyear,Date_week,lag_Sales_1,lag_Sales_3,lag_Sales_5,lag_diff_Sales_1,lag_diff_Sales_3,lag_diff_Sales_5,mean_Sales_4,mean_Sales_5,mean_Sales_6,median_Sales_4,median_Sales_5,median_Sales_6,std_Sales_4,std_Sales_5,std_Sales_6,min_Sales_4,min_Sales_5,min_Sales_6,max_Sales_4,max_Sales_5,max_Sales_6,skew_Sales_4,skew_Sales_5,skew_Sales_6,kurt_Sales_4,kurt_Sales_5,kurt_Sales_6,sum_Sales_4,sum_Sales_5,sum_Sales_6,delta_roll_mean_Sales_4,delta_roll_mean_Sales_5,delta_roll_mean_Sales_6
5574,1115,2015-07-27,10712,27,7,0,208,31,6897.0,6150.0,5074.0,1081.0,1555.0,814.0,6051.25,5855.8,5893.666667,5983.0,5816.0,5949.5,654.070014,715.443359,646.599309,5342.0,5074.0,5074.0,6897.0,6897.0,6897.0,0.562571,0.615182,0.349127,0.484804,-0.242212,0.009346,24205.0,29279.0,35362.0,845.75,1041.2,1003.333333
4459,1115,2015-07-28,8093,28,7,1,209,31,10712.0,5816.0,5342.0,3815.0,4562.0,5638.0,7393.75,6983.4,6665.166667,6523.5,6150.0,5983.0,2257.856856,2159.947638,2083.250769,5816.0,5342.0,5074.0,10712.0,10712.0,10712.0,1.770143,1.870769,1.964107,3.159478,3.651167,4.091853,29575.0,34917.0,39991.0,3318.25,3728.6,4046.833333
3344,1115,2015-07-29,7661,29,7,2,210,31,8093.0,6897.0,6150.0,-2619.0,2277.0,2751.0,7879.5,7533.6,7168.333333,7495.0,6897.0,6523.5,2104.913062,1980.209156,1984.313752,5816.0,5816.0,5342.0,10712.0,10712.0,10712.0,0.929299,1.304495,1.368815,0.667265,1.323379,1.60493,31518.0,37668.0,43010.0,213.5,559.4,924.666667
2229,1115,2015-07-30,8405,30,7,3,211,31,7661.0,10712.0,5816.0,-432.0,764.0,1511.0,8340.75,7835.8,7554.833333,7877.0,7661.0,7279.0,1656.369599,1825.525322,1771.916411,6897.0,5816.0,5816.0,10712.0,10712.0,10712.0,1.466211,1.009384,1.257111,2.521273,1.663816,1.750308,33363.0,39179.0,45329.0,-679.75,-174.8,106.166667
1114,1115,2015-07-31,8680,31,7,4,212,31,8405.0,8093.0,6897.0,744.0,-2307.0,2589.0,8717.75,8353.6,7930.666667,8249.0,8093.0,7877.0,1364.047745,1434.745901,1649.252032,7661.0,6897.0,5816.0,10712.0,10712.0,10712.0,1.701263,1.365268,0.738724,3.0859,2.523149,1.417251,34871.0,41768.0,47584.0,-312.75,51.4,474.333333


In [31]:
train.shape, validation.shape

((785727, 41), (58611, 41))

In [32]:
train = train.dropna()
validation = validation.dropna()

train.shape, validation.shape

((779037, 41), (51921, 41))

In [33]:
X_train = train.drop(columns=['Date','Sales'])
X_validation = validation.drop(columns=['Date','Sales'])

y_train = np.log(train['Sales'])
y_validation = np.log(validation['Sales'])

In [34]:
X_train.head()

Unnamed: 0,Store,Date_day,Date_month,Date_weekday,Date_dayofyear,Date_week,lag_Sales_1,lag_Sales_3,lag_Sales_5,lag_diff_Sales_1,lag_diff_Sales_3,lag_diff_Sales_5,mean_Sales_4,mean_Sales_5,mean_Sales_6,median_Sales_4,median_Sales_5,median_Sales_6,std_Sales_4,std_Sales_5,std_Sales_6,min_Sales_4,min_Sales_5,min_Sales_6,max_Sales_4,max_Sales_5,max_Sales_6,skew_Sales_4,skew_Sales_5,skew_Sales_6,kurt_Sales_4,kurt_Sales_5,kurt_Sales_6,sum_Sales_4,sum_Sales_5,sum_Sales_6,delta_roll_mean_Sales_4,delta_roll_mean_Sales_5,delta_roll_mean_Sales_6
1007175,1,9,1,2,9,2,5580.0,4997.0,4327.0,-1596.0,1094.0,50.0,5559.75,5313.2,5349.333333,5288.5,4997.0,5263.5,1166.519145,1150.873451,1033.170589,4486.0,4327.0,4327.0,7176.0,7176.0,7176.0,1.178432,1.346522,1.18534,1.357596,1.597898,1.614099,22239.0,26566.0,32096.0,20.25,266.8,230.666667
1006060,1,10,1,3,10,2,5471.0,7176.0,4486.0,-109.0,474.0,1144.0,5806.0,5542.0,5339.5,5525.5,5471.0,5234.0,947.74856,1011.014589,1031.386397,4997.0,4486.0,4327.0,7176.0,7176.0,7176.0,1.560558,1.218839,1.233287,2.886719,2.122167,1.775968,23224.0,27710.0,32037.0,-335.0,-71.0,131.5
1004945,1,11,1,4,11,2,4892.0,5580.0,4997.0,-579.0,-2284.0,406.0,5779.75,5623.2,5433.666667,5525.5,5471.0,5234.0,978.577326,916.924043,942.410243,4892.0,4892.0,4486.0,7176.0,7176.0,7176.0,1.412139,1.677684,1.505097,2.611491,3.059446,2.753118,23119.0,28116.0,32602.0,-887.75,-731.2,-541.666667
1003830,1,12,1,5,12,2,4881.0,5471.0,7176.0,-11.0,-699.0,-116.0,5206.0,5600.0,5499.5,5181.5,5471.0,5234.0,371.62795,937.955489,874.305381,4881.0,4881.0,4881.0,5580.0,7176.0,7176.0,0.073207,1.608589,1.851578,-5.570266,2.801227,3.62858,20824.0,28000.0,32997.0,-325.0,-719.0,-618.5
1001600,1,14,1,0,14,3,4952.0,4892.0,5580.0,71.0,-519.0,-2224.0,5049.0,5155.2,5492.0,4922.0,4952.0,5211.5,283.058298,341.297085,879.654705,4881.0,4881.0,4881.0,5471.0,5580.0,7176.0,1.929147,0.638411,1.839962,3.742479,-2.963804,3.565868,20196.0,25776.0,32952.0,-97.0,-203.2,-540.0


In [35]:
X_validation.head()

Unnamed: 0,Store,Date_day,Date_month,Date_weekday,Date_dayofyear,Date_week,lag_Sales_1,lag_Sales_3,lag_Sales_5,lag_diff_Sales_1,lag_diff_Sales_3,lag_diff_Sales_5,mean_Sales_4,mean_Sales_5,mean_Sales_6,median_Sales_4,median_Sales_5,median_Sales_6,std_Sales_4,std_Sales_5,std_Sales_6,min_Sales_4,min_Sales_5,min_Sales_6,max_Sales_4,max_Sales_5,max_Sales_6,skew_Sales_4,skew_Sales_5,skew_Sales_6,kurt_Sales_4,kurt_Sales_5,kurt_Sales_6,sum_Sales_4,sum_Sales_5,sum_Sales_6,delta_roll_mean_Sales_4,delta_roll_mean_Sales_5,delta_roll_mean_Sales_6
57980,1,9,6,1,160,24,4071.0,5384.0,5450.0,-112.0,-1738.0,-1703.0,4861.75,4979.4,5111.833333,4783.5,5384.0,5417.0,867.18217,795.745751,782.176813,4071.0,4071.0,4071.0,5809.0,5809.0,5809.0,0.189377,-0.424466,-0.777924,-4.760218,-2.909892,-1.827132,19447.0,24897.0,30671.0,-790.75,-908.4,-1040.833333
56865,1,10,6,2,161,24,4102.0,4183.0,5809.0,31.0,-1282.0,-1348.0,4435.0,4709.8,4833.166667,4142.5,4183.0,4783.5,634.42625,824.286176,796.789914,4071.0,4071.0,4071.0,5384.0,5809.0,5809.0,1.967234,0.746312,0.132014,3.885759,-2.458826,-2.871188,17740.0,23549.0,28999.0,-333.0,-607.8,-731.166667
55750,1,11,6,3,162,24,3591.0,4071.0,5384.0,-511.0,-592.0,-2218.0,3986.75,4266.2,4523.333333,4086.5,4102.0,4142.5,268.025341,666.588104,867.281423,3591.0,3591.0,3591.0,4183.0,5384.0,5809.0,-1.810657,1.514229,0.802702,3.448394,3.214036,-1.14845,15947.0,21331.0,27140.0,-395.75,-675.2,-932.333333
54635,1,12,6,4,163,24,3627.0,4102.0,4183.0,36.0,-444.0,-1757.0,3847.75,3914.8,4159.666667,3849.0,4071.0,4086.5,276.366152,282.422025,650.820918,3591.0,3591.0,3591.0,4102.0,4183.0,5384.0,-0.003789,-0.520204,1.631707,-5.852972,-3.117334,3.239406,15391.0,19574.0,24958.0,-220.75,-287.8,-532.666667
53520,1,13,6,5,164,24,3695.0,3591.0,4071.0,68.0,-407.0,-488.0,3753.75,3817.2,3878.166667,3661.0,3695.0,3883.0,236.137495,248.897971,268.070451,3591.0,3591.0,3591.0,4102.0,4102.0,4183.0,1.806764,0.518006,0.016193,3.309868,-3.154185,-2.938044,15015.0,19086.0,23269.0,-58.75,-122.2,-183.166667


In [36]:
X_train.shape, y_train.shape

((779037, 39), (779037,))

In [37]:
X_validation.shape, y_validation.shape

((51921, 39), (51921,))

## Pipeline

In [38]:
data = pd.read_csv('train.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [39]:
cut_date = '2015-06-01'
train = data.loc[data['Date'] < cut_date].copy()
validation = data.loc[data['Date'] >= cut_date].copy()

In [40]:
# create a training pipeline

def select(X, variables):
    X = X[variables].copy()
    return X

def preprocessing(X):
    X['Date'] = pd.to_datetime(X['Date'])
    X = X.loc[(X['Open']==1) & (X['Sales']!=0), :].sort_values(by=['Store','Date'])
    return X

def drop_columns(X, columns):
    X = X.drop(columns=columns)
    return X

def drop_NaN(X):
    X = X.dropna()
    return X

pipeline = make_pipeline(
    FunctionTransformer(select, kw_args={'variables': ['Open',
                                                       'Store', 
                                                       'Date',
                                                       'Sales']}
                        ),
                         
    FunctionTransformer(preprocessing),

    FunctionTransformer(seasonal_features, kw_args={'date_column': 'Date', 
                                                    'which_ones': ['day',
                                                                   'month',
                                                                   'weekday',
                                                                   'dayofyear',
                                                                   'week'],
                                                    'copy': True}
                        ),
                         
    FunctionTransformer(lagging_features, kw_args={'target': 'Sales',
                                                             'lags': [1,3,5],
                                                             'lags_diff': [1,3,5],
                                                             'group_by': 'Store',
                                                             'copy': True}
                        ),
    
    FunctionTransformer(moving_statistics_features, kw_args={'target': 'Sales',
                                                             'windows': [4,5,6],
                                                             'which_ones': 'all',
                                                             'delta_roll_mean': True,
                                                             'group_by': 'Store',
                                                             'copy': True}
                        ),

    FunctionTransformer(drop_columns, kw_args={'columns': ['Date',
                                                           'Sales',
                                                           'Open']}
                        ),
                         
    FunctionTransformer(drop_NaN)                         
                         
)

In [41]:
train.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
68015,1,7,2015-05-31,0,0,0,0,0,0
68016,2,7,2015-05-31,0,0,0,0,0,0
68017,3,7,2015-05-31,0,0,0,0,0,0
68018,4,7,2015-05-31,0,0,0,0,0,0
68019,5,7,2015-05-31,0,0,0,0,0,0


In [42]:
validation.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


In [43]:
X_train = pipeline.fit_transform(train)
X_validation = pipeline.transform(validation)

y_train = np.log(train.loc[X_train.index, 'Sales'])
y_validation = np.log(validation.loc[X_validation.index, 'Sales'])

In [44]:
X_train.head()

Unnamed: 0,Store,Date_day,Date_month,Date_weekday,Date_dayofyear,Date_week,lag_Sales_1,lag_Sales_3,lag_Sales_5,lag_diff_Sales_1,lag_diff_Sales_3,lag_diff_Sales_5,mean_Sales_4,mean_Sales_5,mean_Sales_6,median_Sales_4,median_Sales_5,median_Sales_6,std_Sales_4,std_Sales_5,std_Sales_6,min_Sales_4,min_Sales_5,min_Sales_6,max_Sales_4,max_Sales_5,max_Sales_6,skew_Sales_4,skew_Sales_5,skew_Sales_6,kurt_Sales_4,kurt_Sales_5,kurt_Sales_6,sum_Sales_4,sum_Sales_5,sum_Sales_6,delta_roll_mean_Sales_4,delta_roll_mean_Sales_5,delta_roll_mean_Sales_6
1007175,1,9,1,2,9,2,5580.0,4997.0,4327.0,-1596.0,1094.0,50.0,5559.75,5313.2,5349.333333,5288.5,4997.0,5263.5,1166.519145,1150.873451,1033.170589,4486.0,4327.0,4327.0,7176.0,7176.0,7176.0,1.178432,1.346522,1.18534,1.357596,1.597898,1.614099,22239.0,26566.0,32096.0,20.25,266.8,230.666667
1006060,1,10,1,3,10,2,5471.0,7176.0,4486.0,-109.0,474.0,1144.0,5806.0,5542.0,5339.5,5525.5,5471.0,5234.0,947.74856,1011.014589,1031.386397,4997.0,4486.0,4327.0,7176.0,7176.0,7176.0,1.560558,1.218839,1.233287,2.886719,2.122167,1.775968,23224.0,27710.0,32037.0,-335.0,-71.0,131.5
1004945,1,11,1,4,11,2,4892.0,5580.0,4997.0,-579.0,-2284.0,406.0,5779.75,5623.2,5433.666667,5525.5,5471.0,5234.0,978.577326,916.924043,942.410243,4892.0,4892.0,4486.0,7176.0,7176.0,7176.0,1.412139,1.677684,1.505097,2.611491,3.059446,2.753118,23119.0,28116.0,32602.0,-887.75,-731.2,-541.666667
1003830,1,12,1,5,12,2,4881.0,5471.0,7176.0,-11.0,-699.0,-116.0,5206.0,5600.0,5499.5,5181.5,5471.0,5234.0,371.62795,937.955489,874.305381,4881.0,4881.0,4881.0,5580.0,7176.0,7176.0,0.073207,1.608589,1.851578,-5.570266,2.801227,3.62858,20824.0,28000.0,32997.0,-325.0,-719.0,-618.5
1001600,1,14,1,0,14,3,4952.0,4892.0,5580.0,71.0,-519.0,-2224.0,5049.0,5155.2,5492.0,4922.0,4952.0,5211.5,283.058298,341.297085,879.654705,4881.0,4881.0,4881.0,5471.0,5580.0,7176.0,1.929147,0.638411,1.839962,3.742479,-2.963804,3.565868,20196.0,25776.0,32952.0,-97.0,-203.2,-540.0


In [45]:
X_validation.head()

Unnamed: 0,Store,Date_day,Date_month,Date_weekday,Date_dayofyear,Date_week,lag_Sales_1,lag_Sales_3,lag_Sales_5,lag_diff_Sales_1,lag_diff_Sales_3,lag_diff_Sales_5,mean_Sales_4,mean_Sales_5,mean_Sales_6,median_Sales_4,median_Sales_5,median_Sales_6,std_Sales_4,std_Sales_5,std_Sales_6,min_Sales_4,min_Sales_5,min_Sales_6,max_Sales_4,max_Sales_5,max_Sales_6,skew_Sales_4,skew_Sales_5,skew_Sales_6,kurt_Sales_4,kurt_Sales_5,kurt_Sales_6,sum_Sales_4,sum_Sales_5,sum_Sales_6,delta_roll_mean_Sales_4,delta_roll_mean_Sales_5,delta_roll_mean_Sales_6
57980,1,9,6,1,160,24,4071.0,5384.0,5450.0,-112.0,-1738.0,-1703.0,4861.75,4979.4,5111.833333,4783.5,5384.0,5417.0,867.18217,795.745751,782.176813,4071.0,4071.0,4071.0,5809.0,5809.0,5809.0,0.189377,-0.424466,-0.777924,-4.760218,-2.909892,-1.827132,19447.0,24897.0,30671.0,-790.75,-908.4,-1040.833333
56865,1,10,6,2,161,24,4102.0,4183.0,5809.0,31.0,-1282.0,-1348.0,4435.0,4709.8,4833.166667,4142.5,4183.0,4783.5,634.42625,824.286176,796.789914,4071.0,4071.0,4071.0,5384.0,5809.0,5809.0,1.967234,0.746312,0.132014,3.885759,-2.458826,-2.871188,17740.0,23549.0,28999.0,-333.0,-607.8,-731.166667
55750,1,11,6,3,162,24,3591.0,4071.0,5384.0,-511.0,-592.0,-2218.0,3986.75,4266.2,4523.333333,4086.5,4102.0,4142.5,268.025341,666.588104,867.281423,3591.0,3591.0,3591.0,4183.0,5384.0,5809.0,-1.810657,1.514229,0.802702,3.448394,3.214036,-1.14845,15947.0,21331.0,27140.0,-395.75,-675.2,-932.333333
54635,1,12,6,4,163,24,3627.0,4102.0,4183.0,36.0,-444.0,-1757.0,3847.75,3914.8,4159.666667,3849.0,4071.0,4086.5,276.366152,282.422025,650.820918,3591.0,3591.0,3591.0,4102.0,4183.0,5384.0,-0.003789,-0.520204,1.631707,-5.852972,-3.117334,3.239406,15391.0,19574.0,24958.0,-220.75,-287.8,-532.666667
53520,1,13,6,5,164,24,3695.0,3591.0,4071.0,68.0,-407.0,-488.0,3753.75,3817.2,3878.166667,3661.0,3695.0,3883.0,236.137495,248.897971,268.070451,3591.0,3591.0,3591.0,4102.0,4102.0,4183.0,1.806764,0.518006,0.016193,3.309868,-3.154185,-2.938044,15015.0,19086.0,23269.0,-58.75,-122.2,-183.166667


In [46]:
X_train.shape, y_train.shape

((779037, 39), (779037,))

In [47]:
X_validation.shape, y_validation.shape

((51921, 39), (51921,))