# <img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4594/logos/front_page.png"/><span style="color:blue;text-align:center;">v6 Feature Engineering</span>

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, 
Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 
<img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4594/media/rossmann_banner2.png"/>

## Import Packages

In [1]:
import pandas as pd
import numpy as np

## Feature Selection Methods

In [2]:
# Gather some features
def build_features(features, data):
    # remove NaNs
    data.fillna(0, inplace=True)
    data.loc[data.Open.isnull(), 'Open'] = 1
    # Use some properties directly
    features.extend(['Store', 'CompetitionDistance', 'Promo', 'Promo2', 'SchoolHoliday'])

    # Label encode some features
    features.extend(['StoreType', 'Assortment', 'StateHoliday'])
    mappings = {'0':0, 'a':1, 'b':2, 'c':3, 'd':4}
    data.StoreType.replace(mappings, inplace=True)
    data.Assortment.replace(mappings, inplace=True)
    data.StateHoliday.replace(mappings, inplace=True)

    features.extend(['DayOfWeek', 'Month', 'Day', 'Year', 'WeekOfYear'])
    data['Year'] = data.Date.dt.year
    data['Month'] = data.Date.dt.month
    data['Day'] = data.Date.dt.day
    data['DayOfWeek'] = data.Date.dt.dayofweek
    data['WeekOfYear'] = data.Date.dt.weekofyear

    # CompetionOpen en PromoOpen from https://www.kaggle.com/ananya77041/rossmann-store-sales/randomforestpython/code
    # Calculate time competition open time in months
    features.append('CompetitionOpen')
    data['CompetitionOpen'] = 12 * (data.Year - data.CompetitionOpenSinceYear) + \
        (data.Month - data.CompetitionOpenSinceMonth)
    # Promo open time in months
    features.append('PromoOpen')
    data['PromoOpen'] = 12 * (data.Year - data.Promo2SinceYear) + \
        (data.WeekOfYear - data.Promo2SinceWeek) / 4.0
    data['PromoOpen'] = data.PromoOpen.apply(lambda x: x if x > 0 else 0)
    data.loc[data.Promo2SinceYear == 0, 'PromoOpen'] = 0

    # Indicate that sales on that day are in promo interval
    features.append('IsPromoMonth')
    month2str = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', \
             7:'Jul', 8:'Aug', 9:'Sept', 10:'Oct', 11:'Nov', 12:'Dec'}
    data['monthStr'] = data.Month.map(month2str)
    data.loc[data.PromoInterval == 0, 'PromoInterval'] = ''
    data['IsPromoMonth'] = 0
    for interval in data.PromoInterval.unique():
        if interval != '':
            for month in interval.split(','):
                data.loc[(data.monthStr == month) & (data.PromoInterval == interval), 'IsPromoMonth'] = 1

    return data

## Load Data

In [3]:
print("Load the training, test and store data using pandas")
types = {'CompetitionOpenSinceYear': np.dtype(int),
         'CompetitionOpenSinceMonth': np.dtype(int),
         'StateHoliday': np.dtype(str),
         'Promo2SinceWeek': np.dtype(int),
         'SchoolHoliday': np.dtype(int),
         'PromoInterval': np.dtype(str)}

train = pd.read_csv("data/train.csv", parse_dates=[2], dtype=types)
test = pd.read_csv("data/test.csv", parse_dates=[3], dtype=types)
store = pd.read_csv("data/store.csv")

Load the training, test and store data using pandas


## Handle Missing Values and Build Features

In [4]:
print("Assume store open, if not provided")
test.fillna(1, inplace=True)

print("Consider only open stores for training. Closed stores wont count into the score.")
train = train[train["Open"] != 0]
print("Use only Sales bigger then zero")
train = train[train["Sales"] > 0]

print("Join with store")
train = pd.merge(train, store, on='Store')
test = pd.merge(test, store, on='Store')

features = []

print("augment features")
train = build_features(features, train)
test = build_features([], test)
print(features)

Assume store open, if not provided
Consider only open stores for training. Closed stores wont count into the score.
Use only Sales bigger then zero
Join with store
augment features
['Store', 'CompetitionDistance', 'Promo', 'Promo2', 'SchoolHoliday', 'StoreType', 'Assortment', 'StateHoliday', 'DayOfWeek', 'Month', 'Day', 'Year', 'WeekOfYear', 'CompetitionOpen', 'PromoOpen', 'IsPromoMonth']


## Store Feature Selection

In [5]:
train.to_csv("data/train_featured.csv", index=False)
test.to_csv("data/test_featured.csv", index=False);

### Run Train Script from Kaggle

In [8]:
from sklearn.cross_validation import train_test_split
import xgboost as xgb

print('training data processed')

def rmspe(y, yhat):
    return np.sqrt(np.mean(((y - yhat)/y) ** 2))

def rmspe_xg(yhat, y):
    y = np.expm1(y.get_label())
    yhat = np.expm1(yhat)
    return "rmspe", rmspe(y, yhat)

print("Train xgboost model")

params = {"objective": "reg:linear",
          "booster" : "gbtree",
          "eta": 0.1,
          "max_depth": 10,
          "subsample": 0.85,
          "colsample_bytree": 0.4,
          "min_child_weight": 6,
          "silent": 1,
          "thread": 1,
          "seed": 1301
          }
num_boost_round = 1200

print("Train a XGBoost model")
X_train, X_valid = train_test_split(train, test_size=0.012, random_state=10)
y_train = np.log1p(X_train.Sales)
y_valid = np.log1p(X_valid.Sales)
dtrain = xgb.DMatrix(X_train[features], y_train)
dvalid = xgb.DMatrix(X_valid[features], y_valid)

watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
gbm = xgb.train(params, dtrain, num_boost_round, evals=watchlist, early_stopping_rounds=200, \
  feval=rmspe_xg, verbose_eval=True)

print("Validating")
yhat = gbm.predict(xgb.DMatrix(X_valid[features]))
error = rmspe(X_valid.Sales.values, np.expm1(yhat))
print('RMSPE: {:.6f}'.format(error))

print("Make predictions on the test set")
dtest = xgb.DMatrix(test[features])
test_probs = gbm.predict(dtest)

# Make Submission
result = pd.DataFrame({"Id": test["Id"], 'Sales': np.expm1(test_probs)})
result.to_csv("data/submission_v6_kaggle_script_for_test.csv", index=False)

Will train until eval error hasn't decreased in 200 rounds.
[0]	train-rmspe:0.999980	eval-rmspe:0.999520
[1]	train-rmspe:0.999879	eval-rmspe:0.998814
[2]	train-rmspe:0.999483	eval-rmspe:0.997535
[3]	train-rmspe:0.998193	eval-rmspe:0.995371
[4]	train-rmspe:0.994792	eval-rmspe:0.991943
[5]	train-rmspe:0.988319	eval-rmspe:0.986808
[6]	train-rmspe:0.978789	eval-rmspe:0.979530
[7]	train-rmspe:0.968873	eval-rmspe:0.969590
[8]	train-rmspe:0.956882	eval-rmspe:0.956738
[9]	train-rmspe:0.940610	eval-rmspe:0.940531
[10]	train-rmspe:0.920978	eval-rmspe:0.920970
[11]	train-rmspe:0.898074	eval-rmspe:0.898019
[12]	train-rmspe:0.871714	eval-rmspe:0.871621
[13]	train-rmspe:0.842560	eval-rmspe:0.842368
[14]	train-rmspe:0.810569	eval-rmspe:0.810272
[15]	train-rmspe:0.776518	eval-rmspe:0.776044
[16]	train-rmspe:0.740919	eval-rmspe:0.740225
[17]	train-rmspe:0.704385	eval-rmspe:0.703363
[18]	train-rmspe:0.667680	eval-rmspe:0.666228
[19]	train-rmspe:0.631038	eval-rmspe:0.629169
[20]	train-rmspe:0.595041	eval

training data processed
Train xgboost model
Train a XGBoost model
Validating
RMSPE: 0.094526
Make predictions on the test set


[1199]	train-rmspe:0.103954	eval-rmspe:0.094526


kaggle-results:0.11597  
train-rmspe:0.103954	
eval-rmspe:0.094526