# Training and Testing
The steps for using the training file to train a model using Light GBM are given in this notebook.  The steps for generating the predictions to submit to Kaggle are also included.

In [126]:
import numpy as np
import pandas as pd

training_data = pd.read_csv('data/training_data.csv')
print('number of training records:', len(training_data))
training_data.head()

number of training records: 889869


Unnamed: 0,year,month,shop_id,shop_t,shop_t-1,shop_t-11,categ_id,categ_t,categ_t-1,categ_t-11,item_id,t,t-1,t-2,t-5,t-11,t+1
0,2014,0,2,890.0,1322.0,488.0,40,76.0,93.0,40.0,32,1.0,0.0,0.0,0.0,0.0,0.0
1,2014,0,2,890.0,1322.0,488.0,37,44.0,55.0,21.0,33,1.0,1.0,2.0,0.0,0.0,0.0
2,2014,0,2,890.0,1322.0,488.0,37,44.0,55.0,21.0,99,1.0,0.0,0.0,0.0,0.0,0.0
3,2014,0,2,890.0,1322.0,488.0,73,4.0,3.0,10.0,482,2.0,1.0,2.0,0.0,1.0,1.0
4,2014,0,2,890.0,1322.0,488.0,73,4.0,3.0,10.0,485,1.0,1.0,0.0,0.0,0.0,1.0


## Deciding Which Features and Records To Use
The first step is to pick which features to train the model with.  We will experiment with training with different subsets of the available features in the training file.

In [127]:
# Training set A
#features_to_use = ['year','month','shop_id','categ_id','item_id','t']
#categ_features = ['year','month','shop_id','categ_id','item_id']

# Training set B
#features_to_use = ['year','month','shop_id','categ_id','t','t-1','t-2','t-5','t-11']
#categ_features = ['year','month','shop_id','categ_id']

# Training set C
#features_to_use = ['year','month','shop_t','shop_t-1','shop_t-11','categ_id','t','t-1','t-2','t-5','t-11']
#categ_features = ['year','month','categ_id']

# Training set D
#features_to_use = ['year','month','shop_t','shop_t-1','shop_t-11','categ_t','categ_t-1','categ_t-11','t','t-1','t-2','t-5','t-11']
#categ_features = ['year','month']

# Training set E
features_to_use = ['year','month','shop_id','categ_id','t','t-1','t-2','t-11']
categ_features = ['year','month','shop_id','categ_id']

We will also experiment with including and excluding records with negative targets.

In [128]:
print('number of training records with negative targets:', len(training_data.loc[training_data['t+1'] < 0.0]))
training_data = training_data.loc[training_data['t+1'] >= 0.0]

number of training records with negative targets: 397


Separate the target variable for the features.

In [129]:
# separate the target variable
y = training_data['t+1']

# extract the intended features (input variables) 
X = training_data[features_to_use]
X.head()

Unnamed: 0,year,month,shop_id,categ_id,t,t-1,t-2,t-11
0,2014,0,2,40,1.0,0.0,0.0,0.0
1,2014,0,2,37,1.0,1.0,2.0,0.0
2,2014,0,2,37,1.0,0.0,0.0,0.0
3,2014,0,2,73,2.0,1.0,2.0,1.0
4,2014,0,2,73,1.0,1.0,0.0,0.0


## Training the Model
We will split the training data into 2 sets: one for training and one for validation.  It will be a 80/20 testing/validation split. 

In [130]:
from sklearn.model_selection import train_test_split

# Split the inputs and targets into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Validation set has {} samples.".format(X_valid.shape[0]))

Training set has 711577 samples.
Validation set has 177895 samples.


To evaluate the performance of the model, we need to calculate the root mean squared error (RMSE) of the predictions.

In [131]:
from sklearn.metrics import mean_squared_error

# calculates the root mean squared error
def calc_RMSE(actuals, predictions):
    return np.sqrt(mean_squared_error(actuals, predictions))

# callback function for use in early stopping
def calc_RMSE_cb(actuals, predictions):
    return 'RMSE', np.sqrt(mean_squared_error(actuals, predictions)), False

To find optimal hyperparameters for Light GBM, we can use GridSearchCV.  This was used initially to find the hyperparameters for the learning rate (learning_rate), the number of estimators (n_estimators), the maximum depth of the tree (max_depth) and the maximum number of leaves for the base learners (num_leaves).  However, this process takes such a long time every time it runs that it was abandoned in favour of manually tweaking the hyperparameters. 

In [132]:
from lightgbm.sklearn import LGBMRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

# run grid search to find better hyperparameters
def search_for_best_model(inputs, targets):
    cv_sets = ShuffleSplit(n_splits=10, test_size=0.2, train_size=None, random_state=42)
    regressor = LGBMRegressor(learning_rate=0.1, n_estimators=150, reg_lambda=0.0001, n_jobs=4)
    #params = {'max_depth':[50, 75, 100],'num_leaves':[1250, 1500, 1750], learning_rate:[0.1, 0.01, 0.001]}
    #params = {'max_depth':[50, 75, 100],'num_leaves':[1250, 1500, 1750], n_estimators:[100, 150, 200]}
    params = {'max_depth':[70, 75, 80],'num_leaves':[1200, 1250, 1300]}
    scoring_fnc = make_scorer(calc_RMSE, greater_is_better=False)
    grid = GridSearchCV(regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)
    grid = grid.fit(inputs, targets, categorical_feature=categ_features)
    return grid.best_estimator_

#best_model = search_for_best_model(X_train, y_train)
#print("Parameter 'learning_rate' is {} for the optimal model.",format(best_model.get_params()['learning_rate']))
#print("Parameter 'n_estimators' is {} for the optimal model.",format(best_model.get_params()['n_estimators']))
#print("Parameter 'max_depth' is {} for the optimal model.",format(best_model.get_params()['max_depth']))
#print("Parameter 'num_leaves' is {} for the optimal model.",format(best_model.get_params()['num_leaves']))


It's time to initialize a regressor with the right hyperparameters and then train the model.  The training will stop when the validation RMSE does not improve in 5 rounds.  We should also look at the training RMSE of the trained model.

In [133]:
# initialize the regressor with good hyperparameters
regressor = LGBMRegressor(max_depth=75, num_leaves=1250, \
    learning_rate=0.1, n_estimators=150, reg_lambda=0.0001, n_jobs=4)

# create the validation set for early stopping
eval_set=[(X_valid, y_valid)]

# train the model on the training set with early stopping
regressor_model = regressor.fit(X_train, y_train, categorical_feature=categ_features, \
    eval_set=eval_set, eval_metric=calc_RMSE_cb, early_stopping_rounds=5)

# get the predictions on the trained model
predictions_train = regressor_model.predict(X_train)

# display the RMSE of the predictions
print('Training RMSE:', calc_RMSE(y_train, predictions_train))



[1]	valid_0's l2: 40.7462	valid_0's RMSE: 6.38327
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l2: 36.0927	valid_0's RMSE: 6.00772
[3]	valid_0's l2: 31.9585	valid_0's RMSE: 5.65319
[4]	valid_0's l2: 28.9799	valid_0's RMSE: 5.3833
[5]	valid_0's l2: 26.4463	valid_0's RMSE: 5.14259
[6]	valid_0's l2: 24.0377	valid_0's RMSE: 4.90282
[7]	valid_0's l2: 22.0894	valid_0's RMSE: 4.69994
[8]	valid_0's l2: 20.494	valid_0's RMSE: 4.52703
[9]	valid_0's l2: 19.1642	valid_0's RMSE: 4.3777
[10]	valid_0's l2: 18.228	valid_0's RMSE: 4.26942
[11]	valid_0's l2: 17.2472	valid_0's RMSE: 4.15297
[12]	valid_0's l2: 16.5446	valid_0's RMSE: 4.0675
[13]	valid_0's l2: 16.0544	valid_0's RMSE: 4.0068
[14]	valid_0's l2: 15.5956	valid_0's RMSE: 3.94912
[15]	valid_0's l2: 15.3475	valid_0's RMSE: 3.91758
[16]	valid_0's l2: 15.091	valid_0's RMSE: 3.88472
[17]	valid_0's l2: 14.6751	valid_0's RMSE: 3.83081
[18]	valid_0's l2: 14.3955	valid_0's RMSE: 3.79414
[19]	valid_0's l2: 14.3042	valid_0's 

## Generating the Submission File
After training the model, it's time to generate the submission file so we have Kaggle evaluate the performance of the model.  The first step to get predictions required by the competition from the provided test file. 

In [134]:
test_file = pd.read_csv('data/provided/test.csv')
print(test_file.shape)
test_file.head()

(214200, 3)


Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


Using a saved file from the data preparation process, we merge in the features from records that match a date bloc number  of 33 (since we want to predict for date block number 34) and the shop and item pairing in the test file. 

In [135]:
# this file has the same data as the training file but includes records for date block number 33
monthly_totals_all = pd.read_csv('data/td_all.csv')

test_file['date_block_num'] = 33
test_data = test_file.merge(monthly_totals_all, on=['date_block_num','shop_id','item_id'], how='left')
print(test_data.shape)
test_data.head()

(214200, 19)


Unnamed: 0,ID,shop_id,item_id,date_block_num,t,t-1,t-2,t-5,t-11,t+1,categ_id,categ_t,categ_t-1,categ_t-11,shop_t,shop_t-1,shop_t-11,year,month
0,0,5,5037,33,,,,,,,,,,,,,,,
1,1,5,5320,33,,,,,,,,,,,,,,,
2,2,5,5233,33,1.0,3.0,1.0,3.0,0.0,0.0,19.0,75.0,110.0,77.0,1052.0,1092.0,1445.0,2015.0,9.0
3,3,5,5232,33,,,,,,,,,,,,,,,
4,4,5,5268,33,,,,,,,,,,,,,,,


We add the year and month fields, replace any NaN values, and extract the same features as the ones we used for training the model.

In [136]:
test_data['year'] = 2015
test_data['month'] = 9
test_data.fillna(0.0, inplace=True)
test_data = test_data[features_to_use]
test_data.head()

Unnamed: 0,year,month,shop_id,categ_id,t,t-1,t-2,t-11
0,2015,9,5,0.0,0.0,0.0,0.0,0.0
1,2015,9,5,0.0,0.0,0.0,0.0,0.0
2,2015,9,5,19.0,1.0,3.0,1.0,0.0
3,2015,9,5,0.0,0.0,0.0,0.0,0.0
4,2015,9,5,0.0,0.0,0.0,0.0,0.0


Finally, we generate the predictions and create the submission file.

In [137]:
# make predictions
preds_submissions = regressor_model.predict(test_data)

# format the data according to the format in the provided sample submission file
# note: according to the rules of the competition, the true targets are clipped 
submissions = pd.DataFrame({
    "ID": test_file["ID"],
    "item_cnt_month": preds_submissions.clip(0.0, 20.0)
})

# write out submission file
submissions.to_csv("data/submission.csv", index=False)