<h2>Kaggle Challenge : Predict Future Sales - Notebook</h2>
link : https://www.kaggle.com/c/competitive-data-science-predict-future-sales

In [96]:
import os
import pandas as pd
import numpy as np
from scipy.stats import uniform, randint
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
import xgboost as xgb

<h4>Common methods used for models evaluation / hyperparameters otpimisation</h4>
This method takes the result of the RandomizedSearch/GridSearch and report the scores of the 'n' first best models, with the according parameters.

In [97]:
def report_best_scores(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

<h4>Initializing dataframes</h4>


In [98]:
# Data repositories
data_folder = "data"
results_folder = "results"

if not os.path.exists(data_folder):
    os.makedirs(data_folder)
    
if not os.path.exists(results_folder):
    os.makedirs(results_folder)

# Filenames
shops_filename = "shops.csv"
items_filename = "items.csv"
item_categories_filename = "item_categories.csv"
train_filename = "sales_train.csv.gz"
eval_filname = "test.csv.gz"

In [99]:
# Reading files
shops = pd.read_csv(data_folder+"/"+shops_filename)
items = pd.read_csv(data_folder+"/"+items_filename)
item_categories = pd.read_csv(data_folder+"/"+item_categories_filename)
train = pd.read_csv(data_folder+"/"+train_filename, compression='gzip')
eval_df = pd.read_csv(data_folder+"/"+eval_filname, compression='gzip')

<h2>Monthly prediction</h2>
<h5>In this first method I will aggregate the sales of every product on a month basis and predict the number of items sold per shop, per month</h5>
<h3>Preprocessing of the train / test dataset</h3>
<h4>Training dataset</h4>
Shape : [date, date_block_num, shop_id, item_id, item_price, item_cnt_day]

In [113]:
# Aggregation of the price for each item group by shop item and date_block_num
grouper = train.groupby(['date_block_num', 'shop_id', 'item_id'], as_index=False).agg({'item_price': 'mean', 'item_cnt_day': 'count'})
# Features
X = grouper.loc[:, grouper.columns != 'item_cnt_day']
# Target
Y = grouper['item_cnt_day'].rename({'item_cnt_day':'item_cnt_month'})

In [114]:
X.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_price
0,0,0,32,221.0
1,0,0,33,347.0
2,0,0,35,247.0
3,0,0,43,221.0
4,0,0,51,128.5


In [119]:
# Take the modulo of the month number to have seasonnality
X['date_block_num'] = X['date_block_num'].apply(lambda x : x % 12)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


<h4>Testing dataset</h4>
Shape : [shop_id, item_id]
We'll have to associate the price of each item and add a column for the date_block_num (which will be the month number of the predictions : 34

In [85]:
# Add month number column
eval_df['date_block_num'] = eval_df['date_block_num'] = 34 % 12

In [86]:
# Create a ["shop_id", "item_id", "item_price"] dataframe to join with test dataset
prices = train[['shop_id', 'item_id', 'item_price']].drop_duplicates()
prices = prices.groupby(['shop_id', 'item_id'], as_index=False).agg({'item_price':'mean'})

In [87]:
# Add prices 
eval_df_with_prices = pd.merge(eval_df, prices, on=['shop_id', 'item_id'], how = 'left')

For the ('shop_id', 'item_id') that weren't in the training dataset, we calculate the average price for each 'item_id'

In [88]:
eval_df_with_prices['item_price'] = eval_df_with_prices.groupby('item_id')['item_price'].transform(lambda x: x.fillna(x.mean()))

For the remaining rows without 'item_price', we can't use anything besides : Mean prices of all items OR mean prices of all items of the same category.
<h4>--><h/4> Method 1 : Average price

In [89]:
eval_df_with_prices['item_price'] = eval_df_with_prices['item_price'].fillna('mean')
# We make sure that no rows are left with an empty 'item_price'
eval_df_with_prices.count()

Unnamed: 0,ID,shop_id,item_id,date_block_num,item_price


<h3>Model Validation / Optimization</h3>

In [91]:
# Splitting training df into 'train' and 'test' df
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=0)

In [92]:
xgb_model = xgb.XGBRegressor()

# Hyper parameters tunning
params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(100, 150), # default 100
    "subsample": uniform(0.6, 0.4)
}

search = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=200, cv=3, verbose=1, n_jobs=-1, return_train_score=True)

search.fit(X_train, y_train)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 16.9min


KeyboardInterrupt: 

In [None]:
report_best_scores(search.cv_results_)

<h3>Evaluation</h3>
We can now pick the model with the best results from the list given by 'report_best_scores', train it with the entire training data and make the predictions for the submission.

In [None]:
xgb_model = xgb.XGBRegressor(colsample_bytree=0.8045997961875188, gamma=0.04808827554571038, learning_rate=0.31215697934688114, max_depth=5, n_estimators=138, subsample=0.9746919954946938)
xgb_model.fit(X,Y)

In [None]:
# Make the predictions - The columns must be in the same order than the one used for training
predictions = xgb_model.predict(eval_df_with_prices[['date_block_num', 'shop_id', 'item_id', 'item_price']])

In [None]:
predictions.to_csv(results_folder+"/"+"results")