Version 0.1-26.11.2017

Update 14.11.2017

Update 15.12.2017

Update 17.12.2017

Update 20.12.2017 v0.1 - changed lgb parameters

Update 20.12.2017 v0.2 - introduced early stopping

# Final project: predict future sales

This challenge serves as final project for the "How to win a data science competition" Coursera course.
In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

In [None]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_squared_error
%matplotlib inline 

In [None]:
DATA_FOLDER = '../readonly/final_project_data/'

sales    = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'))
items           = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))
item_categories = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))
shops           = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))
train           = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'), compression='gzip')
test           = pd.read_csv(os.path.join(DATA_FOLDER, 'test.csv.gz'), compression='gzip')

# EDA

<ol start="0">
  <li><b>Print the shape of the loaded dataframes </b></li>
</ol>

In [None]:
print ('sales shape %s' % np.str(sales.shape))
print ('items shape %s' % np.str(items.shape))
print ('item_categories shape %s' % np.str(item_categories.shape))
print ('shops shape %s' % np.str(shops.shape))
print ('train shape %s' % np.str(train.shape))
print ('test shape %s' % np.str(test.shape))

In [None]:
sales.head()

In [None]:
items.head()

In [None]:
item_categories.head()

In [None]:
shops.head()

In [None]:
train.head()

Ok, train & transactions are the same ;-()

In [None]:
test.head()

In [None]:
len(train.shop_id.unique())

In [None]:
len(test.shop_id.unique())

In [None]:
len(train.item_id.unique())

In [None]:
len(test.item_id.unique())

GAIO - NOTE. The test set includes a subset of shops & items. I need to find out how the selection has been made to be able to create the predictions

### Let's start manipulating & enriching the data 

In [None]:
#Add the revenues/transaction
sales['revenues'] = sales['item_price'] * sales['item_cnt_day']

In [None]:
# transform the dates splitting day/month/year
sales['f_date'] = pd.to_datetime(sales['date'],format='%d.%m.%Y')
sales['year'] = sales['f_date'].dt.year
sales['month'] = sales['f_date'].dt.month
sales['day'] = sales['f_date'].dt.day
sales['week'] = sales['f_date'].dt.weekofyear

In [None]:
#add category description to the items df
items_merge = pd.merge(left = items, right = item_categories , left_on = 'item_category_id', right_on = 'item_category_id')

In [None]:
print (items_merge.head())

In [None]:
#add the category to the items sold
sales_merge = pd.merge(left = sales,right = items_merge, left_on ='item_id', right_on = 'item_id' )

In [None]:
print (sales_merge.head())

In [None]:
#06.12.2017 - date_block_num is the month number
print (sales_merge.date_block_num.unique())

In [None]:
#check date range
print ('Sale from %s to %s' % (str(sales_merge.f_date.min()),str(sales_merge.f_date.max())))

2 years & 10 months of sales data. I need to predict the sales in November 2015

### Missing Data 

In [None]:
# missing values?
sales_merge.isnull().sum()

No missing data, all columns have been populated

Now I want to visualize the revenues over the train period.
I start grouping by month, shop_id & item_id to reproduct the target grouping

In [None]:
aggrYearMonthShopItem = sales_merge.groupby(['year','month','shop_id','item_id'])[['item_cnt_day']].sum()

In [None]:
aggrYearMonthShopItem.head()

In [None]:
aggrYearMonth = sales_merge.groupby(['year','month'])[['item_cnt_day']].sum()

In [None]:
aggrYearMonth.head()

In [None]:
df_RevMonth = pd.Series(aggrYearMonth.item_cnt_day)#, index = aggrYearMonth.month) 
df_RevMonth.plot(title = "Items/Month") 
plt.xlabel("month") 
plt.ylabel("items") 
plt.rcParams["figure.figsize"] = (30,10)
plt.show()

In [None]:
aggrMonth = sales_merge.groupby(['date_block_num'])[['item_cnt_day']].sum()

In [None]:
plt.plot(aggrMonth.item_cnt_day)
plt.title ("Items/Month")
plt.xlabel("month") 
plt.ylabel("items") 
plt.rcParams["figure.figsize"] = (20,10)
plt.show()

GAIO NOTE - there seems to be a pattern in the behaviour across the years

I drill down to the weeks to verify the seasonality of the shop sales

In [None]:
aggrYearWeek = sales_merge.groupby(['year','week'])[['item_cnt_day']].sum()

In [None]:
aggrYearWeek.head()

In [None]:
df_RevWeek = pd.Series(aggrYearWeek.item_cnt_day)#, index = aggrYearMonth.month) 
df_RevWeek.plot(title = "items/Week") 
plt.xlabel("week") 
plt.ylabel("items") 
plt.rcParams["figure.figsize"] = (20,10)
plt.show()

GAIO NOTES: 

* There seems to be some seasonality in the sales, with an indication of simmetry.
* The simmetry may indicate strong sales seasonality around festivities.
* We can notice a sharp increase followed by a sharp decrease around the spikes.

GAIO NOTES: 

Let's get 1st the Kaggle process right. 
I will create the shop grouping with the code from Wk3, fit a un-optimized model & make a submission to verify that everything works.
Afterwards I will go back to EDA etc...

In [None]:
from itertools import product
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

#fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

In [None]:
all_data.head()

# Train/Validation split

I will train on the 1st 32 months & validate on the last month

In [None]:
dates = all_data['date_block_num']

last_block = dates.max() 

dates_train = dates[dates <  last_block]
dates_val  = dates[dates == last_block]

X_train = all_data.loc[dates <  last_block]
X_val =  all_data.loc[dates == last_block]

y_train = all_data.loc[dates <  last_block, 'target'].values
y_val =  all_data.loc[dates == last_block, 'target'].values

# Modelling

In [None]:
def rmse(X,y):
    return np.sqrt(mean_squared_error(X, y))

1st save the data & create the target

In [None]:
y_train=np.clip(y_train,0, 20)
y_val=np.clip(y_val,0, 20)

In [None]:
import lightgbm as lgb
#model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
#                              learning_rate=0.05, n_estimators=720,
#                              max_bin = 55, bagging_fraction = 0.8,
#                              bagging_freq = 5, feature_fraction = 0.2319,
#                              feature_fraction_seed=9, bagging_seed=9,
#                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
lgb_params = {
               'feature_fraction': 0.75,
               'metric': 'rmse',
               'nthread':1, 
               'min_data_in_leaf': 2**7, 
               'bagging_fraction': 0.75, 
               'learning_rate': 0.03, 
               'objective': 'mse', 
               'bagging_seed': 2**7, 
               'num_leaves': 2**7,
               'bagging_freq':1,
               'verbose':0
              }
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_val, y_val, reference=lgb_train)

model_lgb = lgb.train(lgb_params,                     
                      lgb_train,
                      num_boost_round=300)#,
#                      valid_sets=lgb_eval,
#                      early_stopping_rounds=5)

In [None]:
train_preds = model_lgb.predict(X_train)
rmse_train = rmse(y_train, train_preds)

val_preds = model_lgb.predict(X_val)
rmse_val = rmse(y_val, val_preds)

print('Train R-squared is %f' % rmse_train)
print('Validation R-squared is %f' % rmse_val)

In [None]:
model_lgb = lgb.train(lgb_params, lgb.Dataset(all_data_copy, label=y), 100)
lgb_RMSE = rmse(model_lgb.predict(all_data_copy),y)
print ("RMSE: %.5f" % (lgb_RMSE))

In [None]:
y_test = model_lgb.predict(test)
#clip the target values in the range 0-20
out_df = pd.DataFrame({'ID': test.ID, 'item_cnt_month': np.clip(y_test,0,20)})
# you could use any filename. We choose submission here
out_df.to_csv('predict_future_prices_2117.v0.2.csv', index=False)