# Predict future sales
In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

In [None]:
import pandas as pd
import numpy as np
import pickle
import os
import gc
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.sparse import save_npz, load_npz, hstack, vstack
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline 

pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 50)

import lightgbm as lgb
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score, mean_squared_error
from tqdm import tqdm_notebook

from itertools import product
import warnings

In [None]:
warnings.filterwarnings('ignore', category=DeprecationWarning)

In [None]:
DATA_FOLDER = '../readonly/final_project_data/'

sales    = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'))
items           = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))
item_categories = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))
shops           = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))
train           = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'), compression='gzip')
test           = pd.read_csv(os.path.join(DATA_FOLDER, 'test.csv'))

# EDA

I check the size of the data

In [None]:
print ('sales shape %s' % np.str(sales.shape))
print ('items shape %s' % np.str(items.shape))
print ('item_categories shape %s' % np.str(item_categories.shape))
print ('shops shape %s' % np.str(shops.shape))
print ('train shape %s' % np.str(train.shape))
print ('test shape %s' % np.str(test.shape))

I give a 1st look at the data.
Sales & Train have the same shape. Are the same df?

In [None]:
sales.equals(train)

In [None]:
sales.head()

In [None]:
shops.head()

In [None]:
items.head()

In [None]:
item_categories.head()

In [None]:
test.head()

so I need to predict the shop sales, which in this case means predicting the sales of the combination of shop & product, not just the total shop sales

1st I add the descriptions to shops & categories in the sales df

In [None]:
items_merge = pd.merge(left = items, right = item_categories , left_on = 'item_category_id', right_on = 'item_category_id')

In [None]:
sales_merge = pd.merge(left = sales,right = shops, left_on ='shop_id', right_on = 'shop_id' )
sales_merge = pd.merge(left = sales_merge,right = items_merge, left_on ='item_id', right_on = 'item_id' )

In [None]:
sales_merge.head()

i check the types of the cols

In [None]:
sales_merge.dtypes

daily sales is float. Are there any partial sales?

In [None]:
(sales_merge.item_cnt_day%1 != 0).any()

Are there any NaNs?

In [None]:
sales_merge.isnull().values.any()

Are there zero sales?

In [None]:
sales_merge[sales_merge['item_cnt_day']==0]['item_cnt_day'].sum()

There are no occurrences of zero sales in the df

In [None]:
test.isnull().values.any()

So all cells have been populated with some values

Does each item belong just to one category?

In [None]:
len(items_merge.groupby(['item_name','item_category_name']).nunique()) == len(items_merge)

Now let's have a look how train & test are constructed

In [None]:
set(sales_merge.shop_id) - set(test.shop_id)

All shops in the test set are also in the train set. 

In [None]:
len(set(test.item_id) - set(sales_merge.item_id))

Merda! 363 items have been placed in the test set but they have never been observed in the train set... 
this can be an hint that the test set has been artificially constructed.

In [None]:
len(test.groupby(['shop_id','item_id']))

In [None]:
len(set(test.item_id)) * len(set(test.shop_id))

Hypo confirmed. The test set has been made by combining a set of items with a set of shops.
I can expect that there will be zero sales for many shop-item combinations.

I will have to add the zero lines with all combinations of shop & item in the train set too to align it with test set. 
Why? Because any model trained on the original train set would never predict zero sales as there are no zero sales in the training set.
The accuracy on the test set will then be very low as in this set zero sales are expected.

Now what I would like to do is try to understand whether there was a logic in the selection of the items/shops

In [None]:
add_items_test = set(test.item_id) - set(sales_merge.item_id)

In [None]:
items_merge.loc[add_items_test].sort_values(['item_category_name'])

In [None]:
items_merge.loc[add_items_test].sort_values(['item_name'])

* Items belong to one of the following categories: 
console games, PC games, console/PC accessories, Movies, Music, eBooks & SW & merchandise related to games/video

* I will need to extract text feature to help the model exploit the info in the products

In [None]:
items_merge.loc[add_items_test].sort_values(['item_id'])

Items are ordered by their 1st letter & there seem to be 2 main groups

TO DO: Count Items by 1st letter & category & investigate further

Now let's start visualize the data

In [None]:
sales_eda = sales_merge.copy()

1.Add details about weeks, days of week etc...

In [None]:
import datetime as dt
sales_eda['date'] = pd.to_datetime(sales_eda['date'],format='%d.%m.%Y')
sales_eda['nr_dow'] = sales_eda.date.dt.weekday
sales_eda['month'] = sales_eda.date.dt.month
sales_eda['year'] = sales_eda.date.dt.year

In [None]:
sales_eda.sort_values(['date'], inplace=True)
sales_eda.reset_index(drop=True, inplace=True)

In [None]:
sales_eda['date_block_num'].max()

1.plot raw sales over time

In [None]:
plt.figure(figsize=(15,10))
ax = plt.subplot(211)
sales_eda.groupby('date').item_cnt_day.sum().plot(ax=ax)
ax = plt.subplot(212)
sales_eda.groupby('date_block_num').item_cnt_day.sum().plot(ax=ax)
plt.grid(True)
plt.xticks(np.arange(34))

Observations:
* spikes in Dec = shopping for Xmas, though this is celebrated in Jan
* there seems to be trend over the years
* total sales are going down over the years

Now I check the dates span

In [None]:
dates_df = pd.to_datetime(sales_eda['date'],format='%d.%m.%Y')
print ('Sale from %s to %s' % (str(dates_df.min()),str(dates_df.max())))

I want to predict the sales in November 2015.

Though I know that the test set has been artificially generated, 

I am going to plot Nov 2013 & Nov 2014 to check whether they can be good candidates as valitation sets.

In [None]:
plt.figure(figsize=(15,10))
ax = plt.subplot(211)
sales_eda[sales_eda['date_block_num'] == 10].groupby('date').item_cnt_day.sum().plot(ax=ax)
ax = plt.subplot(212)
sales_eda[sales_eda['date_block_num'] == 22].groupby('date').item_cnt_day.sum().plot(ax=ax)
plt.grid(True)
#plt.xticks(np.arange(31))

Mah...not really helpful

Now I am going to plot the sales per month across the years

In [None]:
plt.figure(figsize=(20,10))
ax = plt.subplot(211)
sales_eda.groupby(['year', 'month']).item_cnt_day.sum().unstack(level=0).plot(ax=ax)
plt.xlim(1, 12)
plt.xticks(range(1, 13))
plt.title('Sales per month at different years')
plt.ylabel('Sales');

* Sales are decreasing
* We can see the trend in the years

Now I look for the sales during the weeksdays over the months.

In [None]:
plt.figure(figsize=(12,12))
ax = plt.subplot(211)
sales_eda[sales_eda['item_cnt_day'] >= 0].groupby(['year', 'nr_dow']).item_cnt_day.sum().unstack(level=0).plot(ax=ax)
plt.grid(True)
ax = plt.subplot(212)
sales_eda[sales_eda['item_cnt_day'] < 0].groupby(['year', 'nr_dow']).item_cnt_day.sum().unstack(level=0).plot(ax=ax)
plt.grid(True)

* Most sales happen on saturdays
* Most returns on Mondays

In [None]:
plt.figure(figsize=(20,10))
ax = plt.subplot(211)
sales_eda.groupby(['year','shop_id']).item_cnt_day.sum().unstack(level=0).plot(ax=ax)
#plt.grid(True)

Interesting - shops open & close over time. I need then to add the zero sales for these cases.

We can see that it seems that shops maintain the same sales performances over the years.

In [None]:
plt.figure(figsize=(20,10))
ax = plt.subplot(211)
sales_eda.groupby(['date_block_num','shop_id']).item_cnt_day.sum().unstack(level=0).plot(ax=ax)
#plt.grid(True)

In [None]:
plt.figure(figsize=(20,10))
ax = plt.subplot(211)
sales_eda.groupby(['year','item_category_id']).item_cnt_day.sum().unstack(level=0).plot(ax=ax)
#plt.grid(True)

There seem to be a bunch of shops & categories with very high sales

Now I look for outliers.
Shop & Category sales spikes are consistent over the years, I will then not look into them further for the time being.
I check the product sales per product id

In [None]:
plt.figure(figsize=(20,10))
ax = plt.subplot(211)
sales_eda.groupby(['date_block_num','item_category_id']).item_cnt_day.sum().unstack(level=0).plot(ax=ax)
#plt.grid(True)

same for categories - there is regularity over time

In [None]:
plt.figure(figsize=(20,10))
ax = plt.subplot(211)
item_sales = sales_eda.groupby(['year','item_id']).item_cnt_day.sum()
item_sales.unstack(level=0).plot(ax=ax)

Let's look at the product with the highest sales

In [None]:
plt.figure(figsize=(20,10))
ax = plt.subplot(211)
item_sales = sales_eda.groupby(['date_block_num','item_id']).item_cnt_day.sum()
item_sales[item_sales.values > 100].unstack(level=0).plot(ax=ax)

In [None]:
item_sales.head()

In [None]:
top_prod = item_sales[item_sales.values > 20000]

In [None]:
top_prod

In [None]:
items_merge[items_merge.item_id.isin(list(top_prod.index.get_level_values(level=1)))]

it is a shopping bag.

# Feature Generation

In [None]:
sales.head()

### Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. 

In [None]:
def aggr_train(sales):
    index_cols = ['shop_id', 'item_id', 'date_block_num']
    # For every month we create a grid from all shops/items combinations from that month
    grid = [] 
    for block_num in sales['date_block_num'].unique():
        cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
        cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
        grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
 
    #turn the grid into pandas dataframe
    grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)
    
    #get aggregated values for (shop_id, item_id, month)
    gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

    #fix column names
    gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
    #join aggregated data to the grid
    all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
    #sort the data
    #all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True) 
    all_data.reset_index(inplace=True, drop=True)
    all_data.target = all_data.target.astype(np.float32)
       
    return all_data 

In [None]:
train_df = aggr_train(sales)

In [None]:
train_df.head()

Add item category ID

In [None]:
train_df['item_category_id'] = train_df['item_id'].map(items['item_category_id']).astype(np.int32)
test['item_category_id'] = test['item_id'].map(items['item_category_id']).astype(np.int32)

In [None]:
train_df.head()

### Lagged features

In [None]:
def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    
    return df

In [None]:
gc.collect();
from tqdm import tqdm_notebook

index_cols = ['shop_id', 'item_id', 'date_block_num']
test_index_cols = ['shop_id', 'item_id']
cols_to_rename = ['target']

shift_range = [1, 2, 3, 6, 12]

train_df = downcast_dtypes(train_df)
test = downcast_dtypes(test)

for month_shift in tqdm_notebook(shift_range):
    print("passa %s\n" % str(month_shift))
    train_shift = train_df[index_cols + cols_to_rename].copy()
    
    train_shift['date_block_num'] = train_shift['date_block_num'] + month_shift
    foo = lambda x: '{}_lag_{}'.format(x, month_shift) if x in cols_to_rename else x
    train_shift = train_shift.rename(columns=foo)
    
    train_df = pd.merge(train_df, train_shift, on=index_cols, how='left').fillna(0)
    
    # Test
    test_month_shift = 34 - month_shift
    test_shift = train_df.loc[train_df.date_block_num == test_month_shift, test_index_cols + cols_to_rename].copy()
    test_shift = test_shift.rename(columns=foo)
    
    test = pd.merge(test, test_shift, on=test_index_cols, how='left').fillna(0)

lagged_features = [col for col in train.columns if 'lag' in col]
del train_shift, test_shift   
gc.collect();

### Text features

In [None]:
#create tfidfs
train_tf_idf_features = {}
test_tf_idf_features = {}
col_vals = [shops.shop_name, items.item_name]
col_names = ['shop_id', 'item_id']

fName = "shop_id_tf_idf_train.npz"

if os.path.exists(fName):
    print('Loading text features...')
    for name in col_names:
        train_tf_idf_features[name] = load_npz(name + '_tf_idf_train.npz')
        test_tf_idf_features[name] = load_npz(name + '_tf_idf_test.npz')
else:
    print('Generate text features...')
    #load russian + relevant english stopwords
    stop_words_ru= ['для','тц', 'тк', 'трк', 'трц', 'ii', 'ул', 'пав','the', 'для', 'of', 'на']
    
    for name, text in zip(col_names, col_vals):
        print(f'Tfidf from feature {name}')
        tfidf_vectorizer = TfidfVectorizer(max_features=30, stop_words=stop_words_ru)
        tf_idf_feats = tfidf_vectorizer.fit_transform(text)

        means = np.array(tf_idf_feats.mean(axis=0)).squeeze()
        argsort = means.argsort()
        print('Top frequency words:')
        print(np.array(tfidf_vectorizer.get_feature_names())[argsort[::-1][:30]])
        print()

        # Create and save
        print('Transform data')
        sparse_text_train = tfidf_vectorizer.transform(train_df[name].map(text))
        sparse_text_test = tfidf_vectorizer.transform(test[name].map(text))
        train_tf_idf_features[name] = sparse_text_train
        test_tf_idf_features[name] = sparse_text_test
        
        print('Save to file')
        save_npz(name + '_tf_idf_train', sparse_text_train)
        save_npz(name + '_tf_idf_test', sparse_text_test)
        print()
print ("...done")

In [None]:
train_tf_idf_features['item_id'].toarray().nonzero()

In [None]:
train_df.head()

In [None]:
test.head()

### Mean Encodings

In [None]:
def expand_mean_enc(tr, test, col,global_mean):
    new_col_name = col + '_enc'
    tr[new_col_name] = np.nan

    cumsum = tr.groupby(col)['target'].cumsum() - tr['target']
    cumcount = tr.groupby(col)['target'].cumcount()

    expanding_mean = pd.DataFrame({'expand_mean': cumsum / cumcount,
                                   'date_block_num': tr.date_block_num,
                                   col: tr[col]})
    for block_num in np.unique(tr.date_block_num)[1:]:
        cur_mask = tr.date_block_num == block_num
        prev_mask = tr.date_block_num <= block_num - 1
        
        mapping = expanding_mean[prev_mask].groupby(col).expand_mean.last()
        tr.loc[cur_mask, new_col_name] = tr.loc[cur_mask, col].map(mapping)
        
    
    # Fill train with last seen values from train
    prev_mask = tr.date_block_num <= tr.date_block_num.max()
    mapping = expanding_mean[prev_mask].groupby(col).expand_mean.last()
    test[new_col_name] = test[col].map(mapping)
    
    # Downcast
    tr[new_col_name] = tr[new_col_name].astype(np.float32)
    test[new_col_name] = test[new_col_name].astype(np.float32)
    
    # Fill NaNs
    tr[new_col_name].fillna(global_mean, inplace=True) 
    tr[new_col_name].replace(np.inf, global_mean, inplace=True)

    test[new_col_name].fillna(global_mean, inplace=True) 
    test[new_col_name].replace(np.inf, global_mean, inplace=True)
    return tr,test

In [None]:
col_4_mean = ['item_id','shop_id','item_category_id']
global_mean = train_df.target.mean()

In [None]:
for col in col_4_mean:
    train_df, test = expand_mean_enc(train_df, test, col,global_mean)

In [None]:
train_df.loc[np.random.choice(train_df.index,10,replace=False)]

### Reduce training size

In [None]:
start_month = 6
train_sel_mask = train_df.date_block_num > start_month
train_df = train_df[train_sel_mask]
train_df.reset_index(inplace=True, drop=True)
for name in sorted(train_tf_idf_features):
    train_tf_idf_features[name]=train_tf_idf_features[name][train_sel_mask.values]

# Misc Utils

### Clip Target Values

In [None]:
def clip_target(counts):
    return np.clip(counts, 0, 20)

### Validation set
I have noticed that the spikes in the data have a 3 month span (start - spike - end). I will then use the last 2 months as validation set.

In [None]:
print ('1st month: %s , last month: %s' % (str(train_df['date_block_num'].min()),str(train_df['date_block_num'].max())))

In [None]:
train_mask = train_df.date_block_num < 32
val_mask = train_df.date_block_num >= 32

x_train= train_df.loc[train_mask].copy()
x_val = train_df.loc[val_mask].copy()

y_train = x_train['target']
y_val = x_val['target']

#drop target column
to_drop_cols = ['target']
x_train.drop(to_drop_cols, axis=1, inplace=True)
x_val.drop(to_drop_cols, axis=1, inplace=True)

# Modeling

### XGBoost

1) Tuning with Hyperopt

2) Plot the results

3) Plot the features by importance

4) Retrain with tuned parameters and most important features

### Tuning 
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
 
https://machinelearningmastery.com/tune-number-size-decision-trees-xgboost-python/

https://www.quora.com/How-do-I-tune-hyperparameters-like-eta-num_rounds-max_depth-for-xgboost

http://mlwhiz.com/blog/2017/12/28/hyperopt_tuning_ml_model/

https://www.dataiku.com/learn/guide/code/python/advanced-xgboost-tuning.html

1) 1st Tune max_depth' , 'min_child_weight', 'subsample', 'colsample_bytree' 

2) Tune eta - learning rate

3) Find the optimal n_estimators 

In [None]:
from sklearn.metrics import mean_squared_error
import xgboost as xgb
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
import pickle
import os

In [None]:
def objective(space):
    print(space)
    
    xgb_params = {
        'colsample_bytree' : space['colsample_bytree'],
        'learning_rate' : .3,
        'max_depth' : int(space['max_depth']),
        'min_child_weight' : space['min_child_weight'],
        'subsample' : space['subsample'],
        'gamma' : space['gamma'],
        'reg_lambda' : space['reg_lambda']
    }
    
    clf = xgb.XGBRegressor(**xgb_params,n_estimators =1000)

    eval_set  = [( x_train, y_train), ( x_val, y_val)]

    clf.fit(x_train, y_train,
            eval_set=eval_set, eval_metric="rmse",
            early_stopping_rounds=10,verbose=True)

    pred = clf.predict(x_val)
    mse_scr = mean_squared_error(y_val, pred)
    print ("SCORE: %s" % str(np.sqrt(mse_scr)))
    return {'loss':mse_scr, 'status': STATUS_OK }


space ={'max_depth': hp.quniform("x_max_depth", 6, 16, 1),
        'min_child_weight': hp.loguniform ('x_min_child', -0.1,3),
        'subsample': hp.uniform ('x_subsample', 0.7, 1),
        'gamma' : hp.uniform ('x_gamma', 0.1,0.5),
        'colsample_bytree' : hp.uniform ('x_colsample_bytree', 0.3,1),
        'reg_lambda' : hp.uniform ('x_reg_lambda', 0,1),
        'tree_method': 'gpu_hist'
       }


fName = "trials_xgb_cg.p"

if os.path.exists(fName):
    trials = pickle.load(open(fName, "rb"))
    hyperparam_history = []
    for i, loss in enumerate(trials.losses()):
        param_vals = {k:v[i] for k,v in trials.vals.items()}
        hyperparam_history.append((loss, param_vals))
    hyperparam_history.sort()
    best = hyperparam_history[0][1]
    print ("Parameters file loaded")
    print ("BEST PARAMETERS-> ", best)
else:#run Hyperopt optimization
    trials = Trials()
    best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=100,            
            trials=trials)
    print ("BEST PARAMETERS-> ", best)
    pickle.dump(trials, open(fName, "wb"))
    print ("Parameters dumped to file")

#### Add text features to the df

In [None]:
print(train_tf_idf_features['shop_id'].shape)
print(train_df.shape)

In [None]:
#get the features
train_text_features = [train_tf_idf_features[name] for name in sorted(train_tf_idf_features)]
test_text_features = [test_tf_idf_features[name] for name in sorted(test_tf_idf_features)]
sparse_train = hstack(train_text_features, format='csr').astype(np.float32)
# Stack to sparse format
sparse_train = hstack(train_text_features, format='csr').astype(np.float32)
sparse_test = hstack(test_text_features, format='csr').astype(np.float32)
print(sparse_train.shape)
gc.collect();

In [None]:
#1st limit to 30 features
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(30)
svd = svd.fit_transform(vstack([sparse_train, sparse_test]))
svd_train = svd[:sparse_train.shape[0]]
svd_test = svd[sparse_train.shape[0]:]
del svd

In [None]:
#IF NOT SVD
train_fold = hstack([train_df] +  [sparse_train])

In [None]:
train_fold.shape

In [None]:
type(train_fold)

#### TO DO GAIO: Create train/validation split