### Summary

##### Multiple validations
In this notebook, we want to perform validations on multiple validation sets. 

To start with , we will have month=32 as the validation set (and use the data prior to that as the training set), and then have month=31 as the validation set (and again use the data prior to that as the training set) and so on. 

##### Taking a mean of the validations scores

In the end, we will take a mean of all these validation scores and hopefully that should give us a much more stable estimate of the predictive ability of the model.

In [15]:
# Various imports.
import pandas as pd
import numpy as np
from statistics import mean


In [16]:
sales_train = pd.read_csv("../input/sales_train.csv")

In [17]:
monthly_sales_data = sales_train[['date_block_num', 'shop_id', 'item_id', 'item_cnt_day' ]].groupby(
    ['date_block_num', 'shop_id', 'item_id']).sum()
monthly_sales_data.rename(columns={'item_cnt_day':'item_cnt_month'}, inplace=True)
monthly_sales_data.reset_index(['date_block_num', 'shop_id', 'item_id'], inplace=True)


In [18]:
monthly_sales_data.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_month
0,0,0,32,6.0
1,0,0,33,3.0
2,0,0,35,1.0
3,0,0,43,1.0
4,0,0,51,2.0


In [31]:
# Put the clippings to a routine for easy access.
def clip_predictions_and_actual_vals(df, 
                                     actual_value_col='item_cnt_month', 
                                     prediction_col='item_cnt_month_prediction'):
    if actual_value_col in df.columns:
        df[actual_value_col] = \
            df[actual_value_col].clip(lower=0, upper=20)
    if prediction_col in df.columns:
        df[prediction_col] = \
            df[prediction_col].clip(lower=0, upper=20)
    


In [20]:
from sklearn.metrics import mean_squared_error
def get_validation_rmse(monthly_validation_data=monthly_sales_data):
    

    rmse_validation_data = np.sqrt(mean_squared_error(monthly_validation_data.item_cnt_month, 
                                                      monthly_validation_data.item_cnt_month_prediction))
    return rmse_validation_data


In [21]:
def get_predictions_on_test_data(train_data, test_data):
    train_data['shop_id_and_item_id'] = list(zip(train_data['shop_id'], 
                                                 train_data['item_id']))

    shop_id_item_id_item_cnt = dict(zip(train_data.shop_id_and_item_id, 
                                        train_data.item_cnt_month))    
    

    test_data['item_cnt_month_prediction'] = test_data.apply(
        lambda x : shop_id_item_id_item_cnt.get((x.shop_id, x.item_id), 0), axis=1)     
    clip_predictions_and_actual_vals(train_data)    
    clip_predictions_and_actual_vals(test_data)        


In [22]:
def get_validation_score(validation_month = 32, monthly_sales_data=monthly_sales_data):
    # Gets the training and validation  data sets.
    monthly_validation_data = monthly_sales_data[monthly_sales_data['date_block_num'] == validation_month].copy()
    
    
    prev_month_data = monthly_sales_data[monthly_sales_data['date_block_num'] == validation_month-1].copy()
    
    
    get_predictions_on_test_data(prev_month_data, monthly_validation_data)
    return get_validation_rmse(monthly_validation_data)
    
    

In [23]:
get_validation_score()

2.6196034545160316

##### Now , let us compute validation scores iteratively for months 1 to 32.

In [24]:
validation_scores = list()
for i in range(1, 33):
    validation_score = get_validation_score(validation_month=i)
    print(i, validation_score)
    validation_scores.append(validation_score)
    
print("Mean validation score is ", mean(validation_scores))

1 2.710395130217868
2 3.0223156762889514
3 2.8016959770366463
4 2.4699370766101896
5 2.5482356397695782
6 2.210297259267594
7 2.426496930365454
8 3.0129113018549636
9 3.010404740220371
10 3.045767612570966
11 3.0568427380053245
12 2.5833516846301863
13 2.5346996530242834
14 2.8093730151311207
15 2.719196974469364
16 2.5575318724316403
17 2.2361738107033937
18 2.323811638937932
19 2.395326960716407
20 2.9929949798835107
21 3.033475245069365
22 3.079804031930541
23 2.9627064809246684
24 2.713172334269662
25 2.4361476089816563
26 2.290570480844994
27 2.7652927704790824
28 2.646595789640537
29 2.298305632682825
30 2.0275932308184674
31 2.11499708886291
32 2.6196034545160316
Mean validation score is  2.6392507756611403


##### One last final time, let us verify the prediction score on test entries by uploading them to kaggle.


In [25]:
test = pd.read_csv('../input/test.csv')

In [26]:
sales_train.date_block_num.describe()

count    2.935849e+06
mean     1.456991e+01
std      9.422988e+00
min      0.000000e+00
25%      7.000000e+00
50%      1.400000e+01
75%      2.300000e+01
max      3.300000e+01
Name: date_block_num, dtype: float64

In [27]:
get_predictions_on_test_data(monthly_sales_data[monthly_sales_data.date_block_num==33].copy(), test)

In [28]:
test.head()

Unnamed: 0,ID,shop_id,item_id,item_cnt_month_prediction
0,0,5,5037,0.0
1,1,5,5320,0.0
2,2,5,5233,1.0
3,3,5,5232,0.0
4,4,5,5268,0.0


In [29]:
test.rename(columns={'item_cnt_month_prediction': 'item_cnt_month'}, inplace=True)

In [30]:
test[['ID', 'item_cnt_month']].to_csv('test_submission_20190805.csv', index=False)

#### Conclusion

The submission matches the score earlier received. This proves the point that though the submission scores very well on kaggle with 15% of the test data being visible, it score relatively poorly on our validation set. Hence, it looks like we should accept the results from kaggle submission with a pinch of salt.
