## Previous Value Benchmark

https://www.kaggle.com/szhou42/predict-future-sales-top-11-solution

A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.

The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get lagged values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.

Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference."

Comments

Simply put: Use October 2015 sales(number of items sold) as our predictions for sales of November 2015

In [1]:
import os
import sys
import pandas as pd

sys.path.insert(0, os.path.abspath('/home/jupyter/kaggle/predict_future_sales/src/'))

import munging.process_data as process_data

In [2]:
DATA_DIR = '/home/jupyter/kaggle/predict_future_sales/data/processed'

In [3]:
train_df = pd.read_feather(f'{DATA_DIR}/train_processed.feather')
test_df = pd.read_feather(f'{DATA_DIR}/test_processed.feather')
shops_df = pd.read_feather(f'{DATA_DIR}/shops_processed.feather')
item_categories_df = pd.read_feather(f'{DATA_DIR}/item_categories_processed.feather')
items_df = pd.read_feather(f'{DATA_DIR}/items_processed.feather')
sample_submission_df = pd.read_feather(f'{DATA_DIR}/submission_processed.feather')

  labels, = index.labels


1. Pick the data for Ocober 2015
2. Sum the sales per shop per item for the month
3. Merge the result of 2 with test data. That way for each combination of shop and item, there would be a  value for item sold per month in the test data.
4. Not all (shop, item) combination present in test data is present in train data. Hence, for large number of (shop, item) in test data, 'item_cnt_month' will be NaN.

In [4]:
train_df_sel_month = train_df[(train_df.date.dt.year == 2015) & (train_df.date.dt.month == 10)]
train_sel_month_summarized = train_df_sel_month.groupby(['shop_id', 'item_id'])['item_cnt_day'].sum()
train_sel_month_summarized = train_sel_month_summarized.reset_index()
train_sel_month_summarized.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)

test_df_filled_oct_2015 = process_data.merge_df(test_df, train_sel_month_summarized, how='left', on=['shop_id', 'item_id'])

Before merge missing values on left_df
ID         0
shop_id    0
item_id    0
dtype: int64
Before merge missing values on right_df
shop_id           0
item_id           0
item_cnt_month    0
dtype: int64
Before merge shape of left_df: (214200, 3)
Before merge shape of right_df: (31531, 3)
After merge missing values in merged_df
ID                     0
shop_id                0
item_id                0
item_cnt_month    185520
dtype: int64
After merge shape of merged_df (214200, 4)


In [5]:
percent_of_null = test_df_filled_oct_2015.item_cnt_month.isna().sum()*100/len(test_df_filled_oct_2015)
print(f'Percent of missing values for item_cnt_month in test data: {percent_of_null}')

Percent of missing values for item_cnt_month in test data: 86.61064425770309


In [6]:
test_df_filled_oct_2015.item_cnt_month.fillna(value=0, inplace=True)

In [7]:
test_df_filled_oct_2015.head()

Unnamed: 0,ID,shop_id,item_id,item_cnt_month
0,0,5,5037,0.0
1,1,5,5320,0.0
2,2,5,5233,1.0
3,3,5,5232,0.0
4,4,5,5268,0.0


In [68]:
sample_submission_df.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5


In [15]:
test_df_filled_oct_2015.ID.nunique()

214200

In [17]:
test_df_filled_oct_2015.head()

Unnamed: 0,ID,shop_id,item_id,item_cnt_month
0,0,5,5037,0.0
1,1,5,5320,0.0
2,2,5,5233,1.0
3,3,5,5232,0.0
4,4,5,5268,0.0


In [19]:
test_df_filled_oct_2015[['ID', 'item_cnt_month']].to_csv('submission_1.csv', index=False)

In [20]:
! kaggle --help

/bin/sh: 1: kaggle: not found


In [26]:
test_df_filled_oct_2015.item_cnt_month.describe()

count    214200.000000
mean          0.293413
std           5.545637
min          -1.000000
25%           0.000000
50%           0.000000
75%           0.000000
max        2253.000000
Name: item_cnt_month, dtype: float64

In [27]:
test_df_filled_oct_2015.item_cnt_month.clip(lower=0, upper=20, inplace=True)

In [28]:
test_df_filled_oct_2015.item_cnt_month.describe()

count    214200.000000
mean          0.255649
std           1.088794
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          20.000000
Name: item_cnt_month, dtype: float64

In [29]:
test_df_filled_oct_2015[['ID', 'item_cnt_month']].to_csv('submission_2.csv', index=False)