# Import libraries

In [1]:
# you have to install ipython-autotime using 'pip install ipython-autotime'
%load_ext autotime

import gc
import IPython.display
import os
import datetime
from tqdm import tqdm_notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# in this project, the metric is rmse, not mse
from sklearn.metrics import mean_squared_error

# models
from sklearn.linear_model import LinearRegression
#SVR is too slow
#from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

# Load datasets

In [2]:
sales = pd.read_csv('./dataset/sales_train.csv.gz')
#shops = pd.read_csv('./dataset/shops.csv')
#items = pd.read_csv('./dataset/items.csv')
#item_cats = pd.read_csv('./dataset/item_categories.csv')
test = pd.read_csv("./dataset/test.csv.gz")

time: 1.76 s


# Analyze raw datasets

Let's start to anylyze basic information about give datasets.

In [3]:
# make float data looks integer data
pd.options.display.float_format = '{:,.0f}'.format

sales.describe()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,2935849,2935849,2935849,2935849,2935849
mean,15,33,10197,891,1
std,9,16,6324,1730,3
min,0,0,0,-1,-22
25%,7,22,4476,249,1
50%,14,31,9343,399,1
75%,23,47,15684,999,1
max,33,59,22169,307980,2169


time: 786 ms


In [4]:
test.describe()

Unnamed: 0,ID,shop_id,item_id
count,214200,214200,214200
mean,107100,32,11019
std,61834,18,6253
min,0,2,30
25%,53550,16,5382
50%,107100,34,11203
75%,160649,47,16072
max,214199,59,22167


time: 44.9 ms


We need to do simple calculations here. The number of shop_id is 60, and the number of item_id is 22,170. Therefore, the total number of combinations of them is 1,330,200. However, there are only 214,200 IDs in the test. It means that this competition only requires 16.1% of the all shop_id and item_id combinations.

We can use this fact in 3 ways.
1. get a prediction of the test IDs in the submission using full data in the training and the validation.
2. get a prediction of the test IDs in the validation and the submission using full data in the training.
3. Reduce data before training to make training short.

I think we should take 2 or 3. In the first way, the validation score can not be fitting to the test score. My strategy is using 3 till the validation and using 2 in the submission only. I think full data has other shops or other items, but it can give some information about how the price is going especially if I use RNN algorithms.

# Plan EDA(Exploratory data analysis)

I think item_price and item_cnt_day have interesting qualtiles and min-max values. First of all, item_cnt_day must not be zero value because sales data is record of something that occured in sales. However, the target is item_cnt_month, so it would be better to analyze monthly data of item_cnt. In the item, the max price is so much higher than others. I'm not sure but, it's possible to use the extream price for prediction.

My plans is as below.

1. reduce data using test id combinations
2. aggregate the total item_cnt_month of shops month by month
3. aggregate the total item_cnt month of items month by month
4. aggregate the total item_cnt_month month by month

The purpose of them is to know if there are correlations between them and if there are patterns in time flow.

# Make utilities to submit

Utility function makes codes simple, so it's good to make these functions

In [5]:
def make_submission_df(all_prediction):
    df = test.merge(all_prediction, on=["shop_id", "item_id"], how="left")[["ID", "item_cnt_month"]]
    df["item_cnt_month"] = df["item_cnt_month"].fillna(0).clip(0, 20)
    
    return df

def make_submission_file(df, comment="", add_time_stamp=True):
    name = comment
    
    if add_time_stamp:
        if len(name) > 0:
            name += '_'
            
        name += datetime.datetime.now().strftime('%Y%m%d%H%M')
    
    df.to_csv("./submission/%s.csv" % name, sep=",", index=False)
    
def make_submission(all_prediction, comment="", add_time_stamp=True):
    make_submission_file(make_submission_df(all_prediction), comment, add_time_stamp)

time: 14 ms


# Make benchmarks

There should be benchmarks to measure my prediction's quality, so I made very simple ones. I think it should be done in first phase.

In [6]:
sample = pd.read_csv('./dataset/sample_submission.csv.gz')
make_submission_file(sample, 'sample_value', False)

sample['item_cnt_month'] = 0
make_submission_file(sample, 'zero_value', False)

previous_month = sales[sales["date_block_num"] == 33].groupby(["shop_id", "item_id"], as_index=False).item_cnt_day.sum().rename(columns={"item_cnt_day": "item_cnt_month"})
make_submission(previous_month, "previous_month_value", False)

del sample, previous_month

time: 1.65 s


# Benchmark results

* sample value(all 0.5): 1.23646
* zero value: 1.25011
* previous month value: 1.16777

# Reduce data using test id combinations

In [7]:
reduced_sales = sales.merge(test)
reduced_sales = reduced_sales.drop('ID', axis=1)
reduced_sales.describe()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,1224439,1224439,1224439,1224439,1224439
mean,19,32,9615,1031,1
std,9,16,6300,1827,3
min,0,2,30,0,-16
25%,12,19,4181,299,1
50%,21,31,7856,549,1
75%,27,46,15229,1199,1
max,33,59,22167,59200,2169


time: 856 ms


Before reducing data, the total number of rows is 2,935,849. Now, the amount of data is reduced to 41.7%.
It means that test data is not randomly picked in all combinations of shop_id and item_id.
One of the possible scenarios is that the host of this competition chose test targets in combinations that appeared in the sales data, not in all combinations.

# Analyze combinations of shop_id and item_id

In [8]:
full_comb = sales[['shop_id', 'item_id']]
full_comb = full_comb.drop_duplicates()
display(full_comb.describe())
display('unique value of shop_id: ' + str(len(full_comb.shop_id.unique())))
display('unique value of item_id: ' + str(len(full_comb.item_id.unique())))

Unnamed: 0,shop_id,item_id
count,424124,424124
mean,31,11458
std,17,6133
min,0,0
25%,18,6244
50%,30,11614
75%,46,16662
max,59,22169


'unique value of shop_id: 60'

'unique value of item_id: 21807'

time: 275 ms


In [9]:
reduced_comb = reduced_sales[['shop_id', 'item_id']]
reduced_comb = reduced_comb.drop_duplicates()
display(reduced_comb.describe())
display('unique value of shop_id: ' + str(len(reduced_comb.shop_id.unique())))
display('unique value of item_id: ' + str(len(reduced_comb.item_id.unique())))

Unnamed: 0,shop_id,item_id
count,111404,111404
mean,31,10884
std,17,6154
min,2,30
25%,16,5241
50%,31,10889
75%,47,16028
max,59,22167


'unique value of shop_id: 42'

'unique value of item_id: 4716'

time: 110 ms


In [10]:
display(test.describe())
display('unique value of shop_id: ' + str(len(test.shop_id.unique())))
display('unique value of item_id: ' + str(len(test.item_id.unique())))

Unnamed: 0,ID,shop_id,item_id
count,214200,214200,214200
mean,107100,32,11019
std,61834,18,6253
min,0,2,30
25%,53550,16,5382
50%,107100,34,11203
75%,160649,47,16072
max,214199,59,22167


'unique value of shop_id: 42'

'unique value of item_id: 5100'

time: 52.9 ms


| data | full | reduced | test |
|------|------| ------- | ---- |
| shop_id | 60 | 42 | 42 |
| item_id | 21,807 | 4,716 | 5100 |
| total | 424,124 | 111,404 | 214,200 |
| possible | 1,308,420 | 198,072 | 214,200 |
| ratio | 32.4% | 56.2% | 100% |

Our target is 214,200 combination. However, in the reduced data, there is only 4,716 unique item_ids. It means that 385 item was not sold in that period. In the combination, there is more zero sold combinations. It's almost half of the test combinations. In the full data set, zero sold combination ratios is abount 1/3. This is not so big gap between them. I think the important is the number of item_id. The test unique item_id is almost 1/4 of full data item_id. If we use one hot encoding for item_id, we can use only 1/4 of memory.

I focused on something else. The total number of the combinations in the test is 214,200, but 111,404 in the reduced dataset. It means that only about half of combinations exists in sales data. One more data selection options is selecting only data in 111,404 combinations. I'm going to use the test dataset first, and then I'll use smaller and bigger one.

In [11]:
del full_comb, reduced_comb, reduced_sales

time: 997 µs


# Get applicable dataset to models

First, define a useful function to save a memory.

In [12]:
def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    
    return df

time: 5.96 ms


### 1. Get base data form

The form should have 'shop_id', 'item_id', 'date_block_num' because the required form of this competition is 'ID' made of 'shop_id' and 'item_id', and 'item_cnt_month'.

In [13]:
# index_cols = ['shop_id', 'item_id', 'date_block_num']
# gb = sales.groupby(index_cols, as_index=False).sum().rename(columns={'item_cnt_day':'item_cnt_month'})
# gb = gb.drop('item_price', axis=1)

# df1 = pd.DataFrame({'shop_id': gb.shop_id.unique(), 'key':np.zeros(len(gb.shop_id.unique()))})
# df2 = pd.DataFrame({'item_id': gb.item_id.unique(), 'key':np.zeros(len(gb.item_id.unique()))})
# df3 = pd.DataFrame({'date_block_num': gb.date_block_num.unique(), 'key':np.zeros(len(gb.date_block_num.unique()))})

# df = df1.merge(df2).merge(df3)

# df = df.drop('key', axis=1)
# df = df.sort_values(by=index_cols)

# df = df.merge(gb, how='outer').fillna(0)
# del df1, df2, df3, gb

# df.head()

# df = downcast_dtypes(df)

# gc.collect()

time: 32.1 s


### 2. Make lag features

In [14]:
# # List of columns that we will use to create lags
# cols_to_rename = list(df.columns.difference(index_cols)) 

# shift_range = [i for i in range(1, 13)]

# lag_df = df

# for month_shift in tqdm_notebook(shift_range):
#     train_shift = lag_df[index_cols + cols_to_rename].copy()
#     train_shift['date_block_num'] = train_shift['date_block_num'] + month_shift
    
#     foo = lambda x: '{}_lag_{}'.format(x, month_shift) if x in cols_to_rename else x
#     train_shift = train_shift.rename(columns=foo)

#     lag_df = lag_df.merge(train_shift, how='outer')
#     del train_shift
#     lag_df = downcast_dtypes(lag_df)
#     gc.collect()


time: 13min 57s


- 위 작업이 시간이 오래걸리므로 csv나 npy파일로 만든 후에 저장했다가 부르는건 어떨까?
- validation용 traing set과 test용 traing set을 분리시킨 후에 파이프 라인을 만들어서 코드를 깔끔하게 만들자
- 작업이 끝나면 Knn feature와 mean encoding 방법을 적용할 방법을 생각해보자
- 마지막은 ensemble을 해보자

### 3. Trim lag_df

In [15]:
# # Don't use old data from year 2013(because we use 12 months lag data in the target)
# # to make submission 33 -> 34

# valid_last = 33
# test_last = 34

# lag_df = lag_df[12 <= lag_df.date_block_num]
# lag_df = lag_df[lag_df.date_block_num <= test_last]

# # List of all lagged features
# fit_cols = [col for col in lag_df.columns if col[-1] in [str(item) for item in shift_range]] 
# # We will drop these at fitting stage
# to_drop_cols = list(set(list(lag_df.columns)) - (set(fit_cols)|set(index_cols))) + ['date_block_num'] 

# lag_df = downcast_dtypes(lag_df)

# lag_df = lag_df.fillna(0)

time: 43.5 s


### 4. Save lag_df

In [25]:
# lag_df.to_csv("full_lag_df.csv", sep=",", index=False)

time: 4min 42s


### 5. load lag_df

In [28]:
lag_df = pd.read_csv("full_lag_df.csv")
lag_df = downcast_dtypes(lag_df)

time: 1min


In [29]:
lag_df.describe()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_cnt_month_lag_1,item_cnt_month_lag_2,item_cnt_month_lag_3,item_cnt_month_lag_4,item_cnt_month_lag_5,item_cnt_month_lag_6,item_cnt_month_lag_7,item_cnt_month_lag_8,item_cnt_month_lag_9,item_cnt_month_lag_10,item_cnt_month_lag_11,item_cnt_month_lag_12
count,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660,30093660
mean,30,11099,23,0,0,0,0,0,0,0,0,0,0,0,0,0
std,17,6397,7,2,2,2,2,2,2,2,2,2,2,2,2,2
min,0,0,12,-5,-5,-22,-22,-22,-22,-22,-22,-22,-22,-22,-22,-22
25%,15,5551,17,0,0,0,0,0,0,0,0,0,0,0,0,0
50%,30,11105,23,0,0,0,0,0,0,0,0,0,0,0,0,0
75%,44,16648,29,0,0,0,0,0,0,0,0,0,0,0,0,0
max,59,22169,34,2253,2253,1644,1305,1305,1305,1305,1305,1305,1305,1305,1305,1305


time: 18 s


# Train/test split

In [16]:
# Save `date_block_num`, as we can't use them as features, but will need them to split the dataset into parts 

dates = lag_df['date_block_num']

# to make submission file, change 33 to 34
last_block = test_last
print('Test `date_block_num` is %d' % last_block)

Test `date_block_num` is 34
time: 1.56 s


In [17]:
dates_train = dates[dates <  last_block]
dates_valid  = dates[dates == last_block]

X_train = lag_df.loc[dates <  last_block].drop(to_drop_cols, axis=1)
X_valid =  lag_df.loc[dates == last_block].drop(to_drop_cols, axis=1)

y_train = lag_df.loc[dates <  last_block, 'item_cnt_month'].values
y_valid =  lag_df.loc[dates == last_block, 'item_cnt_month'].values

time: 8.21 s


# Define this competition metric as a function

In [18]:
def rmse(pred, valid):
    return np.sqrt(mean_squared_error(np.clip(pred, 0, 20), np.clip(valid, 0, 20)))

time: 8.98 ms


# First level models 

In [19]:
lr = LinearRegression()
lr.fit(X_train.values, y_train)
pred_lr = lr.predict(X_valid.values)

print('Clipped rmse for Linear Regression is %f' % rmse(pred_lr, y_valid))

Clipped rmse for Linear Regression is 0.366558
time: 17.8 s


In [20]:
from sklearn.linear_model import ElasticNet
enet = ElasticNet()
enet.fit(X_train.values, y_train)
pred_enet = enet.predict(X_valid.values)

print('Clipped rmse for ElasticNet is %f' % rmse(pred_enet, y_valid))

Clipped rmse for ElasticNet is 0.257933
time: 31.7 s


In [21]:
dd = lag_df[dates == last_block]
dd.item_cnt_month = pred_enet
dd = dd[['shop_id', 'item_id', 'item_cnt_month']]
make_submission(dd, 'enet_with_12month_lag')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


time: 1.82 s


# Submit to kaggle

This cell automatically submits the submission file to kaggle. However, it should be carefully executed because the submitting opportunities are limited.
- remove '#' before submitting
- add a meaningful message to a submission

In [22]:
!kaggle competitions submit -c competitive-data-science-final-project -f ./submission/enet_with_12month_lag_201806280804.csv -m "enet with 1 ~ 12 months lagged data traind with all data !!!"

Successfully submitted to Final project: predict future sales
time: 12.7 s


# Check public score

In [23]:
!kaggle competitions submissions -c competitive-data-science-final-project

fileName                                            date                 description                                                                             status    publicScore  privateScore  
--------------------------------------------------  -------------------  --------------------------------------------------------------------------------------  --------  -----------  ------------  
enet_with_12month_lag_201806280804.csv              2018-06-27 23:16:52  enet with 1 ~ 12 months lagged data traind with all data !!!                            complete  1.04670      None          
lr_with_12month_lag_201806280217.csv                2018-06-27 17:18:10  lr with 1 ~ 12 months lagged data traind with all data                                  complete  1.09279      None          
enet_with_12month_lag_201806280213.csv              2018-06-27 17:16:43  Enet with 1 ~ 12 months lagged data traind with all data                                complete  1.07174      None          
liner

For sure, it scores better with more data han less data. One more thing I want to try is using full data in training and validating with only test data id combinations.