In [1]:
import pandas as pd
import numpy as np
import datetime as datetime

In [2]:
# Import train and test data sets
sales_train = pd.read_csv('../Data/sales_train_merge.csv', index_col = 0, parse_dates=['date'])
sales_test = pd.read_csv('../Data/sales_test_merge.csv', index_col = 0, parse_dates=['date'])

  mask |= (ar1 == a)


In [3]:
sales_train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,profits,item_category_id
0,2013-01-31,0,0,32,884.0,6.0,1326.0,40
1,2013-01-31,0,0,33,1041.0,3.0,1041.0,37
2,2013-01-31,0,0,35,247.0,1.0,247.0,40
3,2013-01-31,0,0,43,221.0,1.0,221.0,40
4,2013-01-31,0,0,51,257.0,2.0,257.0,57


In [4]:
sales_test.head()

Unnamed: 0,ID,shop_id,item_id,item_category_id,date,date_block_num
0,0,5,5037,19,2015-11-30,34
1,1,5,5320,55,2015-11-30,34
2,2,5,5233,19,2015-11-30,34
3,3,5,5232,23,2015-11-30,34
4,4,5,5268,20,2015-11-30,34


# Preprocessing

Before a machine learning algorithm can be developed to fit the data, the data needs to be formatted and preprocessed to remove redundancies, standardize the data, or add extra columns. In this instance, columns like ``item_price``, ``profits``, and ``item_category_id`` could be removed. The ``item_category_id`` column could be removed since each item belongs to a specific column and therefore is correlated to ``item_id``. If kept, the model could deceivingly perform better than implied. The other two columns, ``item_price`` and ``profits``, could be removed since the ``item_price`` could depend on the date due to depreciation or inflation and ``profits`` dependent on ``item_price``.

In [5]:
# Remove unnecessary columns from sales_train
sales_train.drop(['item_price', 'profits', 'item_category_id'], axis=1, inplace=True)

Now that those columns have been removed, the next issue to handle are the ``date`` and ``date_block_num`` columns. As with the ``profits`` and ``item_category_id`` columns, these two columns are correlated to each other and redundant. It'll suit the problem better if the ``date_block_num`` column was removed and the ``date`` column split into two new ones, ``month`` and ``year``.

In [6]:
# Split date column into month and year columns
sales_train['month'] = sales_train['date'].dt.month
sales_train['year'] = sales_train['date'].dt.year

In [7]:
# Remove date_block_num from training data
sales_train.drop(['date', 'date_block_num'], axis=1, inplace=True)

Lastly, the columns ``shop_id`` and ``item_id`` should be converted into category types because each ID number represents an actual shop or item and shouldn't be correlated to each.

In [13]:
# Convert columns to categories
sales_train['shop_id'] = sales_train['shop_id'].astype('category')
sales_train['item_id'] = sales_train['item_id'].astype('category')

In [14]:
sales_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1609124 entries, 0 to 1609123
Data columns (total 5 columns):
shop_id         1609124 non-null category
item_id         1609124 non-null category
item_cnt_day    1609124 non-null float64
month           1609124 non-null int64
year            1609124 non-null int64
dtypes: category(2), float64(1), int64(2)
memory usage: 54.5 MB


Finally, the test set needs to be preprocessed as well.

In [10]:
# Preprocess test set