# 02 Linear regression

In this notebook we slightly refine our previous approach of predicting the sales by performing linear regression on the monthly item counts.

## Reading data

We read the datasets from the csv files.

In [26]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from numpy.polynomial.polynomial import polyfit

train_data = pd.read_csv('data/sales_train.csv')
test_data = pd.read_csv('data/test.csv')

train_data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


We convert the date into a datetime object.

In [27]:
train_data['date'] = pd.to_datetime(train_data['date'], format="%d.%m.%Y")

train_data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,2013-01-02,0,59,22154,999.0,1.0
1,2013-01-03,0,25,2552,899.0,1.0
2,2013-01-05,0,25,2552,899.0,-1.0
3,2013-01-06,0,25,2554,1709.05,1.0
4,2013-01-15,0,25,2555,1099.0,1.0


## Linear regression

In this case, we are going to perform linear regression on the monthly item counts, for every shop and item. In order to validate our predictions, we split the train dataset into the first months and the last one.

In [55]:
# Number of months in the training dataset
n_months = 33

# Split dataset into training and validation
last_month = train_data[train_data['date_block_num'] == n_months]
first_months = train_data[train_data['date_block_num'] < n_months]

first_months.shape, last_month.shape

((2882335, 6), (53514, 6))

Now we aggregate the data by adding up all the daily item counts over each month.

In [141]:
# Aggregate datasets per month
first_agg_month = first_months.groupby(['shop_id', 'item_id', 'date_block_num'], as_index=False)['item_cnt_day'].sum().rename(columns={'item_cnt_day': 'item_cnt_month'})
last_agg_month = last_month.groupby(['shop_id', 'item_id', 'date_block_num'], as_index=False)['item_cnt_day'].sum().rename(columns={'item_cnt_day': 'item_cnt_month'})

first_agg_month.head()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month
0,0,30,1,31.0
1,0,31,1,11.0
2,0,32,0,6.0
3,0,32,1,10.0
4,0,33,0,3.0


We now create a new dataframe containing, for each shop and item, the coefficients of the linear regression.

In [182]:
# Create constant series with all date_block_nums
date_block_nums = pd.Series(range(n_months), name='date_block_num')

# Empty dataframe to store the results of the linear regression
first_agg_all_time = pd.DataFrame()

# Iterate over shops
shop_ids = first_agg_month['shop_id'].unique()
for shop_id in shop_ids:
    first_agg_month_shop = first_agg_month[first_agg_month['shop_id'] == shop_id]

    print(shop_id, end=' ')

    # Iterate over the items of the given shop
    item_ids = first_agg_month_shop['item_id'].unique()
    for item_id in item_ids:
        first_agg_month_shop_item = first_agg_month_shop[first_agg_month_shop['item_id'] == item_id]

        # Merge the dataframe so it has zeros for months with no item counts
        x = pd.merge(date_block_nums, first_agg_month_shop_item, how='left').fillna(0)
        # Fit the linear regression coefficients
        b, m = polyfit(x['date_block_num'], x['item_cnt_month'], deg=1)

        # Append a new row to the aggregated dataframe
        row = pd.DataFrame([[shop_id, item_id, b, m]], columns=['shop_id', 'item_id', 'lr_b', 'lr_m'])
        first_agg_all_time = first_agg_all_time.append(row)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 

We reset the indices and compute our predictions.

In [217]:
first_agg_all_time = first_agg_all_time.reset_index(drop=True)

first_agg_all_time['pred_33'] = first_agg_all_time['lr_b'] + first_agg_all_time['lr_m'] * 33
first_agg_all_time['pred_33'] = first_agg_all_time['pred_33'].clip(lower=0)

first_agg_all_time.head()

Unnamed: 0,shop_id,item_id,lr_b,lr_m,pred_33
0,0,30,3.426025,-0.155414,0.0
1,0,31,1.215686,-0.055147,0.0
2,0,32,1.800357,-0.082219,0.0
3,0,33,0.679144,-0.031083,0.0
4,0,35,1.663102,-0.075535,0.0


And we finally put together in the same dataframe our predictions with the ground truth.

In [218]:
predictions = pd.merge(first_agg_all_time[['shop_id', 'item_id', 'pred_33']], last_agg_month[['shop_id', 'item_id','item_cnt_month']], on=['shop_id', 'item_id'], how='left').rename(columns={'item_cnt_month': 'true_item_cnt'}).fillna(0)

# Compute the squared error
predictions['se'] = (predictions['pred_33'] - predictions['true_item_cnt'])**2

predictions.head()

Unnamed: 0,shop_id,item_id,pred_33,true_item_cnt,se
0,0,30,0.0,0.0,0.0
1,0,31,0.0,0.0,0.0
2,0,32,0.0,0.0,0.0
3,0,33,0.0,0.0,0.0
4,0,35,0.0,0.0,0.0


The rooted mean square error is the metric we are considering for the error. Its value is rather high, this is due to the naive approach for the predictions.

In [219]:
# Compute the rooted mean squared error
rmse = np.sqrt(predictions['se'].mean())

rmse

3.447228314471813

## Submission

We are now ready to submit our predictions. To avoid repeating the training with the whole dataset, we will abuse the data and use the linear regression coefficients for the first 33 months instead of the first 34.

In [220]:
first_agg_all_time['pred_34'] = first_agg_all_time['lr_b'] + first_agg_all_time['lr_m'] * 34
first_agg_all_time['pred_34'] = first_agg_all_time['pred_34'].clip(lower=0)

first_agg_all_time.head()

Unnamed: 0,shop_id,item_id,lr_b,lr_m,pred_33,pred_34
0,0,30,3.426025,-0.155414,0.0,0.0
1,0,31,1.215686,-0.055147,0.0,0.0
2,0,32,1.800357,-0.082219,0.0,0.0
3,0,33,0.679144,-0.031083,0.0,0.0
4,0,35,1.663102,-0.075535,0.0,0.0


We finally generate the csv file ready for submission.

In [221]:
def generate_submission_file(data, pred_field, filename='submission.csv'):
    predictions = pd.merge(test_data, data, on=['shop_id', 'item_id'], how='left').fillna(0)

    output = pd.DataFrame({'ID': predictions['ID'], 'item_cnt_month': predictions[pred_field]})
    output.to_csv('predictions/' + filename, index=False)
    print("Submission successfully saved!")

In [222]:
generate_submission_file(first_agg_all_time, 'pred_34')

Submission successfully saved!
