# M5 Forecasting Accuracy 

Note: This is one of the two complementary competitions that together comprise the M5 forecasting challenge. Can you estimate, as precisely as possible, the point forecasts of the unit sales of various products sold in the USA by Walmart? If you are interested in estimating the uncertainty distribution of the realized values of the same series, be sure to check out its companion competition

How much camping gear will one store sell each month in a year? To the uninitiated, calculating sales at this level may seem as difficult as predicting the weather. Both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses. In this competition, in addition to traditional forecasting methods you’re also challenged to use machine learning to improve forecast accuracy.

The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s.

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

If successful, your work will continue to advance the theory and practice of forecasting. The methods used can be applied in various business areas, such as setting up appropriate inventory or service levels. Through its business support and training, the MOFC will help distribute the tools and knowledge so others can achieve more accurate and better calibrated forecasts, reduce waste and be able to appreciate uncertainty and its risk implications.

Acknowledgements
Additional thanks go to other partner organizations and prize sponsors, National Technical University of Athens (NTUA), INSEAD, Google, Uber and IIF.



In [1]:
# Load packages
import numpy as np
import pandas as pd
import warnings, gc, sys, json
from m5_utils import *

pd.set_option('display.float_format', lambda x: '%.5f' % x)
pd.set_option('display.max_columns', 100)

In [2]:
# Load datasets
calendar = pd.read_csv('input/calendar.csv'); del calendar['weekday']
data = pd.read_csv('input/sales_train_validation.csv')
submission = pd.read_csv('input/sample_submission.csv')
sell_prices = pd.read_csv('input/sell_prices.csv')

In [3]:
print(calendar.shape)
calendar.head().append(calendar.tail())

(1969, 13)


Unnamed: 0,date,wm_yr_wk,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,11101,1,1,2011,d_1,,,,,0,0,0
1,2011-01-30,11101,2,1,2011,d_2,,,,,0,0,0
2,2011-01-31,11101,3,1,2011,d_3,,,,,0,0,0
3,2011-02-01,11101,4,2,2011,d_4,,,,,1,1,0
4,2011-02-02,11101,5,2,2011,d_5,,,,,1,0,1
1964,2016-06-15,11620,5,6,2016,d_1965,,,,,0,1,1
1965,2016-06-16,11620,6,6,2016,d_1966,,,,,0,0,0
1966,2016-06-17,11620,7,6,2016,d_1967,,,,,0,0,0
1967,2016-06-18,11621,1,6,2016,d_1968,,,,,0,0,0
1968,2016-06-19,11621,2,6,2016,d_1969,NBAFinalsEnd,Sporting,Father's day,Cultural,0,0,0


In [None]:
def preprocess_sub(submission,generate_dict = None):
    
    # Identify ids used for validation and evaluation
    test_rows = [row for row in submission['id'] if 'validation' in row]
    val_rows = [row for row in submission['id'] if 'evaluation' in row]
    
    # Create template for validation and evaluation
    test = submission[submission['id'].isin(test_rows)]
    val = submission[submission['id'].isin(val_rows)]
    
    # Identify which forecasting days belong to validation and evaluation
    n = 1914
    test_days = np.arange(n, n+28, 1)
    val_days = np.arange(n+28, n+28+28, 1)
    test_columns = ['id']+['d_'+ str(value) for value in test_days]
    val_columns = ['id']+['d_'+ str(value) for value in val_days]
    
    # Creates a dict to later be used as reference when submitting
    sub_dict_1 = {}
    if generate_dict is not None:
        sub_dict_1['test'] = (dict(zip(test_columns[1::],submission.columns[1::])))
        sub_dict_1['val'] = (dict(zip(val_columns[1::],submission.columns[1::])))

    # Replace columns name
    test.columns = test_columns
    val.columns = val_columns
    

    test = pd.melt(test, id_vars= 'id', var_name= 'day', value_name= 'demand')
    val = pd.melt(val, id_vars= 'id', var_name= 'day', value_name= 'demand')
    
    return test, val, sub_dict_1

def create_dict(train, calendar):
    # 1:'Saturday', 7:'Friday'
    
    sub_dict1 = {}
    # Define columns that will be in the dictionary
    cols = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']

    # Loop through the train dataframe to get the values for the dictionary
    for col in train.columns:
        if col in cols:
            # Set the column as category and generate a dictionary out of it
            train[col] = train[col].astype('category')
            tmp_dict = dict(enumerate(train[col].cat.categories))
            # Compile the information into a nested dictionary
            sub_dict1[col] = tmp_dict

    # Define other columns that will be in the dictionary
    cols = [ 'wm_yr_wk', 'd','event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']

    # Loop through the train dataframe to get the values for the dictionary
    for col in calendar.columns:
        if col in cols:
            # Set the column as category and generate a dictionary out of it
            calendar[col] = calendar[col].astype('category')
            tmp_dict = dict(enumerate(calendar[col].cat.categories))
            # Compile the information into a nested dictionary
            sub_dict1[col] = tmp_dict
    sub_dict1['year'] = {1: 2011, 2: 2012, 3: 2013, 4: 2014, 5: 2015, 6: 2016}
    sub_dict1['source'] = {0:'train', 1:'val', 2:'test'}

    return train, calendar, sub_dict1

def create_inv_dict(source_dict):
    
    inv_dict = {}

    for key, value in source_dict.items():
        tmp_dict = {}
        for i, item in value.items():
            if key != 'year':
                tmp_dict[item] = i
        inv_dict[key] = tmp_dict
    
    inv_dict['year'] = {2011:1, 2012:2, 2013:3, 2014:4, 2015:5, 2016:6}

    return inv_dict

def preprocess(data, submission, calendar = calendar, generate_dict = None):
    
    if generate_dict is not None:
        data, calendar, dict_m5_1 = create_dict(data, calendar)
    else:
        dict_m5_1 = {}
    
    product = data[['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']]
    train = pd.melt(data, id_vars= data.iloc[:,0:6].columns, var_name= 'day', value_name= 'demand')
    train_columns = train.columns
    train['source'] = 'train'
    
    test, val, dict_m5_2 = preprocess_sub(submission, generate_dict)
    
    # Merge both dictionaries
    dict_m5 = dict_m5_1
    dict_m5.update(dict_m5_2)
        
    # Merge products to test and validation
    test = test.merge(product, how = 'left', on = 'id')
    val['id'] = val['id'].transform(lambda x: x.replace('_evaluation','_validation'))
    val = val.merge(product, how = 'left', on = 'id')
    val['id'] = val['id'].transform(lambda x: x.replace('_validation','_evaluation'))

    test = test[train_columns]
    val = val[train_columns]
    test['source'] = 'test'
    val['source'] = 'val'
    
    del product, submission, data
    
    dict_m5_inv = create_inv_dict(dict_m5)
    
    return train, test, val, calendar, dict_m5, dict_m5_inv
    

def merge_df(df, calendar, sell_prices = None):
    
    df = pd.merge(df, calendar, how = 'left', left_on = ['day'], right_on = ['d'])
    del df['d'], df['day']
    df = reduce_mem_usage(df)
    
    if sell_prices is not None:
        df = df.merge(sell_prices, on = ['store_id', 'item_id', 'wm_yr_wk'], how = 'left')
        df = reduce_mem_usage(df)
        
    return df


In [None]:
%%time
train, test, val, calendar, dict_m5, dict_m5_inv = preprocess(data, submission, calendar, generate_dict = True)

In [None]:
with open('input/dict_m5.json') as json_file:
    dict_m5 = json.load(json_file)
with open('input/dict_m5_inv.json') as json_file:
    dict_m5_inv = json.load(json_file)

In [None]:
with open('input/dict_m5.json', 'w') as json_file:
    json.dump(dict_m5, json_file)
    
with open('input/dict_m5_inv.json', 'w') as json_file:
    json.dump(dict_m5_inv, json_file)

In [None]:
%%time
train = merge_df(train, calendar, sell_prices)
print(train.shape); train.head()

In [None]:
%%time
test = merge_df(test, calendar, sell_prices)
print(test.shape); test.head()

In [None]:
%%time
val = merge_df(val, calendar, sell_prices)
print(val.shape); val.head()

In [None]:
%%time
test.to_pickle('input/test_v0.pkl')

In [None]:
%%time
val.to_pickle('input/val_v0.pkl')

In [None]:
del test, val, data, calendar, sell_prices, submission
gc.collect()

In [None]:
train_2011 = train.loc[train['year']==2011,:]
train_2011.to_pickle('input/train_2011.pkl')
del train_2011

In [None]:
train_2012 = train.loc[train['year']==2012,:]
train_2012.to_pickle('input/train_2012.pkl')
del train_2012

In [None]:
train_2013 = train.loc[train['year']==2013,:]
train_2013.to_pickle('input/train_2013.pkl')
del train_2013

In [None]:
train_2014 = train.loc[train['year']==2014,:]
train_2014.to_pickle('input/train_2014.pkl')
del train_2014

In [None]:
train_2015 = train.loc[train['year']==2015,:]
train_2015.to_pickle('input/train_2015.pkl')
del train_2015

In [None]:
train_2016 = train.loc[train['year']==2016,:]
train_2016.to_pickle('input/train_2016.pkl')
del train_2016