# Store Sales - Time Series Forecasting

Use machine learning to predict grocery sales. [source](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

## Objective

In this Kaggle competition, the goal is to 

> build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The evaluation metric for this competition is ***Root Mean Squared Logarithmic Error***.

The `RMSLE` is calculated as:

$$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$$

where:

- $ n $ is the total number of instances,
     
- $\hat{y}$ is the predicted value of the target for instance (i),
   
- $y_i$ is the actual value of the target for instance (i), and,
 
- $log$ is the natural logarithm.

For each id in the test set, you must predict a value for the sales variable. The file should contain a header and have the following format:

    ```
    id,sales
    3000888,0.0
    3000889,0.0
    3000890,0.0
    3000891,0.0
    3000892,0.0
    etc.
    ```


## Libraries for this research notebook

In [None]:
import pandas as pd
from tqdm.auto import tqdm

# to overcome path issue for src
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import sys

# set the path to the current file
current_file_path = Path().resolve()
print(f"current_file_path is {current_file_path}")

# set the path to the src folder
src_folder_path = current_file_path.parent / 'src'
print(f"src_folder_path is {src_folder_path}")

# add the src folder to the system path
sys.path.append(str(src_folder_path))

from data_loader import DBDataLoader
import data_cleaner as dc
import data_exploration as eda
import helper as helper

import polars as pl

## Data Ingestion

Query data from local MySQL using SQLAlchemy and Polars

In [None]:
# define connection string for local database
conn = DBDataLoader().get_connection_string()

In [None]:
# query all records from full_df table
query = 'SELECT * FROM full_df'

# load in table using sqlalchemy and polars
results = pl.read_database_uri(query=query, uri=conn, engine="connectorx")

print(f'Size of dataframe ingested: {round(results.estimated_size("mb"),2)} MB')

In [None]:
# make a copy of queried results - reduce the need to query and wait.
df = results.clone()

## Exploratory Data Analysis (Polars)

In [None]:
# 7 verbs get most job done
# select/slice columns -> select
# create/transform/assign columns -> with_columns
# filter/slice/query rows -> filter
# join/merge another dataframe -> join
# group dataframe rows -> groupby
# aggregate groups -> agg
# sort dataframe -> sort

In [None]:
# minor clean up - clean up sql query output as a result from sql join statements
df = dc.query_clean_up(df)

In [None]:
# downcast columns datatypes to reduce dataframe memory
df = dc.shrink_dataframe(df)

In [None]:
# plot temporal average of sales
eda.plot_sales_averages(dataframe=df, select_hierarchy='city', name='Babahoyo')

In [None]:
# plot an inventory heat map for all stores
eda.plot_stores_inventory(dataframe=df)

In [None]:
# plot the count of products sold for all stores
eda.plot_prod_count_per_store(dataframe=df)

In [None]:
# categories each family into groups
departments = ['grocery', 'household', 'apparel', 'alcohol', 'entertainment', 'kids']

grocery = [
    'GROCERY I', 'GROCERY II', 'BEVERAGES', 'BREAD/BAKERY', 'CANNED GOODS',
    'CEREAL', 'DELI', 'EGGS', 'FROZEN FOODS', 'MEATS',
    'POULTRY', 'PREPARED FOODS', 'DAIRY', 'SEAFOOD'
]

household = [
    'CLEANING', 'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES', 'LAWN AND GARDEN',
    'PET SUPPLIES', 'HARDWARE'
]

apparel = ['LADIESWEAR', 'LINGERIE', 'BEAUTY', 'PERSONAL CARE']

alcohol = ['LIQUOR,WINE,BEER']

entertainment = ['BOOKS', 'MAGAZINES', 'PLAYERS AND ELECTRONICS']

kids = ['BABY CARE', 'TOYS', 'SCHOOL AND OFFICE SUPPLIES', 'CELEBRATION']

# store the groupings in a dictionary
inventory = {
    'grocery': grocery,
    'household': household,
    'apparel': apparel,
    'alcohol': alcohol,
    'entertainment': entertainment,
    'kids': kids
}

In [None]:
dataframe = df
store = 1

# filter dataframe by store
store_df = dataframe.filter(pl.col('store_nbr') == store)

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# create a figure with 6 subplots
departments_fig = make_subplots(
    rows=6,
    cols=2,
    specs=[
        # make subplot (1, 1) take 2 columns of space
        [{'colspan': 2, 'rowspan': 2}, None],
        [None, None],
        [{'colspan': 2, 'rowspan': 2}, None],
        [{}, {}],
        [{}, {}],
        [{}, {}] 
    ],
    subplot_titles=[subplot_title.upper() for subplot_title in inventory.keys()]
)

# loop through each product in department_df
for product in inventory['grocery']:
    # filter store_df by elements of values in inventory dictionary
    product_df = (
        store_df
        .filter(pl.col('family').is_in(inventory['grocery']))
        .filter(pl.col('family') == product)
        .sort(['family', 'date'])
    )
    
    # build a line trace for each product
    trace = go.Scatter(
        x=product_df['date'],
        y=product_df['sales'],
        name=f'{product.upper()}',
        legendgroup='grocery',
        hovertemplate=f'{product.upper()}: %{{y:.2f}}'
    )
    
    # add each product trace to the relevant subplot
    departments_fig.add_trace(trace, row=1, col=1)

# loop through each product in department_df
for product in inventory['household']:
    # filter store_df by elements of values in inventory dictionary
    product_df = (
        store_df
        .filter(pl.col('family').is_in(inventory['household']))
        .filter(pl.col('family') == product)
        .sort(['family', 'date'])
    )
    
    # build a line trace for each product
    trace = go.Scatter(
        x=product_df['date'],
        y=product_df['sales'],
        name=f'{product.upper()}',
        legendgroup='household',
        hovertemplate=f'{product.upper()}: %{{y}}'
    )
    
    # add each product trace to the relevant subplot
    departments_fig.add_trace(trace, row=3, col=1)
    
departments_fig.update_layout(
    title=f"Products' Sales for store {store}",
    template='plotly_dark',
    height=1500,
)

departments_fig.show();

In [None]:
# get latitude and longitude values for each city
eda.get_geo_coordinates('ggjx22', 'Quito', 'EC')

## Exploratory Data Analysis (Time Series)

Using a single family (product) of a store as practice

In [None]:
df_eda = (
    df
    # filter out a sub set of dataset
    .filter(pl.col('store_nbr') == 1)
    .sort('family')
    # further filter a single time series by a random family
    .filter(pl.col('family') == "AUTOMOTIVE")
    # convert to pandas to have a datetime index
    .to_pandas()
    # set column 'date' as index with a monthly freq
    .set_index('date').asfreq('D')
    # for simplicity of eda, consider target variable only
    [['sales']]
)

In [None]:
import plotly.express as px

px.line(data_frame=df_eda)

## Data cleaning

In [None]:
df.dtypes

In [None]:
# df['date'] = pd.to_datetime(df['date'])

In [None]:
df.set_index('date', drop=True, inplace=True)

In [None]:
df['weekday'] = df.index.day_name()

In [None]:
df.isnull().sum()

In [None]:
groupby_store = df.groupby(by=['store_nbr', 'family'], group_keys=True).agg('sum', 'mean')

In [None]:
groupby_store.info()

## Data profile

In [None]:
from ydata_profiling import ProfileReport

In [None]:
df.to

In [None]:
profile = ProfileReport(df.to_pandas(), title="Time-Series EDA full df")
profile.to_notebook_iframe()
profile.to_file("../artifacts/reports/full_df_ProfileReport.html") # AttributeError: 'float' object has no attribute 'shape'

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Use seaborn style defaults and set the default figure size
sns.set(rc={'figure.figsize':(11, 4)})

In [None]:
df['sales_sum'].plot();

In [None]:
df.columns

In [None]:
cols_plot = ['onpromotion_sum', 'transactions_sum', 'sales_sum']
axes = df[cols_plot].plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 9), subplots=True)
for ax in axes:
    ax.set_ylabel('Daily Totals (GWh)')

In [None]:
df_2013 = df.loc['2013']

In [None]:
df_2013

In [None]:
ax = df_2013.loc['2013', 'sales_sum'].plot()
ax.set_ylabel('Daily Sales for 2013');

### Seasonality

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(cols_plot, axes):
    sns.boxplot(data=df, x='weekday', y=name, ax=ax)
    ax.set_ylabel('Sum')
    ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
    if ax != axes[-1]:
        ax.set_xlabel('')

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(cols_plot, axes):
    sns.boxplot(data=df, x='month', y=name, ax=ax)
    ax.set_ylabel('Sum')
    ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
    if ax != axes[-1]:
        ax.set_xlabel('')

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(cols_plot, axes):
    sns.boxplot(data=df, x='store_nbr', y=name, ax=ax)
    ax.set_ylabel('Sum')
    ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
    if ax != axes[-1]:
        ax.set_xlabel('')

## Data preprocessing

In [None]:
df.store_nbr.value_counts()

Notes:

- have outliers needing treatment; need to treat outliers first before we can analyse seasonality for `weekday` and `month`
- selected Store 44 as it has most spread for `transaction_sum` meaning more activity; note assumption that 0 transaction means store is closed as there's no sale on that day
- which tracks with 0 transactions occuring on days with holidays (not included to keep dataset small)

In [None]:
df_str44 = df[(df.store_nbr == 44)]

In [None]:
grp_df_str44 = df_str44.groupby(by=['date'], group_keys=True).agg('sum')

In [None]:
grp_df_str44.drop(columns=['id','store_nbr','year','month','day_of_month'], inplace=True)

In [None]:
grp_df_str44 = grp_df_str44.asfreq('D')

## autoML with pycaret

EDA and ML


In [None]:
grp_df_str44.info()

In [None]:
# check installed version
import pycaret
pycaret.__version__

In [None]:
from sktime.transformations.series.summarize import WindowSummarizer

# feature engineering on target variable (reduced regression models only)
kwargs_target = {
    'lag_feature':{
        'lag':[1,2,3],
        'mean':[[3]]
    }
}

fe_target_rr = [WindowSummarizer(n_jobs=-1, truncate='bfill', **kwargs_target)]

In [None]:
from sktime.transformations.series.summarize import SlidingWindowSplitter
import numpy as np 

# customized a sliding window cross validation object
splitter = SlidingWindowSplitter(
    fh=np.arange(1, 30+1),
    window_length=365,
    step_length=140
)

In [None]:
from sklearn.metrics import mean_squared_log_error
from pycaret.time_series import add_metric

# create a custom function
def rmsle(y_true, y_pred):
    return mean_squared_log_error(y_true
                        , y_pred
                        , squared=False)

In [None]:
# import pycaret time series and init setup
from pycaret.time_series import TSForecastingExperiment

exp = TSForecastingExperiment()

expt = exp.setup(
  data=df_eda,
  target='sales',
  numeric_imputation_target='bfill',
  scale_target='minmax',
  fe_target_rr=fe_target_rr,
  fold_strategy=splitter,
  seasonal_period=[7,52,12,24,4,1],
  sp_detection='auto',
  num_sps_to_use=-1,
  seasonality_type='auto',
  enforce_exogenous=False,
  n_jobs=-1,
  session_id=42,
  system_log=False,
  log_experiment=False,
  experiment_name='store_1_AUTOMOTIVE',
)
expt.add_metric('rmsle', 'RMSLE', rmsle, greater_is_better=False)

In [None]:
expt.plot_model(plot='cv')

In [None]:
expt.check_stats()

In [None]:
# begin autoML and return the top 3 models with lowest RMSLE
best_models = expt.compare_models(
    sort='RMSLE',
    n_select=3,
    turbo=False,
    engine={'auto_arima':'statsforecast'}
)

In [None]:
expt.plot_model(estimator=best_models, plot='forecast')

In [None]:
# check statistical tests on original data
check_stats()

sources for add_metric()

- https://towardsdatascience.com/predict-customer-churn-the-right-way-using-pycaret-8ba6541608ac
- https://github.com/pycaret/pycaret/issues/3491
- https://github.com/pycaret/pycaret/issues/1063

In [None]:
from sklearn.metrics import mean_squared_log_error

# create a custom function
def rmsle(y_true, y_pred):
    return mean_squared_log_error(y_true
                        , y_pred
                        , squared=False)

add_metric('msle', 'MSLE', mean_squared_log_error, greater_is_better=False) # default squared=True
add_metric('rmsle', 'RMSLE', rmsle, greater_is_better=False) # for problem statement squared=False

In [None]:
# compare baseline models
best = compare_models()