# Store Sales - Time Series Forecasting

Use machine learning to predict grocery sales. [source](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

## Objective

In this Kaggle competition, the goal is to 

> build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The evaluation metric for this competition is ***Root Mean Squared Logarithmic Error***.

The `RMSLE` is calculated as:

$$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$$

where:

- $ n $ is the total number of instances,
     
- $\hat{y}$ is the predicted value of the target for instance (i),
   
- $y_i$ is the actual value of the target for instance (i), and,
 
- $log$ is the natural logarithm.

For each id in the test set, you must predict a value for the sales variable. The file should contain a header and have the following format:

    ```
    id,sales
    3000888,0.0
    3000889,0.0
    3000890,0.0
    3000891,0.0
    3000892,0.0
    etc.
    ```


## Libraries for this research notebook

In [2]:
import pandas as pd
from tqdm.auto import tqdm

# to overcome path issue for src
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import sys

# set the path to the current file
current_file_path = Path().resolve()
print(f"current_file_path is {current_file_path}")

# set the path to the src folder
src_folder_path = current_file_path.parent / 'src'
print(f"src_folder_path is {src_folder_path}")

# add the src folder to the system path
sys.path.append(str(src_folder_path))

from data_loader import DBDataLoader
import data_cleaner as dc
import data_exploration as eda
import helper as helper

import polars as pl
import plotly.graph_objects as go
from datetime import datetime

current_file_path is /Users/galvangoh/Desktop/Tech/ai_practitioners/time-series-forecasting/time-series-forecasting/notebooks
src_folder_path is /Users/galvangoh/Desktop/Tech/ai_practitioners/time-series-forecasting/time-series-forecasting/src


## Data Ingestion

Query data from local MySQL using SQLAlchemy and Polars

In [3]:
# define connection string for local database
conn = DBDataLoader().get_connection_string("local", "connectorx")  # no need to swap to "remote" as ConnectorX does not support remote connections

In [4]:
# query all records from full_df table
query = 'SELECT * FROM full_df'

# load in table using sqlalchemy and polars
results = pl.read_database_uri(query=query, uri=conn, engine="connectorx")

print(f'Size of dataframe ingested: {round(results.estimated_size("mb"),2)} MB')

Size of dataframe ingested: 500.24 MB


In [5]:
# make a copy of queried results - reduce the need to query and wait.
df = results.clone()

## Exploratory Data Analysis (Polars)

In [6]:
# 7 verbs get most job done
# select/slice columns -> select
# create/transform/assign columns -> with_columns
# filter/slice/query rows -> filter
# join/merge another dataframe -> join
# group dataframe rows -> groupby
# aggregate groups -> agg
# sort dataframe -> sort

In [7]:
# minor clean up - clean up sql query output as a result from sql join statements
df = dc.query_clean_up(df)

In [8]:
# downcast columns datatypes to reduce dataframe memory
df = dc.shrink_dataframe(df)

In [9]:
agg_df = helper.get_aggregate_data(dataframe=df, select_hierarchy='state')

In [20]:
# check if date column is datetime datatype and sorted 
if agg_df['date'].dtype != pl.Datetime:
    agg_df = agg_df.with_columns(pl.col('date').cast(pl.Datetime))
else:
    pass

In [43]:
# further subset the dataframe into states
state = agg_df.filter(pl.col('state') == 'Azuay')

In [54]:
# create weekly averages
state_temporal_avg_df = (
    state
    .with_columns(
        weekly_avg=pl.col('sales').rolling_mean(window_size='1w', by='date', closed='left')
    )
    .with_columns(
        monthly_avg=pl.col('sales').rolling_mean(window_size='1mo_saturating', by='date', closed='left')
    )
    .with_columns(
        quarterly_avg=pl.col('sales').rolling_mean(window_size='1q_saturating', by='date', closed='left')
    )
    .with_columns(
        yearly_avg=pl.col('sales').rolling_mean(window_size='1y_saturating', by='date', closed='left')
    )    
)

In [129]:
sales_fig = go.Figure()

sales_fig.add_trace(
    go.Scatter(
        x=state_temporal_avg_df['date'],
        y=state_temporal_avg_df['sales'],
        name='Actual Sales',
    )
)

sales_fig.add_trace(
    go.Scatter(
        x=state_temporal_avg_df['date'],
        y=state_temporal_avg_df['weekly_avg'],
        name='Weekly Sales',
    )
)

sales_fig.add_trace(
    go.Scatter(
        x=state_temporal_avg_df['date'],
        y=state_temporal_avg_df['monthly_avg'],
        name='Monthly Sales',
    )
)

sales_fig.add_trace(
    go.Scatter(
        x=state_temporal_avg_df['date'],
        y=state_temporal_avg_df['quarterly_avg'],
        name='Quaterly Sales',
    )
)

sales_fig.add_trace(
    go.Scatter(
        x=state_temporal_avg_df['date'],
        y=state_temporal_avg_df['yearly_avg'],
        name='Yearly Sales',
    )
)

sales_fig.update_layout(
    title=f'Temporal Average Sales by State<br>State: {state_temporal_avg_df.item(row=0, column="state")}',
    xaxis_title='Date',
    yaxis_title='Values',
    showlegend=True,
    legend_orientation='h',
    legend_y=1.12,
    legend_x=0.5,
    legend_xanchor='center',
    autosize=True,
    # paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
)

sales_fig.show()

In [None]:
# get latitude and longitude values for each city
eda.get_geo_coordinates('ggjx22', 'Quito', 'EC')

## Data cleaning

In [None]:
df.dtypes

In [None]:
# df['date'] = pd.to_datetime(df['date'])

In [None]:
df.set_index('date', drop=True, inplace=True)

In [None]:
df['weekday'] = df.index.day_name()

In [None]:
df.isnull().sum()

In [None]:
groupby_store = df.groupby(by=['store_nbr', 'family'], group_keys=True).agg('sum', 'mean')

In [None]:
groupby_store.info()

## Data profile

In [None]:
# from ydata_profiling import ProfileReport

In [None]:
# profile = ProfileReport(df, tsmode=True, title="Time-Series EDA Quito City")
# profile.to_notebook_iframe()
# profile.to_file("../artifacts/reports/quito_ProfileReport.html") # AttributeError: 'float' object has no attribute 'shape'

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Use seaborn style defaults and set the default figure size
sns.set(rc={'figure.figsize':(11, 4)})

In [None]:
df['sales_sum'].plot();

In [None]:
df.columns

In [None]:
cols_plot = ['onpromotion_sum', 'transactions_sum', 'sales_sum']
axes = df[cols_plot].plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 9), subplots=True)
for ax in axes:
    ax.set_ylabel('Daily Totals (GWh)')

In [None]:
df_2013 = df.loc['2013']

In [None]:
df_2013

In [None]:
ax = df_2013.loc['2013', 'sales_sum'].plot()
ax.set_ylabel('Daily Sales for 2013');

### Seasonality

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(cols_plot, axes):
    sns.boxplot(data=df, x='weekday', y=name, ax=ax)
    ax.set_ylabel('Sum')
    ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
    if ax != axes[-1]:
        ax.set_xlabel('')

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(cols_plot, axes):
    sns.boxplot(data=df, x='month', y=name, ax=ax)
    ax.set_ylabel('Sum')
    ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
    if ax != axes[-1]:
        ax.set_xlabel('')

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(cols_plot, axes):
    sns.boxplot(data=df, x='store_nbr', y=name, ax=ax)
    ax.set_ylabel('Sum')
    ax.set_title(name)
    # Remove the automatic x-axis label from all but the bottom subplot
    if ax != axes[-1]:
        ax.set_xlabel('')

## Data preprocessing

In [None]:
df.store_nbr.value_counts()

Notes:

- have outliers needing treatment; need to treat outliers first before we can analyse seasonality for `weekday` and `month`
- selected Store 44 as it has most spread for `transaction_sum` meaning more activity; note assumption that 0 transaction means store is closed as there's no sale on that day
- which tracks with 0 transactions occuring on days with holidays (not included to keep dataset small)

In [None]:
df_str44 = df[(df.store_nbr == 44)]

In [None]:
grp_df_str44 = df_str44.groupby(by=['date'], group_keys=True).agg('sum')

In [None]:
grp_df_str44.drop(columns=['id','store_nbr','year','month','day_of_month'], inplace=True)

In [None]:
grp_df_str44 = grp_df_str44.asfreq('D')

## autoML with pycaret

EDA and ML


In [None]:
grp_df_str44.info()

In [None]:
# check installed version
import pycaret
pycaret.__version__

In [None]:
# import pycaret time series and init setup
from pycaret.time_series import *
s = setup(grp_df_str44,  
            target='sales_sum', 
            fh = 28, 
            session_id = 123, 
            profile=True,
            numeric_imputation_exogenous='mean',
            numeric_imputation_target="median",
            # ignore_features = ['id', 'family', 'store_nbr']
          )  


In [None]:
# check statistical tests on original data
check_stats()

sources for add_metric()

- https://towardsdatascience.com/predict-customer-churn-the-right-way-using-pycaret-8ba6541608ac
- https://github.com/pycaret/pycaret/issues/3491
- https://github.com/pycaret/pycaret/issues/1063

In [None]:
from sklearn.metrics import mean_squared_log_error

# create a custom function
def rmsle(y_true, y_pred):
    return mean_squared_log_error(y_true
                        , y_pred
                        , squared=False)

add_metric('msle', 'MSLE', mean_squared_log_error, greater_is_better=False) # default squared=True
add_metric('rmsle', 'RMSLE', rmsle, greater_is_better=False) # for problem statement squared=False

In [None]:
# compare baseline models
best = compare_models()