# Store Sales - Time Series Forecasting

Use machine learning to predict grocery sales. [source](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

## Objective

In this Kaggle competition, the goal is to 

> build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The evaluation metric for this competition is ***Root Mean Squared Logarithmic Error***.

The `RMSLE` is calculated as:

$$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$$

where:

- $ n $ is the total number of instances,
     
- $\hat{y}$ is the predicted value of the target for instance (i),
   
- $y_i$ is the actual value of the target for instance (i), and,
 
- $log$ is the natural logarithm.

For each id in the test set, you must predict a value for the sales variable. The file should contain a header and have the following format:

    ```
    id,sales
    3000888,0.0
    3000889,0.0
    3000890,0.0
    3000891,0.0
    3000892,0.0
    etc.
    ```


## Libraries for this research notebook

In [1]:
import pandas as pd
from dotenv import dotenv_values
from tqdm.auto import tqdm

# to overcome path issue for src
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import sys

# set the path to the current file
current_file_path = Path().resolve()
print(f"current_file_path is {current_file_path}")

# set the path to the src folder
src_folder_path = current_file_path.parent / 'src'
print(f"src_folder_path is {src_folder_path}")

# add the src folder to the system path
sys.path.append(str(src_folder_path))

from data_loader import DBDataLoader
from logger import logging

current_file_path is /home/ubuntu/repos/time-series-forecasting/notebooks
src_folder_path is /home/ubuntu/repos/time-series-forecasting/src


## Data Ingestion

Query data from MySQL

Connected to the database


In [10]:
df=cx.read_sql(conn, 'SELECT * FROM quito_frm_view')  

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648648 entries, 0 to 648647
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   date              648648 non-null  datetime64[ns]
 1   family            648648 non-null  object        
 2   store_nbr         648648 non-null  Int64         
 3   year              648648 non-null  Int64         
 4   month             648648 non-null  Int64         
 5   day_of_month      648648 non-null  Int64         
 6   onpromotion_sum   648648 non-null  float64       
 7   transactions_sum  648648 non-null  float64       
 8   sales_sum         648648 non-null  float64       
dtypes: Int64(4), datetime64[ns](1), float64(3), object(1)
memory usage: 47.0+ MB


~~DF loaded confirm: 1972674 rows × 14 columns~~

DF loaded confirm: 3,054,348 rows × 20 columns

quito df: 648649 rows × 10 columns

In [None]:
df['date'] = pd.to_datetime(df['date'], format=)

In [33]:
df.columns = ['date', 'family', 'sales_sum', 'onpromotion_sum', 
	'store_nbr', 'city', 'year', 'month', 'day_of_month', 'transactions_sum']

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 648648 entries, 0 to 8647
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   date              648648 non-null  object 
 1   family            648648 non-null  object 
 2   sales_sum         648648 non-null  float64
 3   onpromotion_sum   648648 non-null  object 
 4   store_nbr         648648 non-null  int64  
 5   city              648648 non-null  object 
 6   year              648648 non-null  int64  
 7   month             648648 non-null  int64  
 8   day_of_month      648648 non-null  int64  
 9   transactions_sum  621456 non-null  object 
dtypes: float64(1), int64(4), object(5)
memory usage: 54.4+ MB


## Data Analysis

In [37]:
df.drop(columns=['city'], inplace=True)

In [38]:
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")

In [48]:
df = df[]

In [52]:
df

Unnamed: 0,onpromotion_sum,transactions_sum
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
...,...,...
8643,0,4402
8644,0,5180
8645,0,5296
8646,0,4456


In [43]:
df.transactions_sum  = [0 if item is None else item for item in df.transactions_sum]

In [51]:
df[['onpromotion_sum','transactions_sum']] = pd.to_numeric(df['onpromotion_sum','transactions_sum'])

KeyError: ('onpromotion_sum', 'transactions_sum')

## Data cleaning

In [None]:
groupby_date = df.groupby(by=['date'], as_index=False, group_keys=False)[['state', 'sales', 'transactions']].agg('sum')

In [None]:
groupby_date.info()

In [None]:
# groupby_date_sales = df[['sales']].sum()
groupby_date = groupby_date.set_index('date').asfreq('D')
groupby_date

In [None]:
groupby_city = df.groupby(by=['city', 'state', 'family'], group_keys=True).agg('sum', 'mean')

In [None]:
groupby_city.info()

## Data profile

In [None]:
from ydata_profiling import ProfileReport

In [None]:
# profile = ProfileReport(groupby_city, title="ProfileReport groupby_city")
# # # # profile.to_notebook_iframe()
# profile.to_file("../artifacts/reports/groupby_city_ProfileReport.html")

## Data preprocessing

In [None]:
groupby_date.info()

## autoML with pycaret

EDA and ML


In [None]:
# check installed version
import pycaret
pycaret.__version__

In [None]:
# df.columns = ['id', 'family', 'sales', 'onpromotion', 'city', 'state', 'cluster',
#        'locale', 'locale_name', 'description', 'transferred', 'type',
#        'hol_type', 'store_nbr', 'year', 'month',
#        'day_of_month', 'transactions', 'dcoilwtico']

In [None]:
import matplotlib
import pycaret
# plot the dataset
groupby_date.plot()

In [None]:
# import pycaret time series and init setup
from pycaret.time_series import *
s = setup(groupby_date,  target='sales', fh = 3, session_id = 123,  
          numeric_imputation_exogenous='mean',
          numeric_imputation_target="median")  # fh = 3 --> 3 folds

In [None]:
# check statistical tests on original data
check_stats()

In [None]:
# compare baseline models
best = compare_models()

In [None]:
# plot forecast
plot_model(best, plot = 'forecast')

In [None]:
# plot forecast for 36 months in future
plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 36})

In [None]:
# residuals plot
plot_model(best, plot = 'residuals')

In [None]:
# predict on test set
holdout_pred = predict_model(best)

In [None]:
# show predictions df
holdout_pred.head()

In [None]:
# generate forecast for 36 period in future
predict_model(best, fh = 36)

In [None]:
# save pipeline
save_model(best, 'my_first_pipeline')

In [None]:
# load pipeline
loaded_best_pipeline = load_model('my_first_pipeline')
loaded_best_pipeline