# Store Sales - Time Series Forecasting

Use machine learning to predict grocery sales. [source](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

## Objective

In this Kaggle competition, the goal is to 

> build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The evaluation metric for this competition is ***Root Mean Squared Logarithmic Error***.

The `RMSLE` is calculated as:

$$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$$

where:

- $ n $ is the total number of instances,
     
- $\hat{y}$ is the predicted value of the target for instance (i),
   
- $y_i$ is the actual value of the target for instance (i), and,
 
- $log$ is the natural logarithm.

For each id in the test set, you must predict a value for the sales variable. The file should contain a header and have the following format:

    ```
    id,sales
    3000888,0.0
    3000889,0.0
    3000890,0.0
    3000891,0.0
    3000892,0.0
    etc.
    ```


## Libraries for this research notebook

In [1]:

import pandas as pd
from tqdm.auto import tqdm

# to overcome path issue for src
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import sys

# set the path to the current file
current_file_path = Path().resolve()
print(f"current_file_path is {current_file_path}")

# set the path to the src folder
src_folder_path = current_file_path.parent / 'src'
print(f"src_folder_path is {src_folder_path}")

# add the src folder to the system path
sys.path.append(str(src_folder_path))

from data_loader import DBDataLoader
from logger import logging

current_file_path is /home/ubuntu/repos/time-series-forecasting/notebooks
src_folder_path is /home/ubuntu/repos/time-series-forecasting/src


## Data Ingestion

Query data from MySQL

In [4]:
# load in data set using .sql file
query_file_path = '../src/scripts/train_store_hols.sql'

db = DBDataLoader()

In [5]:
with open(query_file_path, 'r') as query:
    chunks = db.load(query=query.read())
    # count += 1
    print(f'chunks size: {sys.getsizeof(chunks)}')
    logging.info(f"chunks loaded {sys.getsizeof(chunks)}")
    df = pd.DataFrame()
    for i in tqdm(range(sys.getsizeof(chunks)), desc='Reading from DB'):
        for chunk in chunks:
            df = pd.concat([df, chunk])

Loading dataset from database...
chunks size: 112


Reading from DB:   0%|          | 0/112 [00:00<?, ?it/s]

In [6]:
df.shape

(1972674, 15)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1972674 entries, 0 to 2673
Data columns (total 15 columns):
 #   Column       Dtype         
---  ------       -----         
 0   id           int64         
 1   family       object        
 2   sales        float64       
 3   onpromotion  int64         
 4   city         object        
 5   state        object        
 6   cluster      int64         
 7   locale       object        
 8   locale_name  object        
 9   description  object        
 10  transferred  object        
 11  type         object        
 12  hol_type     object        
 13  store_nbr    int64         
 14  date         datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(4), object(9)
memory usage: 240.8+ MB


DF loaded confirm: 1972674 rows × 14 columns

In [8]:
stmt='select * from VwDump1'

chunks = db.load(query=stmt)

print(f'VwDump1 chunks size: {sys.getsizeof(chunks)}')
logging.info(f"VwDump1 chunks loaded {sys.getsizeof(chunks)}")

view_df = pd.DataFrame()
for i in tqdm(range(sys.getsizeof(chunks)), desc='Reading from View'):
    for chunk in chunks:
        view_df = pd.concat([view_df, chunk])

Loading dataset from database...
VwDump1 chunks size: 112


Reading from View:   0%|          | 0/112 [00:00<?, ?it/s]

In [9]:
view_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,transactions,is_city_holiday,is_regional_holiday,is_national_holiday
0,561,2013-01-01,25,AUTOMOTIVE,0.0,0,Salinas,Santa Elena,D,1,770.0,0,0,1
1,562,2013-01-01,25,BABY CARE,0.0,0,Salinas,Santa Elena,D,1,770.0,0,0,1
2,563,2013-01-01,25,BEAUTY,2.0,0,Salinas,Santa Elena,D,1,770.0,0,0,1
3,564,2013-01-01,25,BEVERAGES,810.0,0,Salinas,Santa Elena,D,1,770.0,0,0,1
4,565,2013-01-01,25,BOOKS,0.0,0,Salinas,Santa Elena,D,1,770.0,0,0,1


DF loaded confirm: 3000888 rows × 14 columns

In [10]:
view_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000888 entries, 0 to 887
Data columns (total 14 columns):
 #   Column               Dtype         
---  ------               -----         
 0   id                   int64         
 1   date                 datetime64[ns]
 2   store_nbr            int64         
 3   family               object        
 4   sales                float64       
 5   onpromotion          int64         
 6   city                 object        
 7   state                object        
 8   type                 object        
 9   cluster              int64         
 10  transactions         float64       
 11  is_city_holiday      int64         
 12  is_regional_holiday  int64         
 13  is_national_holiday  int64         
dtypes: datetime64[ns](1), float64(2), int64(7), object(4)
memory usage: 343.4+ MB


## Data cleaning

In [11]:
df['date'] = pd.to_datetime(df['date'])

In [12]:
df.set_index('date', drop=True, inplace=True)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1972674 entries, 2013-01-01 to 2015-12-31
Data columns (total 14 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   family       object 
 2   sales        float64
 3   onpromotion  int64  
 4   city         object 
 5   state        object 
 6   cluster      int64  
 7   locale       object 
 8   locale_name  object 
 9   description  object 
 10  transferred  object 
 11  type         object 
 12  hol_type     object 
 13  store_nbr    int64  
dtypes: float64(1), int64(4), object(9)
memory usage: 225.8+ MB


In [14]:
view_df.drop('id', axis=1, inplace=True)

In [15]:
groupby_store = view_df.groupby(by=['store_nbr', 'family'], group_keys=True).agg('sum', 'mean')

In [16]:
groupby_store.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1782 entries, (1, 'AUTOMOTIVE') to (54, 'SEAFOOD')
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   sales                1782 non-null   float64
 1   onpromotion          1782 non-null   int64  
 2   cluster              1782 non-null   int64  
 3   transactions         1782 non-null   float64
 4   is_city_holiday      1782 non-null   int64  
 5   is_regional_holiday  1782 non-null   int64  
 6   is_national_holiday  1782 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 101.7+ KB


## Data profile

In [18]:
from ydata_profiling import ProfileReport

In [20]:
profile = ProfileReport(view_df, title="ProfileReport view_df")
# profile.to_notebook_iframe()
profile.to_file("../artifacts/reports/view_df_ProfileReport.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  return _cramers_corrected_stat(pd.crosstab(col_1, col_2), correction=True)
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [17]:
profile = ProfileReport(df, title="ProfileReport defaults")
# profile.to_notebook_iframe()
profile.to_file("../artifacts/reports/df_ProfileReport.html")

  def hasna(x: np.ndarray) -> bool:


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  return _cramers_corrected_stat(pd.crosstab(col_1, col_2), correction=True)
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Data preprocessing