# Store Sales - Time Series Forecasting

Use machine learning to predict grocery sales. [source](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

## Objective

In this Kaggle competition, the goal is to 

> build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The evaluation metric for this competition is ***Root Mean Squared Logarithmic Error***.

The `RMSLE` is calculated as:

$$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$$

where:

- $ n $ is the total number of instances,
     
- $\hat{y}$ is the predicted value of the target for instance (i),
   
- $y_i$ is the actual value of the target for instance (i), and,
 
- $log$ is the natural logarithm.

For each id in the test set, you must predict a value for the sales variable. The file should contain a header and have the following format:

    ```
    id,sales
    3000888,0.0
    3000889,0.0
    3000890,0.0
    3000891,0.0
    3000892,0.0
    etc.
    ```


## Libraries for this research notebook

In [None]:
import pandas as pd
import os

# to overcome path issue for src
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import sys

# set the path to the current file
current_file_path = Path().resolve()

# set the path to the src folder
src_folder_path = current_file_path.parent

# add the src folder to the system path
sys.path.append(str(src_folder_path))

import src.data_loader as DB


## Data Ingestion

Query data from MySQL

In [None]:
# instantiate the DataLoader object
load_data = DB.DataLoader()

# create a connection
conn = load_data.initiate_local_connection()

In [None]:
# load in data set using string query
query = '''
    SELECT *
    FROM time_series.oil
'''

results = load_data.query_from_string(conn, query)

In [None]:
results.shape

In [None]:
# load in data set using .sql file
query_file_path = '../src/scripts/train_store_hols.sql'

results2 = load_data.query_from_file(conn, query_file_path)

In [None]:
results2.shape

DF loaded confirm: 1972674 rows × 17 columns

In [None]:
results2.info()

In [None]:
results3 = load_data.query_from_string(conn, 'select * from VwDump1')

In [None]:
results3.shape

DF loaded confirm: 3000888 rows × 14 columns

In [None]:
results3.info()

## Data cleaning