# Store Sales - Time Series Forecasting

Use machine learning to predict grocery sales. [source](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

## Objective

In this Kaggle competition, the goal is to 

> build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The evaluation metric for this competition is ***Root Mean Squared Logarithmic Error***.

The `RMSLE` is calculated as:

$$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$$

where:

- $ n $ is the total number of instances,
     
- $\hat{y}$ is the predicted value of the target for instance (i),
   
- $y_i$ is the actual value of the target for instance (i), and,
 
- $log$ is the natural logarithm.

For each id in the test set, you must predict a value for the sales variable. The file should contain a header and have the following format:

    ```
    id,sales
    3000888,0.0
    3000889,0.0
    3000890,0.0
    3000891,0.0
    3000892,0.0
    etc.
    ```


## Libraries for this research notebook

In [None]:

import pandas as pd
from tqdm.auto import tqdm

# to overcome path issue for src
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import sys

# set the path to the current file
current_file_path = Path().resolve()
print(f"current_file_path is {current_file_path}")

# set the path to the src folder
src_folder_path = current_file_path.parent / 'src'
print(f"src_folder_path is {src_folder_path}")

# add the src folder to the system path
sys.path.append(str(src_folder_path))

from data_loader import DBDataLoader
from logger import logging

## Data Ingestion

Query data from MySQL

In [None]:
# load in data set using .sql file
query_file_path = '../src/scripts/train_store_hols.sql'

db = DBDataLoader()

In [None]:
with open(query_file_path, 'r') as query:
    chunks = db.load(query=query.read())
    # count += 1
    print(f'chunks size: {sys.getsizeof(chunks)}')
    logging.info(f"chunks loaded {sys.getsizeof(chunks)}")
    df = pd.DataFrame()
    for i in tqdm(range(sys.getsizeof(chunks)), desc='Reading from DB'):
        for chunk in chunks:
            df = pd.concat([df, chunk])

In [None]:
df.shape

In [None]:
df.info()

DF loaded confirm: 1972674 rows × 14 columns

In [None]:
stmt='select * from VwDump1'

chunks = db.load(query=stmt)

print(f'VwDump1 chunks size: {sys.getsizeof(chunks)}')
logging.info(f"VwDump1 chunks loaded {sys.getsizeof(chunks)}")

view_df = pd.DataFrame()
for i in tqdm(range(sys.getsizeof(chunks)), desc='Reading from View'):
    for chunk in chunks:
        view_df = pd.concat([view_df, chunk])

In [None]:
view_df.head()

DF loaded confirm: 3000888 rows × 14 columns

In [None]:
view_df.info()

In [None]:
# trying out data ingestion with connectorx library
from sqlalchemy import create_engine, text
from dotenv import dotenv_values
config = dotenv_values()

In [None]:
# create engine to talk to database
engine = create_engine(
    f'mysql+pymysql://'             # dialect + driver
    f'{config.get("USERNAME")}'     # username
    f':{config.get("PASSWORD")}'    # password
    f'@{config.get("ENDPOINT")}'    # host
    f':{config.get("PORT")}'        # port
    f'/{config.get("DBNAME")}'      # database
)

In [None]:
# establish connection and make the query
with engine.connect() as cnxn:
    with open('../src/scripts/query_data.sql') as f:
        query = text(f.read())
        results = pd.read_sql(query, cnxn)

# runtime 1 min 1.4 secs

In [None]:
results.shape

In [None]:
from dotenv import load_dotenv
import os
import mysql.connector

load_dotenv()

In [None]:
DB_HOST = os.getenv("PS_HOST")
DB_USERNAME = os.getenv("PS_USERNAME")
DB_PASSWORD = os.getenv("PS_PASSWORD")
DB_DATABASE = os.getenv("PS_DATABASE")

In [None]:
connection = mysql.connector.connect(
    host=DB_HOST,
    user=DB_USERNAME,
    password=DB_PASSWORD,
    database=DB_DATABASE,
    # ssl_verify_identity=True,
    # ssl_ca="/etc/ssl/certs/ca-certificates.crt"
)

In [None]:
cursor = connection.cursor()

batch_size = 100000
start_id = 0
rows = []

while True:
    
    cursor.execute(
        f"""
        SELECT *
        FROM train
        WHERE id >= {start_id}
        ORDER BY id
        LIMIT {batch_size}
        """
    )
    
    batch = cursor.fetchall()
    if not batch:
        break
    
    rows.extend(batch)
    start_id = batch[-1][0] + 1
    
    if len(rows) >= 3000000:
        break

df = pd.DataFrame.from_records(rows, columns=[desc[0] for desc in cursor.description])

In [None]:
connection = mysql.connector.connect(
    host=DB_HOST,
    user=DB_USERNAME,
    password=DB_PASSWORD,
    database=DB_DATABASE,
    # ssl_verify_identity=True,
    # ssl_ca="/etc/ssl/certs/ca-certificates.crt"
    connect_timeout=1000
)

cursor = connection.cursor()

cursor.execute(
    f"""
    SET GLOBAL connect_timeout=60;
    """
)

cursor.execute(
    f"""
    SET WORKLOAD = 'olap';
    """
)

cursor.execute(
    f"""
    SELECT * FROM full_df
    """
)

rows = cursor.fetchall()

print(len(rows))

df = pd.DataFrame.from_records(rows, columns=[desc[0] for desc in cursor.description])

cursor.close()
connection.close()

In [None]:
df.shape

## Data cleaning

In [None]:
df['date'] = pd.to_datetime(df['date'])

In [None]:
df.set_index('date', drop=True, inplace=True)

In [None]:
df.info()

In [None]:
view_df.drop('id', axis=1, inplace=True)

In [None]:
groupby_store = view_df.groupby(by=['store_nbr', 'family'], group_keys=True).agg('sum', 'mean')

In [None]:
groupby_store.info()

## Data profile

In [None]:
from ydata_profiling import ProfileReport

In [None]:
profile = ProfileReport(view_df, title="ProfileReport view_df")
# profile.to_notebook_iframe()
profile.to_file("../artifacts/reports/view_df_ProfileReport.html")

In [None]:
profile = ProfileReport(df, title="ProfileReport defaults")
# profile.to_notebook_iframe()
profile.to_file("../artifacts/reports/df_ProfileReport.html")

## Data preprocessing