<h1 align = "center">Boilerplate/Template Design</h1>

---

**Objective:** The file provides a simple *boilerplate* for *time series modelling* to concentrate on what is necessary, and stop doing same tasks! The boilerplate is also configured with certain [**nbextensions**](https://gitlab.com/ZenithClown/computer-configurations-and-setups) that I personally use. Install them, if required, else ignore them as they do not participate in any type of code-optimizations. For any new project *edit* this file or `File > Make a Copy` to get started with the project. Some settings and configurations are already provided, as mentioned below.

<h5 align = "center">🛠🚧 Development In Progress 🚧🛠</h5>

In [1]:
# use the code release version for tracking and code modifications. use the
# CHANGELOG.md file to keep track of version features, and/or release notes.
# the version file is avaiable at project root directory, check the
# global configuration setting for root directory information.
# the file is already read and is available as `__version__`
__version__ = open("../VERSION", "rt").read() # bump codecov
print(f"Current Code Version: {__version__}") # TODO : author, contact

Current Code Version: v0.1.2


## Code Imports

A code must be written such that it is always _production ready_. The conventional guidelines provided under [**PEP8**](https://peps.python.org/pep-0008/#imports) defines the conventional or syntactically useful ways of defining and/or manipulating functions. Necessar guidelines w.r.t. code imports are mentioned below, and basic libraries and import settings are defined.

 1. Imports should be on separate lines,
 2. Import order should be:
    * standard library/modules,
    * related third party imports,
    * local application/user defined imports
 3. Wildcard import (`*`) should be avoided, else specifically tagged with **`# noqa: F403`** as per `flake8` or **`# pylint: disable=unused-import`** as per `pylint`
 4. Avoid using relative imports; use explicit imports instead.

In [2]:
import os     # miscellaneous os interfaces
import sys    # configuring python runtime environment
import time   # library for time manipulation, and logging
import pickle # load/save model `pmdarima` model as a pickle file

In [3]:
# use `datetime` to control and preceive the environment
# in addition `pandas` also provides date time functionalities
import datetime as dt

In [4]:
# from copy import deepcopy      # dataframe is mutable
# from tqdm import tqdm as TQ    # progress bar for loops
# from uuid import uuid4 as UUID # unique identifier for objs

### Data Analysis and AI/ML Libraries

Import of data analysis and AI/ML libraries required at different intersections. Check settings and configurations [here](https://gitlab.com/ZenithClown/computer-configurations-and-setups) and code snippets [here](https://gitlab.com/ZenithClown/computer-configurations-and-setups/-/tree/master/template/snippets/vscode) for understanding settings that is used in this notebook. The code uses `matplotlib.styles` which is a custom `.mplstyle` file recognised by the `matplotlib` downlodable from [this link](https://gitlab.com/ZenithClown/computer-configurations-and-setups/-/tree/master/settings/python/matplotlib).

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%precision 3
%matplotlib inline
sns.set_style('whitegrid');
# plt.style.use('default-style'); # http://tinyurl.com/mpl-default-style

pd.set_option('display.max_rows', 50) # max. rows to show
pd.set_option('display.max_columns', 17) # max. cols to show
np.set_printoptions(precision = 3, threshold = 15) # set np options
pd.options.display.float_format = '{:,.3f}'.format # float precisions

In [6]:
# sklearn metrices for analysis can be imported as below
# considering `regression` problem, rmse is imported metrics
# for rmse, use `squared = False` : https://stackoverflow.com/a/18623635/
from sklearn.metrics import (
    mean_squared_error as MSE,
    mean_absolute_error as MAE
)

In [7]:
from statsmodels.tsa.seasonal import seasonal_decompose

In [8]:
import pmdarima as pm # let `auto_arima()` find the coefficients
import statsmodels.api as sm # statsmodels for statistical generates

### User Defined Function(s)

It is recommended that any UDFs are defined outside the scope of the *jupyter notebook* such that development/editing of function can be done more practically. As per *programming guidelines* as [`src`](https://fileinfo.com/extension/src) file/directory is beneficial in code development and/or production release. However, *jupyter notebook* requires *kernel restart* if any imported code file is changed in disc, for this frequently changing functions can be defined in this section.

**Getting Started** with **`PYTHONPATH`**

One must know what are [Environment Variable](https://medium.com/chingu/an-introduction-to-environment-variables-and-how-to-use-them-f602f66d15fa) and how to call/use them in your choice of programming language. Note that an environment variable is *case sensitive* in all operating systems (except windows, since DOS is not case sensitive). Generally, we can access environment variables from terminal/shell/command prompt as:

```shell
# macOS/*nix
echo $VARNAME

# windows
echo %VARNAME%
```

Once you've setup your system with [`PYTHONPATH`](https://bic-berkeley.github.io/psych-214-fall-2016/using_pythonpath.html) as per [*python documentation*](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH) is an important directory where any `import` statements looks for based on their order of importance. If a source code/module is not available check necessary environment variables and/or ask the administrator for the source files. For testing purpose, the module boasts the use of `src`, `utils` and `config` directories. However, these directories are available at `ROOT` level, and thus using `sys.path.append()` to add directories while importing.

**Getting Started** with **`submodules`**

A [`submodule`](https://git-scm.com/book/en/v2/Git-Tools-Submodules) provides functionality to integrate a seperate project in the current repository - this is typically useful to remove code-duplicacy and central repository to control dependent modules. More information on initializing and using submodule is available [here](https://www.youtube.com/watch?v=gSlXo2iLBro). Check [Github-GISTS/ZenithClown](https://gist.github.com/ZenithClown) for more information.

In [9]:
# get udf for `date_range()` function
# always check if `datetime` is imported
import datetime_ as dt_ # https://gist.github.com/ZenithClown/d2dd294c5f528459e16b139c04c0b182

In [10]:
# from stationarity import checkStationarity # https://gist.github.com/ZenithClown/f99d7e1e3f4b4304dd7d43603cef344d

In [11]:
# append `src` and sub-modules to call additional files these directory are
# project specific and not to be added under environment or $PATH variable
# sys.path.append(os.path.join("..", "src", "agents")) # agents for reinforcement modelling
# sys.path.append(os.path.join("..", "src", "engine")) # derivative engines for model control
# sys.path.append(os.path.join("..", "src", "models")) # actual models for decision making tools

## Global Argument(s)

The global arguments are *notebook* specific, however they may also be extended to external libraries and functions on import. The *boilerplate* provides a basic ML directory structure which contains a directory for `data` and a separate directory for `output`. In addition, a separate directory (`data/processed`) is created to save processed dataset such that preprocessing can be avoided.

In [12]:
ROOT = ".." # the document root is one level up, that contains all code structure
DATA = os.path.join(ROOT, "data") # the directory contains all data files, subdirectory (if any) can also be used/defined

# processed data directory can be used, such that preprocessing steps is not
# required to run again-and-again each time on kernel restart
PROCESSED_DATA = os.path.join(DATA, "processed")

In [13]:
OUTPUT_DIR = os.path.join(ROOT, "output")
IMAGES_DIR = os.path.join(OUTPUT_DIR, "images")
MODELS_DIR = os.path.join(OUTPUT_DIR, "savedmodels")

In [14]:
N_LOOKBACK = 12 # no. of periods to lookback in the history
N_FORECAST = 24 # no, of periods of foreward forecast is required

## Historic Price Data

The historic data, i.e., any data in time series format, can be read using `pd.read_*().set_index("date")` or use a custom function to read and process from the file.

In [None]:
data = # write read statement as a dataframe

### Exponential Smoothening

Exponential smoothing or exponential moving average (EMA) is a rule of thumb technique for smoothing time series data using the exponential window function. Instead of the "Simple Moving Average", the EMA model gives a higher weightage to the near prices. In addition, models like DEMA, TEMA are also developed on top of EMA.

In [None]:
# exponential smoothening is always a good topic, we can do so by setting
span = 
alpha = 2 / (span + 1) # thumb rule of alpha selection - # https://help.sumologic.com/docs/metrics/metrics-operators/ewma/

In [None]:
plt.plot(data["values"], label = "Historic Price", color = "#5d8bd4")
plt.plot(data["values"].ewm(alpha = alpha, adjust = False).mean(), label = f"Adjusted EWMA (span = {span}, α = {alpha})", color = "#b09666")

plt.legend(loc = "upper right")
plt.title(commodity)
plt.show()

## Model Development

An exploratory data analysis on a time series data set is typically different from that of conventional/non-timeseries dataset. The [`template`](https://github.com/ZenithClown/ai-ml-project-template) provides some basic analysis techniques that is applicable to any type of datasets. The following methods are implemented:

  * [**Seasonal Decomposition**](https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/): The process of breaking a time series data into three main components - (I) trend, (II) seasonality, and (III) residuals/error terms.
  * [**Check Data Stationarity**](https://www.analyticsvidhya.com/blog/2021/06/statistical-tests-to-check-stationarity-in-time-series-part-1/) Basic statistical models like `ARIMA` works on the principle that the time series is stationary.

But, first we copy the actual dataset into a copied variable (preserve original data), and understand the data to be applied on to the master series. In addition, we also set the date time indexing (with frequency, if not already available) since may functions (like ETS Decomposition) is dependent on it.

In [None]:
frame_ = data[["values"]].copy() # frame with missing dates, and imputed with frequency
all_dates = list(map(lambda x : pd.Timestamp(str(x)), list(dt_.date_range(frame_.index.min().date(), frame_.index.max().date()))[::7])) # weekly frequency data
missing_dates = [date for date in all_dates if date not in frame_.index.values] # ? this function is not authorized

# insert missing dates
frame_.reset_index(inplace = True)
for missing_date in missing_dates:
    frame_.loc[len(frame_)] = [missing_date, pd.NA]
    
# interpolate on missing dates
frame_ = frame_.sort_values("date").set_index("date") # don't sort descending
frame_["values"] = frame_["values"].interpolate() # fill missing values
frame_ = frame_.resample("???").first() # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

In [None]:
plt.figure(figsize = (25, 19))
seasonal_decompose(frame_).plot();
plt.show()

In [None]:
# get yearly average sales
yearly_sales = data.resample("Y")[commodity].mean().reset_index()

# get monthly average sales, grouped by year
monthly_sales = data.resample("M")[commodity].mean().reset_index()
monthly_sales["year"] = monthly_sales["date"].dt.year # can be used as an hue parameter to distinguish

plt.figure(figsize = (25, 5))
sns.lineplot(yearly_sales, x = "date", y = commodity, label = "Yearly Average Sales")
sns.lineplot(monthly_sales, x = "date", y = commodity, hue = "year", palette = "viridis", label = "Monthly Average Sales")

# disable/set xy label
plt.xlabel("")
plt.ylabel("Price Values")

plt.legend([]) # yearly diff color is understood
plt.title("Average Sales Historic Price Trend")

plt.show()

In [None]:
window = 12
*_, rolling = checkStationarity(data, feature = "values", window = window)

plt.plot(rolling)
plt.suptitle("Rolling Mean & Standard Deviation")
plt.title(f"Rolling Window Size = {window}")
plt.show()