<h1 align = "center">Boilerplate/Template Design</h1>

---


**Objective:** The file provides a simple *boilerplate* to concentrate on what is necessary, and stop doing same tasks! The boilerplate is also configured with certain [**nbextensions**](https://gitlab.com/ZenithClown/computer-configurations-and-setups) that I personally use. Install them, if required, else ignore them as they do not participate in any type of code-optimizations. For any new project *edit* this file or `File > Make a Copy` to get started with the project. Some settings and configurations are already provided, as mentioned below.

## Code Imports

**PEP8 Style Guide** lists out the following [*guidelines*](https://peps.python.org/pep-0008/#imports) for imports:
 1. Imports should be on separate lines,
 2. Import order should be:
    * standard library/modules,
    * related third party imports,
    * local application/user defined imports
 3. Wildcard import (`*`) should be avoided, else specifically tagged with **`# noqa: F403`** as per `flake8` [(ignoring errors)](https://flake8.pycqa.org/en/3.1.1/user/ignoring-errors.html).
 4. Avoid using relative imports; use explicit imports instead.

In [1]:
import sys # append additional directories to list
import warnings # display specified user warnings as required

In [4]:
from os import walk # get list of files in a directory and subdirectory
from time import time # get the epoch time, helpful in saving files
from time import ctime # print computer time in human redable format
from os import makedirs # create directories dynamically
from os.path import join # joins mutiple path without os dependency
# from copy import deepcopy # make an actual copy of immutable objects like `str`
# from tqdm import tqdm as TQ # for displaying a nice progress bar while running
# from uuid import uuid1 as UUID # for generating unique identifier
from datetime import datetime as dt # formatting/defining datetime objects

[**`logging`**](https://docs.python.org/3/howto/logging.html) is a standard python module that is meant for tracking any events that happen during any software/code operations. This module is super powerful and helpful for code debugging and other purposes. The next section defines a `logging` configuration in **`../logs/`** directory. Modify the **`LOGS_DIR`** variable under *Global Arguments* to change the default directory. The module is configured with a simplistic approach, such that any `print())` statement can be update to `logging.LEVEL_NAME()` and the code will work. Use logging operations like:

```python
 >> logging.debug("This is a Debug Message.")
 >> logging.info("This is a Information Message.")
 >> logging.warning("This is a Warning Message.")
 >> logging.error("This is a ERROR Message.")
 >> logging.critical("This is a CRITICAL Message.")
```

Note: some directories related to logging is created by default. This can be updated/changed in the following configuration section.

In [2]:
import logging # configure logging on `global arguments` section, as file path is required

### Data Frame, NumPy and Visualization Libraries

Daily use libraries like `pandas`, `numpy` is available for import along with *display* settings in a python notebook. Generally, I prefer to use `matplotlib` and `seaborn` which is imported below with certain configurations as mentioned. Stylesheet is available [here](https://gitlab.com/ZenithClown/computer-configurations-and-setups/-/blob/master/default-style.mplstyle).

In [4]:
# import numpy as np
import pandas as pd
# import seaborn as sns
# import matplotlib.pyplot as plt

%precision 3
# %matplotlib inline
# sns.set_style('whitegrid');
# plt.style.use('default-style'); # https://gitlab.com/ZenithClown/computer-configurations-and-setups
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 15)
# np.set_printoptions(precision = 3, threshold = 15)
pd.options.display.float_format = '{:,.2f}'.format

### Machine Learning Libraries

Uncomment and import any of the `sklearn` libraries below. In addition, boilerplate for importing `tensorflow` is also provided.

In [39]:
# from sklearn.metrics import 
# from sklearn.model_selection import 
# from sklearn.preprocessing import 

In [10]:
import tensorflow as tf
print(f"Tensorflow Version: {tf.__version__}")

# check physical devices
# the given function is available only in tf 2x
tf.config.list_physical_devices()

Tensorflow Version: 2.8.0


[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [11]:
if len(tf.config.list_physical_devices(device_type = "GPU")):
    # https://stackoverflow.com/q/38009682/6623589
    # https://stackoverflow.com/a/59179238/6623589
    print("GPU Computing Available.")
else:
    print("GPU Computing Not Available. If `GPU` is present, check configuration.")

GPU Computing Available.


## Global Argument(s)

The global arguments are *notebook* specific, however they may also be extended to external libraries and functions on import. The *boilerplate* provides a basic ML directory structure which contains a directory for `data` and a separate directory for `output`. In addition, a separate directory (`data/processed`) is created to save processed dataset such that preprocessing can be avoided.

In [6]:
ROOT = ".." # the document root is one level up, that contains all code structure
DATA = join(ROOT, "data") # the directory contains all data files, subdirectory (if any) can also be used/defined

# processed data directory can be used, such that preprocessing steps is not
# required to run again-and-again each time on kernel restart
PROCESSED_DATA = join(DATA, "processed")

In [7]:
# long projects can be overwhelming, and keeping track of files, outputs and
# saved models can be intriguing! to help this out, `today` can be used. for
# instance output can be stored at `output/<today>/` etc.
# `today` is so configured that it permits windows/*.nix file/directory names
today = dt.strftime(dt.strptime(ctime(), "%a %b %d %H:%M:%S %Y"), "%a, %b %d %Y")
print(f"Code Execution Started on: {today}") # only date, name of the sub-directory

Code Execution Started on: Wed, May 25 2022


In [1]:
OUTPUT_DIR = join(ROOT, "output", today)
makedirs(OUTPUT_DIR, exist_ok = True) # create dir if not exist

# in addition create directory for images, saved models
IMAGE_DIR = join(ROOT, "output", "images", today)
MODEL_DIR = join(ROOT, "output", "savedmodels", today)

makedirs(IMAGE_DIR, exist_ok = True) # create dir if not exist
makedirs(MODEL_DIR, exist_ok = True) # create dir if not exist

# also create directory for `logs`
LOGS_DIR = join(ROOT, "logs", "BOILERPLATE") # enter project code to distinguish
makedirs(LOGS_DIR, exist_ok = True)

NameError: name 'join' is not defined

In [2]:
print(LOGS_DIR) # logs file will be generated here

NameError: name 'LOGS_DIR' is not defined

In [9]:
logging.captureWarnings(True) # send warnings to log file automatically https://stackoverflow.com/a/37979724/6623589
logging.basicConfig(
    filename = join(LOGS_DIR, f"{today}.log"), # change `reports` file name
    filemode = "a", # append logs to existing file, if file exists
    format = "%(asctime)s - %(name)s - CLASS:%(levelname)s:%(levelno)s:L#%(lineno)d - %(message)s",
    level = logging.DEBUG
)

## User Defined Function(s)

It is recommended that any UDFs are defined outside the scope of the *jupyter notebook* such that development/editing of function can be done more practically. As per *programming guidelines* as [`src`](https://fileinfo.com/extension/src) file/directory is beneficial in code development and/or production release. However, *jupyter notebook* requires *kernel restart* if any imported code file is changed in disc, for this frequently changing functions can be defined in this section.

**Getting Started** with **`PYTHONPATH`**

One must know what are [Environment Variable](https://medium.com/chingu/an-introduction-to-environment-variables-and-how-to-use-them-f602f66d15fa) and how to call/use them in your choice of programming language. Note that an environment variable is *case sensitive* in all operating systems (except windows, since DOS is not case sensitive). Generally, we can access environment variables from terminal/shell/command prompt as:

```shell
# macOS/*nix
echo $VARNAME

# windows
echo %VARNAME%
```

Once you've setup your system with [`PYTHONPATH`](https://bic-berkeley.github.io/psych-214-fall-2016/using_pythonpath.html) as per [*python documentation*](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH) is an important directory where any `import` statements looks for based on their order of importance. If a source code/module is not available check necessary environment variables and/or ask the administrator for the source files.

For testing purpose, the module boasts the use of `src`, `utils` and `config` directories. However, these directories are available at `ROOT` level, and thus using `sys.path.append()` to add directories while importing.

In [21]:
logging.info("Appending `src/*` to Notebook-Path. Import any modules from `../src/*` to be used here.")
logging.info("Appending `utilities/*` to Notebook-Path. Import any modules from `../utilities/*` to be used here.")

In [15]:
sys.path.append(join(ROOT, "src")) # source files
sys.path.append(join(ROOT, "src", "agents")) # agents for an efficient rl model
sys.path.append(join(ROOT, "src", "engine")) # ai/ml code engines for modelling
sys.path.append(join(ROOT, "src", "models")) # defination of actual ai/ml model
sys.path.append(join(ROOT, "utilities")) # provide a list of utility functions

In [35]:
def get_files(path : str) -> list:
    """Get a List of Files in a given Directory and its Sub-Directory"""
    
    files = []
    for (_, _, file) in walk(path):
        files.extend(file)
        
    return files

In [36]:
# get a list of python code files available in directory
# this can be used for logging purpose, or can also be used to quickly find a file
__src_files__ = [file for file in get_files(join(ROOT, "src")) if file[-2:] == "py"]
__utl_files__ = [file for file in get_files(join(ROOT, "utilities")) if file[-2:] == "py"]

In [38]:
logging.debug(f"`src/*` is now available to import. Available modules = {len(__src_files__)}")
logging.debug(f"`utilities/*` is now available to import. Available modules = {len(__utl_files__)}")

## Read Input File(s)

A typical machine learning project revolves around six important stages (as available in [Amazon ML Life Cycle Documentation](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/well-architected-machine-learning-lifecycle.html)). The notebook boilerplate is provided to address two pillars:

 1. **Data Processing:** An integral part of any machine learning project, which is the most time consuming step! A brief introduction and best practices is available [here](https://towardsdatascience.com/introduction-to-data-preprocessing-in-machine-learning-a9fa83a5dc9d).
 2. **Model Development:** From understanding to deployment, this section address development (training, validating and testing) of an machine learning model.

![ML Life Cycle](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle.png)