<h1 align = "center">TPS May 2022</h1>

---


**Objective:** The May edition of the 2022 Tabular Playground series binary classification problem that includes a number of different feature interactions. This competition is an opportunity to explore various methods for identifying and exploiting these feature interactions ([competition](https://www.kaggle.com/competitions/tabular-playground-series-may-2022/overview)). This notebook serves to provide codes that I've used for solving TPS May 2022. My initial plan is to use a powerful *neural network* to do a binary classification for *predicting states of manufacturing control data*.

## Code Imports

**PEP8 Style Guide** lists out the following [*guidelines*](https://peps.python.org/pep-0008/#imports) for imports:
 1. Imports should be on separate lines,
 2. Import order should be:
    * standard library/modules,
    * related third party imports,
    * local application/user defined imports
 3. Wildcard import (`*`) should be avoided, else specifically tagged with **`# noqa: F403`** as per `flake8` [(ignoring errors)](https://flake8.pycqa.org/en/3.1.1/user/ignoring-errors.html).
 4. Avoid using relative imports; use explicit imports instead.

In [1]:
import sys # append additional directories to list

In [2]:
from time import ctime # print computer time in human redable format
from os import makedirs # create directories dynamically
from os.path import join # joins mutiple path without os dependency
# from copy import deepcopy # make an actual copy of immutable objects like `str`
# from tqdm import tqdm as TQ # for displaying a nice progress bar while running
# from uuid import uuid1 as UUID # for generating unique identifier
from datetime import datetime as dt # formatting/defining datetime objects

[**`logging`**](https://docs.python.org/3/howto/logging.html) is a standard python module that is meant for tracking any events that happen during any software/code operations. This module is super powerful and helpful for code debugging and other purposes. The next section defines a `logging` configuration in **`../logs/`** directory. Modify the **`LOGS_DIR`** variable under *Global Arguments* to change the default directory. The module is configured with a simplistic approach, such that any `print())` statement can be update to `logging.LEVEL_NAME()` and the code will work. Use logging operations like:

```python
 >> logging.debug("This is a Debug Message.")
 >> logging.info("This is a Information Message.")
 >> logging.warning("This is a Warning Message.")
 >> logging.error("This is a ERROR Message.")
 >> logging.critical("This is a CRITICAL Message.")
```

Note: some directories related to logging is created by default. This can be updated/changed in the following configuration section.

In [3]:
import logging # configure logging on `global arguments` section, as file path is required

In [4]:
# import numpy as np
import pandas as pd
# import seaborn as sns
# import matplotlib.pyplot as plt

%precision 3
# %matplotlib inline
# sns.set_style('whitegrid');
# plt.style.use('default-style');
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 35)
# np.set_printoptions(precision = 3, threshold = 15)
pd.options.display.float_format = '{:,.2f}'.format

## Global Argument(s)

The global arguments are *notebook* specific, however they may also be extended to external libraries and functions on import. The *boilerplate* provides a basic ML directory structure which contains a directory for `data` and a separate directory for `output`. In addition, a separate directory (`data/processed`) is created to save processed dataset such that preprocessing can be avoided.

In [5]:
ROOT = ".." # the document root is one level up, that contains all code structure
DATA = join(ROOT, "data") # the directory contains all data files, subdirectory (if any) can also be used/defined

# processed data directory can be used, such that preprocessing steps is not
# required to run again-and-again each time on kernel restart
PROCESSED_DATA = join(DATA, "processed")

In [6]:
# long projects can be overwhelming, and keeping track of files, outputs and
# saved models can be intriguing! to help this out, `today` can be used. for
# instance output can be stored at `output/<today>/` etc.
# `today` is so configured that it permits windows/*.nix file/directory names
today = dt.strftime(dt.strptime(ctime(), "%a %b %d %H:%M:%S %Y"), "%a, %b %d %Y")
print(f"Code Execution Started on: {today}") # only date, name of the sub-directory

Code Execution Started on: Sat, May 07 2022


In [7]:
OUTPUT_DIR = join(ROOT, "output", today)
makedirs(OUTPUT_DIR, exist_ok = True) # create dir if not exist

# also create directory for `logs`
LOGS_DIR = join(ROOT, "logs", "TPS May 2022")
makedirs(LOGS_DIR, exist_ok = True)

In [8]:
print(LOGS_DIR) # logs file will be generated here

..\logs\TPS May 2022


In [9]:
logging.basicConfig(
    filename = join(LOGS_DIR, f"{today}.log"), # change `reports` file name
    filemode = "a", # append logs to existing file, if file exists
    format = "%(asctime)s - %(name)s - CLASS:%(levelname)s:%(levelno)s:L#%(lineno)d - %(message)s",
    level = logging.DEBUG
)

In [10]:
logging.info("Initializing Kernel, Global Arguments Configured....")
logging.info(f"Data Directory: {DATA}")
logging.info(f"Output Data Directory: {OUTPUT_DIR}")

## User Defined Function(s)

It is recommended that any UDFs are defined outside the scope of the *jupyter notebook* such that development/editing of function can be done more practically. As per *programming guidelines* as [`src`](https://fileinfo.com/extension/src) file/directory is beneficial in code development and/or production release. However, *jupyter notebook* requires *kernel restart* if any imported code file is changed in disc, for this frequently changing functions can be defined in this section.

**Getting Started** with **`PYTHONPATH`**

One must know what are [Environment Variable](https://medium.com/chingu/an-introduction-to-environment-variables-and-how-to-use-them-f602f66d15fa) and how to call/use them in your choice of programming language. Note that an environment variable is *case sensitive* in all operating systems (except windows, since DOS is not case sensitive). Generally, we can access environment variables from terminal/shell/command prompt as:

```shell
# macOS/*nix
echo $VARNAME

# windows
echo %VARNAME%
```

Once you've setup your system with [`PYTHONPATH`](https://bic-berkeley.github.io/psych-214-fall-2016/using_pythonpath.html) as per [*python documentation*](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPATH) is an important directory where any `import` statements looks for based on their order of importance. If a source code/module is not available check necessary environment variables and/or ask the administrator for the source files.

For testing purpose, the module boasts the use of `src`, `utils` and `config` directories. However, these directories are available at `ROOT` level, and thus using `sys.path.append()` to add directories while importing.

In [11]:
sys.path.append(join(ROOT, "src")) # source files
sys.path.append(join(ROOT, "src", "agents")) # agents for an efficient rl model
sys.path.append(join(ROOT, "src", "engine")) # ai/ml code engines for modelling
sys.path.append(join(ROOT, "src", "models")) # defination of actual ai/ml model
sys.path.append(join(ROOT, "utilities")) # provide a list of utility functions

In [12]:
from dfutils import reduce_mem_usage

In [13]:
def read_file(filepath : str, reduce_memory : bool = True) -> pd.DataFrame:
    data = pd.read_csv(filepath) # read csv file as is
    
    # do some pre-processing on the data-frame
    data.set_index("id", inplace = True) # set index
    data.drop(columns = ["f_27"], inplace = True) # drop object feature
    
    if reduce_memory:
        data = reduce_mem_usage(data, subset = None, copy = False)
    
    return data.copy() # deepcopy

## Read Input File(s)

A typical machine learning project revolves around six important stages (as available in [Amazon ML Life Cycle Documentation](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/well-architected-machine-learning-lifecycle.html)). The notebook boilerplate is provided to address two pillars:

 1. **Data Processing:** An integral part of any machine learning project, which is the most time consuming step! A brief introduction and best practices is available [here](https://towardsdatascience.com/introduction-to-data-preprocessing-in-machine-learning-a9fa83a5dc9d).
 2. **Model Development:** From understanding to deployment, this section address development (training, validating and testing) of an machine learning model.

![ML Life Cycle](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle.png)

In [20]:
trn_data = read_file(join(DATA, "train.csv"))
tst_data = read_file(join(DATA, "test.csv"))

Initial Memory: 219.73 MB


Reducing Memory Usage: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:01<00:00, 20.77it/s]

Final Memory (after optimization): 57.95 MB
  > Decreased by 73.63%





### Basic Data Information

Training and testing data is read into two variables `trn_data` and `tst_data` respectively. Let's use the basic `pandas` functionalities to find information about the data. The target is a binary column and marked as `target`, and the data has a total of 31 features marked as `f_00, f_01, ..., f_30`. Among the given feature `f_27` is some object data which is simply ignored in `read_file()` function. Also, `id` column is set as index for both training and testing set, and we will require the IDs when submitting the result.

In [16]:
trn_data.sample(5)

Unnamed: 0_level_0,f_00,f_01,f_02,f_03,f_04,f_05,f_06,f_07,f_08,f_09,f_10,f_11,f_12,f_13,f_14,f_15,f_16,f_17,f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_28,f_29,f_30,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
403900,1.7,0.52,0.11,0.7,-1.04,0.48,-2.29,4,3,5,1,0,8,1,4,3,5,1,2,0.62,-3.33,2.82,-4.77,-4.14,-1.87,0.31,0.34,175.5,1,2,0
141698,1.21,0.44,1.47,-0.98,-0.32,-0.98,-0.59,1,2,3,4,1,0,3,0,4,0,1,1,4.08,-2.38,-0.54,-2.9,-0.26,-1.2,-1.52,-1.61,87.94,0,1,0
365948,-0.54,-1.13,2.83,1.71,0.82,0.88,-0.79,3,3,0,0,2,1,1,1,1,5,3,1,0.41,-0.79,0.48,-2.69,2.19,1.47,2.34,1.36,473.5,0,1,0
204430,-1.39,0.75,-1.31,1.28,0.05,-0.43,0.75,1,1,2,2,1,3,4,0,3,1,1,0,-1.37,-0.86,4.35,-0.58,2.32,2.29,-0.94,-0.55,-284.75,1,1,1
695186,-1.39,0.63,-1.25,-0.54,0.01,-1.59,0.29,0,2,1,3,2,2,0,5,1,3,0,4,1.35,-2.91,-5.34,4.21,-3.92,0.09,1.62,-5.99,-378.25,1,2,0
