# Template ML Notebook

This notebook provides a template on how to pre-process data, train and evaluate models. Goal is to standardize model development process in order for it to be more scalable & robust. 

## Set up & prerequisite
* Create and copy a [Github Personal Access Token](https://docs.github.com/en/enterprise-server@3.8/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-personal-access-token) and use it to set the `GITHUB_PERSONAL_ACCESS_TOKEN` variable
* If using Google Colab, create secrets for the following variables:
    * GITHUB_PERSONAL_ACCESS_TOKEN
    * CLIENT_SECRET
* Ensure that google cloud account is set up (needed in order to store models & perform queries)
* Ensure all of the libraries are installed by running `!pip install git+https://{GITHUB_PERSONAL_ACCESS_TOKEN}@github.com/autocase/joulesAI.git@{LIBS_VERSION}`
    **RECOMMENDED**: If running locally, create and use a virtual environment to prevent dependency conflicts:
    ```
    python -m venv <venv>
    source <venv>/bin/activate
    ``` 
    Select the virtual environment as the kernal for the template notebook.
* To reinstall the library, run:
    ```
    pip install pef
    pef libs -y
    ```

## High Level Steps From The Template

1. Import required libraries & set appropriate environment variables
2. Get training data
3. Preform feature engineering 
4. Balance the dataset
5. Train & Evaluate a model
6. Perform grid search to pick best params
7. Retrain the model with best params
8. Perform error analysis

Below are more in depth explanation of some of the steps.

## Feature Engineering

There are a few feature eningeering steps that happen:
* Dates get additional features like `Month`, `year`, `Is_year_end` etc.
* Numerical features get standardized
* Categorical features are one hot encoded
* Building specific features like CFA, TWR etc. get added


## Balancing a dataset

If the target variable is unbalanced, the notebook performs downsampling of the target variable in order to increase model performance & increase robustness.

## Training & Evaluating Models

Notebook offers a standardized way to train a model. By default it trains a random forrest regressor with a set of default params. 

Model evaluation is also standardized and by default evaluates model against a set of default metrics (for regression those would RMSE etc.)

## Grid Search

In order to find best params for the model grid search is also performed in the notebook. There are set of default params defined, but those can be customized as shown in the notebook example.

## Notebook Configuration

In case there is a need for a different model to be ran, different kind of transformation to be applied and grid search params to be updpated all of the configuration is stored in `libs/vars.py`

`CONFIG` variable sets a few basic paramaters up and can be updated in case a different type of model needs to be trained or a target variable name changes.

`DEFAULT_GRID_SEARCH` is a set of default params for grid search. Can be change in case defaults need to be updated.

`TRANSFORM_FUNCTIONS` is a set of default functions that transform a given variable. Can be updated in case there additional transformation functions that need to be performed.

`DEFAULT_RandomForestRegressor_HP` is a set of default params for a random forrest regressor. Can be update in case different defaults have to be applied.

# Libaries & Environment Variables

In [None]:
'''
Setting environment variables that are required to run this template. Update in case params change
'''
import os
from google.colab import userdata

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "sa_key_file.json"
os.environ["GOOGLE_PROJECT"] = "autocase-201317"
os.environ["BQ_DATASET"] = "eplus_simulations"
os.environ["MLFLOW_TRACKING_URL"] = "https://mlflow.autocase.dev"

# GITHUB_PERSONAL_ACCESS_TOKEN = 'ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
# CLIENT_SECRET = 'xxxxxxxxxxxxxxxxxxxxx'

GITHUB_PERSONAL_ACCESS_TOKEN = userdata.get('GITHUB_PERSONAL_ACCESS_TOKEN')
CLIENT_SECRET = userdata.get('CLIENT_SECRET')

LIBS_VERSION = 'ds-libs-and-template'

In [None]:
'''
Install packages that are required to run this template.
'''

!pip install git+https://{GITHUB_PERSONAL_ACCESS_TOKEN}@github.com/autocase/joulesAI.git@{LIBS_VERSION}


In [None]:
'''
Importing all of the internal libraries from libs folder
'''

import libs.connect 
import libs.preprocess 
import libs.train 
import libs.predict 
import libs.vars 
import libs.mlflow_token

from libs.connect import * # containts functions to connect to google cloud & get simulation data/jobs
from libs.preprocess import * # contains functions for data preprocessing (creating additional features, combining datasets etc.)
from libs.train import * # contains functions for training & evaluating models
from libs.predict import * # contains functions for running inference on models
from libs.vars import * # contains common paramater configurations for grid search, random forest etc.
from libs.mlflow_token import * # contains library to fetch MLFLOW Auth token and save to env var.

config = CONFIG

In [None]:
# Get token for ML Flow server
# Follow link generated by this cell, and manually authenticate to MLFlow
# Token will be saved to env

tracking_url = os.getenv('MLFLOW_TRACKING_URL')
client_secret = CLIENT_SECRET

setup_mlflow_environment(tracking_url, client_secret)

# Baseline Model

### List Simulation Jobs

In [None]:
'''
Get all of the current simulation jobs from googles task manager
'''

get_simulation_jobs()

### Load Data

In [None]:
'''
Configure target variable, training data id and climate zones, get the simulation data and plot the target variable
'''

# configuring variables
config['target'] = "total_eui_elec_gj_m2"
config['training_data_simid'] = ['20240124-150716-L1H9L', '20240122-175354-ZTPMH']
config['climate_zones'] = ["5A", "5B", "5C", "6A", "6B"]

# getting simulation data
df = get_simulation_data(config['training_data_simid'], config['target'], config['climate_zones'])

# plotting distribution of the target variable
df[config['target']].hist(bins=30, rwidth=0.75)

### Add Combined Features

In [None]:
'''
Create new features to add to the training dataset (CFA, TWWR etc.) and plot the histogram of the target variable
'''

# create new features
df = getCombinedFeatures(df, toEnergyUsage = True, target=config['target'])

# plotting distribution of the target variable
df[config['target']].hist(bins=30, rwidth=0.75)

### Select Features

In [None]:
'''
select which columns to exclude from the feature set
'''

# select columns to remove
config['drop_cols'] = ["ESA", "AR", "HD", "ahu_burner_efficiency", "supply_air_temp_heating", "temp_setpoint_heating_occupied", "temp_setpoint_heating_setback", "total_eui_ng_gj_m2", "total_eui_elec_gj_m2", "total_eui_elec_gj_m2_ln", "id", "job_id", "sample_id"]

# select a list of features (without the removed columns)
cols = df.columns
features = list(set(cols) - set(config['drop_cols']))
features

### Split Datasset

In [None]:
'''
Splits dataset into training and test sets
'''

config['random_seed'] = 100 # set a random seed so the data splits are reproducable
X_train, X_test, y_train, y_test, feature_info = split_dataset(df, features, config['target'], config['random_seed']) # split data into training and testing
feature_info

### Handle NULLs

In [None]:
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)
y_train = y_train.fillna(0)
y_test = y_test.fillna(0)

### Train Model

In [None]:
# train the selected model

config['model']['algorithm'] = 'RandomForestRegressor' # specify a model 
model = train_model(config['model']['algorithm'], X_train, y_train) # train the model

### Evaluate Model

In [None]:
'''
Evaluate a model
'''

config['transformation'] = '' # specify transformation if needed
model_metrics, importances_df, y_pred = evaluate(model, X_test, y_test, transform_fx=config['transformation'], title=config['experiment_name']) # evaluate the model

### Grid Search

In [None]:
'''
Perform grid search to find a model with best param
'''

grid_search_params = {
    "param_distributions": {
        "n_estimators": [10,100],   # num of trees in forest
        "max_features": ['sqrt'],   # num features @ split
        "max_depth": [100],         # Max num of levels in tree
        "min_samples_split": [2],   # Min num samples to split node
        "min_samples_leaf": [1],    # Min num samples required @ leaf node
        "bootstrap": [True, False]  # sample selection method
    },
    "n_iter": 10, # number of iterations
    "cv": 2, # number of folds
    "verbose": 0, 
    "random_state": 42, # setting the seed
    "n_jobs": -1 # number of jobs to run in parallel. -1 means using all processors.
}

best_hyperparameters = grid_search(config['model']['algorithm'], X_train, y_train, grid_search_params) # perform grid search

best_hyperparameters


In [None]:
'''
Retrain & evaluate model with best hyperparameters
'''

model = train_model(config['model']['algorithm'], X_train, y_train, best_hyperparameters) # retrain model with best hyperparameters
model_metrics, importances_df, y_pred = evaluate(model, X_test, y_test, transform_fx=config['transformation'], title=config['experiment_name']) # evaluate model with best hyperparameters

### Error Analysis

Sample Plan
https://docs.google.com/spreadsheets/d/120glSDNi1COUMeu-B9KnbfzSkzKM7syY/edit?usp=sharing&ouid=103559666706832096026&rtpof=true&sd=true

Error Analysis
https://docs.google.com/spreadsheets/d/18IqPK4H9RoaX-aohsHqFegzqfQcKOjWkRWE_L_IG30I/edit?usp=sharing

In [None]:
'''
Perform error analysis
'''

config['error_analysis_simid'] = ['20240119-221157-QW06O']

error_analysis(config['error_analysis_simid'], config['target'], model, feature_info)

### Save Expirement to ML Flow

In [None]:
'''
Save the selected model to ML Flow
'''

save_experiment(model, feature_info, model_metrics, config)