<a href="https://colab.research.google.com/github/WideSu/Python-for-DS/blob/main/HyperParam_Tuning_Methods(Main).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TO-DO
- [x] Test the average time usage and RMSE for each epoch using scikit-learn random search
- [ ] Test TPE hyper param tuning for HyperOpt, Ray, Optuna
- [ ] Plot the RMSE through timeline
- [ ] Use the different sampler in Optuna: Random,TPE,CMA-ES,NSGA-II

The outcome:
- A chart consisting the average RMSE and excuation time for all hyper parameter tunning methods

|HPO Package                                  |Avg RMSE                        |Avg Time Ellipsed                                            |
|---------------------------------------------|--------------------------------|-------------------------------------------------------------|
|Scikit-learn                                 |                                |                                                             |
|HyperOpt                                     |                                |                                                             |
|Ray                                          |                                |                                                             |
|Optuna                                       |                                |                                                             |

- One time series plot

<img src="https://user-images.githubusercontent.com/44923423/171923215-292e776a-79aa-4a08-8e81-a2ef627bd42a.png" data-canonical-src="https://user-images.githubusercontent.com/44923423/171923215-292e776a-79aa-4a08-8e81-a2ef627bd42a.png" width="500" height="300" />


|Library|Pros|Cons|Scenario|
|-|-|-|-|
|Scikit-learn|Flexible and basic|Only 2 basic methods (grid/random), New methods are not stable|Tradictional tuning|
|HyperOpt|High-speed and flexible,New search method: TPE/ATPE| Out-of-date interface |Time-limited|
|Ray|Systematic and well wrapped|Too customized and not flexible,Time-cost on initialization|Fast development and deployment with various tuning methods|
|Optuna|Well-performed and light;Include all popular and stable tuning methods |Not well wrapped for all methods|Accurate, flexible required|


In [None]:
# @title Mont on Google Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/HPO/

Mounted at /content/drive
/content/drive/MyDrive/HPO


In [None]:
# @title Install and import packages
! pip install dateutil
! pip install lightgbm
! pip install optuna
import pandas as pd
import dateutil
import datetime
import optuna
from tqdm import tqdm, trange
from lightgbm import LGBMRegressor
import sklearn
import math
import time

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement dateutil (from versions: none)[0m
[31mERROR: No matching distribution found for dateutil[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 5.1 MB/s 
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 6.0 MB/s 
[?25hCollecting alembic
  Downloading alembic-1.8.0-py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 46.1 MB/s 
Collecting colorlog
  Downloading colorlog-6.6.0-py2.py3-none-any

In [None]:
# @title Read-in data and check data type and volume
df = pd.read_csv('./exp_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74458 entries, 0 to 74457
Data columns (total 98 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ticker           74458 non-null  object 
 1   permno           74458 non-null  int64  
 2   date             74458 non-null  object 
 3   ret              74458 non-null  float64
 4   absacc           74458 non-null  float64
 5   acc              74458 non-null  float64
 6   age              74458 non-null  float64
 7   agr              74458 non-null  float64
 8   baspread         74458 non-null  float64
 9   bm               74458 non-null  float64
 10  bm_ia            74458 non-null  float64
 11  cash             74458 non-null  float64
 12  cashdebt         74458 non-null  float64
 13  cashpr           74458 non-null  float64
 14  cfp              74458 non-null  float64
 15  cfp_ia           74458 non-null  float64
 16  chatoia          74458 non-null  float64
 17  chcsho      

In [None]:
# @title Change into datatime type
df[["date"]] = df[["date"]].apply(pd.to_datetime)

In [27]:
library_evaluation_df = {
    'Library' : [],
    'Train Start Date': [],
    'Train End Date': [],
    'Test Start Date': [],
    'Test End Date': [],
    'Smallest RMSE': [],
    'Time Ellipsed': []
}

n_trials = 15

In [None]:
# @title optuna hyper param tuning
# Configuration 
train_timespan_months = 180
whole_period_months = 60
test_timespan_months = 1
first_end_time = datetime.datetime(2015, 12, 1)
feat_cols = ['absacc', 'acc', 'age', 'agr', 'baspread','bm', 'bm_ia',
             'cash', 'cashdebt', 'cashpr', 'cfp', 'cfp_ia', 'chatoia', 'chcsho', 'chempia', 'chinv', 'chmom',
             'chpmia', 'chtx', 'cinvest', 'convind', 'currat', 'depr', 'divi', 'divo', 'dolvol', 'dy', 
             'egr', 'ep', 'gma', 'grcapx', 'grltnoa', 'herf', 'hire', 'ill', 'indmom', 'invest', 'lev', 'lgr',
             'maxret', 'mom12m', 'mom1m', 'mom36m', 'mom6m', 'ms', 'mve_ia', 'mvel1', 'nincr', 'operprof',
             'orgcap', 'pchcapx_ia', 'pchcurrat', 'pchdepr', 'pchgm_pchsale', 'pchquick', 'pchsale_pchinvt',
             'pchsale_pchrect', 'pchsale_pchxsga', 'pchsaleinv', 'pctacc', 'ps', 'quick', 'rd', 'rd_mve',
             'rd_sale', 'realestate', 'retvol', 'roaq', 'roavol', 'roeq', 'roic', 'rsup', 'salecash', 'pricedelay',
             'saleinv', 'salerec', 'secured', 'securedind', 'sgr', 'sin', 'sp', 'std_dolvol', 'std_turn',
             'stdacc', 'stdcf', 'tang', 'tb', 'turn', 'zerotrade','aeavol','ear','beta','betasq','idiovol']
y_col = 'ret'
train_end_date = first_end_time
time_usage = []
score_list = []
timeline = []

# Evaluation details for each train and test timespan
evaluate_detail_df = {
    'Train Start Date': [],
    'Train End Date': [],
    'Test Start Date': [],
    'Test End Date': [],
    'Smallest RMSE': [],
    'Time Ellipsed': []
}
predict_times = 60
for period_time in trange(predict_times):
    train_start_date = train_end_date - dateutil.relativedelta.relativedelta(months=train_timespan_months)
    test_end_date = train_end_date + dateutil.relativedelta.relativedelta(months=test_timespan_months)
    print(train_start_date, train_end_date, test_end_date)
    train_data = df.query(f'"{train_start_date}" < date <= "{train_end_date}"')
    test_data = df.query(f'"{train_end_date}" < date <= "{test_end_date}"')
    X_train = train_data[feat_cols].values
    y_train = train_data[y_col].values
    X_test = test_data[feat_cols].values
    y_test = test_data[y_col].values.ravel()
    study = optuna.create_study(sampler=optuna.samplers.TPESampler())  # Create a new study.
    def objective(trial):
        param = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),   
        'num_leaves': trial.suggest_int('num_leaves', 10, 512),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 80),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.0, 1.0), # subsample
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),  # eta
        'lambda_l1': trial.suggest_float('lambda_l1', 0.01, 1),  # reg_alpha
        'lambda_l2': trial.suggest_float('lambda_l2', 0.01, 1), # reg_lambda
        }
        model = LGBMRegressor(seed=42, **param)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = sklearn.metrics.mean_squared_error(y_test, y_pred)
        rmse = math.sqrt(mse)
        return rmse  # An objective value linked with the Trial object.
    ts = time.time()
    study.optimize(objective, n_trials=n_trials)  # Invoke optimization of the objective function.
    te = time.time()
    exc_time = te-ts
    evaluate_detail_df['Smallest RMSE'].append(study.best_value)
    evaluate_detail_df['Time Ellipsed'].append(exc_time)
    evaluate_detail_df['Train Start Date'].append(train_start_date)
    evaluate_detail_df['Train End Date'].append(train_end_date)
    evaluate_detail_df['Test Start Date'].append(train_end_date+dateutil.relativedelta.relativedelta(months=1))
    evaluate_detail_df['Test End Date'].append(test_end_date)
    train_end_date += dateutil.relativedelta.relativedelta(months=1)
evaluate_detail_df = pd.DataFrame(evaluate_detail_df)

NameError: ignored

In [None]:
evaluate_detail_df

NameError: ignored

In [None]:
library_evaluation_df['Library'].extend(['Optuna' for _ in range(len(evaluate_detail_df))])
library_evaluation_df['Train Start Date'].extend(evaluate_detail_df['Train Start Date'])
library_evaluation_df['Train End Date'].extend(evaluate_detail_df['Train End Date'])
library_evaluation_df['Test Start Date'].extend(evaluate_detail_df['Test Start Date'])
library_evaluation_df['Test End Date'].extend(evaluate_detail_df['Test End Date'])
library_evaluation_df['Smallest RMSE'].extend(evaluate_detail_df['Smallest RMSE'])
library_evaluation_df['Time Ellipsed'].extend(evaluate_detail_df['Time Ellipsed'])

In [None]:
pd.DataFrame(library_evaluation_df)

Unnamed: 0,Library,Train Start Date,Train End Date,Test Start Date,Test End Date,Smallest RMSE,Time Ellipsed
0,Optuna,2000-12-01,2015-12-01,2016-01-01,2016-01-01,0.120144,58.760128
1,Optuna,2001-01-01,2016-01-01,2016-02-01,2016-02-01,0.080194,92.129778
2,Optuna,2001-02-01,2016-02-01,2016-03-01,2016-03-01,0.103784,71.986675
3,Optuna,2001-03-01,2016-03-01,2016-04-01,2016-04-01,0.074915,27.287619
4,Optuna,2001-04-01,2016-04-01,2016-05-01,2016-05-01,0.058224,17.13049
5,Optuna,2001-05-01,2016-05-01,2016-06-01,2016-06-01,0.067073,10.279614
6,Optuna,2001-06-01,2016-06-01,2016-07-01,2016-07-01,0.068623,66.076798
7,Optuna,2001-07-01,2016-07-01,2016-08-01,2016-08-01,0.066195,18.987839
8,Optuna,2001-08-01,2016-08-01,2016-09-01,2016-09-01,0.053899,13.782289
9,Optuna,2001-09-01,2016-09-01,2016-10-01,2016-10-01,0.077368,8.68783


# Scikit-learn

Refered to [Scikit-learn RandomnizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

In [15]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer
from scipy.stats import uniform,randint
# Configuration 
train_timespan_months = 180
whole_period_months = 60
test_timespan_months = 1
first_end_time = datetime.datetime(2015, 12, 1)
feat_cols = ['absacc', 'acc', 'age', 'agr', 'baspread','bm', 'bm_ia',
             'cash', 'cashdebt', 'cashpr', 'cfp', 'cfp_ia', 'chatoia', 'chcsho', 'chempia', 'chinv', 'chmom',
             'chpmia', 'chtx', 'cinvest', 'convind', 'currat', 'depr', 'divi', 'divo', 'dolvol', 'dy', 
             'egr', 'ep', 'gma', 'grcapx', 'grltnoa', 'herf', 'hire', 'ill', 'indmom', 'invest', 'lev', 'lgr',
             'maxret', 'mom12m', 'mom1m', 'mom36m', 'mom6m', 'ms', 'mve_ia', 'mvel1', 'nincr', 'operprof',
             'orgcap', 'pchcapx_ia', 'pchcurrat', 'pchdepr', 'pchgm_pchsale', 'pchquick', 'pchsale_pchinvt',
             'pchsale_pchrect', 'pchsale_pchxsga', 'pchsaleinv', 'pctacc', 'ps', 'quick', 'rd', 'rd_mve',
             'rd_sale', 'realestate', 'retvol', 'roaq', 'roavol', 'roeq', 'roic', 'rsup', 'salecash', 'pricedelay',
             'saleinv', 'salerec', 'secured', 'securedind', 'sgr', 'sin', 'sp', 'std_dolvol', 'std_turn',
             'stdacc', 'stdcf', 'tang', 'tb', 'turn', 'zerotrade','aeavol','ear','beta','betasq','idiovol']
y_col = 'ret'
train_end_date = first_end_time
time_usage = []
score_list = []
timeline = []

# Evaluation details for each train and test timespan
evaluate_detail_df = {
    'Train Start Date': [],
    'Train End Date': [],
    'Test Start Date': [],
    'Test End Date': [],
    'Smallest RMSE': [],
    'Time Ellipsed': []
}
predict_times = 60
def rmse_score(y_true, y_pred):
    rmse = sklearn.metrics.mean_squared_error(y_true, y_pred, squared = False)      
    return rmse
my_scorer = make_scorer(rmse_score, greater_is_better=False)
for period_time in trange(predict_times):
    train_start_date = train_end_date - dateutil.relativedelta.relativedelta(months=train_timespan_months)
    test_end_date = train_end_date + dateutil.relativedelta.relativedelta(months=test_timespan_months)
    # print(train_start_date, train_end_date, test_end_date)
    train_data = df.query(f'"{train_start_date}" < date <= "{train_end_date}"')
    test_data = df.query(f'"{train_end_date}" < date <= "{test_end_date}"')
    X_train = train_data[feat_cols].values
    y_train = train_data[y_col].values
    X_test = test_data[feat_cols].values
    y_test = test_data[y_col].values.ravel()
    # print(X_train.shape, y_train.shape)
    model = LGBMRegressor(seed=42)
    param_distribution = dict(
        n_estimators = randint(low=50, high=500),   
        num_leaves = randint(low=10, high=512),
        min_data_in_leaf = randint( low=10, high=80),
        bagging_fraction= uniform( loc=0, scale=0.1), # subsample
        learning_rate= uniform( loc=0.01, scale=0.09),  # eta
        lambda_l1= uniform( loc=0.01, scale=0.99),  # reg_alpha
        lambda_l2= uniform( loc=0.01, scale=0.99), # reg_lambda
    )
    search_cv = RandomizedSearchCV(model, 
                                   param_distribution,
                                   scoring=my_scorer,
                                   random_state=0,
                                   n_iter = n_trials)
    # Calc the search time
    ts = time.time()
    search_cv.fit(X_train, y_train)
    te = time.time()
    exc_time = te-ts
    evaluate_detail_df['Smallest RMSE'].append(b.best_score_)
    evaluate_detail_df['Time Ellipsed'].append(exc_time)
    evaluate_detail_df['Train Start Date'].append(train_start_date)
    evaluate_detail_df['Train End Date'].append(train_end_date)
    evaluate_detail_df['Test Start Date'].append(train_end_date+dateutil.relativedelta.relativedelta(months=1))
    evaluate_detail_df['Test End Date'].append(test_end_date)
    train_end_date += dateutil.relativedelta.relativedelta(months=1)
evaluate_detail_df = pd.DataFrame(evaluate_detail_df)

  2%|▏         | 1/60 [08:44<8:35:39, 524.40s/it]


KeyboardInterrupt: ignored

# HyperOpt

In [32]:
import pickle
import time
import hyperopt
from hyperopt import fmin, hp, Trials
# Configuration 
train_timespan_months = 180
whole_period_months = 60
test_timespan_months = 1
first_end_time = datetime.datetime(2015, 12, 1)
feat_cols = ['absacc', 'acc', 'age', 'agr', 'baspread','bm', 'bm_ia',
             'cash', 'cashdebt', 'cashpr', 'cfp', 'cfp_ia', 'chatoia', 'chcsho', 'chempia', 'chinv', 'chmom',
             'chpmia', 'chtx', 'cinvest', 'convind', 'currat', 'depr', 'divi', 'divo', 'dolvol', 'dy', 
             'egr', 'ep', 'gma', 'grcapx', 'grltnoa', 'herf', 'hire', 'ill', 'indmom', 'invest', 'lev', 'lgr',
             'maxret', 'mom12m', 'mom1m', 'mom36m', 'mom6m', 'ms', 'mve_ia', 'mvel1', 'nincr', 'operprof',
             'orgcap', 'pchcapx_ia', 'pchcurrat', 'pchdepr', 'pchgm_pchsale', 'pchquick', 'pchsale_pchinvt',
             'pchsale_pchrect', 'pchsale_pchxsga', 'pchsaleinv', 'pctacc', 'ps', 'quick', 'rd', 'rd_mve',
             'rd_sale', 'realestate', 'retvol', 'roaq', 'roavol', 'roeq', 'roic', 'rsup', 'salecash', 'pricedelay',
             'saleinv', 'salerec', 'secured', 'securedind', 'sgr', 'sin', 'sp', 'std_dolvol', 'std_turn',
             'stdacc', 'stdcf', 'tang', 'tb', 'turn', 'zerotrade','aeavol','ear','beta','betasq','idiovol']
y_col = 'ret'
train_end_date = first_end_time
time_usage = []
score_list = []
timeline = []

# Define the search space
space = {
        'n_estimators': hp.quniform('n_estimators', 50, 500, 1), 
        'num_leaves': hp.quniform('num_leaves', 10, 512, 1),
        'min_data_in_leaf': hp.quniform('min_data_in_leaf', 10, 80, 1),
        'bagging_fraction':  hp.uniform('bagging_fraction', 0.0, 1.0), # subsample
        'learning_rate': hp.uniform('learning_rate', 0.01, 0.1),  # eta
        'lambda_l1': hp.uniform('lambda_l1', 0.01, 1),  # reg_alpha
        'lambda_l2': hp.uniform('lambda_l2', 0.01, 1), # reg_lambda
}

# Evaluation details for each train and test timespan
evaluate_detail_df = {
    'Train Start Date': [],
    'Train End Date': [],
    'Test Start Date': [],
    'Test End Date': [],
    'Smallest RMSE': [],
    'Time Ellipsed': []
}

# Run the backtest for 5 years
predict_times = 60
for period_time in trange(predict_times):
    train_start_date = train_end_date - dateutil.relativedelta.relativedelta(months=train_timespan_months)
    test_end_date = train_end_date + dateutil.relativedelta.relativedelta(months=test_timespan_months)
    train_data = df.query(f'"{train_start_date}" < date <= "{train_end_date}"')
    test_data = df.query(f'"{train_end_date}" < date <= "{test_end_date}"')
    X_train = train_data[feat_cols].values
    y_train = train_data[y_col].values
    X_test = test_data[feat_cols].values
    y_test = test_data[y_col].values.ravel()
    def objective(params):
        param_dict = dict(
            n_estimators = int(params['n_estimators']),
            num_leaves = int(params['num_leaves']),
            min_data_in_leaf = int(params['min_data_in_leaf']),
            bagging_fraction = params['bagging_fraction'],
            learning_rate = params['learning_rate'],
            lambda_l1 = params['lambda_l1'],
            lambda_l2 = params['lambda_l2']
        )
        model = LGBMRegressor(seed=42,**param_dict)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = sklearn.metrics.mean_squared_error(y_test, y_pred)
        rmse = math.sqrt(mse)
        return rmse  # An objective value linked with the Trial object.
    ts = time.time()
    
    trials = Trials()
    best = fmin(objective,
        space=space,
        algo=hyperopt.rand.suggest, # random search
        max_evals=n_trials,
        trials=trials)

    te = time.time()
    exc_time = te-ts

    best_result = min(trials.losses())
    evaluate_detail_df['Smallest RMSE'].append(best_result)
    evaluate_detail_df['Time Ellipsed'].append(exc_time)
    evaluate_detail_df['Train Start Date'].append(train_start_date)
    evaluate_detail_df['Train End Date'].append(train_end_date)
    evaluate_detail_df['Test Start Date'].append(train_end_date+dateutil.relativedelta.relativedelta(months=1))
    evaluate_detail_df['Test End Date'].append(test_end_date)
    train_end_date += dateutil.relativedelta.relativedelta(months=1)
evaluate_detail_df = pd.DataFrame(evaluate_detail_df)

  0%|          | 0/60 [00:00<?, ?it/s]


  0%|          | 0/1 [00:00<?, ?it/s, best loss: ?][A
100%|██████████| 1/1 [00:47<00:00, 47.18s/it, best loss: 0.12049285163336404]

  2%|▏         | 1/60 [00:47<46:29, 47.28s/it]



  0%|          | 0/1 [00:00<?, ?it/s, best loss: ?][A
100%|██████████| 1/1 [00:29<00:00, 29.20s/it, best loss: 0.07272014708904327]

  3%|▎         | 2/60 [01:16<35:27, 36.68s/it]



  0%|          | 0/1 [00:00<?, ?it/s, best loss: ?][A
100%|██████████| 1/1 [00:11<00:00, 11.58s/it, best loss: 0.10312527684121987]

  5%|▌         | 3/60 [01:28<23:59, 25.25s/it]



  0%|          | 0/1 [00:00<?, ?it/s, best loss: ?][A
100%|██████████| 1/1 [00:09<00:00,  9.49s/it, best loss: 0.07497910387883841]

  7%|▋         | 4/60 [01:37<17:47, 19.06s/it]



  0%|          | 0/1 [00:40<?, ?it/s, best loss: ?]

  7%|▋         | 4/60 [02:18<32:12, 34.51s/it]







KeyboardInterrupt: ignored