# Predicting $PM_{2.5}$ with non-CNN Models 
## (Ridge Regression, Random Forest, XGboost) 

The following tutorial explains how to use our alternative, non-CNN models. Here, we implement ridge regression and random forest, but XGboost is implemented similarly. To be more transparent, each cell contains copied code from our top-level python scripts. However, in practice you should only need to run their associated $\texttt{.sh}$ files. To make the tutorial functional on a PC, we use a curated data subset that only takes 100 days of each sensor's outputs (found in the '../data' folder)

The tutorial is broken into the following steps: 
- 1) Load data
- 2) Making Train/Test Split
- 3) Imputing data with ridge regression or random forest 
- 4) Determining optimal prediction model hyperparameters with cross-validation 
- 5) Training and testing your prediction model 

Each step (in bold markdown) contains a description of the code, the name of the python script, the name of the shell script in $\texttt{.../src}$, and is followed by 3 cells: 
- 1) arguments for the associated script. Each argument should be passed to the associated shell script in the command line. Here, the arguments are all python variables with the format "$\texttt{args_{argument name}}$".
- 2) The imported dependencies required for that step 
- 3) A copy of the top-level python script found in the directory $\texttt{src}$

In [1]:
import numpy as np
import pandas as pd
import configparser

In [2]:
config = configparser.RawConfigParser()
config.read("./config/py_config.ini")

['./config/py_config.ini']

# Import 100 Day Data Subset: 
 - to produce subset: 
  - execute in command line $\texttt{Rscript data_setup.R}$
  - note: the data path variables are set in $\texttt{/src/config/data_setup_config.R}$

In [3]:
#Import preprocessed data
dat = pd.read_csv('../data/data_to_impute.csv')

# Splitting the Data

### Make Train/Test Split. Train/Test is advised for non-CNN prediction models, where train/test/validation are required for CNN prediction models. 
With our script, a given site will be all in train or all in test, in order to make a fair test of predictive capability
 - set argument val_split to 0, and argument val to False, to only do train/test split
 - train/test split is executed by $\texttt{train_val_test_split.py}$
 - to execute, use $\texttt{tvt-split.sh}$ with necessary arguments

In [4]:
#Shell arguments for tvt-split.sh: 
args_val = False
args_train_split = 0.7
args_val_split = 0

In [5]:
#CODE FROM train_val_test_split.py
import sys
from data_split_tune_utils import train_test_split, train_val_test_split

In [6]:
#CODE FROM train_val_test_split.py
if (not args_val) and args_val_split != 0:
    print("Validation split specified without validation flag!")
    sys.exit()


if args_train_split < 0 or args_train_split > 1 or \
   args_val_split < 0 or args_val_split > 1:
    print("Split value out of range!")
    sys.exit()


if args_train_split + args_val_split > 1:
    print("Invalid train/validation split ratio! Must fall under a total of 1.")
    sys.exit()


test_split = 1. - args_train_split - args_val_split


data = pd.read_csv(config["data"]["data_to_impute"])

# set seed for reproducibility
np.random.seed(1)

if args_val: # split data into train, validation, and test sets and save
    train, val, test = train_val_test_split(data, train_prop=args_train_split, test_prop=test_split, site_var_name="site")
    train.to_csv(config["data"]["trainV"], index=False)
    val.to_csv(  config["data"]["valV"], index=False)
    test.to_csv( config["data"]["testV"], index=False)


else: # split data into train and test sets and save
    train, test = train_test_split(data, train_prop=args_train_split, site_var_name="site")
    train.to_csv(config["data"]["train"], index=False)
    test.to_csv(config["data"]["test"], index=False)

In [7]:
print('Training Size: ', train.shape)
print('Test Size: ', test.shape)

Training Size:  (145500, 62)
Test Size:  (62400, 62)


# Imputing Data

## Impute data with Ridge Regression

### First, train a ridge regression imputation model 
- The code 'pickles' the trained ridge imputation model, so you don't need to train every time you wish to impute. Furthermore, you are able to train the imputer on a subset of data (to save time). The next section will use the trained imputer to actually impute the data
- the following is implemented in $\texttt{ridge_imputer_fit.py}$
- to run in shell, use $\texttt{ridge-imputer-fit.sh}$ with appropriate args

In [8]:
#shell arguments: 
args_val = False 
args_impute_split = 0.5
args_max_iter = 10
args_initial_strategy = 'mean'
args_alpha = 0.0001

In [9]:
import pickle 
from data_split_tune_utils import train_test_split, X_y_site_split
from predictiveImputer_mod import PredictiveImputer

In [10]:
if args_impute_split < 0 or args_impute_split > 1:
    print("Impute split value out of range!")
    sys.exit()


train = None
if args_val:
    train = pd.read_csv(config["data"]["trainV"])

else:
    train = pd.read_csv(config["data"]["train"])


# set seed for reproducibility
np.random.seed(1)

# split train up; only fit imputer on part of train set due to memory/time
train1, train2 = train_test_split(train, train_prop=args_impute_split, site_var_name="site")

# split part of train set to fit imputer on into x, y, and site
train1_x, train1_y, train1_sites = X_y_site_split(train1, y_var_name="MonitorData", site_var_name="site")

# create imputer and fit on part of train set
ridge_imputer = PredictiveImputer(max_iter=args_max_iter, initial_strategy=args_initial_strategy, f_model="Ridge")
ridge_imputer.fit(train1_x, alpha=args_alpha, fit_intercept=True, normalize=True, random_state=1)

# save fitted imputer
pickle.dump(ridge_imputer, open(config["Ridge_Imputation"]["model"], "wb"))

Number of variables: 60
Iteration 1
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6
Variable # 7
Variable # 8
Variable # 9
Variable # 10
Variable # 11
Variable # 12
Variable # 13
Variable # 14
Variable # 15
Variable # 16
Variable # 17
Variable # 18
Variable # 19
Variable # 20
Variable # 21
Variable # 22
Variable # 23
Variable # 24
Variable # 25
Variable # 26
Variable # 27
Variable # 28
Variable # 29
Variable # 30
Variable # 31
Variable # 32
Variable # 33
Variable # 34
Variable # 35
Variable # 36
Variable # 37
Variable # 38
Variable # 39
Variable # 40
Variable # 41
Variable # 42
Variable # 43
Variable # 44
Variable # 45
Variable # 46
Variable # 47
Variable # 48
Variable # 49
Variable # 50
Variable # 51
Variable # 52
Variable # 53
Variable # 54
Variable # 55
Variable # 56
Variable # 57
Variable # 58
Variable # 59
Difference: 0.00168568382991

Iteration 2
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6
V

Variable # 23
Variable # 24
Variable # 25
Variable # 26
Variable # 27
Variable # 28
Variable # 29
Variable # 30
Variable # 31
Variable # 32
Variable # 33
Variable # 34
Variable # 35
Variable # 36
Variable # 37
Variable # 38
Variable # 39
Variable # 40
Variable # 41
Variable # 42
Variable # 43
Variable # 44
Variable # 45
Variable # 46
Variable # 47
Variable # 48
Variable # 49
Variable # 50
Variable # 51
Variable # 52
Variable # 53
Variable # 54
Variable # 55
Variable # 56
Variable # 57
Variable # 58
Variable # 59
Difference: 3.62899334253e-08



### Then, run the trained ridge regression model to impute the data
- the following is implemented in $\texttt{ridge_impute_eval_train_val_test.py}$
- to run in shell, use $\texttt{ridge-impute-tt.sh}$ with appropriate arguments

In [11]:
#Shell arguments: 
args_val = False
args_backup_strategy = 'mean'

In [12]:
import pickle
# this imported function was created for this package to split datasets (see data_split_tune_utils.py)
from data_split_tune_utils import X_y_site_split

In [13]:
train, val, test = None, None, None

if args_val:
    train = pd.read_csv(config["data"]["trainV"])
    val   = pd.read_csv(config["data"]["valV"])
    test  = pd.read_csv(config["data"]["testV"])

else:
    train = pd.read_csv(config["data"]["train"])
    test  = pd.read_csv(config["data"]["test"])

# set seed for reproducibility
np.random.seed(1)

# load fitted ridge imputer
ridge_imputer = pickle.load(open(config["Ridge_Imputation"]["model"], "rb"))

# split train and test datasets into x, y, and site
train_x, train_y, train_sites = X_y_site_split(train, y_var_name="MonitorData", site_var_name="site")
test_x, test_y, test_sites = X_y_site_split(test, y_var_name="MonitorData", site_var_name="site")

### make imputations on train and test data matrices and create dataframes with imputation R^2 evaluations; computed weighted R^2 values
train_x_imp, train_r2_scores_df = ridge_imputer.transform(train_x, evaluate=True, backup_impute_strategy=args_backup_strategy)
train_r2_scores_df.columns = ["Train_R2", "Train_num_missing"]
train_r2_scores_df.loc[max(train_r2_scores_df.index)+1, :] = [np.average(train_r2_scores_df.loc[:, "Train_R2"].values,
                                                                   weights=train_r2_scores_df.loc[:, "Train_num_missing"].values,
                                                                   axis=0), np.mean(train_r2_scores_df.loc[:, "Train_num_missing"].values)]

test_x_imp, test_r2_scores_df = ridge_imputer.transform(test_x, evaluate=True, backup_impute_strategy=args_backup_strategy)
test_r2_scores_df.columns = ["Test_R2", "Test_num_missing"]
test_r2_scores_df.loc[max(test_r2_scores_df.index)+1, :] = [np.average(test_r2_scores_df.loc[:, "Test_R2"].values,
                                                                   weights=test_r2_scores_df.loc[:, "Test_num_missing"].values,
                                                                   axis=0), np.mean(test_r2_scores_df.loc[:, "Test_num_missing"].values)]

### convert imputed train and test data matrices back into pandas dataframes with column names
cols = ["site", "MonitorData"] + list(train_x.columns)
train_imp_df = pd.DataFrame(np.concatenate([train_sites.values.reshape(len(train_sites), -1),
                                              train_y.values.reshape(len(train_y), -1),
                                              train_x_imp], axis=1),
                                              columns=cols)

test_imp_df = pd.DataFrame(np.concatenate([test_sites.values.reshape(len(test_sites), -1),
                                              test_y.values.reshape(len(test_y), -1),
                                              test_x_imp], axis=1),
                                              columns=cols)

var_df = pd.DataFrame(np.array(cols[2:] + ["Weighted_Mean_R2"]).reshape(len(cols)-2+1, -1), columns=["Variable"])



if args_val:
    # split val into x, y, and site
    val_x, val_y, val_sites = X_y_site_split(val, y_var_name="MonitorData", site_var_name="site")

    ### make imputations on val data matrix and create dataframe with imputation R^2 evaluations; computed weighted R^2 values
    val_x_imp, val_r2_scores_df = ridge_imputer.transform(val_x, evaluate=True, backup_impute_strategy="mean")
    val_r2_scores_df.columns = ["Val_R2", "Val_num_missing"]
    val_r2_scores_df.loc[max(val_r2_scores_df.index)+1, :] = [np.average(val_r2_scores_df.loc[:, "Val_R2"].values,
                                                                       weights=val_r2_scores_df.loc[:, "Val_num_missing"].values,
                                                                       axis=0), np.mean(val_r2_scores_df.loc[:, "Val_num_missing"].values)]

    ### convert imputed val data matrix back into pandas dataframes with column names
    val_imp_df = pd.DataFrame(np.concatenate([val_sites.values.reshape(len(val_sites), -1),
                                                  val_y.values.reshape(len(val_y), -1),
                                                  val_x_imp], axis=1),
                                                  columns=cols)
    # save imputed datasets
    train_imp_df.to_csv(config["Ridge_Imputation"]["trainV"], index=False)
    test_imp_df.to_csv(config["Ridge_Imputation"]["testV"], index=False)
    val_imp_df.to_csv(config["Ridge_Imputation"]["valV"], index=False)

    # put R^2 evaluations for train, val, and test datasets into same pandas dataframe
    r2_scores_df = pd.concat([var_df, train_r2_scores_df, val_r2_scores_df, test_r2_scores_df], axis=1)

    # save evaluations
    r2_scores_df.to_csv(config["Ridge_Imputation"]["r2_scores"], index=False)
    
else:
    # save imputed train and test datasets
    train_imp_df.to_csv(config["Ridge_Imputation"]["train"], index=False)
    test_imp_df.to_csv(config["Ridge_Imputation"]["test"], index=False)
    # put R^2 evaluations for train and test datasets into same pandas dataframe and save
    r2_scores_df = pd.concat([var_df, train_r2_scores_df, test_r2_scores_df], axis=1)
    r2_scores_df.to_csv(config["Ridge_Imputation"]["r2_scores"], index=False)

In [14]:
nrows = 5
r2_scores_df.head(nrows)

Unnamed: 0,Variable,Train_R2,Train_num_missing,Test_R2,Test_num_missing
0,Unnamed: 0,0.059603,0.0,0.026848,0.0
1,month,0.999993,0.0,0.999993,0.0
2,cumulative_month,0.999993,0.0,0.999993,0.0
3,sin_time,0.951986,0.0,0.952512,0.0
4,cos_time,0.998624,0.0,0.998614,0.0


### Impute with Random Forest
- This random forest distribution, while effective, struggles with memory. As such the random forest imputation does not save the model (since the model object itself uses excessive memory), but rather trains and imputes in one function. 
- We use minimal iterations, trees, max_features to speed up training
- The following is implemented in $\texttt{rf_fit_impute_eval_train_val_test.py}$
- To run, use $\texttt{rf-impute-tvt.sh}$ with appropriate args

In [17]:
#Shell arguments for rf-impute-tvt.sh: 
args_val = True 
args_impute_split = 0.3
args_initial_strategy = 'mean'
args_backup_strategy = 'median'
args_max_iter = 2
args_max_features = 5 
args_n_estimators = 5

In [19]:
#CODE FROM rf_fit_impute_eval_train_val_test.py
# these are imported functions created for this package that split datasets (see data_split_tune_utils.py)
from data_split_tune_utils import train_test_split, X_y_site_split, train_val_test_split
# this is the PredictiveImputer class inspired by the MissForest algorithm (see predictiveImputer_mod.py)
from predictiveImputer_mod import PredictiveImputer

In [20]:
#CODE FROM rf_fit_impute_eval_train_val_test.py
train, val, test = None, None, None

if args_val:
    train = pd.read_csv(config["data"]["trainV"])
    val   = pd.read_csv(config["data"]["valV"])
    test  = pd.read_csv(config["data"]["testV"])

else:
    train = pd.read_csv(config["data"]["train"])
    test  = pd.read_csv(config["data"]["test"])

# set seed for reproducibility
np.random.seed(1)

# split train up; only fit imputer on part of train set due to memory/time
train1, train2 = train_test_split(train, train_prop=args_impute_split, site_var_name="site")

# split train and test datasets into x, y, and site
train1_x, train1_y, train1_sites = X_y_site_split(train1, y_var_name="MonitorData", site_var_name="site")
train2_x, train2_y, train2_sites = X_y_site_split(train2, y_var_name="MonitorData", site_var_name="site")
test_x, test_y, test_sites = X_y_site_split(test, y_var_name="MonitorData", site_var_name="site")

# create imputer and fit on part of train set
rf_imputer = PredictiveImputer(max_iter=args_max_iter, initial_strategy=args_initial_strategy, f_model="RandomForest")
rf_imputer.fit(train1_x, max_features=args_max_features, n_estimators=args_n_estimators, n_jobs=-1, verbose=0, random_state=1)

### make imputations on train and test data matrices and create dataframes with imputation R^2 evaluations; computed weighted R^2 values
train1_x_imp, train1_r2_scores_df = rf_imputer.transform(train1_x, evaluate=True, backup_impute_strategy=args_backup_strategy)
train1_r2_scores_df.columns = ["Train1_R2", "Train1_num_missing"]
train1_r2_scores_df.loc[max(train1_r2_scores_df.index)+1, :] = [np.average(train1_r2_scores_df.loc[:, "Train1_R2"].values,
                                                                   weights=train1_r2_scores_df.loc[:, "Train1_num_missing"].values,
                                                                   axis=0), np.mean(train1_r2_scores_df.loc[:, "Train1_num_missing"].values)]

train2_x_imp, train2_r2_scores_df = rf_imputer.transform(train2_x, evaluate = True, backup_impute_strategy = "mean")
train2_r2_scores_df.columns = ["Train2_R2", "Train2_num_missing"]
train2_r2_scores_df.loc[max(train2_r2_scores_df.index)+1, :] = [np.average(train2_r2_scores_df.loc[:, "Train2_R2"].values,
                                                                   weights=train2_r2_scores_df.loc[:, "Train2_num_missing"].values,
                                                                   axis=0), np.mean(train2_r2_scores_df.loc[:, "Train2_num_missing"].values)]

test_x_imp, test_r2_scores_df = rf_imputer.transform(test_x, evaluate = True, backup_impute_strategy = "mean")
test_r2_scores_df.columns = ["Test_R2", "Test_num_missing"]
test_r2_scores_df.loc[max(test_r2_scores_df.index)+1, :] = [np.average(test_r2_scores_df.loc[:, "Test_R2"].values,
                                                                   weights = test_r2_scores_df.loc[:, "Test_num_missing"].values,
                                                                   axis=0), np.mean(test_r2_scores_df.loc[:, "Test_num_missing"].values)]

### convert imputed train and test data matrices back into pandas dataframes with column names
cols = ["site", "MonitorData"] + list(train1_x.columns)
train1_imp_df = pd.DataFrame(np.concatenate([train1_sites.values.reshape(len(train1_sites), -1),
                                              train1_y.values.reshape(len(train1_y), -1),
                                              train1_x_imp], axis=1),
                                              columns=cols)

train2_imp_df = pd.DataFrame(np.concatenate([train2_sites.values.reshape(len(train2_sites), -1),
                                              train2_y.values.reshape(len(train2_y), -1),
                                              train2_x_imp], axis=1),
                                              columns=cols)

test_imp_df = pd.DataFrame(np.concatenate([test_sites.values.reshape(len(test_sites), -1),
                                              test_y.values.reshape(len(test_y), -1),
                                              test_x_imp], axis=1),
                                              columns=cols)

# put R^2 evaluations for train and test datasets into same pandas dataframe
var_df = pd.DataFrame(np.array(cols[2:] + ["Weighted_Mean_R2"]).reshape(len(cols)-2+1, -1), columns=["Variable"])
train_imp_df = pd.concat([train1_imp_df, train2_imp_df])

# recombine partial train sets (both imputed) into single train set
train_imp_df = train_imp_df.reset_index().sort_values(["site", "index"])
train_imp_df.drop("index", axis=1, inplace=True)
train_imp_df.reset_index(inplace=True, drop=True)

# save evaluations
#pickle.dump(rf_imputer, open("rfV_imputer.pkl", "wb"))

if args_val:
    # split val into x, y, and site
    val_x, val_y, val_sites = X_y_site_split(val, y_var_name="MonitorData", site_var_name="site")

    ### make imputations on val data matrix and create dataframe with imputation R^2 evaluations; computed weighted R^2 values
    val_x_imp, val_r2_scores_df = rf_imputer.transform(val_x, evaluate = True, backup_impute_strategy = "mean")
    val_r2_scores_df.columns = ["Val_R2", "Val_num_missing"]
    val_r2_scores_df.loc[max(val_r2_scores_df.index)+1, :] = [np.average(val_r2_scores_df.loc[:, "Val_R2"].values,
                                                                       weights = val_r2_scores_df.loc[:, "Val_num_missing"].values,
                                                                       axis=0), np.mean(val_r2_scores_df.loc[:, "Val_num_missing"].values)]

    ### convert imputed val data matrix back into pandas dataframes with column names
    val_imp_df = pd.DataFrame(np.concatenate([val_sites.values.reshape(len(val_sites), -1),
                                                  val_y.values.reshape(len(val_y), -1),
                                                  val_x_imp], axis=1),
                                                  columns=cols)

    # save imputed datasets
    train_imp_df.to_csv(config["RF_Imputation"]["trainV"], index=False)
    val_imp_df.to_csv(config["RF_Imputation"]["valV"], index=False)
    test_imp_df.to_csv(config["RF_Imputation"]["testV"], index=False)

    # put R^2 evaluations for train, val, and test datasets into same pandas dataframe
    r2_scores_df = pd.concat([var_df, train1_r2_scores_df, train2_r2_scores_df, val_r2_scores_df, test_r2_scores_df], axis=1)

    # save evaluations
    r2_scores_df.to_csv(config["RF_Imputation"]["r2_scores"], index=False)


else:
    train_imp_df.to_csv(config["RF_Imputation"]["train"], index=False)
    test_imp_df.to_csv(config["RF_Imputation"]["test"], index=False)

    # put R^2 evaluations for train and test datasets into same pandas dataframe and save
    r2_scores_df = pd.concat([var_df, train1_r2_scores_df, train2_r2_scores_df, test_r2_scores_df], axis=1)
    r2_scores_df.to_csv(config["RF_Imputation"]["r2_scores"], index=False)
    
    

Number of variables: 60
Iteration 1
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6
Variable # 7
Variable # 8
Variable # 9
Variable # 10
Variable # 11
Variable # 12
Variable # 13
Variable # 14
Variable # 15
Variable # 16
Variable # 17
Variable # 18
Variable # 19
Variable # 20
Variable # 21
Variable # 22
Variable # 23
Variable # 24
Variable # 25
Variable # 26
Variable # 27
Variable # 28
Variable # 29
Variable # 30
Variable # 31
Variable # 32
Variable # 33
Variable # 34
Variable # 35
Variable # 36
Variable # 37
Variable # 38
Variable # 39
Variable # 40
Variable # 41
Variable # 42
Variable # 43
Variable # 44
Variable # 45
Variable # 46
Variable # 47
Variable # 48
Variable # 49
Variable # 50
Variable # 51
Variable # 52
Variable # 53
Variable # 54
Variable # 55
Variable # 56
Variable # 57
Variable # 58
Variable # 59
Difference: 0.000576077413529

Iteration 2
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6


# Making Pollution Predictions with Ridge Regression 
Note: the process is identical for random forest and XGBoost. Simply change the argument 'model'

### Cross-validation
We use cross validation to tune any of ridge regression, random forest, or XGBoost models. Below, we use cross validation to tune a ridge regression predictor. Optimal model hyperparamters are stored ('pickled') as a dict type, and then used when training the model in the next section 
- the following is implemented in $\texttt{model_cross_validation.py}$
- to run in the shell, use $\texttt{model-cross-val.sh}$

In [21]:
args_model = "ridge"
args_n_folds = 3
args_dataset = 'ridgeImp' #Dataset imputed with ridge regression

In [22]:
import sklearn.linear_model
import sklearn.ensemble
import sklearn.metrics
import sys
import xgboost as xgb
# these are imported functions created for this package that involve splitting datasets or performing cross-validation
# see data_split_tune_utils.py
from data_split_tune_utils import cross_validation_splits, X_y_site_split, cross_validation

In [23]:
if args_n_folds < 1:
    print("n_folds must be at least 1!")
    sys.exit()

train = None

if args_dataset == "ridgeImp":
    train = pd.read_csv(config["Ridge_Imputation"]["train"])


elif args_dataset == "rfImp":
    train = pd.read_csv(config["RF_Imputation"]["train"])


if train.empty: # failsafe
    print("Invalid dataset!")
    sys.exit()


# set seed for reproducibility
np.random.seed(1)

# drop rows with no monitor data response value
train = train.dropna(axis=0)

print("Training model \"{}\" on dataset \"{}\"...)".format(args_model, args_dataset))

if args_model == "ridge":
    # instantiate ridge
    ridge = sklearn.linear_model.Ridge(random_state=1, normalize=True, fit_intercept=True)

    # ridge hyper-parameters to test in cross-validation
    parameter_grid_ridge = {"alpha" : [0.1, 0.01, 0.001, 0.0001, 0.00001]}

    # run cross-validation
    cv_r2, best_hyperparams = cross_validation(data=train, model=ridge, hyperparam_dict=parameter_grid_ridge, num_folds=args_n_folds, y_var_name="MonitorData", site_var_name="site")
    print("Cross-validation R^2: " + str(cv_r2))
    print("Best hyper-parameters: " + str(best_hyperparams))
    
    # save dictionary with best ridge hyper-parameter combination
    pickle.dump(best_hyperparams, open(config["Reg_Best_Hyperparams"]["ridge"], "wb"))

elif args_model == "rf":
    # instantiate random forest
    rf = sklearn.ensemble.RandomForestRegressor(n_estimators=200, random_state=1, n_jobs=-1)

    # random forest hyper-parameters to test in cross-validation
    parameter_grid_rf = {"max_features" : [10, 15, 20, 25]}

    # run cross-validation
    cv_r2, best_hyperparams = cross_validation(data=train, model=rf, hyperparam_dict=parameter_grid_rf, num_folds=args_n_folds, y_var_name="MonitorData", site_var_name="site")
    print("Cross-validation R^2: " + str(cv_r2))
    print("Best hyper-parameters: " + str(best_hyperparams))

    # save dictionary with best random forest hyper-parameter combination
    pickle.dump(best_hyperparams, open(config["Reg_Best_Hyperparams"]["rf"], "wb"))

elif args_model == "xgb":
    # instantiate xgboost
    xgboost = xgb.XGBRegressor(random_state=1, n_jobs=-1)

    #Information on gradient boosting parameter tuning
    #https://machinelearningmastery.com/configure-gradient-boosting-algorithm
    # xgboost hyper-parameters to test in cross-validation
    parameter_grid_xgboost = {"learning_rate": [0.001, 0.01, 0.05, 0.1], "max_depth": [4, 6, 8, 10], "n_estimators": [100, 250, 500, 750, 1000]}

    # run cross-validation
    cv_r2, best_hyperparams = cross_validation(data=train, model=xgboost, hyperparam_dict=parameter_grid_xgboost, num_folds=4, y_var_name="MonitorData", site_var_name="site")
    print("Cross-validation R^2: " + str(cv_r2))
    print("Best hyper-parameters: " + str(best_hyperparams))

    # save dictionary with best xgboost hyper-parameter combination
    pickle.dump(best_hyperparams, open(config["Reg_Best_Hyperparams"]["xgb"], "wb"))

Training model "ridge" on dataset "ridgeImp"...)
Cross-validation R^2: 0.737384561994
Best hyper-parameters: {'alpha': 0.01}


## Feed the cross-validation hyper-parameters to our train/test script
- optimal hyper parameters from cross-validation are pickled in a dict, and loaded into the following training code
- the following is implemented in $\texttt{final_train_test.py}$
- if doing random forest, feature importances are stored 

In [24]:
#Command Line Arguments: 
args_model = 'ridge'
args_dataset = 'ridgeImp'

In [25]:
import sklearn.linear_model
import sklearn.ensemble
import sklearn.metrics
import sys
import xgboost as xgb
# this is an imported functions created for this package that splits datasets (see data_split_tune_utils.py)
from data_split_tune_utils import X_y_site_split

models = ["ridge", "rf", "xgb"]
datasets = ["ridgeImp", "rfImp"]

if args_model not in models:
    print("Invalid regression model!")
    sys.exit()


if args_dataset not in datasets:
    print("Invalid dataset!")
    sys.exit()


train = None
test  = None


if args_dataset == "ridgeImp":
    train = pd.read_csv(config["Ridge_Imputation"]["train"])
    test  = pd.read_csv(config["Ridge_Imputation"]["test"])


elif args_dataset == "rfImp":
    train = pd.read_csv(config["RF_Imputation"]["train"])
    test  = pd.read_csv(config["RF_Imputation"]["test"])

if train.empty or test.empty: # failsafe
    print("Invalid dataset!")
    sys.exit()


# drop rows with no monitor data response value
train = train.dropna(axis=0)
test = test.dropna(axis=0)

# split train and test datasets into x, y, and site
train_x, train_y, train_sites = X_y_site_split(train, y_var_name="MonitorData", site_var_name="site")
test_x, test_y, test_sites = X_y_site_split(test, y_var_name="MonitorData", site_var_name="site")


print("Training model \"{}\" on dataset \"{}\"...)".format(args_model, args_dataset))

if args_model == "ridge":
    # load dictionary with best ridge hyper-parameters from cross-validation
    best_hyperparams = pickle.load(open(config["Reg_Best_Hyperparams"]["ridge"], "rb"))

    # instantiate ridge
    ridge = sklearn.linear_model.Ridge(random_state=1, normalize=True, fit_intercept=True)

    # set ridge attributes to best combination of hyper-parameters from cross-validation
    for key in list(best_hyperparams.keys()):
        setattr(ridge, key, best_hyperparams[key])

    # fit ridge on train data and make predictions on test data; compute test R^2
    ridge.fit(train_x, train_y)
    test_pred_ridge = ridge.predict(test_x)
    test_r2_ridge = sklearn.metrics.r2_score(test_y, test_pred_ridge)
    print("Test R^2: " + str(test_r2_ridge))

    # put ridge predictions into test dataframe (note that these predictions do not include those for rows where there is no response value)
    # save ridge predictions and fitted ridge model
    test["MonitorData_pred"] = pd.Series(test_pred_ridge, index=test.index)
    test.to_csv(config["Regression"]["ridge_pred"], index=False)
    pickle.dump(ridge, open(config["Regression"]["ridge_final"], "wb"))


elif args_model == "rf":
    # load dictionary with best random forest hyper-parameters from cross-validation
    best_hyperparams = pickle.load(open(config["Reg_Best_Hyperparams"]["rf"], "rb"))

    # instantiate random forest
    rf = sklearn.ensemble.RandomForestRegressor(n_estimators=500, random_state=1, n_jobs=-1)

    # set random forest attributes to best combination of hyper-parameters from cross-validation
    for key in list(best_hyperparams.keys()):
        setattr(rf, key, best_hyperparams[key])

    # fit random forest on train data and make predictions on test data; compute test R^2
    rf.fit(train_x, train_y)
    test_pred_rf = rf.predict(test_x)
    test_r2_rf = sklearn.metrics.r2_score(test_y, test_pred_rf)
    print("Test R^2: " + str(test_r2_rf))

    # put random forest predictions into test dataframe (note that these predictions do not include those for rows where there is no response value)
    # save random forest predictions; don't save fitted random forest model due to memory
    
    test["MonitorData_pred"] = pd.Series(test_pred_rf, index=test.index)
    test.to_csv("../data/test_rfPred.csv", index=False)
    #pickle.dump(rf, open("rf_final.pkl", "wb"))

    # create dataframe of random forest feature importances and save
    feature_importance_df = pd.DataFrame(rf.feature_importances_.reshape(len(rf.feature_importances_), -1), columns=["RF_Feature_Importance"])
    feature_importance_df["Variable"] = pd.Series(train_x.columns, index=feature_importance_df.index)
    feature_importance_df.to_csv(config["Regression"]["rf_ftImp"], index=False)

elif args_model == "xgb":
    # load dictionary with best xgboost hyper-parameters from cross-validation
    best_hyperparams = pickle.load(open(config["Reg_Best_Hyperparams"]["xgb"], "rb"))

    # instantiate xgboost
    xgboost = xgb.XGBRegressor(random_state=1, n_jobs=-1)

    # set xgboost attributes to best combination of hyper-parameters from cross-validation
    for key in list(best_hyperparams.keys()):
        setattr(xgboost, key, best_hyperparams[key])

    # fit xgboost on train data and make predictions on test data; compute test R^2
    xgboost.fit(train_x, train_y)
    test_pred_xgboost = xgboost.predict(test_x)
    test_r2_xgboost = sklearn.metrics.r2_score(test_y, test_pred_xgboost)
    print("Test R^2: " + str(test_r2_xgboost))

    # put xgboost predictions into test dataframe (note that these predictions do not include those for rows where there is no response value)
    # save xgboost predictions; don't save fitted xgboost model due to memory
    test["MonitorData_pred"] = pd.Series(test_pred_xgboost, index=test.index)
    test.to_csv(config["Regression"]["xgb_pred"], index=False)
    #pickle.dump(xgboost, open("xgboost_final.pkl", "wb"))

    # create dataframe of xgboost feature importances and save
    feature_importance_df = pd.DataFrame(xgboost.feature_importances_.reshape(len(xgboost.feature_importances_), -1), columns=["XGBoost_Feature_Importance"])
    feature_importance_df["Variable"] = pd.Series(train_x.columns, index=feature_importance_df.index)
    feature_importance_df.to_csv(config["Regression"]["xgb_ftImp"], index=False)

Training model "ridge" on dataset "ridgeImp"...)
Test R^2: 0.682429611604
