# Predicting $PM_{2.5}$ with our CNN Architecture 

The following tutorial explains how to use our CNN architecture. To be more transparent, each cell contains copied code from our top-level python scripts. However, in practice you should only need to run their associated $\texttt{.sh}$ files. To make the tutorial functional on a PC, we use a curated data subset that only takes 100 days of each sensor (found in the '../data' folder).

The tutorial is broken into the following steps: 
- 1) Load data
- 2) Making Train/Test/Validation Split
- 3) Imputing data with ridge regression or random forest variants of MissForest
- 4) Determining optimal CNN hyper-parameters 
- 5) Training and testing one of the two CNN architectures 

Each step (in bold markdown) contains a description of the code, the name of the python script, the name of the shell script in $\texttt{.../src}$, and is followed by 3 cells: 
- 1) arguments for the associated script. Each argument should be passed to the associated shell script in the command line. Here, the arguments are all python variables with the format "$\texttt{args_{argument name}}$".
- 2) The imported dependencies required for that step 
- 3) A copy of the top-level python script found in the directory $\texttt{src}$

In [9]:
import numpy as np
import pandas as pd
import configparser

In [10]:
config = configparser.RawConfigParser()
config.read("./config/py_config.ini")

['./config/py_config.ini']

# Import 100 Day Data Subset: 
 - to produce subset: 
  - execute in command line $\texttt{Rscript data_setup.R}$
  - note: the data path variables are set in $\texttt{/src/config/data_setup_config.R}$

In [12]:
#Import preprocessed data
dat = pd.read_csv('../data/data_to_impute.csv')
#Drop useless index column 
dat = dat.iloc[:, 1:]

# Splitting the Data

### Make Train/Test/Val Split. Train/Test/Val is necessary for CNN prediction models, whereas only train and test are required for other prediction models. The other models can use cross-validation for tuning. 
With our script, a given site will be all in train or all in test, in order to make a fair test of predictive capability

 - train/test/val split is executed by $\texttt{train_val_test_split.py}$
 - to execute, use $\texttt{tvt-split.sh}$ with necessary arguments

In [14]:
#Shell arguments for tvt_split.sh, which calls train_val_test_split.py: 
args_val = True
args_train_split = 0.7
args_val_split = 0.1

In [15]:
#CODE FROM /src/train_val_test_split.py
#Get utils 
import sys
from data_split_tune_utils import train_test_split, train_val_test_split

In [16]:
#CODE FROM /src/train_val_test_split.py
if (not args_val) and args_val_split != 0:
    print("Validation split specified without validation flag!")
    sys.exit()


if args_train_split < 0 or args_train_split > 1 or \
   args_val_split < 0 or args_val_split > 1:
    print("Split value out of range!")
    sys.exit()


if args_train_split + args_val_split > 1:
    print("Invalid train/validation split ratio! Must fall under a total of 1.")
    sys.exit()


test_split = 1. - args_train_split - args_val_split


data = pd.read_csv(config["data"]["data_to_impute"])

# set seed for reproducibility
np.random.seed(1)

if args_val: # split data into train, validation, and test sets and save
    train, val, test = train_val_test_split(data, train_prop=args_train_split, test_prop=test_split, site_var_name="site")
    train.to_csv(config["data"]["trainV"], index=False)
    val.to_csv(  config["data"]["valV"], index=False)
    test.to_csv( config["data"]["testV"], index=False)


else: # split data into train and test sets and save
    train, test = train_test_split(data, train_prop=args_train_split, site_var_name="site")
    train.to_csv(config["data"]["train"], index=False)
    test.to_csv(config["data"]["test"], index=False)

In [17]:
print('Training Size: ', train.shape)
print('Test Size: ', test.shape)

Training Size:  (145500, 62)
Test Size:  (41600, 62)


# Imputing Data

## Impute data with ridge regression variant of MissForest

### First, fit ridge regression imputation models
- The code 'pickles' the trained ridge imputation model, so you don't need to train every time you wish to impute. Furthermore, you are able to train the imputer on a subset of data (to save time). The next section will use the trained imputer to actually impute the data
- the following is implemented in $\texttt{ridge_imputer_fit.py}$
- to run in shell, use $\texttt{ridge-imputer-fit.sh}$ with appropriate args

In [18]:
#shell arguments to ridge-imputer-fit.sh, which calls ridge_imputer_fit.py: 
args_val = True 
args_impute_split = 0.5
args_max_iter = 10
args_initial_strategy = 'mean'
args_alpha = 0.0001

In [19]:
#CODE FROM ridge_imputer_fit.py
import pickle 
from data_split_tune_utils import train_test_split, X_y_site_split
from predictiveImputer_mod import PredictiveImputer

In [20]:
#CODE FROM ridge_imputer_fit.py
if args_impute_split < 0 or args_impute_split > 1:
    print("Impute split value out of range!")
    sys.exit()


train = None
if args_val:
    train = pd.read_csv(config["data"]["trainV"])

else:
    train = pd.read_csv(config["data"]["train"])


# set seed for reproducibility
np.random.seed(1)

# split train up; only fit imputer on part of train set due to memory/time
train1, train2 = train_test_split(train, train_prop=args_impute_split, site_var_name="site")

# split part of train set to fit imputer on into x, y, and site
train1_x, train1_y, train1_sites = X_y_site_split(train1, y_var_name="MonitorData", site_var_name="site")

# create imputer and fit on part of train set
ridge_imputer = PredictiveImputer(max_iter=args_max_iter, initial_strategy=args_initial_strategy, f_model="Ridge")
ridge_imputer.fit(train1_x, alpha=args_alpha, fit_intercept=True, normalize=True, random_state=1)

# save fitted imputer
pickle.dump(ridge_imputer, open(config["Ridge_Imputation"]["model"], "wb"))

Number of variables: 60
Iteration 1
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6
Variable # 7
Variable # 8
Variable # 9
Variable # 10
Variable # 11
Variable # 12
Variable # 13
Variable # 14
Variable # 15
Variable # 16
Variable # 17
Variable # 18
Variable # 19
Variable # 20
Variable # 21
Variable # 22
Variable # 23
Variable # 24
Variable # 25
Variable # 26
Variable # 27
Variable # 28
Variable # 29
Variable # 30
Variable # 31
Variable # 32
Variable # 33
Variable # 34
Variable # 35
Variable # 36
Variable # 37
Variable # 38
Variable # 39
Variable # 40
Variable # 41
Variable # 42
Variable # 43
Variable # 44
Variable # 45
Variable # 46
Variable # 47
Variable # 48
Variable # 49
Variable # 50
Variable # 51
Variable # 52
Variable # 53
Variable # 54
Variable # 55
Variable # 56
Variable # 57
Variable # 58
Variable # 59
Difference: 0.000688761218348

Iteration 2
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6


Variable # 24
Variable # 25
Variable # 26
Variable # 27
Variable # 28
Variable # 29
Variable # 30
Variable # 31
Variable # 32
Variable # 33
Variable # 34
Variable # 35
Variable # 36
Variable # 37
Variable # 38
Variable # 39
Variable # 40
Variable # 41
Variable # 42
Variable # 43
Variable # 44
Variable # 45
Variable # 46
Variable # 47
Variable # 48
Variable # 49
Variable # 50
Variable # 51
Variable # 52
Variable # 53
Variable # 54
Variable # 55
Variable # 56
Variable # 57
Variable # 58
Variable # 59
Difference: 9.35447205916e-08



### Then, run the fitted ridge regression imputer to impute the data
- the following is implemented in $\texttt{ridge_impute_eval_train_val_test.py}$
- to run in shell, use $\texttt{ridge-impute-tt.sh}$ with appropriate arguments

In [21]:
#Shell arguments: 
args_val = True
args_backup_strategy = 'mean'

In [22]:
#CODE FROM ridge_impute_eval_train_val_test.py
import pickle
# this imported function was created for this package to split datasets (see data_split_tune_utils.py)
from data_split_tune_utils import X_y_site_split

In [23]:
#CODE FROM ridge_impute_eval_train_val_test.py
train, val, test = None, None, None

if args_val:
    train = pd.read_csv(config["data"]["trainV"])
    val   = pd.read_csv(config["data"]["valV"])
    test  = pd.read_csv(config["data"]["testV"])

else:
    train = pd.read_csv(config["data"]["train"])
    test  = pd.read_csv(config["data"]["test"])

# set seed for reproducibility
np.random.seed(1)

# load fitted ridge imputer
ridge_imputer = pickle.load(open(config["Ridge_Imputation"]["model"], "rb"))

# split train and test datasets into x, y, and site
train_x, train_y, train_sites = X_y_site_split(train, y_var_name="MonitorData", site_var_name="site")
test_x, test_y, test_sites = X_y_site_split(test, y_var_name="MonitorData", site_var_name="site")

### make imputations on train and test data matrices and create dataframes with imputation R^2 evaluations; computed weighted R^2 values
train_x_imp, train_r2_scores_df = ridge_imputer.transform(train_x, evaluate=True, backup_impute_strategy=args_backup_strategy)
train_r2_scores_df.columns = ["Train_R2", "Train_num_missing"]
train_r2_scores_df.loc[max(train_r2_scores_df.index)+1, :] = [np.average(train_r2_scores_df.loc[:, "Train_R2"].values,
                                                                   weights=train_r2_scores_df.loc[:, "Train_num_missing"].values,
                                                                   axis=0), np.mean(train_r2_scores_df.loc[:, "Train_num_missing"].values)]

test_x_imp, test_r2_scores_df = ridge_imputer.transform(test_x, evaluate=True, backup_impute_strategy=args_backup_strategy)
test_r2_scores_df.columns = ["Test_R2", "Test_num_missing"]
test_r2_scores_df.loc[max(test_r2_scores_df.index)+1, :] = [np.average(test_r2_scores_df.loc[:, "Test_R2"].values,
                                                                   weights=test_r2_scores_df.loc[:, "Test_num_missing"].values,
                                                                   axis=0), np.mean(test_r2_scores_df.loc[:, "Test_num_missing"].values)]

### convert imputed train and test data matrices back into pandas dataframes with column names
cols = ["site", "MonitorData"] + list(train_x.columns)
train_imp_df = pd.DataFrame(np.concatenate([train_sites.values.reshape(len(train_sites), -1),
                                              train_y.values.reshape(len(train_y), -1),
                                              train_x_imp], axis=1),
                                              columns=cols)

test_imp_df = pd.DataFrame(np.concatenate([test_sites.values.reshape(len(test_sites), -1),
                                              test_y.values.reshape(len(test_y), -1),
                                              test_x_imp], axis=1),
                                              columns=cols)

var_df = pd.DataFrame(np.array(cols[2:] + ["Weighted_Mean_R2"]).reshape(len(cols)-2+1, -1), columns=["Variable"])



if args_val:
    # split val into x, y, and site
    val_x, val_y, val_sites = X_y_site_split(val, y_var_name="MonitorData", site_var_name="site")

    ### make imputations on val data matrix and create dataframe with imputation R^2 evaluations; computed weighted R^2 values
    val_x_imp, val_r2_scores_df = ridge_imputer.transform(val_x, evaluate=True, backup_impute_strategy="mean")
    val_r2_scores_df.columns = ["Val_R2", "Val_num_missing"]
    val_r2_scores_df.loc[max(val_r2_scores_df.index)+1, :] = [np.average(val_r2_scores_df.loc[:, "Val_R2"].values,
                                                                       weights=val_r2_scores_df.loc[:, "Val_num_missing"].values,
                                                                       axis=0), np.mean(val_r2_scores_df.loc[:, "Val_num_missing"].values)]

    ### convert imputed val data matrix back into pandas dataframes with column names
    val_imp_df = pd.DataFrame(np.concatenate([val_sites.values.reshape(len(val_sites), -1),
                                                  val_y.values.reshape(len(val_y), -1),
                                                  val_x_imp], axis=1),
                                                  columns=cols)
    # save imputed datasets
    train_imp_df.to_csv(config["Ridge_Imputation"]["trainV"], index=False)
    test_imp_df.to_csv(config["Ridge_Imputation"]["testV"], index=False)
    val_imp_df.to_csv(config["Ridge_Imputation"]["valV"], index=False)

    # put R^2 evaluations for train, val, and test datasets into same pandas dataframe
    r2_scores_df = pd.concat([var_df, train_r2_scores_df, val_r2_scores_df, test_r2_scores_df], axis=1)

    # save evaluations
    r2_scores_df.to_csv(config["Ridge_Imputation"]["r2_scores"], index=False)
    
else:
    # save imputed train and test datasets
    train_imp_df.to_csv(config["Ridge_Imputation"]["train"], index=False)
    test_imp_df.to_csv(config["Ridge_Imputation"]["test"], index=False)
    # put R^2 evaluations for train and test datasets into same pandas dataframe and save
    r2_scores_df = pd.concat([var_df, train_r2_scores_df, test_r2_scores_df], axis=1)
    r2_scores_df.to_csv(config["Ridge_Imputation"]["r2_scores"], index=False)

In [25]:
nrows = 5
r2_scores_df.head(nrows)

Unnamed: 0,Variable,Train_R2,Train_num_missing,Val_R2,Val_num_missing,Test_R2,Test_num_missing
0,Unnamed: 0,0.078775,0.0,0.035756,0.0,0.007915,0.0
1,month,0.999993,0.0,0.999993,0.0,0.999993,0.0
2,cumulative_month,0.999993,0.0,0.999993,0.0,0.999993,0.0
3,sin_time,0.957092,0.0,0.954576,0.0,0.95532,0.0
4,cos_time,0.998618,0.0,0.998628,0.0,0.998622,0.0


### Impute with Random Forest variant of MissForest
- The random forest MissForest, while effective, struggles with memory. As such, the random forest imputation script does not save the fitted imputer (since saving the model object itself requires excessive memory). The same script both fits the imputer and makes the imputations.
- The following is implemented in $\texttt{rf_fit_impute_eval_train_val_test.py}$
- To run, use $\texttt{rf-impute-tvt.sh}$ with appropriate args

In [26]:
#Shell arguments for rf-impute-tvt.sh: 
args_val = True 
args_impute_split = 0.5 
args_initial_strategy = 'mean'
args_backup_strategy = 'median'
args_max_iter = 2
args_max_features = 5 
args_n_estimators = 5 

In [27]:
#CODE FROM rf_fit_impute_eval_train_val_test.py
# these are imported functions created for this package that split datasets (see data_split_tune_utils.py)
from data_split_tune_utils import train_test_split, X_y_site_split, train_val_test_split
# this is the PredictiveImputer class inspired by the MissForest algorithm (see predictiveImputer_mod.py)
from predictiveImputer_mod import PredictiveImputer

In [28]:
#CODE FROM rf_fit_impute_eval_train_val_test.py
train, val, test = None, None, None

if args_val:
    train = pd.read_csv(config["data"]["trainV"])
    val   = pd.read_csv(config["data"]["valV"])
    test  = pd.read_csv(config["data"]["testV"])

else:
    train = pd.read_csv(config["data"]["train"])
    test  = pd.read_csv(config["data"]["test"])

# set seed for reproducibility
np.random.seed(1)

# split train up; only fit imputer on part of train set due to memory/time
train1, train2 = train_test_split(train, train_prop=args_impute_split, site_var_name="site")

# split train and test datasets into x, y, and site
train1_x, train1_y, train1_sites = X_y_site_split(train1, y_var_name="MonitorData", site_var_name="site")
train2_x, train2_y, train2_sites = X_y_site_split(train2, y_var_name="MonitorData", site_var_name="site")
test_x, test_y, test_sites = X_y_site_split(test, y_var_name="MonitorData", site_var_name="site")

# create imputer and fit on part of train set
rf_imputer = PredictiveImputer(max_iter=args_max_iter, initial_strategy=args_initial_strategy, f_model="RandomForest")
rf_imputer.fit(train1_x, max_features=args_max_features, n_estimators=args_n_estimators, n_jobs=-1, verbose=0, random_state=1)

### make imputations on train and test data matrices and create dataframes with imputation R^2 evaluations; computed weighted R^2 values
train1_x_imp, train1_r2_scores_df = rf_imputer.transform(train1_x, evaluate=True, backup_impute_strategy=args_backup_strategy)
train1_r2_scores_df.columns = ["Train1_R2", "Train1_num_missing"]
train1_r2_scores_df.loc[max(train1_r2_scores_df.index)+1, :] = [np.average(train1_r2_scores_df.loc[:, "Train1_R2"].values,
                                                                   weights=train1_r2_scores_df.loc[:, "Train1_num_missing"].values,
                                                                   axis=0), np.mean(train1_r2_scores_df.loc[:, "Train1_num_missing"].values)]

train2_x_imp, train2_r2_scores_df = rf_imputer.transform(train2_x, evaluate = True, backup_impute_strategy = "mean")
train2_r2_scores_df.columns = ["Train2_R2", "Train2_num_missing"]
train2_r2_scores_df.loc[max(train2_r2_scores_df.index)+1, :] = [np.average(train2_r2_scores_df.loc[:, "Train2_R2"].values,
                                                                   weights=train2_r2_scores_df.loc[:, "Train2_num_missing"].values,
                                                                   axis=0), np.mean(train2_r2_scores_df.loc[:, "Train2_num_missing"].values)]

test_x_imp, test_r2_scores_df = rf_imputer.transform(test_x, evaluate = True, backup_impute_strategy = "mean")
test_r2_scores_df.columns = ["Test_R2", "Test_num_missing"]
test_r2_scores_df.loc[max(test_r2_scores_df.index)+1, :] = [np.average(test_r2_scores_df.loc[:, "Test_R2"].values,
                                                                   weights = test_r2_scores_df.loc[:, "Test_num_missing"].values,
                                                                   axis=0), np.mean(test_r2_scores_df.loc[:, "Test_num_missing"].values)]

### convert imputed train and test data matrices back into pandas dataframes with column names
cols = ["site", "MonitorData"] + list(train1_x.columns)
train1_imp_df = pd.DataFrame(np.concatenate([train1_sites.values.reshape(len(train1_sites), -1),
                                              train1_y.values.reshape(len(train1_y), -1),
                                              train1_x_imp], axis=1),
                                              columns=cols)

train2_imp_df = pd.DataFrame(np.concatenate([train2_sites.values.reshape(len(train2_sites), -1),
                                              train2_y.values.reshape(len(train2_y), -1),
                                              train2_x_imp], axis=1),
                                              columns=cols)

test_imp_df = pd.DataFrame(np.concatenate([test_sites.values.reshape(len(test_sites), -1),
                                              test_y.values.reshape(len(test_y), -1),
                                              test_x_imp], axis=1),
                                              columns=cols)

# put R^2 evaluations for train and test datasets into same pandas dataframe
var_df = pd.DataFrame(np.array(cols[2:] + ["Weighted_Mean_R2"]).reshape(len(cols)-2+1, -1), columns=["Variable"])
train_imp_df = pd.concat([train1_imp_df, train2_imp_df])

# recombine partial train sets (both imputed) into single train set
train_imp_df = train_imp_df.reset_index().sort_values(["site", "index"])
train_imp_df.drop("index", axis=1, inplace=True)
train_imp_df.reset_index(inplace=True, drop=True)

# save evaluations
#pickle.dump(rf_imputer, open("rfV_imputer.pkl", "wb"))

if args_val:
    # split val into x, y, and site
    val_x, val_y, val_sites = X_y_site_split(val, y_var_name="MonitorData", site_var_name="site")

    ### make imputations on val data matrix and create dataframe with imputation R^2 evaluations; computed weighted R^2 values
    val_x_imp, val_r2_scores_df = rf_imputer.transform(val_x, evaluate = True, backup_impute_strategy = "mean")
    val_r2_scores_df.columns = ["Val_R2", "Val_num_missing"]
    val_r2_scores_df.loc[max(val_r2_scores_df.index)+1, :] = [np.average(val_r2_scores_df.loc[:, "Val_R2"].values,
                                                                       weights = val_r2_scores_df.loc[:, "Val_num_missing"].values,
                                                                       axis=0), np.mean(val_r2_scores_df.loc[:, "Val_num_missing"].values)]

    ### convert imputed val data matrix back into pandas dataframes with column names
    val_imp_df = pd.DataFrame(np.concatenate([val_sites.values.reshape(len(val_sites), -1),
                                                  val_y.values.reshape(len(val_y), -1),
                                                  val_x_imp], axis=1),
                                                  columns=cols)

    # save imputed datasets
    train_imp_df.to_csv(config["RF_Imputation"]["trainV"], index=False)
    val_imp_df.to_csv(config["RF_Imputation"]["valV"], index=False)
    test_imp_df.to_csv(config["RF_Imputation"]["testV"], index=False)

    # put R^2 evaluations for train, val, and test datasets into same pandas dataframe
    r2_scores_df = pd.concat([var_df, train1_r2_scores_df, train2_r2_scores_df, val_r2_scores_df, test_r2_scores_df], axis=1)

    # save evaluations
    r2_scores_df.to_csv(config["RF_Imputation"]["r2_scores"], index=False)


else:
    train_imp_df.to_csv(config["RF_Imputation"]["train"], index=False)
    test_imp_df.to_csv(config["RF_Imputation"]["test"], index=False)

    # put R^2 evaluations for train and test datasets into same pandas dataframe and save
    r2_scores_df = pd.concat([var_df, train1_r2_scores_df, train2_r2_scores_df, test_r2_scores_df], axis=1)
    r2_scores_df.to_csv(config["RF_Imputation"]["r2_scores"], index=False)
    
    

Number of variables: 60
Iteration 1
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6
Variable # 7
Variable # 8
Variable # 9
Variable # 10
Variable # 11
Variable # 12
Variable # 13
Variable # 14
Variable # 15
Variable # 16
Variable # 17
Variable # 18
Variable # 19
Variable # 20
Variable # 21
Variable # 22
Variable # 23
Variable # 24
Variable # 25
Variable # 26
Variable # 27
Variable # 28
Variable # 29
Variable # 30
Variable # 31
Variable # 32
Variable # 33
Variable # 34
Variable # 35
Variable # 36
Variable # 37
Variable # 38
Variable # 39
Variable # 40
Variable # 41
Variable # 42
Variable # 43
Variable # 44
Variable # 45
Variable # 46
Variable # 47
Variable # 48
Variable # 49
Variable # 50
Variable # 51
Variable # 52
Variable # 53
Variable # 54
Variable # 55
Variable # 56
Variable # 57
Variable # 58
Variable # 59
Difference: 0.000556477402734

Iteration 2
Variable # 0
Variable # 1
Variable # 2
Variable # 3
Variable # 4
Variable # 5
Variable # 6


# Making Pollution Predictions with time-domain CNNs 
There are two CNN types. Refer to $\texttt{/src/CNN_architecture.py}$ for description of the differences. Here, we implement CNN type 2 in two steps: 
- 1) Tune hyper-parameters 
- 2) Train and test the CNN to compute an $R^2$ value

## First, validate CNN to tune hyperparams (analog to cross validation with other models)
- The validation script performs grid with several different combinations of hyper-parameters using our validation dataset. It will print the optimal hyper-parameters it finds. If you wish to use the hyper-parameters, modify their corresponding variables in $\texttt{/src/config/generate_py_config.py}$. Then, rerun the generate_py_config.py script. 
- The following is implemented in $\texttt{CNN_validate.py}$
- To run, use shell script $\texttt{CNN-validate.sh}$

In [1]:
#Arguments for CNN-validate.sh
args_cnn_type = 'cnn_2'
args_dataset = 'ridgeImp'

In [7]:
#CODE FROM CNN_validate.py

import torch
from torch.autograd import Variable
import sklearn.preprocessing
import sklearn.metrics
# this imported function was created for this package to split datasets (see data_split_tune_utils.py)
from data_split_tune_utils import X_y_site_split
# these imported functions were created for this package for the purposes of data pre-processing for CNNs, training CNNs,
# and evaluation of CNNs (see CNN_utils.py)
from CNN_utils import split_sizes_site, split_data, pad_stack_splits, get_monitorData_indices, r2, get_nonConst_vars, train_CNN
# these imported classes are the CNN architectures (see CNN_architecture.py)
from CNN_architecture import CNN1, CNN2

In [11]:
#CODE FROM CNN_validate.py

# set seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

train = None
val = None

if args_dataset == "ridgeImp":
    # read in train and val sets
    train = pd.read_csv(config["Ridge_Imputation"]["trainV"])
    val = pd.read_csv(config["Ridge_Imputation"]["valV"])

elif args_dataset == "rfImp":
    # read in train and val sets
    train = pd.read_csv(config["RF_Imputation"]["trainV"])
    val = pd.read_csv(config["RF_Imputation"]["valV"])


### delete sites from datasets where all monitor outputs are nan
train_sites_all_nan_df = train.loc[:, ['site', 'MonitorData']].groupby('site').any()
train_sites_to_delete = list(train_sites_all_nan_df[train_sites_all_nan_df['MonitorData']==False].index)
train = train[~train['site'].isin(train_sites_to_delete)]

val_sites_all_nan_df = val.loc[:, ['site', 'MonitorData']].groupby('site').any()
val_sites_to_delete = list(val_sites_all_nan_df[val_sites_all_nan_df['MonitorData']==False].index)
val = val[~val['site'].isin(val_sites_to_delete)]

# split train, val into x, y, and sites
train_x, train_y, train_sites = X_y_site_split(train, y_var_name='MonitorData', site_var_name='site')
val_x, val_y, val_sites = X_y_site_split(val, y_var_name='MonitorData', site_var_name='site')

# get dataframes with non-constant features only
nonConst_vars = get_nonConst_vars(train, site_var_name='site', y_var_name='MonitorData', cutoff=20)
train_x_nonConst = train_x.loc[:, nonConst_vars]
val_x_nonConst = val_x.loc[:, nonConst_vars]

# standardize all features
standardizer_all = sklearn.preprocessing.StandardScaler(with_mean = True, with_std = True)
train_x_std_all = standardizer_all.fit_transform(train_x)
val_x_std_all = standardizer_all.transform(val_x)

# standardize non-constant features
standardizer_nonConst = sklearn.preprocessing.StandardScaler(with_mean = True, with_std = True)
train_x_std_nonConst = standardizer_nonConst.fit_transform(train_x_nonConst)
val_x_std_nonConst = standardizer_nonConst.transform(val_x_nonConst)




# get split sizes for TRAIN data (splitting by site)
train_split_sizes = split_sizes_site(train_sites.values)

# get tuples by site
train_x_std_tuple_nonConst = split_data(torch.from_numpy(train_x_std_nonConst).float(), train_split_sizes, dim = 0)
train_x_std_tuple = split_data(torch.from_numpy(train_x_std_all).float(), train_split_sizes, dim = 0)
train_y_tuple = split_data(torch.from_numpy(train_y.values), train_split_sizes, dim = 0)

# get site sequences stacked into matrix to go through CNN
train_x_std_stack_nonConst = pad_stack_splits(train_x_std_tuple_nonConst, np.array(train_split_sizes), 'x')
train_x_std_stack_nonConst = Variable(torch.transpose(train_x_std_stack_nonConst, 1, 2))


# get split sizes for VALIDATION data (splitting by site)
val_split_sizes = split_sizes_site(val_sites.values)

# get tuples by site
val_x_std_tuple_nonConst = split_data(torch.from_numpy(val_x_std_nonConst).float(), val_split_sizes, dim = 0)
val_x_std_tuple = split_data(torch.from_numpy(val_x_std_all).float(), val_split_sizes, dim = 0)
val_y_tuple = split_data(torch.from_numpy(val_y.values), val_split_sizes, dim = 0)

# get site sequences stacked into matrix to go through CNN
val_x_std_stack_nonConst = pad_stack_splits(val_x_std_tuple_nonConst, np.array(val_split_sizes), 'x')
val_x_std_stack_nonConst = Variable(torch.transpose(val_x_std_stack_nonConst, 1, 2))


# training parameters and model input sizes
num_epochs = 10
batch_size = 128
input_size_conv = train_x_std_nonConst.shape[1]
input_size_full = train_x_std_all.shape[1]
print('Total number of variables: ' + str(input_size_full))
print('Total number of non-constant variables: ' + str(input_size_conv))
print()

### tune CNN1
if args_cnn_type == "cnn_1":
    # CNN and optimizer hyper-parameters to test
    hidden_size_conv_list = [25, 50]
    kernel_size_list = [3, 5]
    padding_list = [1, 2]
    hidden_size_full_list = [50, 100]
    dropout_full_list = [0.1, 0.4]
    hidden_size_combo_list = [50, 100]
    dropout_combo_list = [0.1, 0.4]
    lr_list = [0.1]
    weight_decay_list = [0.00001]

    # Loss function
    mse_loss = torch.nn.MSELoss(size_average=True)

    # grid search
    best_val_r2 = -np.inf
    for hidden_size_conv in hidden_size_conv_list:
        for kernel_size, padding in zip(kernel_size_list, padding_list):
            for hidden_size_full in hidden_size_full_list:
                for dropout_full in dropout_full_list:
                    for hidden_size_combo in hidden_size_combo_list:
                        for dropout_combo in dropout_combo_list:
                            for lr in lr_list:
                                for weight_decay in weight_decay_list:
                                    # instantiate CNN
                                    cnn = CNN1(input_size_conv, hidden_size_conv, kernel_size, padding, input_size_full, hidden_size_full,
                                              dropout_full, hidden_size_combo, dropout_combo)

                                    # instantiate optimizer
                                    optimizer = torch.optim.Adam(cnn.parameters(), lr=lr, weight_decay=weight_decay)

                                    print('Hidden size conv: ' + str(hidden_size_conv))
                                    print('Kernel size: ' + str(kernel_size))
                                    print('Hidden size full: ' + str(hidden_size_full))
                                    print('Dropout full: ' + str(dropout_full))
                                    print('Hidden size combo: ' + str(hidden_size_combo))
                                    print('Dropout combo: ' + str(dropout_combo))
                                    print('Learning rate: ' + str(lr))
                                    print('Weight decay: ' + str(weight_decay))

                                    # train
                                    train_CNN(train_x_std_stack_nonConst, train_x_std_tuple, train_y_tuple, cnn, optimizer, mse_loss, num_epochs, batch_size)

                                    # evaluate
                                    val_r2 = r2(cnn, batch_size, val_x_std_stack_nonConst, val_x_std_tuple, val_y_tuple, get_pred=False)
                                    print('Validation R^2: ' + str(val_r2))
                                    print()
                                    print()

                                    # keep track of best validation R^2 and hyper-parameters
                                    if val_r2 > best_val_r2:
                                        best_val_r2 = val_r2
                                        best_hidden_size_conv = hidden_size_conv
                                        best_kernel_size = kernel_size
                                        best_hidden_size_full = hidden_size_full
                                        best_dropout_full = dropout_full
                                        best_hidden_size_combo = hidden_size_combo
                                        best_dropout_combo = dropout_combo
                                        best_lr = lr
                                        best_weight_decay = weight_decay
                                        
    print('Best validation R^2: ' + str(best_val_r2))
    print('Best hidden size conv: ' + str(best_hidden_size_conv))
    print('Best kernel size: ' + str(best_kernel_size))
    print('Best hidden size full: ' + str(best_hidden_size_full))
    print('Best dropout full: ' + str(best_dropout_full))
    print('Best hidden size combo: ' + str(best_hidden_size_combo))
    print('Best dropout combo: ' + str(best_dropout_combo))
    print('Best learning rate: ' + str(best_lr))
    print('Best weight decay: ' + str(best_weight_decay))
    
### tune CNN2
else:
    hidden_size_conv_list = [25, 50]
    kernel_size_list = [3, 5]
    padding_list = [1, 2]
    hidden_size_full_list = [50, 100]
    dropout_full_list = [0.1, 0.4]
    hidden_size2_full_list = [50, 100]
    dropout2_full_list = [0.1, 0.4]
    lr_list = [0.01]
    weight_decay_list = [0.00001]

    # Loss function
    mse_loss = torch.nn.MSELoss(size_average=True)

    # grid search
    best_val_r2 = -np.inf
    for hidden_size_conv in hidden_size_conv_list:
        for kernel_size, padding in zip(kernel_size_list, padding_list):
            for hidden_size_full in hidden_size_full_list:
                for dropout_full in dropout_full_list:
                    for hidden_size2_full in hidden_size2_full_list:
                        for dropout2_full in dropout2_full_list:
                            for lr in lr_list:
                                for weight_decay in weight_decay_list:
                                    # instantiate CNN
                                    cnn = CNN2(input_size_conv, hidden_size_conv, kernel_size, padding, input_size_full, hidden_size_full,
                                              dropout_full, hidden_size2_full, dropout2_full)

                                    # instantiate optimizer
                                    optimizer = torch.optim.Adam(cnn.parameters(), lr=lr, weight_decay=weight_decay)

                                    print('Hidden size conv: ' + str(hidden_size_conv))
                                    print('Kernel size: ' + str(kernel_size))
                                    print('Hidden size full: ' + str(hidden_size_full))
                                    print('Dropout full: ' + str(dropout_full))
                                    print('Hidden size 2 full: ' + str(hidden_size2_full))
                                    print('Dropout 2 full: ' + str(dropout2_full))
                                    print('Learning rate: ' + str(lr))
                                    print('Weight decay: ' + str(weight_decay))

                                    # train
                                    train_CNN(train_x_std_stack_nonConst, train_x_std_tuple, train_y_tuple, cnn, optimizer, mse_loss, num_epochs, batch_size)

                                    # evaluate
                                    val_r2 = r2(cnn, batch_size, val_x_std_stack_nonConst, val_x_std_tuple, val_y_tuple, get_pred=False)
                                    print('Validation R^2: ' + str(val_r2))
                                    print()
                                    print()

                                    # keep track of best validation R^2 and hyper-parameters
                                    if val_r2 > best_val_r2:
                                        best_val_r2 = val_r2
                                        best_hidden_size_conv = hidden_size_conv
                                        best_kernel_size = kernel_size
                                        best_hidden_size_full = hidden_size_full
                                        best_dropout_full = dropout_full
                                        best_hidden_size2_full = hidden_size2_full
                                        best_dropout2_full = dropout2_full
                                        best_lr = lr
                                        best_weight_decay = weight_decay
                                        
    print('Best validation R^2: ' + str(best_val_r2))
    print('Best hidden size conv: ' + str(best_hidden_size_conv))
    print('Best kernel size: ' + str(best_kernel_size))
    print('Best hidden size full: ' + str(best_hidden_size_full))
    print('Best dropout full: ' + str(best_dropout_full))
    print('Best hidden size 2 full: ' + str(best_hidden_size2_full))
    print('Best dropout 2 full: ' + str(best_dropout2_full))
    print('Best learning rate: ' + str(best_lr))
    print('Best weight decay: ' + str(best_weight_decay))                               

Total number of variables: 60
Total number of non-constant variables: 21

Hidden size conv: 25
Kernel size: 3
Hidden size full: 50
Dropout full: 0.1
Hidden size 2 full: 50
Dropout 2 full: 0.1
Learning rate: 0.01
Weight decay: 1e-05
Epoch loss after epoch 5: 183.62091827392578

Epoch loss after epoch 10: 132.90976905822754

Validation R^2: 0.667679489866


Hidden size conv: 25
Kernel size: 3
Hidden size full: 50
Dropout full: 0.1
Hidden size 2 full: 50
Dropout 2 full: 0.4
Learning rate: 0.01
Weight decay: 1e-05
Epoch loss after epoch 5: 201.7007999420166

Epoch loss after epoch 10: 149.17006874084473

Validation R^2: 0.63532073318


Hidden size conv: 25
Kernel size: 3
Hidden size full: 50
Dropout full: 0.1
Hidden size 2 full: 100
Dropout 2 full: 0.1
Learning rate: 0.01
Weight decay: 1e-05
Epoch loss after epoch 5: 183.10474586486816

Epoch loss after epoch 10: 131.5680980682373

Validation R^2: 0.675813953056


Hidden size conv: 25
Kernel size: 3
Hidden size full: 50
Dropout full: 0.1
H

Epoch loss after epoch 10: 137.98619842529297

Validation R^2: 0.645463039066


Hidden size conv: 25
Kernel size: 5
Hidden size full: 100
Dropout full: 0.4
Hidden size 2 full: 50
Dropout 2 full: 0.4
Learning rate: 0.01
Weight decay: 1e-05
Epoch loss after epoch 5: 217.55944061279297

Epoch loss after epoch 10: 163.14635848999023

Validation R^2: 0.599544309799


Hidden size conv: 25
Kernel size: 5
Hidden size full: 100
Dropout full: 0.4
Hidden size 2 full: 100
Dropout 2 full: 0.1
Learning rate: 0.01
Weight decay: 1e-05
Epoch loss after epoch 5: 175.58785247802734

Epoch loss after epoch 10: 137.55784225463867

Validation R^2: 0.660688938901


Hidden size conv: 25
Kernel size: 5
Hidden size full: 100
Dropout full: 0.4
Hidden size 2 full: 100
Dropout 2 full: 0.4
Learning rate: 0.01
Weight decay: 1e-05
Epoch loss after epoch 5: 181.59077262878418

Epoch loss after epoch 10: 152.2690143585205

Validation R^2: 0.628058482613


Hidden size conv: 50
Kernel size: 3
Hidden size full: 50
Dropout

KeyboardInterrupt: 

## Then, train type 2 CNN
- this step will pull hyper-parameters from the config file ($\texttt{/src/config/py_config.ini}$), which, as described above, you may set as the hyper-parameters chosen by validation
- the following code is executed in $\texttt{CNN_train_test.py}$
- to run, use $\texttt{CNN-tt.sh}$

In [45]:
#Command Line Args for CNN-tt.sh: 
args_cnn_type = 'cnn_2'
args_dataset = 'ridgeImp'

In [46]:
#CODE FROM CNN_train_test.py

import torch
from torch.autograd import Variable
import sklearn.preprocessing
import sklearn.metrics
# this imported function was created for this package to split datasets (see data_split_tune_utils.py)
from data_split_tune_utils import X_y_site_split
# these imported functions were created for this package for the purposes of data pre-processing for CNNs, training CNNs,
# and evaluation of CNNs (see CNN_utils.py)
from CNN_utils import split_sizes_site, split_data, pad_stack_splits, get_monitorData_indices, r2, get_nonConst_vars, train_CNN
# these imported classes are the CNN architectures (see CNN_architecture.py)
from CNN_architecture import CNN1, CNN2

In [58]:
#CODE FROM CNN_train_test.py

# set seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

# read in train, val, and test
train, val, test = None, None, None


if args_dataset == "ridgeImp":
    train = pd.read_csv(config["Ridge_Imputation"]["trainV"])
    val   = pd.read_csv(config["Ridge_Imputation"]["valV"])
    test  = pd.read_csv(config["Ridge_Imputation"]["testV"])


elif args_dataset == "rfImp":
    train = pd.read_csv(config["RF_Imputation"]["trainV"])
    val   = pd.read_csv(config["RF_Imputation"]["valV"])
    test  = pd.read_csv(config["RF_Imputation"]["testV"])


# combine train and validation sets into train set
train = pd.concat([train, val], axis=0, ignore_index=True)

### delete sites from datasets where all monitor outputs are nan
train_sites_all_nan_df = train.loc[:, ['site', 'MonitorData']].groupby('site').any()
train_sites_to_delete = list(train_sites_all_nan_df[train_sites_all_nan_df['MonitorData']==False].index)
train = train[~train['site'].isin(train_sites_to_delete)]


# test_sites_all_nan_df = pd.DataFrame(np.isnan(test.groupby('site').sum()['MonitorData']))
# test_sites_to_delete = list(test_sites_all_nan_df[test_sites_all_nan_df['MonitorData'] == True].index)
# test = test[~test['site'].isin(test_sites_to_delete)]

test_sites_all_nan_df = test.loc[:, ['site', 'MonitorData']].groupby('site').any()
test_sites_to_delete = list(test_sites_all_nan_df[test_sites_all_nan_df['MonitorData']==False].index)
test = test[~test['site'].isin(test_sites_to_delete)]

# split train, test into x, y, and sites
train_x, train_y, train_sites = X_y_site_split(train, y_var_name='MonitorData', site_var_name='site')
test_x, test_y, test_sites = X_y_site_split(test, y_var_name='MonitorData', site_var_name='site')

# get dataframes with non-constant features only
nonConst_vars = get_nonConst_vars(train, site_var_name='site', y_var_name='MonitorData', cutoff=20)
train_x_nonConst = train_x.loc[:, nonConst_vars]
test_x_nonConst = test_x.loc[:, nonConst_vars]

# standardize all features
standardizer_all = sklearn.preprocessing.StandardScaler(with_mean = True, with_std = True)
train_x_std_all = standardizer_all.fit_transform(train_x)
test_x_std_all = standardizer_all.transform(test_x)

# standardize non-constant features
standardizer_nonConst = sklearn.preprocessing.StandardScaler(with_mean = True, with_std = True)
train_x_std_nonConst = standardizer_nonConst.fit_transform(train_x_nonConst)
test_x_std_nonConst = standardizer_nonConst.transform(test_x_nonConst)




# get split sizes for TRAIN data (splitting by site)
train_split_sizes = split_sizes_site(train_sites.values)

# get tuples by site
train_x_std_tuple_nonConst = split_data(torch.from_numpy(train_x_std_nonConst).float(), train_split_sizes, dim = 0)
train_x_std_tuple = split_data(torch.from_numpy(train_x_std_all).float(), train_split_sizes, dim = 0)
train_y_tuple = split_data(torch.from_numpy(train_y.values), train_split_sizes, dim = 0)

# get site sequences stacked into matrix to go through CNN
train_x_std_stack_nonConst = pad_stack_splits(train_x_std_tuple_nonConst, np.array(train_split_sizes), 'x')
train_x_std_stack_nonConst = Variable(torch.transpose(train_x_std_stack_nonConst, 1, 2))


# get split sizes for TEST data (splitting by site)
test_split_sizes = split_sizes_site(test_sites.values)

# get tuples by site
test_x_std_tuple_nonConst = split_data(torch.from_numpy(test_x_std_nonConst).float(), test_split_sizes, dim = 0)
test_x_std_tuple = split_data(torch.from_numpy(test_x_std_all).float(), test_split_sizes, dim = 0)
test_y_tuple = split_data(torch.from_numpy(test_y.values), test_split_sizes, dim = 0)

# get site sequences stacked into matrix to go through CNN
test_x_std_stack_nonConst = pad_stack_splits(test_x_std_tuple_nonConst, np.array(test_split_sizes), 'x')
test_x_std_stack_nonConst = Variable(torch.transpose(test_x_std_stack_nonConst, 1, 2))


# training parameters and model input sizes
num_epochs = 50
batch_size = 128
input_size_conv = train_x_std_nonConst.shape[1]
input_size_full = train_x_std_all.shape[1]

# train/test CNN1
if args_cnn_type == "cnn_1":
    hidden_size_conv  = int["CNN_hyperparam_1"]["hidden_size_conv"]
    kernel_size       = int["CNN_hyperparam_1"]["kernel_size"]
    padding           = int["CNN_hyperparam_1"]["padding"]
    hidden_size_full  = int["CNN_hyperparam_1"]["hidden_size_full"]
    dropout_full      = np.float64["CNN_hyperparam_1"]["dropout_full"]
    hidden_size_combo = config["CNN_hyperparam_1"]["hidden_size_combo"]
    dropout_combo     = np.float64["CNN_hyperparam_1"]["dropout_combo"]
    lr                = np.float64["CNN_hyperparam_1"]["lr"]
    weight_decay      = np.float64["CNN_hyperparam_1"]["weight_decay"]

    # Loss function
    mse_loss = torch.nn.MSELoss(size_average=True)

    # instantiate CNN
    cnn = CNN1(input_size_conv, hidden_size_conv, kernel_size, padding, input_size_full, hidden_size_full,
               dropout_full, hidden_size_combo, dropout_combo)

    # instantiate optimizer
    optimizer = torch.optim.Adam(cnn.parameters(), lr=lr, weight_decay=weight_decay)

    print('Total number of variables: ' + str(input_size_full))
    print('Total number of non-constant variables: ' + str(input_size_conv))
    print('Hidden size conv: ' + str(hidden_size_conv))
    print('Kernel size: ' + str(kernel_size))
    print('Hidden size full: ' + str(hidden_size_full))
    print('Dropout full: ' + str(dropout_full))
    print('Hidden size combo: ' + str(hidden_size_combo))
    print('Dropout combo: ' + str(dropout_combo))
    print('Learning rate: ' + str(lr))
    print('Weight decay: ' + str(weight_decay))
    print()

    # train
    train_CNN(train_x_std_stack_nonConst, train_x_std_tuple, train_y_tuple, cnn, optimizer, mse_loss, num_epochs, batch_size)

    # evaluate
    test_r2, test_pred_cnn = r2(cnn, batch_size, test_x_std_stack_nonConst, test_x_std_tuple, test_y_tuple, get_pred=True)

    print()
    print('Test R^2: ' + str(test_r2))

    # put model predictions into test dataframe (note that these predictions do not include those for rows where there is no response value)
    test = test.dropna(axis=0)
    test['MonitorData_pred'] = pd.Series(test_pred_cnn, index=test.index)

    # save test dataframe with predictions and final model
    
    pickle.dump(cnn, open(config["Regression"]["cnn_1_model"], 'wb'))
    test.to_csv(config["Regression"]["cnn_1_pred"], index=False)


# train/test CNN2
else:
    hidden_size_conv  = int(config["CNN_hyperparam_2"]["hidden_size_conv"])
    kernel_size       = int(config["CNN_hyperparam_2"]["kernel_size"])
    padding           = int(config["CNN_hyperparam_2"]["padding"])
    hidden_size_full  = int(config["CNN_hyperparam_2"]["hidden_size_full"])
    dropout_full      = np.float64(config["CNN_hyperparam_2"]["dropout_full"])
    hidden_size2_full = int(config["CNN_hyperparam_2"]["hidden_size2_full"])
    dropout2_full     = np.float64(config["CNN_hyperparam_2"]["dropout2_full"])
    lr                = np.float64(config["CNN_hyperparam_2"]["lr"])
    weight_decay      = np.float64(config["CNN_hyperparam_2"]["weight_decay"])

    # Loss function
    mse_loss = torch.nn.MSELoss(size_average=True)

    # instantiate CNN
    cnn = CNN2(input_size_conv, hidden_size_conv, kernel_size, padding, input_size_full, hidden_size_full,
               dropout_full, hidden_size2_full, dropout2_full)

    # instantiate optimizer
    optimizer = torch.optim.Adam(cnn.parameters(), lr=lr, weight_decay=weight_decay)

    print('Total number of variables: ' + str(input_size_full))
    print('Total number of non-constant variables: ' + str(input_size_conv))
    print('Hidden size conv: ' + str(hidden_size_conv))
    print('Kernel size: ' + str(kernel_size))
    print('Hidden size full: ' + str(hidden_size_full))
    print('Dropout full: ' + str(dropout_full))
    print('Hidden size 2 full: ' + str(hidden_size2_full))
    print('Dropout 2 full: ' + str(dropout2_full))
    print('Learning rate: ' + str(lr))
    print('Weight decay: ' + str(weight_decay))
    print()

    # train
    train_CNN(train_x_std_stack_nonConst, train_x_std_tuple, train_y_tuple, cnn, optimizer, mse_loss, num_epochs, batch_size)

    # evaluate
    test_r2, test_pred_cnn = r2(cnn, batch_size, test_x_std_stack_nonConst, test_x_std_tuple, test_y_tuple, get_pred=True)
    
    print()
    print('Test R^2: ' + str(test_r2))

    # put model predictions into test dataframe (note that these predictions do not include those for rows where there is no response value)
    test = test.dropna(axis=0)
    test['MonitorData_pred'] = pd.Series(test_pred_cnn, index=test.index)

    # save test dataframe with predictions and final model
    pickle.dump(cnn, open(config["Regression"]["cnn_2_model"], 'wb'))
    test.to_csv(config["Regression"]["cnn_2_pred"], index=False)

Total number of variables: 60
Total number of non-constant variables: 21
Hidden size conv: 25
Kernel size: 3
Hidden size full: 100
Dropout full: 0.1
Hidden size 2 full: 100
Dropout 2 full: 0.1
Learning rate: 0.1
Weight decay: 1e-05

Epoch loss after epoch 5: 121.91483116149902

Epoch loss after epoch 10: 126.83883666992188

Epoch loss after epoch 15: 123.53434371948242

Epoch loss after epoch 20: 114.59788227081299

Epoch loss after epoch 25: 114.38522338867188

Epoch loss after epoch 30: 110.7976655960083

Epoch loss after epoch 35: 110.20826053619385

Epoch loss after epoch 40: 114.17048645019531

Epoch loss after epoch 45: 102.69075775146484

Epoch loss after epoch 50: 108.55745220184326

Epoch loss after epoch 55: 97.1395320892334

Epoch loss after epoch 60: 97.53401756286621

Epoch loss after epoch 65: 98.39294147491455

Epoch loss after epoch 70: 91.31104850769043

Epoch loss after epoch 75: 93.83342361450195

Epoch loss after epoch 80: 94.49025440216064

Epoch loss after epoch 8