# Prediction Intervals

This notebook is used to generate 95% prediction intervals. The XGBoost algorithm does not support interval prediction, therefore we have used Stochastic Gradient Boosting Regressor for generating interval predictions.


For each time step, we train 3 models. One for point predictions (which is only used for comparison with XGB predictions), one for lower quantile prediction, and one for upper quantile prediction. The latter two use a `quantile` loss (`alpha`= 0.025 and 0.975 respectively) whereas the first one uses `neg_root_mean_squared_error`.


ections 1 and 2 of this notebook is very similar to the main notebook (`STXGB model`).

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn
from scipy.stats import norm
import geopandas as gpd
import os 
import time

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler

__SET ALPHA VALUE__

In [None]:
alpha = 0.975

In [None]:
# Set output directory
# You can change it if you want to
output_dir = './output/'

In [None]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

## 1- Loading Data

The `all_features_v1.csv` contains all of the features and target variables that we have used in different variants of STXGB model (STXGB-FB, STXGB-SG, and STXGB-SGR) and for 1- to 4-week prediction horizons. 


This file is published publicly alongside the code, so you can download the csv file and run STXGB models for yourself

In [None]:
# here we load data from Zenodo URL

df_url = 'https://zenodo.org/record/5533027/files/all_features_v1_0.csv?download=1'
covid_df = pd.read_csv(df_url,index_col=0, dtype={'GEOID':str})

#### Load a base geojson file 

This file contains county FIPS and is used to store model outputs is a georeferenced format.

In [None]:
url='https://drive.google.com/file/d/1MVyLzzHl3hzno4o1rLZtI0peqIi23zsr/view?usp=sharing'
url_counties='https://drive.google.com/uc?id=' + url.split('/')[-2]
counties_shp = gpd.read_file(url_counties)

### 1.1. The list of features in the STXGB-FB model

In [None]:
temp_cols = [col for col in covid_df.columns if 'TEMP' in col]

In [None]:
socio_cols = ['POP_DENSITY',
'PCT_MALE',
'PCT_65_OVE',
'PCT_BLACK',
'PCT_HISPAN', 
'PCT_AMIND',
'PCT_RURAL',
'PCT_COL_DE' ,
'PCT_TRUMP_',
'MED_HOS_IN']

In [None]:
inc_cols = [col for col in covid_df.columns if 'DELTA_INC' in col]
inc_cols.pop(0)

In [None]:
spc_cols = [col for col in covid_df.columns if 'DELTA_SPC_T' in col]

In [None]:
rel_cols = [col for col in covid_df.columns if 'REL_' in col]
rel_cols_non_delta = [col for col in rel_cols if 'DELTA' in col]
rel_cols = list(set(rel_cols)^set(rel_cols_non_delta))

In [None]:
ratio_cols = [col for col in covid_df.columns if 'RATIO_' in col]
ratio_cols_non_delta = [col for col in ratio_cols if 'DELTA' in col]
ratio_cols = list(set(ratio_cols)^set(ratio_cols_non_delta))

In [None]:
facebook_features = socio_cols + temp_cols + rel_cols + ratio_cols + spc_cols  + inc_cols  

In [None]:
facebook_features.extend(('LOG_MEAN_INC_RATE_T_4', 'MEAN_SPC_T_4'))

### 1.3. The list of features in the STXGB-SG model

In [None]:
fpc_cols = [col for col in covid_df.columns if 'DELTA_FPC_T' in col]

In [None]:
pct_home_cols = [col for col in covid_df.columns if 'completely_home_' in col]
pct_home_cols_non_base = [col for col in pct_home_cols if 'baselined' in col]
pct_home_cols = list(set(pct_home_cols)^set(pct_home_cols_non_base))

In [None]:
dist_traveled_cols = [col for col in covid_df.columns if 'distance_traveled_' in col]
dist_traveled_cols_non_current = [col for col in dist_traveled_cols if 'current' in col]
dist_traveled_cols = list(set(dist_traveled_cols)^set(dist_traveled_cols_non_current))

In [None]:
safegraph_features = socio_cols + temp_cols + pct_home_cols + dist_traveled_cols + \
                     fpc_cols + inc_cols

In [None]:
safegraph_features.extend(('LOG_MEAN_INC_RATE_T_4','MEAN_FPC_T_4'))

## 2. Set training and testing size

The dataset is initially divided into a 34-week subset for training and a 1-week subset for testing.

At each time step, the size of training weeks increases by 1 and the test week has a 1-week shift towards the end of November

In [None]:
training_size = 30 # week
testing_size = 1 # week
num_counties = 3103
time_steps = 14

## 3. Generate PIs for one-week (7-day) prediction horizon

In [None]:
counties_sgb_interval_7 = counties_shp.copy()

In [None]:
train_r2 = dict()
train_rmse = dict()
train_mae = dict()
test_rmse = dict()
test_mae = dict()
tuned_params_sgb_7 = dict()
tuned_params_sgb_lower_7 = dict()
tuned_params_sgb_upper_7 = dict()

models=['safegraph', 'facebook']
features = [safegraph_features, facebook_features]

gb_params = dict(learning_rate=np.arange(0.05,0.3,0.05), 
                     n_estimators=np.arange(100,1000,100), 
                     subsample = np.arange(0.1,0.9,0.05),
                     max_depth=[int(i) for i in np.arange(1,10,1)],
                     max_features = ['sqrt', 'log2']) 

for i in range(time_steps):
    
    training_df = covid_df.iloc[:(i+training_size)*num_counties,:]
    testing_df = covid_df.iloc[(i+training_size)*num_counties:(i+training_size+testing_size)*num_counties,:]

    for model,feature in zip(models, features):
    
        X_train = training_df[feature]
        y_train = training_df['LOG_DELTA_INC_RATE_T']
        X_test = testing_df[feature]
        y_test = testing_df['LOG_DELTA_INC_RATE_T'] 

        #scaling X
        scaler = MinMaxScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)


        #cross validation
        sgb_regressor_point = GradientBoostingRegressor(random_state=7)

        sgb_cv = RandomizedSearchCV(sgb_regressor_point, gb_params, random_state=7, 
                                    scoring='neg_root_mean_squared_error', n_jobs=-1)
        
        sgb_optimized = sgb_cv.fit(X_train, y_train)
        best_sgb = sgb_optimized.best_estimator_
        tuned_params_sgb_7[model, i] = sgb_optimized.best_params_
        
        
        # model evaluation for training set
        train_r2_sgb = round(best_sgb.score(X_train, y_train),2)
        train_r2[model, i] = train_r2_sgb

        y_train_predicted_sgb = best_sgb.predict(X_train)
        rmse_train_sgb = (np.sqrt(mean_squared_error(y_train, y_train_predicted_sgb)))
        train_rmse[model, i] = rmse_train_sgb
        train_mae[model, i] =  mean_absolute_error(y_train, y_train_predicted_sgb)

        # model evaluation for test set
        y_test_predicted_sgb = best_sgb.predict(X_test)
        rmse_test_gbr = (np.sqrt(mean_squared_error(y_test, y_test_predicted_sgb)))
        test_rmse[model, i] = rmse_test_gbr
        test_mae[model, i] = mean_absolute_error(y_test, y_test_predicted_sgb)
        
        
        # lower and upper interval predictions
        sgb_regressor_lower = best_sgb.set_params(loss='quantile', alpha=1-alpha)
        sgb_regressor_lower = sgb_regressor_lower.fit(X_train, y_train)
        
        y_test_predicted_sgb_lower = sgb_regressor_lower.predict(X_test)
        
        sgb_regressor_upper = best_sgb.set_params(loss='quantile', alpha=alpha)
        sgb_regressor_upper = sgb_regressor_upper.fit(X_train, y_train)
        
        
        y_test_predicted_sgb_upper = sgb_regressor_upper.predict(X_test)
        
        
        # add labels and predictions to a county data frame
        col_suffix = model +'_' + str(i)
        
        testing_df.loc[:,'y_test_'+ col_suffix] = y_test
        testing_df.loc[:,'y_predicted_'+ col_suffix] = y_test_predicted_sgb
        testing_df.loc[:,'y_predicted_lower_'+ col_suffix] = y_test_predicted_sgb_lower
        testing_df.loc[:,'y_predicted_upper_'+ col_suffix] = y_test_predicted_sgb_upper
        
        testing_df['delta_inc_test_'+ col_suffix] = np.exp(testing_df['y_test_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_'+ col_suffix] = np.exp(testing_df['y_predicted_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_lower_'+ col_suffix] = np.exp(testing_df['y_predicted_lower_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_upper_'+ col_suffix] = np.exp(testing_df['y_predicted_upper_'+ col_suffix]) - 1
        
        testing_df['delta_case_test_'+ col_suffix] = (testing_df['delta_inc_test_'+ col_suffix] * 
                                                          testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_'+ col_suffix] = (testing_df['delta_inc_pred_'+ col_suffix] * 
                                                          testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_lower_'+ col_suffix] = (testing_df['delta_inc_pred_lower_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_upper_'+ col_suffix] = (testing_df['delta_inc_pred_upper_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['error_y_'+ col_suffix] = testing_df['y_test_'+ col_suffix] - testing_df['y_predicted_'+ col_suffix]
        
        testing_df['error_delta_inc_'+ col_suffix] = testing_df['delta_inc_test_'+ col_suffix] - \
                                                        testing_df['delta_inc_pred_'+ col_suffix]
        
        testing_df['error_delta_case_'+ col_suffix] = testing_df['delta_case_test_'+ col_suffix] - \
                                                        testing_df['delta_case_pred_'+ col_suffix]

        
        counties_sgb_interval_7 = counties_sgb_interval_7.merge(testing_df[test_cols], how='left', on='GEOID')
        
        print('Model {} in time step {} done!'.format(model, i))

In [None]:
counties_sgb_interval_7.to_file(output_dir + 'counties_sgb_interval_7.geojson', driver='GeoJSON')

## 4. Generate PIs for two-week prediction horizon

In [None]:
counties_sgb_interval_14 = counties_shp.copy()

In [None]:
train_r2 = dict()
train_rmse = dict()
train_mae = dict()
test_rmse = dict()
test_mae = dict()
tuned_params_sgb_14 = dict()
tuned_params_sgb_lower_14 = dict()
tuned_params_sgb_upper_14 = dict()

models=['safegraph', 'facebook']
features = [safegraph_features, facebook_features]

gb_params = dict(learning_rate=np.arange(0.05,0.3,0.05), 
                     n_estimators=np.arange(100,1000,100), 
                     subsample = np.arange(0.1,0.9,0.05),
                     max_depth=[int(i) for i in np.arange(1,10,1)],
                     max_features = ['sqrt', 'log2']) 

for i in range(time_steps):
    
    training_df = covid_df.iloc[:(i+training_size)*num_counties,:]
    testing_df = covid_df.iloc[(i+training_size)*num_counties:(i+training_size+testing_size)*num_counties,:]
    
    for model,feature in zip(models, features):
    
        X_train = training_df[feature]
        y_train = training_df['LOG_DELTA_INC_RATE_T_14']
        X_test = testing_df[feature]
        y_test = testing_df['LOG_DELTA_INC_RATE_T_14'] 

        #scaling X
        scaler = MinMaxScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)


        #cross validation
        sgb_regressor_point = GradientBoostingRegressor(random_state=14)

        sgb_cv = RandomizedSearchCV(sgb_regressor_point, gb_params, random_state=14, 
                                    scoring='neg_root_mean_squared_error', n_jobs=-1)
        sgb_optimized = sgb_cv.fit(X_train, y_train)
        best_sgb = sgb_optimized.best_estimator_
        tuned_params_sgb_14[model, i] = sgb_optimized.best_params_
        
        
        # model evaluation for training set
        train_r2_sgb = round(best_sgb.score(X_train, y_train),2)
        train_r2[model, i] = train_r2_sgb

        y_train_predicted_sgb = best_sgb.predict(X_train)
        rmse_train_sgb = (np.sqrt(mean_squared_error(y_train, y_train_predicted_sgb)))
        train_rmse[model, i] = rmse_train_sgb
        train_mae[model, i] =  mean_absolute_error(y_train, y_train_predicted_sgb)

        # model evaluation for test set
        y_test_predicted_sgb = best_sgb.predict(X_test)
        rmse_test_gbr = (np.sqrt(mean_squared_error(y_test, y_test_predicted_sgb)))
        test_rmse[model, i] = rmse_test_gbr
        test_mae[model, i] = mean_absolute_error(y_test, y_test_predicted_sgb)
        
        
        # lower and upper interval predictions
        sgb_regressor_lower = best_sgb.set_params(loss='quantile', alpha=1-alpha)
        sgb_regressor_lower = sgb_regressor_lower.fit(X_train, y_train)
        
        y_test_predicted_sgb_lower = sgb_regressor_lower.predict(X_test)
        
        sgb_regressor_upper = best_sgb.set_params(loss='quantile', alpha=alpha)
        sgb_regressor_upper = sgb_regressor_upper.fit(X_train, y_train)
        
        
        y_test_predicted_sgb_upper = sgb_regressor_upper.predict(X_test)
    

        
        # add labels and predictions to a county data frame
        col_suffix = model +'_' + str(i)
        
        testing_df.loc[:,'y_test_'+ col_suffix] = y_test
        testing_df.loc[:,'y_predicted_'+ col_suffix] = y_test_predicted_sgb
        testing_df.loc[:,'y_predicted_lower_'+ col_suffix] = y_test_predicted_sgb_lower
        testing_df.loc[:,'y_predicted_upper_'+ col_suffix] = y_test_predicted_sgb_upper
        
        testing_df['delta_inc_test_'+ col_suffix] = np.exp(testing_df['y_test_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_'+ col_suffix] = np.exp(testing_df['y_predicted_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_lower_'+ col_suffix] = np.exp(testing_df['y_predicted_lower_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_upper_'+ col_suffix] = np.exp(testing_df['y_predicted_upper_'+ col_suffix]) - 1
        
        testing_df['delta_case_test_'+ col_suffix] = (testing_df['delta_inc_test_'+ col_suffix] * 
                                                      testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_'+ col_suffix] = (testing_df['delta_inc_pred_'+ col_suffix] * 
                                                      testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_lower_'+ col_suffix] = (testing_df['delta_inc_pred_lower_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_upper_'+ col_suffix] = (testing_df['delta_inc_pred_upper_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['error_y_'+ col_suffix] = testing_df['y_test_'+ col_suffix] - \
                                                testing_df['y_predicted_'+ col_suffix]
        
        testing_df['error_delta_inc_'+ col_suffix] = testing_df['delta_inc_test_'+ col_suffix] - \
                                                        testing_df['delta_inc_pred_'+ col_suffix]
        
        testing_df['error_delta_case_'+ col_suffix] = testing_df['delta_case_test_'+ col_suffix] - \
                                                        testing_df['delta_case_pred_'+ col_suffix]
        
        test_cols = ['GEOID',  
                     'y_test_'+ col_suffix, 'y_predicted_'+ col_suffix, 
                     'delta_inc_test_'+ col_suffix,  'delta_inc_pred_'+ col_suffix,
                     'delta_case_test_'+ col_suffix, 'delta_case_pred_'+ col_suffix,
                     'error_y_'+ col_suffix, 'error_delta_inc_'+ col_suffix, 'error_delta_case_'+ col_suffix,
                    'y_predicted_lower_'+ col_suffix, 'y_predicted_upper_'+ col_suffix,
                    'delta_case_pred_lower_'+ col_suffix, 'delta_case_pred_upper_'+ col_suffix]

        
        counties_sgb_interval_14 = counties_sgb_interval_14.merge(testing_df[test_cols], how='left', on='GEOID')
        
        print('Model {} in time step {} done!'.format(model, i))

In [None]:
counties_sgb_interval_14.to_file(output_dir + 'counties_sgb_interval_14.geojson', driver='GeoJSON')

## 5. Generate PIs for three-week prediction horizon

In [None]:
counties_sgb_interval_21 = counties_shp.copy()

In [None]:
train_r2 = dict()
train_rmse = dict()
train_mae = dict()
test_rmse = dict()
test_mae = dict()
tuned_params_sgb_21 = dict()
tuned_params_sgb_lower_21 = dict()
tuned_params_sgb_upper_21 = dict()

models=['safegraph', 'facebook']
features = [safegraph_features, facebook_features]

gb_params = dict(learning_rate=np.arange(0.05,0.3,0.05), 
                     n_estimators=np.arange(100,1000,100), 
                     subsample = np.arange(0.1,0.9,0.05),
                     max_depth=[int(i) for i in np.arange(1,10,1)],
                     max_features = ['sqrt', 'log2']) 

for i in range(time_steps):
    
    training_df = covid_df.iloc[:(i+training_size)*num_counties,:]
    testing_df = covid_df.iloc[(i+training_size)*num_counties:(i+training_size+testing_size)*num_counties,:]

    
    for model,feature in zip(models, features):
    
        X_train = training_df[feature]
        y_train = training_df['LOG_DELTA_INC_RATE_T_21']
        X_test = testing_df[feature]
        y_test = testing_df['LOG_DELTA_INC_RATE_T_21'] 

        #scaling X
        scaler = MinMaxScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)


        #cross validation
        sgb_regressor_point = GradientBoostingRegressor(random_state=21)

        sgb_cv = RandomizedSearchCV(sgb_regressor_point, gb_params, random_state=21, 
                                    scoring='neg_root_mean_squared_error', n_jobs=-1)
        
        sgb_optimized = sgb_cv.fit(X_train, y_train)
        best_sgb = sgb_optimized.best_estimator_
        tuned_params_sgb_21[model, i] = sgb_optimized.best_params_
        
        
        # model evaluation for training set
        train_r2_sgb = round(best_sgb.score(X_train, y_train),2)
        train_r2[model, i] = train_r2_sgb

        y_train_predicted_sgb = best_sgb.predict(X_train)
        rmse_train_sgb = (np.sqrt(mean_squared_error(y_train, y_train_predicted_sgb)))
        train_rmse[model, i] = rmse_train_sgb
        train_mae[model, i] =  mean_absolute_error(y_train, y_train_predicted_sgb)

        # model evaluation for test set
        y_test_predicted_sgb = best_sgb.predict(X_test)
        rmse_test_gbr = (np.sqrt(mean_squared_error(y_test, y_test_predicted_sgb)))
        test_rmse[model, i] = rmse_test_gbr
        test_mae[model, i] = mean_absolute_error(y_test, y_test_predicted_sgb)
        
        
        # lower and upper interval predictions
        sgb_regressor_lower = best_sgb.set_params(loss='quantile', alpha=1-alpha)
        sgb_regressor_lower = sgb_regressor_lower.fit(X_train, y_train)
        
        y_test_predicted_sgb_lower = sgb_regressor_lower.predict(X_test)
        
        sgb_regressor_upper = best_sgb.set_params(loss='quantile', alpha=alpha)
        sgb_regressor_upper = sgb_regressor_upper.fit(X_train, y_train)
        
        
        y_test_predicted_sgb_upper = sgb_regressor_upper.predict(X_test)
        
        
        # add labels and predictions to a county data frame
        col_suffix = model +'_' + str(i)
        
        testing_df.loc[:,'y_test_'+ col_suffix] = y_test
        testing_df.loc[:,'y_predicted_'+ col_suffix] = y_test_predicted_sgb
        testing_df.loc[:,'y_predicted_lower_'+ col_suffix] = y_test_predicted_sgb_lower
        testing_df.loc[:,'y_predicted_upper_'+ col_suffix] = y_test_predicted_sgb_upper
        
        testing_df['delta_inc_test_'+ col_suffix] = np.exp(testing_df['y_test_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_'+ col_suffix] = np.exp(testing_df['y_predicted_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_lower_'+ col_suffix] = np.exp(testing_df['y_predicted_lower_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_upper_'+ col_suffix] = np.exp(testing_df['y_predicted_upper_'+ col_suffix]) - 1
        
        testing_df['delta_case_test_'+ col_suffix] = (testing_df['delta_inc_test_'+ col_suffix] * 
                                                          testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_'+ col_suffix] = (testing_df['delta_inc_pred_'+ col_suffix] * 
                                                          testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_lower_'+ col_suffix] = (testing_df['delta_inc_pred_lower_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_upper_'+ col_suffix] = (testing_df['delta_inc_pred_upper_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['error_y_'+ col_suffix] = testing_df['y_test_'+ col_suffix] - testing_df['y_predicted_'+ col_suffix]
        
        testing_df['error_delta_inc_'+ col_suffix] = testing_df['delta_inc_test_'+ col_suffix] - \
                                                        testing_df['delta_inc_pred_'+ col_suffix]
        
        testing_df['error_delta_case_'+ col_suffix] = testing_df['delta_case_test_'+ col_suffix] - \
                                                        testing_df['delta_case_pred_'+ col_suffix]
        
        test_cols = ['GEOID',  
                     'y_test_'+ col_suffix, 'y_predicted_'+ col_suffix, 
                     'delta_inc_test_'+ col_suffix,  'delta_inc_pred_'+ col_suffix,
                     'delta_case_test_'+ col_suffix, 'delta_case_pred_'+ col_suffix,
                     'error_y_'+ col_suffix, 'error_delta_inc_'+ col_suffix, 'error_delta_case_'+ col_suffix,
                    'y_predicted_lower_'+ col_suffix, 'y_predicted_upper_'+ col_suffix,
                    'delta_case_pred_lower_'+ col_suffix, 'delta_case_pred_upper_'+ col_suffix]

        
        counties_sgb_interval_21 = counties_sgb_interval_21.merge(testing_df[test_cols], how='left', on='GEOID')
        
        print('Model {} in time step {} done!'.format(model, i))

In [None]:
counties_sgb_interval_21.to_file(output_dir + 'counties_sgb_interval_21.geojson', driver='GeoJSON')

## 6. Generate PIs for four-week prediction horizon

In [None]:
counties_sgb_interval_28 = counties_shp.copy()

In [None]:
train_r2 = dict()
train_rmse = dict()
train_mae = dict()
test_rmse = dict()
test_mae = dict()
tuned_params_sgb_28 = dict()
tuned_params_sgb_lower_28 = dict()
tuned_params_sgb_upper_28 = dict()

models=['safegraph', 'facebook']
features = [safegraph_features, facebook_features]

gb_params = dict(learning_rate=np.arange(0.05,0.3,0.05), 
                     n_estimators=np.arange(100,1000,100), 
                     subsample = np.arange(0.1,0.9,0.05),
                     max_depth=[int(i) for i in np.arange(1,10,1)],
                     max_features = ['sqrt', 'log2']) 

for i in range(time_steps):
    
    training_df = covid_df.iloc[:(i+training_size)*num_counties,:]
    testing_df = covid_df.iloc[(i+training_size)*num_counties:(i+training_size+testing_size)*num_counties,:]
    
    for model,feature in zip(models, features):
    
        X_train = training_df[feature]
        y_train = training_df['LOG_DELTA_INC_RATE_T_28']
        X_test = testing_df[feature]
        y_test = testing_df['LOG_DELTA_INC_RATE_T_28'] 

        #scaling X
        scaler = MinMaxScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)


        #cross validation
        sgb_regressor_point = GradientBoostingRegressor(random_state=28)

        sgb_cv = RandomizedSearchCV(sgb_regressor_point, gb_params, random_state=28, 
                                    scoring='neg_root_mean_squared_error', n_jobs=-1)
        sgb_optimized = sgb_cv.fit(X_train, y_train)
        best_sgb = sgb_optimized.best_estimator_
        tuned_params_sgb_28[model, i] = sgb_optimized.best_params_
        
        
        # model evaluation for training set
        train_r2_sgb = round(best_sgb.score(X_train, y_train),2)
        train_r2[model, i] = train_r2_sgb

        y_train_predicted_sgb = best_sgb.predict(X_train)
        rmse_train_sgb = (np.sqrt(mean_squared_error(y_train, y_train_predicted_sgb)))
        train_rmse[model, i] = rmse_train_sgb
        train_mae[model, i] =  mean_absolute_error(y_train, y_train_predicted_sgb)

        # model evaluation for test set
        y_test_predicted_sgb = best_sgb.predict(X_test)
        rmse_test_gbr = (np.sqrt(mean_squared_error(y_test, y_test_predicted_sgb)))
        test_rmse[model, i] = rmse_test_gbr
        test_mae[model, i] = mean_absolute_error(y_test, y_test_predicted_sgb)
        
        
        # lower and upper interval predictions
        sgb_regressor_lower = best_sgb.set_params(loss='quantile', alpha=1-alpha)
        sgb_regressor_lower = sgb_regressor_lower.fit(X_train, y_train)
        
        y_test_predicted_sgb_lower = sgb_regressor_lower.predict(X_test)
        
        sgb_regressor_upper = best_sgb.set_params(loss='quantile', alpha=alpha)
        sgb_regressor_upper = sgb_regressor_upper.fit(X_train, y_train)
        
        
        y_test_predicted_sgb_upper = sgb_regressor_upper.predict(X_test)

        
        # add labels and predictions to a county data frame
        col_suffix = model +'_' + str(i)
        
        testing_df.loc[:,'y_test_'+ col_suffix] = y_test
        testing_df.loc[:,'y_predicted_'+ col_suffix] = y_test_predicted_sgb
        testing_df.loc[:,'y_predicted_lower_'+ col_suffix] = y_test_predicted_sgb_lower
        testing_df.loc[:,'y_predicted_upper_'+ col_suffix] = y_test_predicted_sgb_upper
        
        testing_df['delta_inc_test_'+ col_suffix] = np.exp(testing_df['y_test_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_'+ col_suffix] = np.exp(testing_df['y_predicted_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_lower_'+ col_suffix] = np.exp(testing_df['y_predicted_lower_'+ col_suffix]) - 1
        testing_df['delta_inc_pred_upper_'+ col_suffix] = np.exp(testing_df['y_predicted_upper_'+ col_suffix]) - 1
        
        testing_df['delta_case_test_'+ col_suffix] = (testing_df['delta_inc_test_'+ col_suffix] * 
                                                          testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_'+ col_suffix] = (testing_df['delta_inc_pred_'+ col_suffix] * 
                                                          testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_lower_'+ col_suffix] = (testing_df['delta_inc_pred_lower_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['delta_case_pred_upper_'+ col_suffix] = (testing_df['delta_inc_pred_upper_'+ col_suffix] * 
                                                            testing_df['POPULATION']) / 10000
        
        testing_df['error_y_'+ col_suffix] = testing_df['y_test_'+ col_suffix] - testing_df['y_predicted_'+ col_suffix]
        
        testing_df['error_delta_inc_'+ col_suffix] = testing_df['delta_inc_test_'+ col_suffix] - \
                                                        testing_df['delta_inc_pred_'+ col_suffix]
        
        testing_df['error_delta_case_'+ col_suffix] = testing_df['delta_case_test_'+ col_suffix] - \ 
                                                        testing_df['delta_case_pred_'+ col_suffix]

        
        counties_sgb_interval_28 = counties_sgb_interval_28.merge(testing_df[test_cols], how='left', on='GEOID')
        
        print('Model {} in time step {} done!'.format(model, i))

In [None]:
counties_sgb_interval_28.to_file(output_dir + 'counties_sgb_interval_28.geojson', driver='GeoJSON')