# Positive breaks XGBoost model - local run with loop

### Peter R.
#### 2024-03-14

Note: Here I use h10p, that is Bfast breaks with h=0.1 or ~ 10%. The filtering of records with h10p is the samet than with h5p.

Today (2024-03-04) I meet with MJF to discuss XGB model improvements. An important number of Bfast breaks have very wide confidence intervals (CIs) associated with the time of break. These CIs can range from about 1 month to 80+ months. These won't allow for high quality matching with yearly climate or disturbance data. For this reason, we decided to run XGB models with subsets of data, each subset has narrow CIs. We will run the following XGB models with the follwoing dataframe subsets:

- Dataframe3 (df3): Records with CIs shorter than 3 16-days data points (48 days or about 1.5 months)

- Dataframe6 (df6): Records with CIs shorter than 6 16-days data points (96 days or about 3 months)

- Dataframe9 (df9): Records with CIs shorter than 9 16-days data points (144 days or about 5 months)

- Dataframe23 (df23): Records with CIS shorter than 23 16-days data points (368 days or about 1 year)

Some questions to have in mind:

- How many matches with disturbance data do the above have?
- Why does forest age become the top ranking variable with VIFplust variable set?  This variable was number 10 in other previous XGB model.
- I am assuming that Hansen is best and that it only includes stand-replacing disturbances


In [1]:
# 2024-03-12
# Peter R.
# XGBoost script
# Positive breaks, n_estimators (number of trees)=1000 and with optimal parameter from DRAC model_bp1 & early stopping

#Here I am using a loop to run several models at a time

import os
import time

import pandas as pd
from numpy import nan
import xgboost as xgb
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

from sklearn.model_selection import RandomizedSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# for feature importance plots
from matplotlib import pyplot
import matplotlib.pyplot as plt
import numpy as np

#for dependency plots
from sklearn.inspection import PartialDependenceDisplay

#start = time.time()

# Get the current working directory
cwd = os.getcwd()

#print(cwd)

# DRAC directory
#os.chdir("/home/georod/projects/def-mfortin/georod/scripts/github/forc_trends/models/xgboost")
# Win directory
os.chdir(r'C:\Users\Peter R\github\forc_trends\models\xgboost')


print("XGB version:", xgb.__version__)
print("negative breaks")


# Windows
df1 = pd.read_csv(r'.\data\forest_evi_breaks_positive_h10p_v3.csv', skipinitialspace=True)
# DRAC
#df1 = pd.read_csv(r'./data/forest_evi_breaks_positive_v2.csv', skipinitialspace=True)
#df1.head()


df11 = pd.get_dummies(df1, columns=['for_pro'], dtype=float)

#Df0: all rows
#df2 = df11 # N=843
# Df3: 1.5 months, version4/df3
#df2 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.1315068) & (df11['magnitude'] < -700)] #N= 171
# Df6: 3 months, version4/df6
#df2 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.2630137) & (df11['magnitude'] < -700)] #N= 450
# Df9: 5 months, version4/df9
# df2 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.3945205) & (df11['magnitude']< -700)] #N=523
# Df23: 12 months, 1 year, version4/df23
#df2 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 1.008219) & (df11['magnitude'] < -700)] #N=649

# Df23v2: 12 months, 1 year, bounded by same year, version4/df23v2
#df2 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 1.008219) & (df11['magnitude'] < -700)] # N=2291
#df2 = df2.loc[(np.floor(df2['brkdate95'])==np.floor(df2['brkdate25']))]


XGB version: 1.7.6
negative breaks


In [14]:
#Df0: all rows
df0 = df11 # N=5073
# Df3: 1.5 months, version4/df3
df3 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.09) & (df11['magnitude'] > 500)] #N= 0, 0.1315068
#Df6: 3 months, version4/df6
df6 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.2630137) & (df11['magnitude'] > 500)] #N= 205
# Df9: 5 months, version4/df9
df9 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.3945205) & (df11['magnitude'] > 500)] #N=487
# Df23: 12 months, 1 year, version4/df23
df23 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 1.008219) & (df11['magnitude'] > 500)] #N=5347

#dfall = [df0,df3, df6, df9, df23]
dfall = [df0, df6, df9, df23]


In [16]:
print(dfall[3].describe()) 

                 pix         year     brk    brkdate25      brkdate  \
count    1185.000000  1185.000000  1185.0  1185.000000  1185.000000   
mean   228208.130802  2013.899578     0.0  2014.206620  2014.561610   
std    139166.964025     5.658276     0.0     5.601179     5.751487   
min       608.000000  2005.000000     0.0  2004.957000  2005.000000   
25%    117074.000000  2009.000000     0.0  2008.913000  2009.000000   
50%    220618.000000  2014.000000     0.0  2014.696000  2014.957000   
75%    331498.000000  2020.000000     0.0  2020.217000  2020.826000   
max    493282.000000  2020.000000     0.0  2020.826000  2020.957000   

         brkdate95    magnitude  no_brk  fire_year    harv_year  ...  \
count  1185.000000  1185.000000     0.0        0.0    30.000000  ...   
mean   2014.923141   765.512148     NaN        NaN  2010.200000  ...   
std       5.617967   181.518861     NaN        NaN     5.040525  ...   
min    2005.304000   500.471000     NaN        NaN  2005.000000  ...   


In [17]:
(dfall[1]).head()
#range(len(dfall))

Unnamed: 0,pix,year,brk,brkdate25,brkdate,brkdate95,magnitude,no_brk,fire_year,harv_year,...,bffp,bffp_lag1,bffp_lag2,bffp_lag3,effp,effp_lag1,effp_lag2,effp_lag3,for_pro_0,for_pro_1
123,195275,2006,0,2006.696,2006.739,2006.957,527.986,,,,...,140.0,135.0,151.0,145.0,265.0,273.0,268.0,267.0,1.0,0.0
441,44203,2007,0,2007.609,2007.783,2007.826,661.731,,,,...,137.0,137.0,129.98,147.087,276.0,267.913,276.913,270.913,1.0,0.0
1152,346407,2014,0,2014.304,2014.348,2014.522,1310.899,,,,...,137.0,140.0,136.0,137.0,275.0,268.0,272.0,275.0,1.0,0.0
1170,1873,2008,0,2008.522,2008.565,2008.739,770.015,,,,...,145.0,143.156,143.0,138.0,262.0,272.0,263.0,272.0,1.0,0.0
1171,1875,2008,0,2008.565,2008.609,2008.783,725.291,,,,...,145.0,143.0,143.0,137.62,262.0,272.0,263.0,272.0,1.0,0.0


In [18]:
# loop version

cols1 = ['for_age', 'for_con', 'map', 'map_lag1', 'map_lag2', 'map_lag3', 'mat', 'mat_lag1', 'mat_lag2', 'mat_lag3', 'rh', 'rh_lag1', 'rh_lag2', 'rh_lag3', 'for_pro_0']
cols2 = ['for_con', 'cmi_sm', 'cmi_sm_lag1', 'cmi_sm_lag2', 'cmi_sm_lag3', 'dd5_wt_lag1', 'dd5_wt_lag3']
cols3 = ['for_con', 'cmi_sm', 'cmi_sm_lag1', 'cmi_sm_lag2', 'cmi_sm_lag3', 'dd5_wt_lag1', 'dd5_wt_lag3', 'for_age', 'for_pro_0']

df_labs = ['df0','df3', 'df6', 'df9', 'df23']
model_labs = ['First variable set', 'VIF variable set', 'VIFplus variable set']

for z in range(len(dfall)):
    list_of_vars = [[cols1], [cols2], [cols3]]
    for index, list in enumerate(list_of_vars):
        for x in list:
            #print(x)
            X1 = dfall[z][x]
            #print(X1.describe())
            y1 = dfall[z].iloc[:,6].abs()
            seed = 7 # random seed to help with replication
            testsize1 = 0.33 # percent of records to test after training
            x1_train, x1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=testsize1, random_state=seed) # Split data set. Note the 'stratify' option
            model_bp2 = XGBRegressor(base_score=None, booster=None, callbacks=None,
                 colsample_bylevel=None, colsample_bynode=None,
                 colsample_bytree=None, early_stopping_rounds=50,
                 enable_categorical=False, eval_metric=None, feature_types=None,
                 gamma=0.2, gpu_id=None, grow_policy=None, importance_type=None,
                 interaction_constraints=None, learning_rate=0.01, max_bin=None,
                 max_cat_threshold=None, max_cat_to_onehot=None,
                 max_delta_step=None, max_depth=8, max_leaves=None,
                 min_child_weight=None, missing=nan, monotone_constraints=None,
                 n_estimators=1000, n_jobs=None, num_parallel_tree=None,
                 predictor=None, random_state=42, reg_lambda=10, reg_alpha=1)
               # EVALUATION (with test)
            eval_set = [(x1_train, y1_train), (x1_test, y1_test)]
                #UserWarning: `eval_metric` in `fit` method is deprecated for better compatibility with scikit-learn, use `eval_metric` in constructor or`set_params` instead.
            model_bp2.fit(x1_train, y1_train, eval_set=eval_set, verbose=False)
                # make predictions for test data
            y_pred = model_bp2.predict(x1_test)
            predictions = [round(value) for value in y_pred]
                # retrieve performance metrics
            results = model_bp2.evals_result()
            mse = mean_squared_error(y1_test, y_pred)
                #r2 = explained_variance_score(y1_test, ypred)
            r2 = r2_score(y1_test, y_pred)
                # adjusted R-squared
            adj_r2 = 1 - (((1-r2) * (len(y1_test)-1))/(len(y1_test)-x1_test.shape[1]-1))

            #print("MSE: %.2f" % mse)
            var1 = "%.2f" % mse

            #print("RMSE: %.2f" % (mse**(1/2.0)))
            var2 = "%.2f" % (mse**(1/2.0))

            #print("R-sq: %.3f" % r2)
            var3 = "%.3f" % r2

            #print("R-sq-adj: %.3f" % adj_r2)
            var4 = "%.3f" % adj_r2

            var5 = X1.shape[0]
            var6 = X1.shape[1]

            # row for table
            #print("| %.2f" % mse, "| %.2f" % (mse**(1/2.0)), "| %.3f" % r2, "| %.3f" % adj_r2, "|", X1.shape[0], "|", X1.shape[1],"|")
            #print("|", z+1, "|", df_labs[z], "|", model_labs[index], "| %.2f" % mse, "| %.2f" % (mse**(1/2.0)), "| %.3f" % r2, "| %.3f" % adj_r2, "|", X1.shape[0], "|", X1.shape[1],"|")
            print("|", (z+1), "|", df_labs[z], "|", model_labs[index], "|", var1, "|", var2, "|", var3, "|", var4, "|",var5, "|",var6, "|")
            # Feature importance plot
            #xgb.plot_importance(model_bp2, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance, gain', 
            #            xlabel='F score - Gain', ylabel='Features', 
            #            importance_type='gain', max_num_features=15, grid=True, show_values=False) #, values_format='{v:.2f}' )

            #pyplot.savefig(r'.\figs\version4\h2p\df23\neg_gain_m{y}_v1.png'.format(y=len(x)),  dpi=300, bbox_inches='tight')
            #pyplot.show()
            # create lis of feature names to be used in dependency plot so that high ranking vars are plotted
            #features_names1 = pd.DataFrame()
           # features_names1['columns'] = X1.columns
           # features_names1['importances'] = model_bp2.feature_importances_
           # features_names1.sort_values(by='importances',ascending=False,inplace=True)
           # features_names2 = features_names1['columns'].tolist()[0:10]

           # _, ax1 = plt.subplots(figsize=(9, 8), constrained_layout=True)

           # display = PartialDependenceDisplay.from_estimator(model_bp2, x1_train, features_names2, ax=ax1)

           # _ = display.figure_.suptitle(("Partial dependence plots"), fontsize=12, )

           # pyplot.savefig(r'.\figs\version4\h2p\df23\neg_partial_dep_m{y}_v1.png'.format(y=len(x)),  dpi=300, bbox_inches='tight')

           # pyplot.show()
        

| 1 | df0 | First variable set | 24790.41 | 157.45 | 0.243 | 0.236 | 5073 | 15 |
| 1 | df0 | VIF variable set | 24217.77 | 155.62 | 0.260 | 0.257 | 5073 | 7 |
| 1 | df0 | VIFplus variable set | 24872.98 | 157.71 | 0.240 | 0.236 | 5073 | 9 |
| 2 | df3 | First variable set | 60622.99 | 246.22 | -0.337 | 2.528 | 25 | 15 |
| 2 | df3 | VIF variable set | 55962.67 | 236.56 | -0.234 | -8.875 | 25 | 7 |
| 2 | df3 | VIFplus variable set | 60842.01 | 246.66 | -0.342 | 11.736 | 25 | 9 |
| 3 | df6 | First variable set | 46846.05 | 216.44 | 0.139 | -1.036 | 80 | 15 |
| 3 | df6 | VIF variable set | 66782.91 | 258.42 | -0.228 | -0.681 | 80 | 7 |
| 3 | df6 | VIFplus variable set | 63568.00 | 252.13 | -0.169 | -0.788 | 80 | 9 |
| 4 | df9 | First variable set | 25140.81 | 158.56 | 0.176 | 0.143 | 1185 | 15 |
| 4 | df9 | VIF variable set | 26888.44 | 163.98 | 0.119 | 0.103 | 1185 | 7 |
| 4 | df9 | VIFplus variable set | 25794.95 | 160.61 | 0.154 | 0.135 | 1185 | 9 |


**Table 1**: Model comparison for negative breaks. Standard data set with all records (including NAs for for_age and for_con).


|ID|Data frame| Model   | MSE| RMSE| R-sq | R-sq-adj |N rows| N vars|
| --------| --------| --------| --------| -------- | ------- |-------- | ------- |------- |
| 1 | df0 | First variable set | 24790.41 | 157.45 | 0.243 | 0.236 | 5073 | 15 |
| 2 | df0 | VIF variable set | 24217.77 | 155.62 | 0.260 | 0.257 | 5073 | 7 |
| 3 | df0 | VIFplus variable set | 24872.98 | 157.71 | 0.240 | 0.236 | 5073 | 9 |
| 4 | df3 | First variable set | 60622.99 | 246.22 | -0.337 | 2.528 | 25 | 15 |
| 5 | df3 | VIF variable set | 55962.67 | 236.56 | -0.234 | -8.875 | 25 | 7 |
| 6 | df3 | VIFplus variable set | 60842.01 | 246.66 | -0.342 | 11.736 | 25 | 9 |
| 7 | df6 | First variable set | 46846.05 | 216.44 | 0.139 | -1.036 | 80 | 15 |
| 8 | df6 | VIF variable set | 66782.91 | 258.42 | -0.228 | -0.681 | 80 | 7 |
| 9 | df6 | VIFplus variable set | 63568.00 | 252.13 | -0.169 | -0.788 | 80 | 9 |
| 10 | df9 | First variable set | 25140.81 | 158.56 | 0.176 | 0.143 | 1185 | 15 |
| 11 | df9 | VIF variable set | 26888.44 | 163.98 | 0.119 | 0.103 | 1185 | 7 |
| 12 | df9 | VIFplus variable set | 25794.95 | 160.61 | 0.154 | 0.135 | 1185 | 9 |









In [137]:
# Misc cmds
#Describe the data
#X1 = df2[x]
#print(X1.shape)
#x
#print(len(cols1))
# Count NAs per columns to check that step above worked #mat 607 before, now 0
#X1.isna().sum()

In [138]:
#X1.describe()

### Models without records that have disturbance matches

When dealing with positive forest EVI breaks, I can't remove records matched to Hansen et al.'s disturbance data as there are no such matched records. This makes sense as positive breaks should not be matched to disturbances.

In [139]:
# How many records are matched to disturbance data?
#print(df2[['hansen_year']].describe()) # N=2775
#print(df2[['magnitude', 'fire_year', 'harv_year', 'canlad_year', 'hansen_year']].describe()) # Hansen=? with df4; Hansen=648 with df5

In [168]:
# Df0: all records
#df0 = df11 # N=687
df0 = df0.loc[df0['hansen_year'].isnull()]
# Df3: 1.5 months
#df3 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.1315068) & (df11['magnitude']> 700)] #N=40
df3 = df3.loc[df3['hansen_year'].isnull()]
# Df6: 3 months
#df6 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.2630137) & (df11['magnitude']> 700)] #N=168
df6 = df6.loc[df6['hansen_year'].isnull()] #N=168
# Df9: 5 months, version 4
#df9 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 0.3945205) & (df11['magnitude']> 700)] #N=649
df9 = df9.loc[df9['hansen_year'].isnull()] 
# Df23: 12 months, 1 year, version 5
#df23 = df11.loc[(df11['brkdate95']-df11['brkdate25'] <= 1.008219) & (df11['magnitude']> 700)] #N=2216
df23 = df23.loc[df23['hansen_year'].isnull()] 
#dfall = [[df0],[df3], [df6], [df9], [df23]]
#dfall = [df0, df3, df6, df9, df23]
dfall = [df0, df6, df9, df23]

In [141]:
#print(df2[['canlad_year']].describe()) # 2483; 247
#print(df2[['harv_year']].describe()) # 1187; 204
#print(df2[['fire_year']].describe()) # 139; 107

In [142]:
# This produces an empty df as there are no records
#df3 = df2.drop(df2[df2.hansen_year > 0].index)
#X3.tail
#X3.shape
#df3.describe()
#df2.drop(df2[df2.hansen_year > 0].index, inplace=True) # gives a warning
#df2.shape

#df2 = df2.loc[df2['hansen_year'] > 0] # 2775

#df2 = df2.loc[df2['hansen_year'].isnull()]

#df2.shape

In [143]:
#df2.drop(df2[df2.hansen_year > 0].index).describe()

In [169]:
dfall[1].shape

(15, 158)

In [171]:
# loop version

cols1 = ['for_age', 'for_con', 'map', 'map_lag1', 'map_lag2', 'map_lag3', 'mat', 'mat_lag1', 'mat_lag2', 'mat_lag3', 'rh', 'rh_lag1', 'rh_lag2', 'rh_lag3', 'for_pro_0']
cols2 = ['for_con', 'cmi_sm', 'cmi_sm_lag1', 'cmi_sm_lag2', 'cmi_sm_lag3', 'dd5_wt_lag1', 'dd5_wt_lag3']
cols3 = ['for_con', 'cmi_sm', 'cmi_sm_lag1', 'cmi_sm_lag2', 'cmi_sm_lag3', 'dd5_wt_lag1', 'dd5_wt_lag3', 'for_age', 'for_pro_0']

#df_labs = ['df0','df3', 'df6', 'df9', 'df23']
df_labs = ['df0', 'df6','df9', 'df23']
model_labs = ['First variable set', 'VIF variable set', 'VIFplus variable set']

for z in range(len(dfall)):
    list_of_vars = [ [cols1], [cols2], [cols3]]
    for index, list in enumerate(list_of_vars):
        for x in list:
            #print(x)
            X1 = dfall[z][x]
            #print(X1.describe())
            y1 = dfall[z].iloc[:,6].abs()
            seed = 7 # random seed to help with replication
            testsize1 = 0.33 # percent of records to test after training
            x1_train, x1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=testsize1, random_state=seed) # Split data set. Note the 'stratify' option
            model_bp2 = XGBRegressor(base_score=None, booster=None, callbacks=None,
                 colsample_bylevel=None, colsample_bynode=None,
                 colsample_bytree=None, early_stopping_rounds=50,
                 enable_categorical=False, eval_metric=None, feature_types=None,
                 gamma=0.2, gpu_id=None, grow_policy=None, importance_type=None,
                 interaction_constraints=None, learning_rate=0.01, max_bin=None,
                 max_cat_threshold=None, max_cat_to_onehot=None,
                 max_delta_step=None, max_depth=8, max_leaves=None,
                 min_child_weight=None, missing=nan, monotone_constraints=None,
                 n_estimators=1000, n_jobs=None, num_parallel_tree=None,
                 predictor=None, random_state=42, reg_lambda=10, reg_alpha=1)
               # EVALUATION (with test)
            eval_set = [(x1_train, y1_train), (x1_test, y1_test)]
                #UserWarning: `eval_metric` in `fit` method is deprecated for better compatibility with scikit-learn, use `eval_metric` in constructor or`set_params` instead.
            model_bp2.fit(x1_train, y1_train, eval_set=eval_set, verbose=False)
                # make predictions for test data
            y_pred = model_bp2.predict(x1_test)
            predictions = [round(value) for value in y_pred]
                # retrieve performance metrics
            results = model_bp2.evals_result()
            mse = mean_squared_error(y1_test, y_pred)
                #r2 = explained_variance_score(y1_test, ypred)
            r2 = r2_score(y1_test, y_pred)
                # adjusted R-squared
            adj_r2 = 1 - (((1-r2) * (len(y1_test)-1))/(len(y1_test)-x1_test.shape[1]-1))

            #print("MSE: %.2f" % mse)
            var1 = "%.2f" % mse

            #print("RMSE: %.2f" % (mse**(1/2.0)))
            var2 = "%.2f" % (mse**(1/2.0))

            #print("R-sq: %.3f" % r2)
            var3 = "%.3f" % r2

            #print("R-sq-adj: %.3f" % adj_r2)
            var4 = "%.3f" % adj_r2

            var5 = X1.shape[0]
            var6 = X1.shape[1]

             # row for table
            #print("| %.2f" % mse, "| %.2f" % (mse**(1/2.0)), "| %.3f" % r2, "| %.3f" % adj_r2, "|", X1.shape[0], "|", X1.shape[1],"|")
            print("|", (z+1), "|", df_labs[z], "|", model_labs[index], "|", var1, "|", var2, "|", var3, "|", var4, "|",var5, "|",var6, "|")
            # Feature importance plot
            #xgb.plot_importance(model_bp2, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance, gain', 
            #            xlabel='F score - Gain', ylabel='Features', 
            #            importance_type='gain', max_num_features=15, grid=True, show_values=False) #, values_format='{v:.2f}' )

            #pyplot.savefig(r'.\figs\version4\h2p\df23\neg_gain_m{y}_v2.png'.format(y=len(x)),  dpi=300, bbox_inches='tight')
            #pyplot.show()
            # create lis of feature names to be used in dependency plot so that high ranking vars are plotted
            #features_names1 = pd.DataFrame()
            #features_names1['columns'] = X1.columns
            #features_names1['importances'] = model_bp2.feature_importances_
            #features_names1.sort_values(by='importances',ascending=False,inplace=True)
            #features_names2 = features_names1['columns'].tolist()[0:10]

            #_, ax1 = plt.subplots(figsize=(9, 8), constrained_layout=True)

            #display = PartialDependenceDisplay.from_estimator(model_bp2, x1_train, features_names2, ax=ax1)

            #_ = display.figure_.suptitle(("Partial dependence plots"), fontsize=12, )

            #pyplot.savefig(r'.\figs\version4\h2p\df23\neg_partial_dep_m{y}_v2.png'.format(y=len(x)),  dpi=300, bbox_inches='tight')

            #pyplot.show()

| 1 | df0 | First variable set | 23664.83 | 153.83 | 0.386 | 0.383 | 11118 | 15 |
| 1 | df0 | VIF variable set | 23734.41 | 154.06 | 0.384 | 0.383 | 11118 | 7 |
| 1 | df0 | VIFplus variable set | 24011.27 | 154.96 | 0.377 | 0.375 | 11118 | 9 |
| 2 | df6 | First variable set | 25736.86 | 160.43 | 0.652 | 1.126 | 15 | 15 |
| 2 | df6 | VIF variable set | 32191.19 | 179.42 | 0.565 | 1.580 | 15 | 7 |
| 2 | df6 | VIFplus variable set | 32191.19 | 179.42 | 0.565 | 1.348 | 15 | 9 |
| 3 | df9 | First variable set | 45051.81 | 212.25 | 0.606 | 0.070 | 81 | 15 |
| 3 | df9 | VIF variable set | 55276.32 | 235.11 | 0.517 | 0.339 | 81 | 7 |
| 3 | df9 | VIFplus variable set | 55457.17 | 235.49 | 0.515 | 0.259 | 81 | 9 |
| 4 | df23 | First variable set | 28267.67 | 168.13 | 0.290 | 0.276 | 2425 | 15 |
| 4 | df23 | VIF variable set | 28986.44 | 170.25 | 0.272 | 0.265 | 2425 | 7 |
| 4 | df23 | VIFplus variable set | 28023.82 | 167.40 | 0.296 | 0.288 | 2425 | 9 |


**Table 2**: Model comparison for negative breaks. Subset of data records was used, excluding records that had a match with disturbance data. (Some NAs for for_age and for_con.)

|ID| Data frame| Model   | MSE| RMSE| R-sq | R-sq-adj | N rows| N vars|
| --------| --------| -------- | ------- |-------- | ------- |------- |------- |------- |
| 1 | df0 | First variable set | 23664.83 | 153.83 | 0.386 | 0.383 | 11118 | 15 |
| 2 | df0 | VIF variable set | 23734.41 | 154.06 | 0.384 | 0.383 | 11118 | 7 |
| 3 | df0 | VIFplus variable set | 24011.27 | 154.96 | 0.377 | 0.375 | 11118 | 9 |
| 4 | df6 | First variable set | 25736.86 | 160.43 | 0.652 | 1.126 | 15 | 15 |
| 5 | df6 | VIF variable set | 32191.19 | 179.42 | 0.565 | 1.580 | 15 | 7 |
| 6 | df6 | VIFplus variable set | 32191.19 | 179.42 | 0.565 | 1.348 | 15 | 9 |
| 7 | df9 | First variable set | 45051.81 | 212.25 | 0.606 | 0.070 | 81 | 15 |
| 8 | df9 | VIF variable set | 55276.32 | 235.11 | 0.517 | 0.339 | 81 | 7 |
| 9 | df9 | VIFplus variable set | 55457.17 | 235.49 | 0.515 | 0.259 | 81 | 9 |
| 10 | df23 | First variable set | 28267.67 | 168.13 | 0.290 | 0.276 | 2425 | 15 |
| 11 | df23 | VIF variable set | 28986.44 | 170.25 | 0.272 | 0.265 | 2425 | 7 |
| 12 | df23 | VIFplus variable set | 28023.82 | 167.40 | 0.296 | 0.288 | 2425 | 9 |






In [172]:
# check one more
# df2.shape
X1.shape

(2425, 9)