<a id="top"></a>

# Initial Exploratory Modeling

<ol>
    <li><a href="#loaddata">Load data</a></li>
    <li><a href="#preprocessing">Preprocessing</a></li>
    <li><a href="#modeldefinition">Model Definition</a></li>
    <li><a href="#modeltraining">Model Training and Evaluation</a></li>
    <li><a href="#summary">Summary</a></li>
</ol>




<a id="loaddata"></a>
## Load Data
[Top](#top)

In [3]:
import pandas as pd
import re
from datetime import datetime, timedelta
# Connect to Database
import getpass
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine
import pandas as pd
import re
from datetime import datetime, timedelta
%reload_ext sql

mypasswd = getpass.getpass()
username = 'dgyw5' # Replace with your pawprint
host = 'pgsql.dsa.lan'
database = 'caponl_22g2'

postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database': database}
engine = create_engine(URL(**postgres_db), echo=False)

connection_string = f'postgres://{username}:{mypasswd}@{host}/{database}'
%sql $connection_string
del mypasswd, connection_string

········


In [4]:
ghcn = %sql select * from ghcn_data where month in (1,2,3,11,12)
ghcn = ghcn.DataFrame()
ghcn['date'] = pd.to_datetime(ghcn['date'])

 * postgres://dgyw5:***@pgsql.dsa.lan/caponl_22g2
46355 rows affected.


<a id="preprocessing"></a>
## Preprocessing
[Top](#top)

We'll preprocess the data to fill in the missing values in a similar way as we did <a href="./Predict%20Missing%20Values.ipynb">here.</a>  Note that we will not be actually predicting missing values at this point.

In [5]:
# If the min temp doesn't get above freezing, we can assume total snowfall is 0
ghcn.loc[(ghcn.total_snowfall.isnull()) & (ghcn.min_temp >= 33), 'total_snowfall'] = 0

# If the snow depth is 0, the ground snow water equivalent should also be 0
ghcn.loc[(ghcn.ground_snow_water_equivalent_inches.isnull()) & (ghcn.snow_depth_inches == 0), 'ground_snow_water_equivalent_inches'] = 0

# For the remaining missing values, we're going to first try filling them in using the mean value for a given station's year/month
group_cols = ['station_id','year','month']
for col in ghcn.columns[8:]:
    ghcn[col] = ghcn[[col, *group_cols]].groupby(group_cols).transform(lambda x: x.fillna(x.mean()))[col]

# Failing that, we'll fill in the median value for a given year/month across all stations
group_cols = ['year','month']
for col in ghcn.columns[8:]:
    ghcn[col] = ghcn[[col, *group_cols]].groupby(group_cols).transform(lambda x: x.fillna(x.mean()))[col]
    
# Finally, failing that, we'll just fill in the mean by column
ghcn = ghcn.fillna(ghcn.mean())

In [6]:
# Confirming that we've filled in all the rows
print (len(ghcn))
print (len(ghcn.dropna()))

46355
46355


In [7]:
# Shape data by week
def getWeek(date, startOrEnd):
    start = date - timedelta(days=date.weekday()+1)
    if startOrEnd == 'start':
        return start
    elif startOrEnd == 'end':
        return start + timedelta(days=6)

def getWeekdayType(date):
    if date.weekday() in [4,5,6]:
        return 'weekend'
    else:
        return 'weekday'

ghcn['week_start'] = ghcn['date'].apply(lambda date: getWeek(date, 'start'))
ghcn['week_end'] = ghcn['date'].apply(lambda date: getWeek(date, 'end'))

In [8]:
# The columns will need to be aggregated differently depending on if they're purely quantitative vs binary.  
ghcn_quant_cols = ['total_precip', 'total_snowfall', 'avg_temp', 'max_temp', 'min_temp', 'snow_depth_inches',
                   'observed_temp_f']
ghcn_binary_cols = [c for c in ghcn.columns if 'indicator' in c]

# Aggregate quantitative columns using mean and max
aggMethods = dict(zip(ghcn_quant_cols, [['mean','max']] * len(ghcn_quant_cols)))

# Aggregate binary columns using sum to get a count of occurences per week
for c in ghcn_binary_cols:
    aggMethods[c] = 'sum'
    
ghcn_weekly = ghcn.groupby(['station_id','week_start','year','month']).agg(aggMethods).reset_index()
ghcn_weekly.columns = [re.sub('_$','','_'.join(col).strip()) for col in ghcn_weekly.columns.values]

Here we use SQL to load the daily traffic data for the winter months, aggregating by start of the week and splitting the traffic into weekday vs weekend.  We also select only the direction that is headed towards the nearest resorts.

In [9]:
query = """
select station_id traffic_station_id
      ,case when extract(dow from date) = 0 then date else date_trunc('week', date) - '1 day'::interval end::date week_start
      ,cast(case when extract(dow from date) = 0 then date else date_trunc('week', date) - '1 day'::interval end - '7 day'::interval as date) prior_week_start
      ,avg (case when extract(dow from date) in (5,6,0) then dma_adj_traffic else 0 end) avg_weekend_traffic
      ,avg (case when extract(dow from date) in (1,2,3,4) then dma_adj_traffic else 0 end) avg_weekday_traffic
from daily_traffic_data
where month in (1,2,3,11,12) 
and relative_resort_direction = 'towards'
group by station_id
, case when extract(dow from date) = 0 then date else date_trunc('week', date) - '1 day'::interval end
,cast(case when extract(dow from date) = 0 then date else date_trunc('week', date) - '1 day'::interval end - '7 day'::interval as date)
order by traffic_station_id, week_start
"""

traffic = %sql $query
traffic = traffic.DataFrame()

 * postgres://dgyw5:***@pgsql.dsa.lan/caponl_22g2
4279 rows affected.


In [10]:
traffic_weekly_same = traffic.set_index(['week_start'])[['traffic_station_id','avg_weekend_traffic','avg_weekday_traffic']]
traffic_weekly_prior = traffic.set_index(['prior_week_start'])[['traffic_station_id','avg_weekend_traffic','avg_weekday_traffic']]
traffic_weekly_same.index = pd.to_datetime(traffic_weekly_same.index)
traffic_weekly_prior.index = pd.to_datetime(traffic_weekly_prior.index)

ghcn_join = ghcn_weekly.set_index('week_start')

combined_same = traffic_weekly_same.join(ghcn_join).reset_index()
combined_prior = traffic_weekly_prior.join(ghcn_join).reset_index()
combined_prior.rename({'index':'week_start', 'week_end': 'prior_week_end'}, axis=1, inplace=True)

In [11]:
# Mapping of traffic stations to their corresponding weather stations
traffic_weather_map = {106: ['USC00050372', 'USC00058575','USS0006K24S','US1COSU0067'], 
                       308: ['USS0006L11S','USC00051959'],
                       105: ['USC00058575','USC00050372'],
                       219: ['USC00050372','USS0006L11S','USC00051959'], 
                       240: ['US1COSU0067'], 
                       120: ['USS0005K05S'],
                       119: ['USS0006K24S'],
                       236: ['USC00050372'],
                       126: ['USC00058575']}

<a id="modeldefinition"></a>
## Model Definition
[Top](#top)

In [23]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn import preprocessing
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import mean_absolute_error
from time import time


# We're trying a Gradient Boost model, since we have several variables that are weekly correlated. The hope is that this model
# would be able to leverage the combined correlations of all variables to predict the traffic.
gboost_param_grid = {'n_estimators': range(100,200,20),
                     'learning_rate': [0.05, 0.1, 0.15]}
gboost = GridSearchCV(GradientBoostingRegressor(), gboost_param_grid, cv=3, n_jobs=5)


# We're also trynig a Random Forest model as a sort of "all-purpose" ML regression model.
rforest_param_grid = {'n_estimators': range(10,200,20)}
rforest = GridSearchCV(RandomForestRegressor(), rforest_param_grid, cv=3, n_jobs=5)

# Finally, we're trying a lasso model, in the hopes that its built-in feature selection capabilities will be able to identify
# the most relevant variables
lasso_pipe = Pipeline([
    ('Lasso', TransformedTargetRegressor(regressor=Lasso(), transformer=preprocessing.StandardScaler()))
])
lasso_param_grid = {'Lasso__regressor__alpha': [0.5, 1.0, 1.5]}
lasso = GridSearchCV(lasso_pipe, lasso_param_grid, cv=3, n_jobs=5)

In [15]:
tStation = 106
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')

In [24]:
# Utility function to display the results of grid searches more easily
def GetModelParamsAndScores(model_grid):

    rowList = []
    rowDict = {}
    for params in model_grid.cv_results_['params']:
        for k,v in params.items():
            rowDict[k] = v
        rowList.append(rowDict.copy())
    
    models = pd.DataFrame(rowList)
    
    attribs = pd.DataFrame({k:v for k,v in model_grid.cv_results_.items() if not(k.startswith('param'))})
    models = pd.concat ([models,attribs], axis=1)
    return models

# Split the data into training and testing sets based on current or prior week's weather data
def get_train_test_split (tStation, sameprior, y_col='avg_weekend_traffic', test_size=0.25):
    
    if sameprior == 'prior':
        t = combined_prior[(combined_prior.traffic_station_id == tStation) & 
                           (combined_prior.station_id.isin(traffic_weather_map[tStation]))]
    else:
        t = combined_same[(combined_same.traffic_station_id == tStation) & 
                          (combined_same.station_id.isin(traffic_weather_map[tStation]))]
    
    
    X = t[t.columns[7:]]
    y = t[y_col]
    
    return train_test_split (X, y, test_size=test_size)  

# Fit and evaluate the model
def fit_and_evaluate (model, name):
    startTime = time()
    model_grid = model.fit(X_train, y_train)
    print (f'{name}:')
    print ('-'*50)
    print (f'fit done after {time() - startTime} secs')
    
    models = GetModelParamsAndScores(model_grid)
    print ('Best Model performance:\n',models[models.rank_test_score==1][['mean_train_score','mean_test_score']])
    preds = model_grid.predict(X_test)
    mae = mean_absolute_error(y_test, preds)   
    print ('Mean absolute error:', mae)
    print ('-'*50)

<a id="modeltraining"></a>
## Model training and evaluation
[Top](#top)

### Station 106 (I-70 near the Eisenhower Tunnel)

#### Same week's weather:

In [25]:
tStation = 106
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 3.880502223968506 secs
Best Model performance:
    mean_train_score  mean_test_score
1          0.447755         0.071641
Mean absolute error: 1309.2157916467652
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 11.212236642837524 secs
Best Model performance:
    mean_train_score  mean_test_score
6          0.873058         0.027533
Mean absolute error: 1371.3985036168306
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.992621898651123 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.060025          0.04924
Mean absolute error: 1320.417095192397
--------------------------------------------------


#### Prior week's weather

In [106]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 2.0260908603668213 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.464563         0.085671
Mean absolute error: 1243.8051775633946
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 10.666655540466309 secs
Best Model performance:
    mean_train_score  mean_test_score
8          0.868866         0.075573
Mean absolute error: 1277.36954473447
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 1.232893705368042 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.123945         0.102886
Mean absolute error: 1331.7211931017484
--------------------------------------------------


### Station 219 (SH285 near Johnson Village)

#### Same week's weather:

In [107]:
tStation = 219
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 1.96108078956604 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.470333         0.043111
Mean absolute error: 228.51337680835093
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 10.13311505317688 secs
Best Model performance:
    mean_train_score  mean_test_score
7          0.856047         0.020615
Mean absolute error: 230.14388927967613
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.9815986156463623 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.104556          0.07443
Mean absolute error: 237.9580516522873
--------------------------------------------------


#### Prior week's weather

In [108]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 1.9278576374053955 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.416675        -0.034381
Mean absolute error: 216.8723395037638
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 8.91223692893982 secs
Best Model performance:
    mean_train_score  mean_test_score
6          0.851098        -0.027117
Mean absolute error: 220.7549276695378
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.8671145439147949 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.045057         0.012794
Mean absolute error: 210.57328896159942
--------------------------------------------------


### Station 119 (I-70 near Copper Mountain

#### Same week's weather:

In [109]:
tStation = 119
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')



GradientBoost:
--------------------------------------------------
fit done after 0.9956090450286865 secs
Best Model performance:
    mean_train_score  mean_test_score
5          0.849769         0.010841
Mean absolute error: 973.9886475701976
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 5.284533500671387 secs
Best Model performance:
    mean_train_score  mean_test_score
0           0.78117        -0.035026
Mean absolute error: 1082.0803904339289
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.04161977767944336 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.125892         0.030113
Mean absolute error: 871.2791286796469
--------------------------------------------------




#### Prior week's weather

In [110]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 0.9860982894897461 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.677942        -0.331997
Mean absolute error: 1062.3959742671393
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 5.494194746017456 secs
Best Model performance:
    mean_train_score  mean_test_score
8          0.845508        -0.248694
Mean absolute error: 1099.2274291916333
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.03727889060974121 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.123542        -0.136516
Mean absolute error: 1028.5206354895563
--------------------------------------------------


### Station 120 (I-70 near Idaho Springs)

#### Same week's weather

In [111]:
tStation = 120
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')



GradientBoost:
--------------------------------------------------
fit done after 1.0104551315307617 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.601572        -0.113858
Mean absolute error: 2394.417957101593
--------------------------------------------------




RandomForest:
--------------------------------------------------
fit done after 4.430756092071533 secs
Best Model performance:
    mean_train_score  mean_test_score
5          0.853039        -0.077651
Mean absolute error: 2390.7774092265645
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.04976797103881836 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.065746        -0.044591
Mean absolute error: 2278.2509703192427
--------------------------------------------------




#### Prior week's weather

In [112]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 1.0267963409423828 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.665461         0.114243
Mean absolute error: 2021.3468168328627
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 5.6809046268463135 secs
Best Model performance:
    mean_train_score  mean_test_score
9          0.876152         0.100024
Mean absolute error: 1953.2263044278259
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.03663349151611328 secs
Best Model performance:
    mean_train_score  mean_test_score
2           0.16368           0.0789
Mean absolute error: 1978.9625650338585
--------------------------------------------------


### Station 308 (SH135 near Castle Mountain Rd)

#### Same week's weather

In [113]:
tStation = 308
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 1.4862909317016602 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.411547        -0.084648
Mean absolute error: 152.8233674394701
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 6.837780714035034 secs
Best Model performance:
    mean_train_score  mean_test_score
5          0.852756        -0.043278
Mean absolute error: 148.01012852315867
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.7034006118774414 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.019717        -0.032282
Mean absolute error: 145.51837180472424
--------------------------------------------------


#### Prior week's weather

In [114]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')



GradientBoost:
--------------------------------------------------
fit done after 1.5480408668518066 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.438843        -0.007268
Mean absolute error: 135.03302002855833
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 6.580195426940918 secs
Best Model performance:
    mean_train_score  mean_test_score
7          0.866447         0.012143
Mean absolute error: 129.38391563332937
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.6516950130462646 secs
Best Model performance:
    mean_train_score  mean_test_score
1          0.066549          0.01742
Mean absolute error: 133.7144464339247
--------------------------------------------------


### Station 236 (SH82 near Snowmass Creek Rd)

#### Same week's weather

In [115]:
tStation = 236
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')



GradientBoost:
--------------------------------------------------
fit done after 1.0113682746887207 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.663975        -0.166847
Mean absolute error: 603.5032790808024
--------------------------------------------------




RandomForest:
--------------------------------------------------
fit done after 5.409968614578247 secs
Best Model performance:
    mean_train_score  mean_test_score
5          0.851163         -0.09708
Mean absolute error: 588.6898417253682
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.036058902740478516 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.114446         0.037279
Mean absolute error: 605.5449242167062
--------------------------------------------------


#### Prior week's weather

In [116]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')



GradientBoost:
--------------------------------------------------
fit done after 1.026543378829956 secs
Best Model performance:
    mean_train_score  mean_test_score
0           0.79811         0.007115
Mean absolute error: 490.72201784117755
--------------------------------------------------




RandomForest:
--------------------------------------------------
fit done after 5.3650062084198 secs
Best Model performance:
    mean_train_score  mean_test_score
9          0.869561        -0.040686
Mean absolute error: 495.7981842464929
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.03937697410583496 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.151381        -0.004015
Mean absolute error: 483.22294528138565
--------------------------------------------------




### Station 126 (I-70 near West Vail)

#### Same week's weather

In [117]:
tStation = 126
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')



GradientBoost:
--------------------------------------------------
fit done after 0.9452989101409912 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.712651        -0.175752
Mean absolute error: 1737.5152660206377
--------------------------------------------------




RandomForest:
--------------------------------------------------
fit done after 3.9301793575286865 secs
Best Model performance:
    mean_train_score  mean_test_score
3          0.833144        -0.099183
Mean absolute error: 1735.209608596519
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.03977775573730469 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.056012        -0.054422
Mean absolute error: 1634.5520694955476
--------------------------------------------------




#### Prior week's weather

In [118]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 0.9817690849304199 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.681455        -0.192678
Mean absolute error: 1449.8981161501529
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 5.1246771812438965 secs
Best Model performance:
    mean_train_score  mean_test_score
7          0.838246        -0.127033
Mean absolute error: 1496.343236326029
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.04642224311828613 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.070079        -0.080467
Mean absolute error: 1327.2365546192723
--------------------------------------------------


### Station 240 (SH9 near Breckenridge)

#### Same week's weather

In [119]:
tStation = 240
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'same')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 0.5219249725341797 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.942045        -1.115859
Mean absolute error: 894.6615809187017
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 2.799424886703491 secs
Best Model performance:
    mean_train_score  mean_test_score
4          0.865634        -0.395594
Mean absolute error: 790.5417731374614
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.0315854549407959 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.274969         -0.07299
Mean absolute error: 730.8378743426111
--------------------------------------------------


#### Prior week's weather

In [120]:
X_train, X_test, y_train, y_test = get_train_test_split (tStation, 'prior')
fit_and_evaluate(gboost, 'GradientBoost')
fit_and_evaluate(rforest, 'RandomForest')
fit_and_evaluate(lasso, 'Lasso')

GradientBoost:
--------------------------------------------------
fit done after 0.5004353523254395 secs
Best Model performance:
    mean_train_score  mean_test_score
0          0.913212         0.054996
Mean absolute error: 999.0315998461014
--------------------------------------------------
RandomForest:
--------------------------------------------------
fit done after 2.738407611846924 secs
Best Model performance:
    mean_train_score  mean_test_score
3          0.842349         0.123376
Mean absolute error: 816.710498995143
--------------------------------------------------
Lasso:
--------------------------------------------------
fit done after 0.031545400619506836 secs
Best Model performance:
    mean_train_score  mean_test_score
2          0.259411         0.165359
Mean absolute error: 746.318077825374
--------------------------------------------------




<a id="summary"></a>
## Summary
[Top](#top)

In general, we can see that these models did not perform very well with our data.  In some cases, the R-squared values for teh training data was very high, but that of the test data was very low.  This may indicate overfitting.  This indicated to us that attempting to use daily weather patterns to predict daily traffic patterns wasn't a viable strategy.  At this point, we then turned to using rolling averages.