Load in the new hourly counts dataset:

In [1]:
import numpy as np
import pandas as pd

In [4]:
df = pd.read_csv('../data/updated_calls_weather_tfk.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25465 entries, 0 to 25464
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                25465 non-null  int64  
 1   year                      25465 non-null  int64  
 2   month                     25465 non-null  int64  
 3   day                       25465 non-null  int64  
 4   hour                      25465 non-null  int64  
 5   num_calls                 25465 non-null  int64  
 6   BRONX                     25465 non-null  int64  
 7   BROOKLYN                  25465 non-null  int64  
 8   MANHATTAN                 25465 non-null  int64  
 9   QUEENS                    25465 non-null  int64  
 10  RICHMOND / STATEN ISLAND  25465 non-null  int64  
 11  UNKNOWN                   25465 non-null  int64  
 12  STATION                   25465 non-null  object 
 13  NAME                      25465 non-null  object 
 14  DATE  

In [5]:
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn tools:
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
# sklearn models:
from sklearn.linear_model import LinearRegression
# tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input


Set up X and y, do train test split, and scale:

In [8]:
X = df[['year', 'month', 'day', 'hour', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TAVG_CALC']]
y = df['num_calls']
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

Baseline:

In [10]:
m = y_train.mean()
baseline_train = [m for y in y_train]
baseline_test  = [m for y in y_test]
R20 = r2_score(y_train,baseline_train)
R21 = r2_score(y_test,baseline_test)
RMS0= mean_squared_error(y_train,baseline_train,squared=False)
RMS1= mean_squared_error(y_test, baseline_test, squared=False)
print(f'BASELINE, call volume = {m}')
print(f'R2:  Train: {R20}, Test: {R21}')
print(f'RMS: Train: {RMS0}, Test: {RMS1}')

BASELINE, call volume = 136.38548539114043
R2:  Train: 0.0, Test: -1.7549351112977618e-05
RMS: Train: 54.20031558657461, Test: 54.340338222301554


Linear Modeling:

In [39]:
lr1 = LinearRegression()
lr1.fit(X_train,y_train)
R20 = lr1.score(X_train,y_train)
R21 = lr1.score(X_test,y_test)
RMS0= mean_squared_error(y_train,lr1.predict(X_train),squared=False)
RMS1= mean_squared_error(y_test, lr1.predict(X_test), squared=False)
print(f'R2:  Train: {R20}, Test: {R21}')
print(f'RMS: Train: {RMS0}, Test: {RMS1}')

R2:  Train: 0.4032671580422418, Test: 0.4052229110923856
RMS: Train: 40.255567545763256, Test: 40.11893624015415


We know several features have non-linear interactions with num_calls. Let's try Polynomial Features:

In [40]:
for n in range(2,5):
    print()
    print(f'Polynomial Features of Degree {n}:')
    pf = PolynomialFeatures(degree=n)
    PF_train = pf.fit_transform(Z_train)
    PF_test  = pf.transform(Z_test)
    lr2 = LinearRegression()
    lr2.fit(PF_train,y_train)
    R20 = lr2.score(PF_train,y_train)
    R21 = lr2.score(PF_test,y_test)
    RMS0= mean_squared_error(y_train,lr2.predict(PF_train),squared=False)
    RMS1= mean_squared_error(y_test, lr2.predict(PF_test), squared=False)
    print(f'R2:  Train: {R20}, Test: {R21}')
    print(f'RMS: Train: {RMS0}, Test: {RMS1}')


Polynomial Features of Degree 2:
R2:  Train: 0.5116451501308716, Test: 0.5154334226943383
RMS: Train: 36.416963952985625, Test: 36.211708980596406

Polynomial Features of Degree 3:
R2:  Train: 0.6730675026092716, Test: 0.6749179787498814
RMS: Train: 29.79648465778076, Test: 29.659837639626755

Polynomial Features of Degree 4:
R2:  Train: 0.7600247774298331, Test: 0.7575773505001402
RMS: Train: 25.52814072703261, Test: 25.612908048162943


Polynomial Features definitely helped!

Neural Net, first attempt:

In [41]:
#set random state for reproducability:
np.random.seed(42)

model = Sequential()
model.add(Dense(32, input_dim=Z_train.shape[1], 
                    activation='relu'))
model.add(Dense(16,  
                    activation='relu'))
model.add(Dense(8,  
                    activation='relu',))
model.add(Dense(1, activation=None))

model.compile(loss='mse', optimizer='adam', metrics=['mae'])


results = model.fit(Z_train, y_train, epochs=100, batch_size=256, \
                            validation_data=(Z_test, y_test))



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [42]:
preds_nn0 = model.predict(Z_train)
preds_nn1 = model.predict(Z_test)

In [43]:
R20 = lr2.score(PF_train,y_train)
R21 = lr2.score(PF_test,y_test)
RMS0= mean_squared_error(y_train,lr2.predict(PF_train),squared=False)
RMS1= mean_squared_error(y_test, lr2.predict(PF_test), squared=False)
print('Best Linear Model:')
print(f'R2:  Train: {R20}, Test: {R21}')
print(f'RMS: Train: {RMS0}, Test: {RMS1}')

Best Linear Model:
R2:  Train: 0.7600247774298331, Test: 0.7575773505001402
RMS: Train: 25.52814072703261, Test: 25.612908048162943


In [44]:
R20 = r2_score(y_train,preds_nn0)
R21 = r2_score(y_test, preds_nn1)
RMS0= mean_squared_error(y_train,preds_nn0,squared=False)
RMS1= mean_squared_error(y_test, preds_nn1, squared=False)
print('Neural Net Model:')
print(f'R2:  Train: {R20}, Test: {R21}')
print(f'RMS: Train: {RMS0}, Test: {RMS1}')

Neural Net Model:
R2:  Train: 0.7941486260633767, Test: 0.7953976735485776
RMS: Train: 23.64356000382835, Test: 23.530306132286523


So there is definitely some room for improvement, and we should be able to get a good model from this data.