# Training a Model

Now that we've got data in Hopsworks and the architecture for updating it, we can go ahead and start writing our model training data. Since we're working with time series data that has strong seasonality, I'm going to use Meta's Prophet algorithm.

Since our data is hosted on Hopsworks, we need to get it:

In [1]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/14486
Connected. Call `.close()` to terminate connection gracefully.




In [2]:
# Load feature groups.
zip_code = '60603'  # Chicago
country_code = 'US'
city = 'Chicago'

fg_name = f'aqi_{city}_{zip_code}'.lower()

aqi_online_fg = fs.get_feature_group(fg_name, version=1)

not_features = ['date', 'lat', 'lon', 'id']

ds_query = aqi_online_fg.select_except(not_features)

In [3]:
ds_query.show(5, online=True)

Unnamed: 0,co,no,no2,o3,so2,pm2_5,pm10,nh3,datetime,aqi
0,343.8,0.47,31.19,13.77,5.13,9.09,12.33,0.68,2020-11-27 05:00:00,1
1,240.33,0.05,12.85,57.22,5.19,1.07,2.38,0.76,2020-11-30 17:00:00,1
2,417.23,11.06,40.44,33.98,13.23,12.01,17.45,3.04,2020-12-02 16:00:00,2
3,487.33,14.08,41.13,15.02,9.66,19.76,25.85,5.26,2020-12-03 10:00:00,2
4,460.63,4.08,55.52,3.89,7.87,21.9,27.35,1.84,2020-12-07 07:00:00,3


Notice that the data appears to be out of order. This is ok.

We will now define some transformation functions to normalize all of our features. These transformations will be applied to the data when we create a feature view.

In [8]:
# Load the transformation function we want.
standard_scaler = fs.get_transformation_function(name="standard_scaler")

# Map features to transformation function
transformation_functions = {
    'co': standard_scaler, 
    'no': standard_scaler, 
    'no2': standard_scaler, 
    'o3': standard_scaler,
    'so2': standard_scaler, 
    'pm2_5': standard_scaler, 
    'pm10': standard_scaler, 
    'nh3': standard_scaler
}

Training data is created from feature views in Hopsworks. Feature views are logical views over sets of features. Normally they are created by joining together different feature groups. Since we only have one here though it's a little different.

In [10]:
fv_name = f'{fg_name}_fv'

try:
    feature_view = fs.get_feature_view(name=fv_name, version=1)
except: 
    feature_view = fs.create_feature_view(
    name=fv_name,
    version=1,
    description='feature view for creating training data',
    query=ds_query,
    labels=['aqi'],
    transformation_functions=transformation_functions
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/14486/fs/14406/fv/aqi_chicago_60603_fv/version/1


Now let's get the earliest and latest dates in the dataset to split our data into a training and testing set:

In [22]:
import datetime
import pandas as pd

newest_date = pd.to_datetime(fs.sql(f"SELECT MAX(`datetime`) FROM `{fg_name}_1`", online=True).values[0][0])
oldest_date = pd.to_datetime(fs.sql(f"SELECT MIN(`datetime`) FROM `{fg_name}_1`", online=True).values[0][0])

print(newest_date, oldest_date)

2023-01-13 12:00:00 2020-11-27 00:00:00


In [34]:
train_start = oldest_date
train_end = newest_date - datetime.timedelta(days=30)

test_start = train_end + datetime.timedelta(hours=1)
test_end = newest_date

print(train_start, train_end, test_start, test_end)

2020-11-27 00:00:00 2022-12-14 12:00:00 2022-12-14 13:00:00 2023-01-13 12:00:00


We'll give ourselves roughly 2 years of training data and 1 month of testing data. Now convert to a format Hopsworks can understand:

In [35]:
train_start_str = train_start.strftime("%Y%m%d%H%M%S")
train_end_str = train_end.strftime("%Y%m%d%H%M%S")
test_start_str = test_start.strftime("%Y%m%d%H%M%S")
test_end_str = test_end.strftime("%Y%m%d%H%M%S")

print(train_start_str, train_end_str, test_start_str, test_end_str)

20201127000000 20221214120000 20221214130000 20230113120000


In [40]:
print(f'aqi data for training {train_start} to {train_end}')

aqi data for training 2020-11-27 00:00:00 to 2022-12-14 12:00:00


In [41]:
# Create training datasets based event time filter
# train_d, train_d_job = feature_view.create_training_data(
#         start_time = train_start_str,
#         end_time = train_end_str,    
#         description = f'aqi data for training {train_start} to {train_end}',
#         data_format = "csv",
#         coalesce = True,
#         write_options = {'wait_for_job': False},
#     )

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/14486/jobs/named/aqi_chicago_60603_fv_1_1_create_fv_td_17012023215333/executions




NameError: name 'testing_end' is not defined

In [42]:
# Create testing datasets based event time filter
# test_d, test_d_job = feature_view.create_training_data(
#         start_time = test_start_str,
#         end_time = test_end_str,    
#         description = f'aqi data for testing {test_start} to {test_end}',
#         data_format = "csv",
#         coalesce = True,
#         write_options = {'wait_for_job': False},
#     )

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/14486/jobs/named/aqi_chicago_60603_fv_1_2_create_fv_td_17012023215418/executions




Now that the train and test data set views have been created, we can access them like so:

In [71]:
train_x, train_y = feature_view.get_training_data(1)
test_x, test_y = feature_view.get_training_data(2)

Now we have a dataframe for each train and test x and y! Simple!

In [72]:
train_x.head()

Unnamed: 0,co,no,no2,o3,so2,pm2_5,pm10,nh3,datetime
0,0.161839,-0.30381,0.411643,-0.80792,-0.412964,-0.434698,-0.351544,0.110326,2021-11-02T04:00:00.000Z
1,0.66773,-0.143022,1.524082,-1.386293,0.040377,-0.371978,-0.225976,0.723985,2021-11-02T20:00:00.000Z
2,-0.082379,-0.255417,-0.08466,-0.314349,-0.22783,-0.686437,-0.626309,0.001044,2021-11-02T18:00:00.000Z
3,1.295705,0.528229,1.866102,-1.583492,-0.213589,0.18477,0.439669,1.404895,2021-11-02T23:00:00.000Z
4,1.225936,0.458502,1.83215,-1.580905,-0.073552,0.051597,0.287772,1.350254,2021-11-02T22:00:00.000Z


In [75]:
train_x.datetime = pd.to_datetime(train_x.datetime)
test_x.datetime = pd.to_datetime(test_x.datetime)

In [76]:
# data points are not in order
train_x = train_x.sort_values("datetime")
train_y = train_y.reindex(train_x.index)

test_x = test_x.sort_values("datetime")
test_y = test_y.reindex(test_x.index)

In [77]:
train_x.iloc[1]

co                           0.877038
no                           1.121428
no2                          1.079206
o3                          -1.490354
so2                          2.587156
pm2_5                        0.762138
pm10                         0.954092
nh3                          0.354108
datetime    2020-11-27 13:00:00+00:00
Name: 15561, dtype: object

In [78]:
train_y.iloc[1]

aqi    2
Name: 15561, dtype: int64

In [81]:
# need to remove time zone information in order to use prophet
train_x['datetime'] = train_x['datetime'].dt.tz_localize(None)
test_x['datetime'] = test_x['datetime'].dt.tz_localize(None)

Now we can bring in Prophet and train a model:

In [82]:
from prophet import Prophet

m = Prophet()

In [83]:
df = pd.concat([train_x.datetime, train_y], axis=1)
df.columns = ['ds', 'y']
df.head()

Unnamed: 0,ds,y
15567,2020-11-27 12:00:00,2
15561,2020-11-27 13:00:00,2
15568,2020-11-27 14:00:00,2
15560,2020-11-27 15:00:00,2
15564,2020-11-27 16:00:00,2


In [84]:
m.fit(df)

2023-01-17 17:58:18,460 DEBUG: input tempfile: /var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/_kc4o751.json
2023-01-17 17:58:18,989 DEBUG: input tempfile: /var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/pupuyzz8.json
2023-01-17 17:58:18,991 DEBUG: idx 0
2023-01-17 17:58:18,991 DEBUG: running CmdStan, num_threads: None
2023-01-17 17:58:18,991 DEBUG: CmdStan args: ['/Users/giorgio/.virtualenvs/MLBook/lib/python3.9/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=78265', 'data', 'file=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/_kc4o751.json', 'init=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/pupuyzz8.json', 'output', 'file=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/prophet_modelgo9rjbw1/prophet_model-20230117175818.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']


17:58:18 - cmdstanpy - INFO - Chain [1] start processing


2023-01-17 17:58:18,992 INFO: Chain [1] start processing


17:58:19 - cmdstanpy - INFO - Chain [1] done processing


2023-01-17 17:58:19,147 INFO: Chain [1] done processing


17:58:19 - cmdstanpy - ERROR - Chain [1] error: terminated by signal 6 Unknown error: -6


2023-01-17 17:58:19,149 ERROR: Chain [1] error: terminated by signal 6 Unknown error: -6
2023-01-17 17:58:19,151 DEBUG: input tempfile: /var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/dw_i37th.json
2023-01-17 17:58:19,694 DEBUG: input tempfile: /var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/blxfq0n0.json
2023-01-17 17:58:19,695 DEBUG: idx 0
2023-01-17 17:58:19,696 DEBUG: running CmdStan, num_threads: None
2023-01-17 17:58:19,696 DEBUG: CmdStan args: ['/Users/giorgio/.virtualenvs/MLBook/lib/python3.9/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=19013', 'data', 'file=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/dw_i37th.json', 'init=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/blxfq0n0.json', 'output', 'file=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/prophet_modely4gij80l/prophet_model-20230117175819.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']


17:58:19 - cmdstanpy - INFO - Chain [1] start processing


2023-01-17 17:58:19,696 INFO: Chain [1] start processing


17:58:19 - cmdstanpy - INFO - Chain [1] done processing


2023-01-17 17:58:19,702 INFO: Chain [1] done processing


17:58:19 - cmdstanpy - ERROR - Chain [1] error: terminated by signal 6 Unknown error: -6


2023-01-17 17:58:19,703 ERROR: Chain [1] error: terminated by signal 6 Unknown error: -6


RuntimeError: Error during optimization! Command '/Users/giorgio/.virtualenvs/MLBook/lib/python3.9/site-packages/prophet/stan_model/prophet_model.bin random seed=19013 data file=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/dw_i37th.json init=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/blxfq0n0.json output file=/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/tmpujrqgvrq/prophet_modely4gij80l/prophet_model-20230117175819.csv method=optimize algorithm=newton iter=10000' failed: console log output:

dyld[9347]: Library not loaded: '@rpath/libtbb.dylib'
  Referenced from: '/Users/giorgio/.virtualenvs/MLBook/lib/python3.9/site-packages/prophet/stan_model/prophet_model.bin'
  Reason: tried: '/private/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/pip-install-5jja258v/prophet_71a13987954041aeb4d92b3bccc5056e/build/lib.macosx-12.1-arm64-cpython-39/prophet/stan_model/cmdstan-2.26.1/stan/lib/stan_math/lib/tbb/libtbb.dylib' (no such file), '/private/var/folders/0r/wbzfff2x7xn34__m82hhzyj40000gn/T/pip-install-5jja258v/prophet_71a13987954041aeb4d92b3bccc5056e/build/lib.macosx-12.1-arm64-cpython-39/prophet/stan_model/cmdstan-2.26.1/stan/lib/stan_math/lib/tbb/libtbb.dylib' (no such file), '/usr/local/lib/libtbb.dylib' (no such file), '/usr/lib/libtbb.dylib' (no such file)
