### Model Training Pipeline

This notebook retrains the model and saves the model and performance metrics (Accuracy and AUC) to the Hopsworks.ai Model Registry.

It executes Notebook 07 as part of the process and will use parameters as set there (GPU/no GPU, retune Hyperparameters or not, etc...).

Notebook 07 is executed as a subprocess and the output is captured and displayed in this notebook.

Notebook 07 includes experiment tracking with Neptune.ai.

In [1]:
import os

import pandas as pd
import numpy as np

import hopsworks

from hsml.schema import Schema
from hsml.model_schema import ModelSchema
from hsfs.client.exceptions import RestAPIError

from pathlib import Path  #for Windows/Linux compatibility
DATAPATH = Path(r'data')

import json

from datetime import datetime, timedelta

from src.hopsworks_utils import (
    convert_feature_names,
)

from dotenv import load_dotenv



**Connect to Hopsworks FeatureStore**

In [2]:
try:
    HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']
except:
    raise Exception('Set environment variable HOPSWORKS_API_KEY')

In [3]:
project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


ConnectionError: HTTPSConnectionPool(host='c.app.hopsworks.ai', port=443): Max retries exceeded with url: /hopsworks-api/api/project (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000279E8F9CE20>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

In [None]:
rolling_stats_fg = fs.get_or_create_feature_group(
    name="rolling_stats",
    version=2,
)

**Delete Old FeatureView**

In [None]:
try:
    feature_view = fs.get_feature_view(
        name = 'rolling_stats_fv',
        version = 2,
    )
    feature_view.delete()
except RestAPIError as e:
    if e.response.json().get("errorCode", "") == 270009:
        print("Feature view does not exist. No need to delete it.")


**Create New FeatureView**

In [None]:
query = rolling_stats_fg.select_all()

feature_view = fs.create_feature_view(
    name = 'rolling_stats_fv',
    version = 2,
    query = query
)

**Create Training and Test Set**

Use time filter: Previous 365 days is test set, all others are training set

In [None]:
STARTDATE = "2003-01-01" #data goes back to 2003 season
TODAY = datetime.now()
LASTYEAR = (TODAY - timedelta(days=45)).strftime('%Y-%m-%d')
TODAY = TODAY.strftime('%Y-%m-%d') 

td_train, td_job = feature_view.create_training_data(
        start_time=STARTDATE,
        end_time=LASTYEAR,    
        description='All data except last 45 days',
        data_format="csv",
        coalesce=True,
        write_options={'wait_for_job': False},
    )

td_test, td_job = feature_view.create_training_data(
        start_time=LASTYEAR,
        end_time=TODAY,    
        description='Last 45 days',
        data_format="csv",
        coalesce=True,
        write_options={'wait_for_job': False},
    )


In [None]:
train = feature_view.get_training_data(td_train)[0]
test = feature_view.get_training_data(td_test)[0]


**Re-Convert Feature Names**

- For whatever reason, hopsworks.ai converts all feature names to lowercase. 
- For reusability of existing codebase, these need to be converted back to original mixed-case in train and test dataframes.
- The original feature names in proper mixed-case is read from a JSON file, then mapped back to train and test

In [None]:

train = convert_feature_names(train)
test = convert_feature_names(test)

#fix date format
train["GAME_DATE_EST"] = train["GAME_DATE_EST"].str[:10]
test["GAME_DATE_EST"] = test["GAME_DATE_EST"].str[:10]


In [None]:
train

**Save data**

As a convenience to re-use the existing model training notebook, the data is saved to files first (currently <100 megabytes total)

In [None]:
train.to_csv(DATAPATH / "train_selected.csv",index=False)
test.to_csv(DATAPATH / "test_selected.csv",index=False)

**Model Training**

The existing model training notebook is re-used. It includes Neptune.ai experiment tracking for both training run and hyperparameter tuning.


In [None]:
%run 07_model_testing.ipynb


**Save to Model Registry**



In [None]:
# read in train_predictions to create model schema
train = pd.read_csv(DATAPATH / "train_predictions.csv")
target = train['TARGET']
drop_columns = ['TARGET', 'PredictionPct', 'Prediction']
train = train.drop(columns=drop_columns)

input_schema = Schema(train)
output_schema = Schema(target)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

# read in model meta_data from training run
with open('model_data.json', 'rb') as fp:
    model_data = json.load(fp)
    

# log back in to hopsworks.ai. Hyperparameter tuning may take hours.
project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
mr = project.get_model_registry()

model = mr.python.create_model(
    name=model_data['model_name'],
    metrics=model_data['metrics'],
    description=model_data['model_name'],
    model_schema=model_schema
)
model.save('model.pkl')








