### Production Features Pipeline

This notebook is run daily from a Github Action. 

1. It scrapes the results from the previous day's games, performs feature engineering, and saves the results back to the Feature Store at Hopsworks.ai

2. It scrapes the upcoming games for today, and saves the blank records back into the Feature Store at Hopsworks.ai so that they can be accessed by the model for the prediction service via the streamlit app.

**Note:**
There are two options for webscraping in this notebook. 
Set the 'WEBSCRAPER' variable to either 'SCRAPINGANT' or 'SELENIUM' to choose which version to run.

1. SCRAPINGANT: Uses a webscraping service with a Python API, ScrapingAnt, which handles all the proxy server issues, but does require an account. The free account allows for 1000 page requests, which is more than enough for this project. Proxies are required when running this notebook from a Github Action or otherwise key data will fail to be scraped from NBA.com. 

2. SELENIUM: This option does not currently integrate proxy servers into the webscraping process, which can cause issues when scraping from certain locations, in particular Github Actions. 

In [1]:
# select web scraper; 'SCRAPINGANT' or 'SELENIUM'
# SCRAPINGANT requires a subscription but includes a proxy server

WEBSCRAPER = 'SCRAPINGANT'
#WEBSCRAPER = 'SELENIUM'

In [2]:
import os

import pandas as pd
import numpy as np

import hopsworks

from datetime import datetime, timedelta
from pytz import timezone

from src.webscraping import (
    get_new_games,
    activate_web_driver,
    get_todays_matchups,
)

from src.data_processing import (
    process_games,
    add_TARGET,
)

from src.feature_engineering import (
    process_features,
)

from src.hopsworks_utils import (
    save_feature_names,
    convert_feature_names,
)

import json

import time


from pathlib import Path  #for Windows/Linux compatibility
DATAPATH = Path(r'data')

**Load API keys**

In [3]:
from dotenv import load_dotenv

load_dotenv()

try:
    HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']
except:
    raise Exception('Set environment variable HOPSWORKS_API_KEY')

# if scrapingant is chosen then set the api key, otherwise load the selenium webdriver
if WEBSCRAPER == 'SCRAPINGANT':
    try:
        SCRAPINGANT_API_KEY = os.environ['SCRAPINGANT_API_KEY']
    except:
        raise Exception('Set environment variable SCRAPINGANT_API_KEY')
    driver = None
    
elif WEBSCRAPER == 'SELENIUM':
    driver = activate_web_driver('chromium')
    SCRAPINGANT_API_KEY = ""
    



**Scrape New Completed Games and Format Them**

In [4]:


df_new = get_new_games(SCRAPINGANT_API_KEY, driver)

# get the SEASON of the last game in the database
# this will used when constructing rows for prediction
SEASON = df_new['SEASON'].max()

df_new




<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   HOME            50 non-null     int64         
 1   GAME_DATE_EST   50 non-null     datetime64[ns]
 2   HOME_TEAM_WINS  50 non-null     int64         
 3   PTS             50 non-null     int64         
 4   FG_PCT          50 non-null     float64       
 5   FG3_PCT         50 non-null     float64       
 6   FT_PCT          50 non-null     float64       
 7   REB             50 non-null     int64         
 8   AST             50 non-null     int64         
 9   TEAM_ID         50 non-null     object        
 10  GAME_ID         50 non-null     object        
dtypes: datetime64[ns](1), float64(3), int64(5), object(2)
memory usage: 4.7+ KB
None


Unnamed: 0,GAME_DATE_EST,HOME_TEAM_WINS,PTS_home,FG_PCT_home,FG3_PCT_home,FT_PCT_home,REB_home,AST_home,HOME_TEAM_ID,GAME_ID,PTS_away,FG_PCT_away,FG3_PCT_away,FT_PCT_away,REB_away,AST_away,VISITOR_TEAM_ID,SEASON
0,2023-02-12,0,118,50.6,38.2,78.1,32,25,1610612765,22200856,119,48.3,36.7,66.7,40,21,1610612761,2022
1,2023-02-12,0,109,50.0,35.3,61.9,34,24,1610612763,22200855,119,44.3,41.2,76.9,54,28,1610612738,2022
2,2023-02-11,1,119,50.5,21.7,82.4,51,35,1610612743,22200847,105,47.1,25.0,58.6,37,25,1610612766,2022
3,2023-02-11,0,128,51.6,39.2,73.7,43,27,1610612742,22200854,133,51.0,25.0,92.0,50,22,1610612758,2022
4,2023-02-11,1,109,44.9,32.0,80.8,52,19,1610612747,22200853,103,41.7,27.3,91.7,43,31,1610612744,2022
5,2023-02-11,0,89,48.6,23.8,80.0,44,27,1610612741,22200852,97,41.9,25.7,84.2,39,26,1610612739,2022
6,2023-02-11,0,120,51.1,28.1,75.0,38,23,1610612762,22200851,126,45.3,33.3,73.7,52,18,1610612752,2022
7,2023-02-11,1,107,43.5,30.8,91.3,46,24,1610612748,22200848,103,42.5,37.9,78.3,44,22,1610612753,2022
8,2023-02-11,0,106,42.6,29.7,88.2,49,25,1610612759,22200850,125,48.5,38.5,78.9,48,29,1610612737,2022
9,2023-02-11,1,101,44.0,26.1,96.7,41,17,1610612755,22200846,98,40.2,32.5,84.6,42,23,1610612751,2022


**Retrieve todays games**

In [5]:
#retrieve list of teams playing today

# get today's games on NBA schedule
matchups, game_ids = get_todays_matchups(SCRAPINGANT_API_KEY, driver)


print(matchups)
print(game_ids)


[['1610612737', '1610612766'], ['1610612759', '1610612739'], ['1610612762', '1610612754'], ['1610612745', '1610612755'], ['1610612743', '1610612748'], ['1610612751', '1610612752'], ['1610612753', '1610612741'], ['1610612740', '1610612760'], ['1610612750', '1610612742'], ['1610612764', '1610612744'], ['1610612747', '1610612757']]
['22200857', '22200858', '22200859', '22200860', '22200861', '22200862', '22200863', '22200864', '22200865', '22200866', '22200867']


**Close Webdriver**

In [7]:
if WEBSCRAPER == 'SELENIUM':
    driver.close() 

**Create Rows for Today's Games with Empty Stats**

In [8]:
# append today's matchups to the new games dataframe


df_today = df_new.drop(df_new.index) #empty copy of df_new with same columns
for i, matchup in enumerate(matchups):
    game_details = {'HOME_TEAM_ID': matchup[1], 
                    'VISITOR_TEAM_ID': matchup[0], 
                    'GAME_DATE_EST': datetime.now(timezone('EST')).strftime("%Y-%m-%d"), 
                    'GAME_ID': int(game_ids[i]),                       
                    'SEASON': SEASON,
                    } 
    game_details_df = pd.DataFrame(game_details, index=[i])
    # append to new games dataframe
    df_today = pd.concat([df_today, game_details_df], ignore_index = True)

#blank rows will be filled with 0 to prevent issues with feature engineering
df_today = df_today.fillna(0) 

df_today


Unnamed: 0,GAME_DATE_EST,HOME_TEAM_WINS,PTS_home,FG_PCT_home,FG3_PCT_home,FT_PCT_home,REB_home,AST_home,HOME_TEAM_ID,GAME_ID,PTS_away,FG_PCT_away,FG3_PCT_away,FT_PCT_away,REB_away,AST_away,VISITOR_TEAM_ID,SEASON
0,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612766,22200857,0.0,0.0,0.0,0.0,0.0,0.0,1610612737,2022
1,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612739,22200858,0.0,0.0,0.0,0.0,0.0,0.0,1610612759,2022
2,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612754,22200859,0.0,0.0,0.0,0.0,0.0,0.0,1610612762,2022
3,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612755,22200860,0.0,0.0,0.0,0.0,0.0,0.0,1610612745,2022
4,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612748,22200861,0.0,0.0,0.0,0.0,0.0,0.0,1610612743,2022
5,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612752,22200862,0.0,0.0,0.0,0.0,0.0,0.0,1610612751,2022
6,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612741,22200863,0.0,0.0,0.0,0.0,0.0,0.0,1610612753,2022
7,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612760,22200864,0.0,0.0,0.0,0.0,0.0,0.0,1610612740,2022
8,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612742,22200865,0.0,0.0,0.0,0.0,0.0,0.0,1610612750,2022
9,2023-02-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1610612744,22200866,0.0,0.0,0.0,0.0,0.0,0.0,1610612764,2022


**Access Feature Store**

In [9]:
project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)

# HOPSWORKS can be kinda buggy and has been throwing a lot of errors recently or even just failing to return data
# so I'm adding a try/except block to retry the query if it fails
tries = 5

for i in range(tries):
    
    try:
        fs = project.get_feature_store()
    except KeyError as e:
        if i < tries - 1: # i is zero indexed
            time.sleep(30)
            continue
        else:
            raise ValueError('HOPSWORKS failed to connect')
    break



Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/3350




Connected. Call `.close()` to terminate connection gracefully.


**Access Feature Group**

In [10]:
# HOPSWORKS can be kinda buggy and has been throwing a lot of errors recently or even just failing to return data
# so I'm adding a try/except block to retry the query if it fails
tries = 5

for i in range(tries):
    
    try:
        rolling_stats_fg = fs.get_feature_group(
        name="rolling_stats",
        version=2,
        )
    except KeyError as e:
        if i < tries - 1: # i is zero indexed
            time.sleep(30)
            continue
        else:
            raise ValueError('HOPSWORKS failed to connect')
    break



**Query Old Data Needed for Feature Engineering of New Data**

To generate features like rolling averages for the new games, older data from previous games is needed since some of the rolling averages might extend back 15 or 20 games or so.

In [11]:
BASE_FEATURES = ['game_date_est',
 'game_id',
 'home_team_id',
 'visitor_team_id',
 'season',
 'pts_home',
 'fg_pct_home',
 'ft_pct_home',
 'fg3_pct_home',
 'ast_home',
 'reb_home',
 'pts_away',
 'fg_pct_away',
 'ft_pct_away',
 'fg3_pct_away',
 'ast_away',
 'reb_away',
 'home_team_wins',
]

ds_query = rolling_stats_fg.select(BASE_FEATURES)

# HOPSWORKS can be kinda buggy and has been throwing a lot of errors recently or even just failing to return data
# so I'm adding a try/except block to retry the query if it fails
tries = 5

for i in range(tries):
    for j in range(tries):
        try:
            df_old = ds_query.read()
        except KeyError as e:
            if j < tries - 1: 
                time.sleep(10)
                continue
            else:
                raise ValueError('HOPSWORKS failed to connect')
        break

    if df_old.empty:
        if i < tries - 1: 
            time.sleep(10)
        else:
            raise ValueError('HOPSWORKS failed to return data')
    else:
        break



df_old


2023-02-13 07:59:49,775 INFO: USE `nba_predictor_featurestore`
2023-02-13 07:59:50,278 INFO: SELECT `fg0`.`game_date_est` `game_date_est`, `fg0`.`game_id` `game_id`, `fg0`.`home_team_id` `home_team_id`, `fg0`.`visitor_team_id` `visitor_team_id`, `fg0`.`season` `season`, `fg0`.`pts_home` `pts_home`, `fg0`.`fg_pct_home` `fg_pct_home`, `fg0`.`ft_pct_home` `ft_pct_home`, `fg0`.`fg3_pct_home` `fg3_pct_home`, `fg0`.`ast_home` `ast_home`, `fg0`.`reb_home` `reb_home`, `fg0`.`pts_away` `pts_away`, `fg0`.`fg_pct_away` `fg_pct_away`, `fg0`.`ft_pct_away` `ft_pct_away`, `fg0`.`fg3_pct_away` `fg3_pct_away`, `fg0`.`ast_away` `ast_away`, `fg0`.`reb_away` `reb_away`, `fg0`.`home_team_wins` `home_team_wins`
FROM `nba_predictor_featurestore`.`rolling_stats_2` `fg0`




Unnamed: 0,game_date_est,game_id,home_team_id,visitor_team_id,season,pts_home,fg_pct_home,ft_pct_home,fg3_pct_home,ast_home,reb_home,pts_away,fg_pct_away,ft_pct_away,fg3_pct_away,ast_away,reb_away,home_team_wins
0,2017-12-08,21700374,1610612759,1610612738,2017,105,0.468994,0.875000,0.295898,16,46,102,0.458008,0.881836,0.289062,14,39,1
1,2015-12-04,21500287,1610612742,1610612745,2015,96,0.457031,0.700195,0.275879,23,42,100,0.464111,0.556152,0.461914,20,45,0
2,2013-03-01,21200874,1610612756,1610612737,2012,92,0.444092,0.833008,0.455078,16,38,87,0.425049,0.772949,0.347900,21,43,1
3,2005-11-30,20500210,1610612738,1610612755,2005,110,0.447998,0.784180,0.250000,24,59,103,0.408936,0.770996,0.308105,21,40,1
4,2018-12-10,21800395,1610612749,1610612739,2018,108,0.437988,0.817871,0.416992,22,58,92,0.375000,0.666992,0.333008,24,46,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23628,2023-02-10,22200852,1610612741,1610612739,2022,89,48.593750,80.000000,23.796875,27,44,97,41.906250,84.187500,25.703125,26,39,0
23629,2015-04-10,21401175,1610612753,1610612761,2014,99,0.460938,1.000000,0.350098,21,46,101,0.415039,0.933105,0.300049,22,46,0
23630,2005-01-15,20400537,1610612745,1610612759,2004,73,0.333008,0.730957,0.399902,15,49,67,0.353027,0.629883,0.125000,10,39,1
23631,2012-03-07,21100575,1610612749,1610612741,2011,104,0.477051,0.789062,0.333008,29,35,106,0.518066,0.881836,0.312988,26,43,0


**Convert Feature Names back to original mixed case**

In [12]:
#hopsworks converts all feature names to lowercase, and for code reuse, we need to convert them back
df_old = convert_feature_names(df_old)
df_old
df_old[df_old['PTS_home'] == 0]

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
1801,2023-02-13,22200865,1610612742,1610612750,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
5065,2023-02-13,22200858,1610612739,1610612759,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
7576,2023-02-13,22200861,1610612748,1610612743,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
7687,2023-02-13,22200864,1610612760,1610612740,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
8284,2023-02-13,22200859,1610612754,1610612762,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
9702,2023-02-13,22200862,1610612752,1610612751,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
10634,2023-02-13,22200860,1610612755,1610612745,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
15929,2023-02-13,22200857,1610612766,1610612737,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
17528,2023-02-13,22200863,1610612741,1610612753,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0
17802,2023-02-13,22200867,1610612757,1610612747,2022,0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0,0,0


**Update Yesterday's Matchup Predictions with New Final Results**

In [13]:
# filter out games that are pending final results
# (these were the rows used for prediction yesterday)
# and then update these with the new results


# one approach is to simply drop the rows that were used for prediction yesterday
# which are games that have 0 points for home team
# and then append the new rows to the dataframe
df_old = df_old[df_old['PTS_home'] != 0]
df_old = pd.concat([df_old, df_new], ignore_index = True)

df_old

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2017-12-08,21700374,1610612759,1610612738,2017,105,0.468994,0.875000,0.295898,16,46,102,0.458008,0.881836,0.289062,14,39,1
1,2015-12-04,21500287,1610612742,1610612745,2015,96,0.457031,0.700195,0.275879,23,42,100,0.464111,0.556152,0.461914,20,45,0
2,2013-03-01,21200874,1610612756,1610612737,2012,92,0.444092,0.833008,0.455078,16,38,87,0.425049,0.772949,0.347900,21,43,1
3,2005-11-30,20500210,1610612738,1610612755,2005,110,0.447998,0.784180,0.250000,24,59,103,0.408936,0.770996,0.308105,21,40,1
4,2018-12-10,21800395,1610612749,1610612739,2018,108,0.437988,0.817871,0.416992,22,58,92,0.375000,0.666992,0.333008,24,46,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23642,2023-02-10,22200836,1610612756,1610612754,2022,117,46.200000,87.500000,30.300000,25,53,104,42.400000,75.000000,28.200000,23,38,1
23643,2023-02-10,22200837,1610612752,1610612755,2022,108,53.100000,68.800000,34.400000,23,35,119,52.400000,77.300000,36.800000,27,40,0
23644,2023-02-09,22200834,1610612749,1610612747,2022,115,45.500000,72.200000,25.500000,24,51,106,47.200000,71.400000,38.900000,23,43,1
23645,2023-02-09,22200833,1610612741,1610612751,2022,105,46.200000,76.200000,19.200000,15,46,116,42.900000,84.000000,38.600000,26,49,0


**Add Today's Matchups for Feature Engineering**

In [14]:
df_combined = pd.concat([df_old, df_today], ignore_index = True)
df_combined

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2017-12-08 00:00:00,21700374,1610612759,1610612738,2017,105.0,0.468994,0.875000,0.295898,16.0,46.0,102.0,0.458008,0.881836,0.289062,14.0,39.0,1.0
1,2015-12-04 00:00:00,21500287,1610612742,1610612745,2015,96.0,0.457031,0.700195,0.275879,23.0,42.0,100.0,0.464111,0.556152,0.461914,20.0,45.0,0.0
2,2013-03-01 00:00:00,21200874,1610612756,1610612737,2012,92.0,0.444092,0.833008,0.455078,16.0,38.0,87.0,0.425049,0.772949,0.347900,21.0,43.0,1.0
3,2005-11-30 00:00:00,20500210,1610612738,1610612755,2005,110.0,0.447998,0.784180,0.250000,24.0,59.0,103.0,0.408936,0.770996,0.308105,21.0,40.0,1.0
4,2018-12-10 00:00:00,21800395,1610612749,1610612739,2018,108.0,0.437988,0.817871,0.416992,22.0,58.0,92.0,0.375000,0.666992,0.333008,24.0,46.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23653,2023-02-13,22200863,1610612741,1610612753,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
23654,2023-02-13,22200864,1610612760,1610612740,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
23655,2023-02-13,22200865,1610612742,1610612750,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
23656,2023-02-13,22200866,1610612744,1610612764,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0


**Data Processing**

In [15]:
df_combined = process_games(df_combined) 
df_combined = add_TARGET(df_combined)
df_combined

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,PLAYOFF,TARGET
0,2017-12-08 00:00:00,21700374,1610612759,1610612738,2017,105.0,0.468994,0.875000,0.295898,16.0,46.0,102.0,0.458008,0.881836,0.289062,14.0,39.0,1.0,0,1.0
1,2015-12-04 00:00:00,21500287,1610612742,1610612745,2015,96.0,0.457031,0.700195,0.275879,23.0,42.0,100.0,0.464111,0.556152,0.461914,20.0,45.0,0.0,0,0.0
2,2013-03-01 00:00:00,21200874,1610612756,1610612737,2012,92.0,0.444092,0.833008,0.455078,16.0,38.0,87.0,0.425049,0.772949,0.347900,21.0,43.0,1.0,0,1.0
3,2005-11-30 00:00:00,20500210,1610612738,1610612755,2005,110.0,0.447998,0.784180,0.250000,24.0,59.0,103.0,0.408936,0.770996,0.308105,21.0,40.0,1.0,0,1.0
4,2018-12-10 00:00:00,21800395,1610612749,1610612739,2018,108.0,0.437988,0.817871,0.416992,22.0,58.0,92.0,0.375000,0.666992,0.333008,24.0,46.0,1.0,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23653,2023-02-13,22200863,1610612741,1610612753,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0,0.0
23654,2023-02-13,22200864,1610612760,1610612740,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0,0.0
23655,2023-02-13,22200865,1610612742,1610612750,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0,0.0
23656,2023-02-13,22200866,1610612744,1610612764,2022,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0,0.0


**Feature Engineering**

In [16]:
# Feature engineering to add: 
    # rolling averages of key stats, 
    # win/lose streaks, 
    # home/away streaks, 
    # specific matchup (team X vs team Y) rolling averages and streaks

df_combined = process_features(df_combined)



#fix type conversion issues with hopsworks
df_combined['TARGET'] = df_combined['TARGET'].astype('int16')
df_combined['HOME_TEAM_WINS'] = df_combined['HOME_TEAM_WINS'].astype('int16')

df_combined


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


['HOME_PTS_home_AVG_LAST_3_HOME', 'HOME_PTS_home_AVG_LAST_7_HOME', 'HOME_PTS_home_AVG_LAST_10_HOME', 'HOME_FG_PCT_home_AVG_LAST_3_HOME', 'HOME_FG_PCT_home_AVG_LAST_7_HOME', 'HOME_FG_PCT_home_AVG_LAST_10_HOME', 'HOME_FT_PCT_home_AVG_LAST_3_HOME', 'HOME_FT_PCT_home_AVG_LAST_7_HOME', 'HOME_FT_PCT_home_AVG_LAST_10_HOME', 'HOME_FG3_PCT_home_AVG_LAST_3_HOME', 'HOME_FG3_PCT_home_AVG_LAST_7_HOME', 'HOME_FG3_PCT_home_AVG_LAST_10_HOME', 'HOME_AST_home_AVG_LAST_3_HOME', 'HOME_AST_home_AVG_LAST_7_HOME', 'HOME_AST_home_AVG_LAST_10_HOME', 'HOME_REB_home_AVG_LAST_3_HOME', 'HOME_REB_home_AVG_LAST_7_HOME', 'HOME_REB_home_AVG_LAST_10_HOME', 'HOME_TEAM_ID', 'GAME_DATE_EST']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


['VISITOR_TEAM_WINS_AVG_LAST_3_VISITOR', 'VISITOR_TEAM_WINS_AVG_LAST_7_VISITOR', 'VISITOR_TEAM_WINS_AVG_LAST_10_VISITOR', 'VISITOR_PTS_away_AVG_LAST_3_VISITOR', 'VISITOR_PTS_away_AVG_LAST_7_VISITOR', 'VISITOR_PTS_away_AVG_LAST_10_VISITOR', 'VISITOR_FG_PCT_away_AVG_LAST_3_VISITOR', 'VISITOR_FG_PCT_away_AVG_LAST_7_VISITOR', 'VISITOR_FG_PCT_away_AVG_LAST_10_VISITOR', 'VISITOR_FT_PCT_away_AVG_LAST_3_VISITOR', 'VISITOR_FT_PCT_away_AVG_LAST_7_VISITOR', 'VISITOR_FT_PCT_away_AVG_LAST_10_VISITOR', 'VISITOR_FG3_PCT_away_AVG_LAST_3_VISITOR', 'VISITOR_FG3_PCT_away_AVG_LAST_7_VISITOR', 'VISITOR_FG3_PCT_away_AVG_LAST_10_VISITOR', 'VISITOR_AST_away_AVG_LAST_3_VISITOR', 'VISITOR_AST_away_AVG_LAST_7_VISITOR', 'VISITOR_AST_away_AVG_LAST_10_VISITOR', 'VISITOR_REB_away_AVG_LAST_3_VISITOR', 'VISITOR_REB_away_AVG_LAST_7_VISITOR', 'VISITOR_REB_away_AVG_LAST_10_VISITOR', 'VISITOR_TEAM_ID', 'GAME_DATE_EST']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


['PTS_AVG_LAST_3_ALL', 'PTS_AVG_LAST_7_ALL', 'PTS_AVG_LAST_10_ALL', 'PTS_AVG_LAST_15_ALL', 'FG_PCT_AVG_LAST_3_ALL', 'FG_PCT_AVG_LAST_7_ALL', 'FG_PCT_AVG_LAST_10_ALL', 'FG_PCT_AVG_LAST_15_ALL', 'FT_PCT_AVG_LAST_3_ALL', 'FT_PCT_AVG_LAST_7_ALL', 'FT_PCT_AVG_LAST_10_ALL', 'FT_PCT_AVG_LAST_15_ALL', 'FG3_PCT_AVG_LAST_3_ALL', 'FG3_PCT_AVG_LAST_7_ALL', 'FG3_PCT_AVG_LAST_10_ALL', 'FG3_PCT_AVG_LAST_15_ALL', 'AST_AVG_LAST_3_ALL', 'AST_AVG_LAST_7_ALL', 'AST_AVG_LAST_10_ALL', 'AST_AVG_LAST_15_ALL', 'REB_AVG_LAST_3_ALL', 'REB_AVG_LAST_7_ALL', 'REB_AVG_LAST_10_ALL', 'REB_AVG_LAST_15_ALL', 'TEAM1', 'GAME_DATE_EST']


Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,...,FG3_PCT_AVG_LAST_10_ALL_x_minus_y,FG3_PCT_AVG_LAST_15_ALL_x_minus_y,AST_AVG_LAST_3_ALL_x_minus_y,AST_AVG_LAST_7_ALL_x_minus_y,AST_AVG_LAST_10_ALL_x_minus_y,AST_AVG_LAST_15_ALL_x_minus_y,REB_AVG_LAST_3_ALL_x_minus_y,REB_AVG_LAST_7_ALL_x_minus_y,REB_AVG_LAST_10_ALL_x_minus_y,REB_AVG_LAST_15_ALL_x_minus_y
0,2003-10-28,20300003,1610612747,1610612742,2003,109,0.505859,0.600098,0.350098,32,...,,,,,,,,,,
1,2003-10-28,20300001,1610612755,1610612748,2003,89,0.439941,0.533203,0.350098,25,...,,,,,,,,,,
2,2003-10-28,20300002,1610612759,1610612756,2003,83,0.425049,0.769043,0.099976,20,...,,,,,,,,,,
3,2003-10-29,20300004,1610612738,1610612748,2003,98,0.506836,0.730957,0.312988,28,...,,,,,,,,,,
4,2003-10-29,20300012,1610612762,1610612757,2003,99,0.575195,0.713867,0.556152,25,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23628,2023-02-13,22200858,1610612739,1610612759,2022,0,0.000000,0.000000,0.000000,0,...,6.814063,3.941667,4.666667,3.571429,3.8,2.333333,-3.000000,0.571429,-3.1,-2.000000
23629,2023-02-13,22200862,1610612752,1610612751,2022,0,0.000000,0.000000,0.000000,0,...,-4.606250,-2.890625,-4.000000,0.857143,-1.7,-2.733333,1.000000,3.428571,5.5,4.866667
23630,2023-02-13,22200863,1610612741,1610612753,2022,0,0.000000,0.000000,0.000000,0,...,-5.223438,-5.546875,-1.000000,3.285714,0.4,0.600000,2.333333,-0.285714,1.4,1.733333
23631,2023-02-13,22200859,1610612754,1610612762,2022,0,0.000000,0.000000,0.000000,0,...,1.220313,-0.479167,-2.333333,0.571429,-0.6,0.666667,-10.666667,-6.285714,-5.6,-6.000000


**Insert New Data into Feature Group**

In [17]:
# HOPSWORKS can be kinda buggy and has been throwing a lot of errors recently or even just failing to return data
# so I'm adding a try/except block to retry the query if it fails
tries = 5

for i in range(tries):
    
    try:
        rolling_stats_fg.insert(df_combined, overwrite = True, write_options={"wait_for_job" : False})
    except KeyError as e:
        if i < tries - 1: 
            time.sleep(30)
            continue
        else:
            raise ValueError('HOPSWORKS failed to connect')
    break





Uploading Dataframe: 0.00% |          | Rows 0/23633 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/3350/jobs/named/rolling_stats_2_offline_fg_backfill/executions


In [18]:
df_combined

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,...,FG3_PCT_AVG_LAST_10_ALL_x_minus_y,FG3_PCT_AVG_LAST_15_ALL_x_minus_y,AST_AVG_LAST_3_ALL_x_minus_y,AST_AVG_LAST_7_ALL_x_minus_y,AST_AVG_LAST_10_ALL_x_minus_y,AST_AVG_LAST_15_ALL_x_minus_y,REB_AVG_LAST_3_ALL_x_minus_y,REB_AVG_LAST_7_ALL_x_minus_y,REB_AVG_LAST_10_ALL_x_minus_y,REB_AVG_LAST_15_ALL_x_minus_y
0,2003-10-28,20300003,1610612747,1610612742,2003,109,0.505859,0.600098,0.350098,32,...,,,,,,,,,,
1,2003-10-28,20300001,1610612755,1610612748,2003,89,0.439941,0.533203,0.350098,25,...,,,,,,,,,,
2,2003-10-28,20300002,1610612759,1610612756,2003,83,0.425049,0.769043,0.099976,20,...,,,,,,,,,,
3,2003-10-29,20300004,1610612738,1610612748,2003,98,0.506836,0.730957,0.312988,28,...,,,,,,,,,,
4,2003-10-29,20300012,1610612762,1610612757,2003,99,0.575195,0.713867,0.556152,25,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23628,2023-02-13,22200858,1610612739,1610612759,2022,0,0.000000,0.000000,0.000000,0,...,6.814063,3.941667,4.666667,3.571429,3.8,2.333333,-3.000000,0.571429,-3.1,-2.000000
23629,2023-02-13,22200862,1610612752,1610612751,2022,0,0.000000,0.000000,0.000000,0,...,-4.606250,-2.890625,-4.000000,0.857143,-1.7,-2.733333,1.000000,3.428571,5.5,4.866667
23630,2023-02-13,22200863,1610612741,1610612753,2022,0,0.000000,0.000000,0.000000,0,...,-5.223438,-5.546875,-1.000000,3.285714,0.4,0.600000,2.333333,-0.285714,1.4,1.733333
23631,2023-02-13,22200859,1610612754,1610612762,2022,0,0.000000,0.000000,0.000000,0,...,1.220313,-0.479167,-2.333333,0.571429,-0.6,0.666667,-10.666667,-6.285714,-5.6,-6.000000
