### Production Features Pipeline - CSV Version

This notebook is run daily from a Github Action. 

1. It scrapes the results from the previous day's games, performs feature engineering, and saves the results back to a csv file. This is an alternative version of the pipeline that DOES NOT utilize the Hopsworks.ai Feature Store and is less dependent on other platforms.

2. It scrapes the upcoming games for today, and saves the blank records back into the csv file so that they can be accessed by the model for the prediction.

**Note:**
There are two options for webscraping in this notebook. 
Set the 'WEBSCRAPER' variable to either 'SCRAPINGANT' or 'SELENIUM' to choose which version to run.

1. SCRAPINGANT: Uses a webscraping service with a Python API, ScrapingAnt, which handles all the proxy server issues, but does require an account. The free account allows for 1000 page requests, which is more than enough for this project. Proxies are required when running this notebook from a Github Action or otherwise key data will fail to be scraped from NBA.com. 

2. SELENIUM: This option does not currently integrate proxy servers into the webscraping process, which can cause issues when scraping from certain locations, in particular Github Actions. For occasional use from local machines, this option may work fine, but you may need to setup a proxy server.

In [1]:
# select web scraper; 'SCRAPINGANT' or 'SELENIUM'
# SCRAPINGANT requires a subscription but includes a proxy server

WEBSCRAPER = 'SCRAPINGANT'
#WEBSCRAPER = 'SELENIUM'

In [2]:
import os

import pandas as pd
import numpy as np

import hopsworks

from datetime import datetime, timedelta
from pytz import timezone

import json

import time

from pathlib import Path  #for Windows/Linux compatibility

# change working directory to project root when running from notebooks folder to make it easier to import modules
# and to access sibling folders
os.chdir('..') 

 
from src.webscraping import (
    get_new_games,
    activate_web_driver,
    get_todays_matchups,
)

from src.data_processing import (
    process_games,
    add_TARGET,
)

from src.feature_engineering import (
    process_features,
)


DATAPATH = Path(r'data')

**Load API keys**

In [3]:
from dotenv import load_dotenv

load_dotenv()

try:
    HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']
except:
    raise Exception('Set environment variable HOPSWORKS_API_KEY')

# if scrapingant is chosen then set the api key, otherwise load the selenium webdriver
if WEBSCRAPER == 'SCRAPINGANT':
    try:
        SCRAPINGANT_API_KEY = os.environ['SCRAPINGANT_API_KEY']
    except:
        raise Exception('Set environment variable SCRAPINGANT_API_KEY')
    driver = None
    
elif WEBSCRAPER == 'SELENIUM':
    driver = activate_web_driver('chromium')
    SCRAPINGANT_API_KEY = ""
    



**Scrape New Completed Games and Format Them**

In [None]:


df_new = get_new_games(SCRAPINGANT_API_KEY, driver)

if df_new.empty:
    print('No new games to process')

    # determine what season we are in currently
    today = datetime.now(timezone('EST')) #nba.com uses US Eastern Standard Time
    if today.month >= 10:
        SEASON = today.year
    else:
        SEASON = today.year - 1
else:

    # get the SEASON of the last game in the database
    # this will used when constructing rows for prediction
    SEASON = df_new['SEASON'].max()

    df_new




**Retrieve todays games**

In [None]:
#retrieve list of teams playing today

# get today's games on NBA schedule
matchups, game_ids = get_todays_matchups(SCRAPINGANT_API_KEY, driver)

if matchups is None:
    print('No games today')
else:
    print(matchups)
    print(game_ids)


**Close Webdriver**

In [6]:
if WEBSCRAPER == 'SELENIUM':
    driver.close() 

**Check if anything is going on in the season**

In [7]:
if (df_new.empty) and (matchups is None):
    print('No new games to process')
    exit()
    

**Create Rows for Today's Games with Empty Stats**

In [8]:
# reformat today's matchups to the new games dataframe

if matchups is None:
    print('No games going on. Nothing to do.')
    exit()    

else:

    df_today = df_new.drop(df_new.index) #empty copy of df_new with same columns
    for i, matchup in enumerate(matchups):
        game_details = {'HOME_TEAM_ID': matchup[1], 
                        'VISITOR_TEAM_ID': matchup[0], 
                        'GAME_DATE_EST': datetime.now(timezone('EST')).strftime("%Y-%m-%d"), 
                        'GAME_ID': int(game_ids[i]),                       
                        'SEASON': SEASON,
                        } 
        game_details_df = pd.DataFrame(game_details, index=[i])
        # append to new games dataframe
        df_today = pd.concat([df_today, game_details_df], ignore_index = True)

    #blank rows will be filled with 0 to prevent issues with feature engineering
    df_today = df_today.fillna(0) 

    df_today



**Query Old Data Needed for Feature Engineering of New Data**

To generate features like rolling averages for the new games, older data from previous games is needed since some of the rolling averages might extend back 15 or 20 games or so.

In [None]:


df_old = pd.read_csv(DATAPATH / 'games.csv')

df_old


**Update Yesterday's Matchup Predictions with New Final Results**

In [None]:
# filter out games that are pending final results
# (these were the rows used for prediction yesterday)
# and then update these with the new results


# one approach is to simply drop the rows that were used for prediction yesterday
# which are games that have 0 points for home team
# and then append the new rows to the dataframe
df_old = df_old[df_old['PTS_home'] != 0]
df_old = pd.concat([df_old, df_new], ignore_index = True)


# save the new games to the database
df_old.to_csv(DATAPATH / 'games.csv', index=False)

df_old

**Add Today's Matchups for Feature Engineering**

In [11]:
if matchups is None:
    print('No games today')
    df_combined = df_old
else:
    df_combined = pd.concat([df_old, df_today], ignore_index = True)
    df_combined

**Data Processing**

In [None]:
df_combined = process_games(df_combined) 
df_combined = add_TARGET(df_combined)
df_combined

**Feature Engineering**

In [None]:
# Feature engineering to add: 
    # rolling averages of key stats, 
    # win/lose streaks, 
    # home/away streaks, 
    # specific matchup (team X vs team Y) rolling averages and streaks

df_combined = process_features(df_combined)



#fix type conversion issues with hopsworks
df_combined['TARGET'] = df_combined['TARGET'].astype('int16')
df_combined['HOME_TEAM_WINS'] = df_combined['HOME_TEAM_WINS'].astype('int16')

# save file
df_combined.to_csv(DATAPATH / 'games_engineered.csv', index=False)


df_combined


In [None]:
# check to make sure there are no duplicate games were inadvertently added
df_combined[df_combined.duplicated(subset=['GAME_ID'], keep=False)]

In [1]:
import os

import pandas as pd
import numpy as np

import joblib

from datetime import datetime, timedelta
from pytz import timezone

import json

import time

from pathlib import Path  #for Windows/Linux compatibility

# change working directory to project root when running from notebooks folder to make it easier to import modules
# and to access sibling folders
os.chdir('..') 


from src.feature_engineering import (
    fix_datatypes,
    remove_non_rolling,
)

from src.constants import (
    LONG_INTEGER_FIELDS,    
    SHORT_INTEGER_FIELDS,   
    DATE_FIELDS,            
    DROP_COLUMNS,
    NBA_TEAMS_NAMES,
)

DATAPATH = Path(r'data')

def remove_unused_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove features that are not used in the model.

    """
    
    # remove stats from today's games - these are blank (the game hasn't been played) and are not used by the model
    use_columns = remove_non_rolling(df)
    X = df[use_columns]

    # drop columns not used in model
    X = X.drop(DROP_COLUMNS, axis=1)

    # MATCHUP is just for informational display, not used by model
    X = X.drop('MATCHUP', axis=1) 
    
    return X



df = pd.read_csv(DATAPATH / 'games_engineered.csv')

# add to dataframe to begin with?
df['MATCHUP'] = df['VISITOR_TEAM_ID'].map(NBA_TEAMS_NAMES) + " @ " + df['HOME_TEAM_ID'].map(NBA_TEAMS_NAMES)



model_dir  = Path.cwd() / "models"

with open(model_dir / "model.pkl", 'rb') as f:
    model = joblib.load(f)

X = remove_unused_features(df)

preds = model.predict_proba(X)[:,1]

df['HOME_TEAM_WIN_PROBABILITY'] = preds

df = df.reset_index(drop=True)

df.to_csv(DATAPATH / 'games_predictions.csv', index=False)
    
df


Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,...,AST_AVG_LAST_3_ALL_x_minus_y,AST_AVG_LAST_7_ALL_x_minus_y,AST_AVG_LAST_10_ALL_x_minus_y,AST_AVG_LAST_15_ALL_x_minus_y,REB_AVG_LAST_3_ALL_x_minus_y,REB_AVG_LAST_7_ALL_x_minus_y,REB_AVG_LAST_10_ALL_x_minus_y,REB_AVG_LAST_15_ALL_x_minus_y,MATCHUP,HOME_TEAM_WIN_PROBABILITY
0,2003-10-28 00:00:00+00:00,20300003,1610612747,1610612742,2003,109,0.506,0.600,0.350,32,...,,,,,,,,,Dallas Mavericks @ Los Angeles Lakers,0.594546
1,2003-10-28 00:00:00+00:00,20300002,1610612759,1610612756,2003,83,0.425,0.769,0.100,20,...,,,,,,,,,Phoenix Suns @ San Antonio Spurs,0.606008
2,2003-10-28 00:00:00+00:00,20300001,1610612755,1610612748,2003,89,0.440,0.533,0.350,25,...,,,,,,,,,Miami Heat @ Philadelphia 76ers,0.589503
3,2003-10-29 00:00:00+00:00,20300006,1610612740,1610612737,2003,88,0.324,0.700,0.160,24,...,,,,,,,,,Atlanta Hawks @ New Orleans Pelicans,0.627729
4,2003-10-29 00:00:00+00:00,20300008,1610612765,1610612754,2003,87,0.392,0.742,0.333,15,...,,,,,,,,,Indiana Pacers @ Detroit Pistons,0.594546
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27033,2024-10-26 00:00:00+00:00,22400093,1610612750,1610612761,2024,0,0.000,0.000,0.000,0,...,-3.000000,-1.714286,-0.3,-1.666667,-1.333333,-2.714286,-1.7,1.800000,Toronto Raptors @ Minnesota Timberwolves,0.579480
27034,2024-10-26 00:00:00+00:00,22400096,1610612747,1610612758,2024,0,0.000,0.000,0.000,0,...,5.000000,0.428571,0.6,0.933333,-1.000000,-2.142857,-3.9,-3.266667,Sacramento Kings @ Los Angeles Lakers,0.522538
27035,2024-10-26 00:00:00+00:00,22400090,1610612764,1610612739,2024,0,0.000,0.000,0.000,0,...,3.333333,4.142857,4.7,4.000000,4.000000,4.428571,3.9,4.866667,Cleveland Cavaliers @ Washington Wizards,0.287596
27036,2024-10-26 00:00:00+00:00,22400087,1610612743,1610612746,2024,0,0.000,0.000,0.000,0,...,-1.000000,2.571429,3.1,3.200000,1.000000,0.000000,-2.0,-2.400000,LA Clippers @ Denver Nuggets,0.552375


In [2]:
df.sort_values(by=['GAME_DATE_EST'], ascending=False)

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,...,AST_AVG_LAST_3_ALL_x_minus_y,AST_AVG_LAST_7_ALL_x_minus_y,AST_AVG_LAST_10_ALL_x_minus_y,AST_AVG_LAST_15_ALL_x_minus_y,REB_AVG_LAST_3_ALL_x_minus_y,REB_AVG_LAST_7_ALL_x_minus_y,REB_AVG_LAST_10_ALL_x_minus_y,REB_AVG_LAST_15_ALL_x_minus_y,MATCHUP,HOME_TEAM_WIN_PROBABILITY
27037,2024-10-26 00:00:00+00:00,22400089,1610612765,1610612738,2024,0,0.000,0.000,0.000,0,...,-1.000000,-1.857143,-1.1,-1.600000,-3.000000,-4.714286,-4.7,-4.133333,Boston Celtics @ Detroit Pistons,0.126832
27036,2024-10-26 00:00:00+00:00,22400087,1610612743,1610612746,2024,0,0.000,0.000,0.000,0,...,-1.000000,2.571429,3.1,3.200000,1.000000,0.000000,-2.0,-2.400000,LA Clippers @ Denver Nuggets,0.552375
27035,2024-10-26 00:00:00+00:00,22400090,1610612764,1610612739,2024,0,0.000,0.000,0.000,0,...,3.333333,4.142857,4.7,4.000000,4.000000,4.428571,3.9,4.866667,Cleveland Cavaliers @ Washington Wizards,0.287596
27034,2024-10-26 00:00:00+00:00,22400096,1610612747,1610612758,2024,0,0.000,0.000,0.000,0,...,5.000000,0.428571,0.6,0.933333,-1.000000,-2.142857,-3.9,-3.266667,Sacramento Kings @ Los Angeles Lakers,0.522538
27033,2024-10-26 00:00:00+00:00,22400093,1610612750,1610612761,2024,0,0.000,0.000,0.000,0,...,-3.000000,-1.714286,-0.3,-1.666667,-1.333333,-2.714286,-1.7,1.800000,Toronto Raptors @ Minnesota Timberwolves,0.579480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,2003-10-29 00:00:00+00:00,20300008,1610612765,1610612754,2003,87,0.392,0.742,0.333,15,...,,,,,,,,,Indiana Pacers @ Detroit Pistons,0.594546
3,2003-10-29 00:00:00+00:00,20300006,1610612740,1610612737,2003,88,0.324,0.700,0.160,24,...,,,,,,,,,Atlanta Hawks @ New Orleans Pelicans,0.627729
2,2003-10-28 00:00:00+00:00,20300001,1610612755,1610612748,2003,89,0.440,0.533,0.350,25,...,,,,,,,,,Miami Heat @ Philadelphia 76ers,0.589503
1,2003-10-28 00:00:00+00:00,20300002,1610612759,1610612756,2003,83,0.425,0.769,0.100,20,...,,,,,,,,,Phoenix Suns @ San Antonio Spurs,0.606008
