# Over/Under Model (Non-Scaled Features)

This notebook provides an analysis pipeline for predicting whether the Over or the Under will hit in a given NFL game. It covers key aspects such as feature selection, model training, and evaluation. This notebook is specifically for non-scaled features, and the models. 

It applies many different feature selection techniques and trains and evaluates the various models using the different features that were selected from the various feature selection techniques.

By following this pipeline, one can gain an insight into the performance of different models based on different feature selection techniques and models on predicting whether the Over/Under will hit.

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn packages
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif, VarianceThreshold, RFE
from sklearn.linear_model import LogisticRegression, Lasso

# Sklearn Model packages
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import xgboost as xgb
import lightgbm as lgb

In [16]:
pd.set_option('display.max_columns', None)

In [17]:
df = pd.read_csv('/Users/epainter/Desktop/bet_model_v2/data/processing/fc_3.csv')
df.columns

Index(['spread', 'ou_value', 'fav_ml_result', 'fav_sp_result', 'ou_result',
       'fav_ppg', 'und_ppg', 'fav_papg', 'und_papg', 'fav_ypg', 'und_ypg',
       'fav_yapg', 'und_yapg', 'fav_topg', 'und_topg', 'fav_tofpg',
       'und_tofpg', 'fav_avg_mov', 'und_avg_mov', 'fav_win_pct', 'und_win_pct',
       'fav_last_5_win_pct', 'und_last_5_win_pct', 'fav_home_win_pct',
       'und_home_win_pct', 'fav_away_win_pct', 'und_away_win_pct', 'ppg_diff',
       'ypg_diff', 'topg_diff', 'avg_mov_diff', 'win_pct_diff',
       'last_5_win_pct_diff', 'team_ovr_diff', 'ypg_sum', 'ppg_ratio',
       'ypg_ratio', 'avg_mov_ratio'],
      dtype='object')

My **df** has many features that are redundant or of no use to our model. I only want to work some of the features that I believe to be of some importance. 

In [18]:
cols = ['ou_result',
        'ou_value', 'fav_avg_mov', 'fav_win_pct', 'und_win_pct',
       'fav_last_5_win_pct', 'und_last_5_win_pct', 'ppg_diff',
       'ypg_diff', 'topg_diff', 'avg_mov_diff', 'win_pct_diff',
       'last_5_win_pct_diff', 'team_ovr_diff', 'ypg_sum', 'ppg_ratio','ypg_ratio']

ou_df = df[cols]

ou_df

Unnamed: 0,ou_result,ou_value,fav_avg_mov,fav_win_pct,und_win_pct,fav_last_5_win_pct,und_last_5_win_pct,ppg_diff,ypg_diff,topg_diff,avg_mov_diff,win_pct_diff,last_5_win_pct_diff,team_ovr_diff,ypg_sum,ppg_ratio,ypg_ratio
0,1,53.5,14.00,1.00,0.00,1.0,0.0,14.00,9.00,-1.00,28.00,1.00,1.0,5.35,729.00,1.70,1.02
1,1,49.5,-13.00,0.00,1.00,0.0,1.0,-13.00,123.00,2.00,-26.00,-1.00,-1.0,-1.00,889.00,0.66,1.32
2,0,47.0,32.00,1.00,0.00,1.0,0.0,32.00,75.00,-2.00,64.00,1.00,1.0,4.31,687.00,6.33,1.25
3,1,39.5,10.00,1.00,0.00,1.0,0.0,10.00,150.00,0.00,20.00,1.00,1.0,-10.92,658.00,1.59,1.59
4,1,48.0,4.00,1.00,0.00,1.0,0.0,4.00,-16.00,0.00,8.00,1.00,1.0,-0.89,760.00,1.13,0.96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1066,0,35.0,-17.00,0.31,0.62,0.2,0.4,-1.50,-28.62,-0.44,-21.00,-0.31,-0.2,-8.25,684.38,0.93,0.92
1067,0,42.5,0.33,0.69,0.31,0.2,0.4,11.50,86.25,0.38,9.33,0.38,-0.2,12.55,629.37,1.77,1.32
1068,1,40.0,6.33,0.75,0.56,0.8,0.8,5.50,38.94,0.00,0.66,0.19,0.0,14.26,770.18,1.23,1.11
1069,1,47.5,-7.33,0.69,0.25,0.6,0.0,9.50,46.25,-0.87,1.67,0.44,0.6,7.50,688.37,1.48,1.14


# Feature Importance Process

In the following cells, I am running 5 different feature importannce techniques. 

    1. Recursive Feature Elimination
    2. Lasso 
    3. Random Forest
    4. Mutual Information
    5. ANOVA F-Value

For the first 3, I am grouping them together. I have a function *select_features* that applies the 3 feature importance techniques and sorts them in descending order of there importance. 

For the next 2 (4-5), it selects features based on mutual information and F_values.

In [19]:
X = ou_df.drop(columns=['ou_result'])
y = ou_df['ou_result']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.15, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((910, 16), (161, 16), (910,), (161,))

### RFE, Lasso, & Random Forest

In [20]:
def select_features(X, y):
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
    
    # 1.) Recursive Feature Elimination
    rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=10, step=1)
    rfe_selector = rfe_selector.fit(X_train, y_train)
    rfe_support = rfe_selector.get_support()
    rfe_features = X.loc[:,rfe_support].columns.tolist()
    
    # 2.) Lasso
    lasso = Lasso(alpha=0.1)
    lasso.fit(X_train, y_train)
    lasso_support = lasso.coef_ != 0
    lasso_features = X.loc[:,lasso_support].columns.tolist()
    
    # 3.) Random Forest
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    rf_features = X.columns[rf.feature_importances_.argsort()[::-1][:10]].tolist()
    
    # Combine results
    feature_selection_df = pd.DataFrame({'Feature': X.columns,
                                         'RFE': rfe_support,
                                         'Lasso': lasso_support,
                                         'RF': rf.feature_importances_})
    
    # Count the methods that selected each feature
    feature_selection_df['Total'] = np.sum(feature_selection_df.iloc[:, 1:], axis=1)
    
    # Sort with the most important features (selected by most methods) on top
    feature_selection_df = feature_selection_df.sort_values('Total', ascending=False)
    
    return feature_selection_df

# Usage
feature_importance = select_features(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Mutual Information

In [21]:
mi_sel = SelectKBest(mutual_info_classif, k=5)
mi_sel.fit(X_train, y_train)

# Get the indices of the selected features
selected_feature_indices = mi_sel.get_support(indices=True)

# Get the names of the selected features
selected_feature_names = X_train.columns[selected_feature_indices]

# Create a new DataFrame with only the selected features
X_train_mi = pd.DataFrame(mi_sel.transform(X_train), 
                                columns=selected_feature_names, 
                                index=X_train.index)

X_test_mi = pd.DataFrame(mi_sel.transform(X_test),
                        columns=selected_feature_names,
                        index=X_test.index)

X_train_mi.columns

Index(['fav_avg_mov', 'fav_last_5_win_pct', 'und_last_5_win_pct', 'ppg_diff',
       'avg_mov_diff'],
      dtype='object')

### ANOVA

In [22]:
f_sel = SelectKBest(f_classif, k=5)
f_sel.fit(X_train, y_train)

# Get the indices of the selected features
selected_feature_indices = f_sel.get_support(indices=True)

# Get the names of the selected features
selected_feature_names = X_train.columns[selected_feature_indices]

# Create a new DataFrame with only the selected features
X_train_f = pd.DataFrame(f_sel.transform(X_train), 
                                columns=selected_feature_names, 
                                index=X_train.index)

X_test_f = pd.DataFrame(f_sel.transform(X_test), 
                                columns=selected_feature_names, 
                                index=X_test.index)

X_train_f.columns

Index(['und_last_5_win_pct', 'ypg_diff', 'topg_diff', 'avg_mov_diff',
       'ypg_sum'],
      dtype='object')

# Feature Selection

Some of this step overlaps with the last step. For Mutual Information and ANOVA F-Value, it has already found the most important features and created a data frame that can be used in our model. 

For the first 3 methods (RFE, Lasso, RF) we need to select the n (I chose 5) most important features.

In [23]:
# Sort features by 'Total' and select top n
top_5_features = feature_importance.sort_values('Total', ascending=False)['Feature'].head(5).tolist()

# Create a mask for the top 5 features
top_5_mask = X_train.columns.isin(top_5_features)

# Selecting only top n features from training and test sets
X_train_top5 = X_train.loc[:, top_5_mask]
X_test_top5 = X_test.loc[:, top_5_mask]

# Models

Here I have defined the models to be used and evaluated. These models are robust to non-scaling of features. 

THe models include Decision Tress, Random Forest Classifiers, Gradient Boosting, GaussianNB (standard), and some other classifiers.

In [24]:
models = [
    DecisionTreeClassifier(random_state=42),
    RandomForestClassifier(random_state=42),
    GradientBoostingClassifier(random_state=42),
    GaussianNB(),
    xgb.XGBClassifier(random_state=42),
    lgb.LGBMClassifier(random_state=42)
]

We need to evaluate each model for each feature subset of features that were selected based on the different feature selection techniques.

### RFE, Lasso, & Random Forest

In [25]:
for model in models:
    
    # Fit the model
    model.fit(X_train_top5, y_train)
    
    # Predictions using test data
    y_pred = model.predict(X_test_top5)
    
    # Evaluate the model
    print(f"\n{model.__class__.__name__} Results (RFE, Lasso, Random Forest):")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))



DecisionTreeClassifier Results (RFE, Lasso, Random Forest):
Accuracy: 0.5403726708074534
Confusion Matrix:
 [[50 43]
 [31 37]]
Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.54      0.57        93
           1       0.46      0.54      0.50        68

    accuracy                           0.54       161
   macro avg       0.54      0.54      0.54       161
weighted avg       0.55      0.54      0.54       161


RandomForestClassifier Results (RFE, Lasso, Random Forest):
Accuracy: 0.5031055900621118
Confusion Matrix:
 [[54 39]
 [41 27]]
Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.58      0.57        93
           1       0.41      0.40      0.40        68

    accuracy                           0.50       161
   macro avg       0.49      0.49      0.49       161
weighted avg       0.50      0.50      0.50       161


GradientBoostingClassifier Results (RFE, 

### Mutual Information

In [26]:
for model in models:

    model.fit(X_train_mi, y_train)
    
    y_pred = model.predict(X_test_mi)
    
    print(f"\n{model.__class__.__name__} Results (Mutual Information):")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))


DecisionTreeClassifier Results (Mutual Information):
Accuracy: 0.484472049689441
Confusion Matrix:
 [[47 46]
 [37 31]]
Classification Report:
               precision    recall  f1-score   support

           0       0.56      0.51      0.53        93
           1       0.40      0.46      0.43        68

    accuracy                           0.48       161
   macro avg       0.48      0.48      0.48       161
weighted avg       0.49      0.48      0.49       161


RandomForestClassifier Results (Mutual Information):
Accuracy: 0.4968944099378882
Confusion Matrix:
 [[49 44]
 [37 31]]
Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.53      0.55        93
           1       0.41      0.46      0.43        68

    accuracy                           0.50       161
   macro avg       0.49      0.49      0.49       161
weighted avg       0.50      0.50      0.50       161


GradientBoostingClassifier Results (Mutual Information):

### ANOVA

In [27]:
for model in models:

    model.fit(X_train_f, y_train)

    y_pred = model.predict(X_test_f)
    
    print(f"\n{model.__class__.__name__} Results (ANOVA):")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))


DecisionTreeClassifier Results (ANOVA):
Accuracy: 0.5031055900621118
Confusion Matrix:
 [[49 44]
 [36 32]]
Classification Report:
               precision    recall  f1-score   support

           0       0.58      0.53      0.55        93
           1       0.42      0.47      0.44        68

    accuracy                           0.50       161
   macro avg       0.50      0.50      0.50       161
weighted avg       0.51      0.50      0.51       161


RandomForestClassifier Results (ANOVA):
Accuracy: 0.5341614906832298
Confusion Matrix:
 [[60 33]
 [42 26]]
Classification Report:
               precision    recall  f1-score   support

           0       0.59      0.65      0.62        93
           1       0.44      0.38      0.41        68

    accuracy                           0.53       161
   macro avg       0.51      0.51      0.51       161
weighted avg       0.53      0.53      0.53       161


GradientBoostingClassifier Results (ANOVA):
Accuracy: 0.577639751552795
Confusion

# Model Selection

From the models above the DecisionTree using RFE, Laso, and RF feature selection showed promising results with an accuracy of ~54% which is better than the historical ~50% of times that the over hits in a game. 

In addition the GradientBoosting using ANOVA featrue selection technique was promising with an accuracy of ~57%, however some of the performance metrics were not as strong as they could be, escpecially for the f-1 score and recall, but the precision showed some promise.