# LSTM Statistic Predictor Using OPS(On-base Plus Slugging)

In this notebook, we will build and train a custom LSTM RNN that uses a 12 years of batting statistics data to predict the value of OPS.

## Statistics Use In Moneyball

According to Lewis (2003), Billy Beane (the inspiration of Moneyball) decided to base his drafting of position players/hitters on certain statistics. His main two statistics included `on-base percentage (OBP)` and `slugging percentage`. These two stats combined to form a new statistic called `on-base plus slugging (OPS)`.

The OPS formula below:

        OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
        
        SLG = (1B)+(2∗2B)+(3∗3B)+(4∗HR) / AB
        

H = Hits - when the batter strikes the ball without error

BB = Walks - when a pitcher throws four pitches out of the strike zone, none of which are swing at by the batter

HBP = Hit by pitch - when a batter is struck by a pitched ball without swinging at it and is awarded first base

AB = At bat - when a batter reaches base via fielder's choice, hit, or error (not including catcher's interference)

SF = Sacrifice fly - when a batter hits a fly-ball to the outfield or foul territory that allows a runner to score

1B = Single - when batter hits the ball and reaches first base

2B = Double - when batter hits the ball and reaches second base

3B = Triple - when batter hits the ball and reaches third base

HR = Home run - when batter hits the ball and circles all four bases

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import datetime
from pathlib import Path
import glob

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Denssorflow as tfe
import ten
import warnings
warnings.filterwarnisorflw as tf


%m
atplotlib inline


Bad key "text.kerning_factor" on line 4 in
C:\Users\Andy L\anaconda3\envs\dev\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [3]:
# Import dataset
path = "Resources/batting_stats_*.csv"
all_files = glob.glob(path)
all_files

# Read & combine into dataframe
data = []

for file in all_files:
    df = pd.read_csv(file, index_col=None, header=0)    
    year = file[-8:-4]
    df['Year'] = year
    data.append(df)

stats_df = pd.concat(data, axis=0, ignore_index=True)
stats_df = stats_df[['Year', 'Player', 'Team', 'Pos', 'Age', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR','RBI', 
             'SB', 'CS', 'BB', 'SO', 'SH', 'SF', 'HBP', 'AVG', 'OBP', 'SLG','OPS']]

# Drop player with 0 OPS
stats_df = stats_df[stats_df.OPS !=0].set_index('Player')

stats_df.head(100)

Unnamed: 0_level_0,Year,Team,Pos,Age,G,AB,R,H,2B,3B,...,CS,BB,SO,SH,SF,HBP,AVG,OBP,SLG,OPS
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ichiro Suzuki,2010,SEA,OF,47,162,680,74,214,30,3,...,9,45,86,3,1,3,0.315,0.359,0.394,0.753
Derek Jeter,2010,NYY,SS,47,157,663,111,179,30,3,...,5,63,106,1,3,9,0.270,0.340,0.370,0.710
Michael Young,2010,TEX,3B,44,157,656,99,186,36,3,...,2,50,115,0,11,1,0.284,0.330,0.444,0.774
Juan Pierre,2010,CWS,OF,43,160,651,96,179,18,3,...,18,45,47,15,2,21,0.275,0.341,0.316,0.657
Rickie Weeks,2010,MIL,DH,38,160,651,112,175,32,4,...,4,76,184,0,2,25,0.269,0.366,0.464,0.830
Marco Scutaro,2010,BOS,2B,45,150,632,92,174,38,0,...,4,53,71,4,3,3,0.275,0.333,0.388,0.721
Nick Markakis,2010,BAL,OF,37,160,629,79,187,45,3,...,2,73,93,0,5,2,0.297,0.370,0.436,0.806
Denard Span,2010,MIN,OF,37,153,629,85,166,24,10,...,4,60,74,10,2,4,0.264,0.331,0.348,0.679
Brandon Phillips,2010,CIN,2B,40,155,626,100,172,33,5,...,12,46,83,6,1,8,0.275,0.332,0.430,0.762
Robinson Cano,2010,NYY,2B,38,160,626,103,200,41,3,...,2,57,77,0,5,8,0.319,0.381,0.534,0.915


In [4]:
# Creating X & y variables
X = stats_df.iloc[:, 4:-2]
y = stats_df["OPS"].values

In [5]:
# Creating training, validation, and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

X_train.shape

(5914, 17)

In [6]:
# Scale the data
scaler = StandardScaler().fit(X)
X = scaler.transform(X)

y = y.reshape(-1, 1)
scaler_y = StandardScaler().fit(y)
y = scaler_y.transform(y)

In [7]:
y_train.shape

(5914,)

In [8]:
#X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
#X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

In [9]:
# Model set-up
number_input_features = 17
hidden_nodes_layer1 = 34
hidden_nodes_layer2 = 5

In [10]:
# Define the LSTM RNN model
model = Sequential()

# Layer 1
model.add(
    Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu")
)

# Layer 2
model.add(Dense(units=hidden_nodes_layer2, activation="relu"))

# Output layer
model.add(Dense(1, activation="sigmoid"))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [11]:
# Compile the model
model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=[
        "accuracy",
        tf.keras.metrics.TruePositives(name="tp"),
        tf.keras.metrics.TrueNegatives(name="tn"),
        tf.keras.metrics.FalsePositives(name="fp"),
        tf.keras.metrics.FalseNegatives(name="fn"),
        tf.keras.metrics.Precision(name="precision"),
        tf.keras.metrics.Recall(name="recall"),
        tf.keras.metrics.AUC(name="auc"),
    ],
)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [12]:
# Training the model
batch_size = 1000
epochs = 100
training_history = model.fit(
    X_train,
    y_train,
    #validation_data=(X_val, y_val),
    epochs=epochs,
    batch_size=batch_size,
    verbose=1,
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [13]:
# Make predictions using the testing data X_test
predicted = model.predict(X_test)

# Evaluate the model
model.evaluate(X_test, y_test)



[0.5084630828841946, 0.005578093, 1814.0, 0.0, 0.0, 158.0, 1.0, 0.9198783, 0.0]

In [14]:
predicted_OPS = scaler_y.inverse_transform(predicted)

In [15]:
moneyball = pd.DataFrame({
    "Real": y_test.ravel(),
    "Predicted": predicted_OPS.ravel()
    }, index = stats_df.index[-len(predicted_OPS): ])
moneyball.head(20)

Unnamed: 0_level_0,Real,Predicted
Player,Unnamed: 1_level_1,Unnamed: 2_level_1
Jacob Nottingham,0.95,0.852732
Taylor Motter,0.651,0.786296
Ildemaro Vargas,0.616,0.769089
Adam Rosales,0.614,0.7831
Brandon Barnes,0.661,0.830018
Anthony Alford,0.646,0.795635
Luis Sardinas,1.185,0.858469
Matt den Dekker,0.419,0.763675
Corban Joseph,0.486,0.768718
Gabriel Guerrero,0.444,0.749129


In [16]:
moneyball = moneyball.groupby('Player').mean()
moneyball.head(20)

Unnamed: 0_level_0,Real,Predicted
Player,Unnamed: 1_level_1,Unnamed: 2_level_1
AJ Pollock,0.642333,0.784162
AJ Reed,0.696,0.807914
Aaron Altherr,0.757,0.823085
Aaron Hicks,0.586,0.775682
Aaron Judge,0.637,0.791791
Abiatal Avelino,0.546,0.784224
Abraham Almonte,0.637333,0.794459
Abraham Toro,0.802667,0.822177
Adalberto Mondesi,0.845333,0.831775
Adam Duvall,0.712,0.804174


In [17]:
moneyball.to_csv(r'Resources/Result/Moneyball.csv')