<a href="https://colab.research.google.com/github/bdg221/ml_project/blob/main/MLB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MLB Position Player Career of 7 Years**

# **Project Background and Description**

## Background

After 3 years of a rookie contract, the salary for the player is determined by arbitration for the next 3 years. After the first 6 years of the players career, the player is considered a free agent and can get a new contract, including changing teams. Usually, the players subsequent contract includes an enormous salary increase (at least 2x the league minimum.)

The goal of the model is to predict if a player's career will be last at least 7 years based off 2 years of stats. With this information, an MLB team could offer the player a subsequent contract early, locking in the player for a lower salary than if they waited until the player became a free agent.

### Example:
A player completes two years in the majors and is given a contract above the league minimum. If the player has a career of at least 7 years, then the team would save money in the last few years of the contract by locking the player into an early contract.  

## Project Description
When a minor league player makes it to the major leagues, the MLB team has the rookie player under contract for 6 years. The first 3 years of the rookie contract is considered pre-arbitration where the MLB team determines the player's salary (usually league minimum which is $700,000 in 2022.) After 3 years in the majors, the player reaches an arbitration level where the team and the player submit salaries and a third party chooses one. Recently, the arbitration has been won by the players more than the teams. After the 6 years of the rookie contract, if an agreement is not met prior, the player is considered a free agent and can sign with any team for any amount.

Citation: https://franchisesports.co.uk/mlb-contract-extension-rules-explained/

In 2019 close to two thirds of MLB players had less than 3 years of experience which meant that the majority made close to the league minimum. Only a little more than one third of all MLB players make meaningifully above the minimum salary. Therefore, knowing if a player will have a longer career and being able to sign a player earlier, saves the MLB team money.

Citation: https://www.thescore.com/mlb/news/2241177/manfreds-letter-ought-to-be-a-call-for-players-to-dig-in

## Performance Metric

Unfortunately, it is difficult to find median values for the first contracts rookies receive from MLB clubs. However, the current median salary for an MLB player is \$1.2 million and the current league minimum is \$700,000.

If a team is able to offer a contract for multiple years at or below the current median, the team would save significant money in the later years of the contract when the model is correct in the prediction.

Obviously, if the team is incorrect and gives the player a contract but his career is less than 7 years the team would not only be paying a higher salary than the player is worth, but potentially be paying when the player is no longer playing.

If the player is predicted to have a shorter career and the prediction is incorrect, this could have 2 results. If the team wishes to keep the player on the team, the subsequent contract will be much higher and for potentially later years in the players career when performance decline is more expected and injuries are more common. However, if a team decides NOT to keep the player, the team would save money since it paid the minimum amount during the first 6 years.

Please note that this model and metric are for average players and not outliers and superstars. Since we are referring to average players, the teams would be able to offer the player closer to the league median (still above league minimum.)

While a 51% success rate would be successful, to further mitigate against losses, I would recommend at 70% success rate to ensure financial success.

In [1]:
#tables and visualizations
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

#machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn import config_context
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# **Load Data**

We will need to load data from 2 CSV files. 

The PLAYERS csv file includes the players information including the date they debuted and the date of their last game.

The BATTING csv file includes the batting stats for players which will be used for feature engineering.

In [4]:

# Read in Players so we can calculate the lengths of careers
players = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/People.csv')
players = players.drop(columns=['birthYear','birthMonth', 'birthDay','birthCountry','birthState','birthCity','deathYear','deathMonth','deathDay','deathCity','deathState','deathCountry','nameGiven','weight','height','bats','throws', 'retroID', 'bbrefID'])
display(players.head())
players.info()

# Read in Batting statistics 
batting = pd.read_csv('https://raw.githubusercontent.com/chadwickbureau/baseballdatabank/master/core/Batting.csv')
#batting = pd.read_csv('https://raw.githubusercontent.com/bdg221/ml_project/main/Batting.csv')
batting = batting.drop(columns=['teamID','stint','lgID','CS','IBB','HBP','SH','SF','GIDP'])
display(batting.head())
batting.info()

Unnamed: 0,playerID,nameFirst,nameLast,debut,finalGame
0,aardsda01,David,Aardsma,2004-04-06,2015-08-23
1,aaronha01,Hank,Aaron,1954-04-13,1976-10-03
2,aaronto01,Tommie,Aaron,1962-04-10,1971-09-26
3,aasedo01,Don,Aase,1977-07-26,1990-10-03
4,abadan01,Andy,Abad,2001-09-10,2006-04-13


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20514 entries, 0 to 20513
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   playerID   20514 non-null  object
 1   nameFirst  20477 non-null  object
 2   nameLast   20514 non-null  object
 3   debut      20304 non-null  object
 4   finalGame  20304 non-null  object
dtypes: object(5)
memory usage: 801.5+ KB


Unnamed: 0,playerID,yearID,G,AB,R,H,2B,3B,HR,RBI,SB,BB,SO
0,abercda01,1871,1,4,0,0,0,0,0,0.0,0.0,0,0.0
1,addybo01,1871,25,118,30,32,6,0,0,13.0,8.0,4,0.0
2,allisar01,1871,29,137,28,40,4,5,0,19.0,3.0,2,5.0
3,allisdo01,1871,27,133,28,44,10,2,2,27.0,1.0,0,2.0
4,ansonca01,1871,25,120,29,39,11,3,0,16.0,6.0,2,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110495 entries, 0 to 110494
Data columns (total 13 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   playerID  110495 non-null  object 
 1   yearID    110495 non-null  int64  
 2   G         110495 non-null  int64  
 3   AB        110495 non-null  int64  
 4   R         110495 non-null  int64  
 5   H         110495 non-null  int64  
 6   2B        110495 non-null  int64  
 7   3B        110495 non-null  int64  
 8   HR        110495 non-null  int64  
 9   RBI       109739 non-null  float64
 10  SB        108127 non-null  float64
 11  BB        110495 non-null  int64  
 12  SO        108395 non-null  float64
dtypes: float64(3), int64(9), object(1)
memory usage: 11.0+ MB


# Feature Engineering

1. Only take players from 1990 - 2015.
2. Only count a row if the player has > 0 ABs.
3. Find the first 2 years (may be more than 2 rows) for a unique player and sum the values of the rows

*The rookie year does not count for a player if he has fewer than 130 at bats



RUNS - add the 2 years' worth of runs and divide by 2 years' worth of games to give average runs per game over 2 years

HITS - average hits per game over 2 years

DOUBLES - average doubles per game over 2 years

TRIPLES - average triples per game over 2 years

HOMERUNS - average homeruns per game over 2 years

AVERAGE - total hits over 2 years divided by total at bats over 2 years

RBI - average RBIs per game over 2 years

WALKS - average walks per game over 2 years

STRIKEOUTS - average strikeouts per game over 2 years

STOLEN BASES - average stolen bases per game over 2 years




In [5]:
# Change debut and finalGame to datetime objects for calculations
players['finalGame'] = pd.to_datetime(players['finalGame'])
players['debut'] = pd.to_datetime(players['debut'])

# Remove players that fall outside the 1990-2015 range
players = players[players['finalGame'].dt.year > 1991]
players = players[players['debut'].dt.year < 2015]

#Calculate the length of the careers in years
players['length'] = (players['finalGame'] - players['debut'])/np.timedelta64(1,'Y')

#Set a column called result with the answer to our prediction
players['result'] = np.select([(players['length'] > 6), (players['length'] < 7)], [1, 0])
players['name'] = players['nameFirst']+' '+players['nameLast']
players = players.drop(columns=['nameFirst', 'nameLast','debut','finalGame','length'])
display(players.head())
players.info()

Unnamed: 0,playerID,result,name
0,aardsda01,1,David Aardsma
4,abadan01,0,Andy Abad
5,abadfe01,1,Fernando Abad
14,abbotje01,0,Jeff Abbott
15,abbotji01,1,Jim Abbott


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5663 entries, 0 to 20508
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   playerID  5663 non-null   object
 1   result    5663 non-null   int64 
 2   name      5663 non-null   object
dtypes: int64(1), object(2)
memory usage: 177.0+ KB


In [6]:


# This query will return only stats for the years 1990 through 2015.
# Also the rows will only be included if ABs is greater than 0
batting = batting.query("yearID > 1989 and yearID < 2016 and AB > 130")
aggregations = {
    'G':'sum',
    'AB':'sum',
    'R':'sum',
    'H':'sum',
    '2B':'sum',
    '3B':'sum',
    'HR':'sum',
    'RBI':'sum',
    'SB':'sum',
    'BB':'sum',
    'SO':'sum',
}

batting = batting.groupby('playerID').apply(lambda x:x.sort_values('yearID').head(2).agg(aggregations))
batting = batting.query('AB > 200')


finalset = pd.merge(players, batting, on='playerID')
display(finalset.head())

Unnamed: 0,playerID,result,name,G,AB,R,H,2B,3B,HR,RBI,SB,BB,SO
0,abbotje01,0,Jeff Abbott,169.0,459.0,64.0,127.0,29.0,2.0,15.0,70.0,5.0,30.0,66.0
1,abbotku01,1,Kurt Abbott,221.0,765.0,101.0,193.0,35.0,10.0,26.0,93.0,7.0,52.0,208.0
2,abercre01,0,Reggie Abercrombie,111.0,255.0,39.0,54.0,12.0,2.0,5.0,24.0,6.0,18.0,78.0
3,abernbr01,0,Brent Abernathy,196.0,767.0,89.0,194.0,35.0,5.0,7.0,73.0,18.0,52.0,81.0
4,abnersh01,0,Shawn Abner,188.0,392.0,38.0,103.0,19.0,1.0,2.0,31.0,3.0,21.0,63.0


group by playerID filter based on start date + 2 years, operatoin is "group by with summarize" 



In [7]:
finalset['AVG'] = finalset['H'] / finalset['AB']
finalset['R'] = finalset['R'] / finalset['G']
finalset['H'] = finalset['H'] / finalset['G']
finalset['2B'] = finalset['2B'] / finalset['G']
finalset['3B'] = finalset['3B'] / finalset['G']
finalset['HR'] = finalset['HR'] / finalset['G']
finalset['RBI'] = finalset['RBI'] / finalset['G']
finalset['SB'] = finalset['SB'] / finalset['G']
finalset['BB'] = finalset['BB'] / finalset['G']
finalset['SO'] = finalset['SO'] / finalset['G']
backup = finalset
finalset = finalset.drop(columns=['playerID','name','G','AB'])
display(finalset.head())

Unnamed: 0,result,R,H,2B,3B,HR,RBI,SB,BB,SO,AVG
0,0,0.378698,0.751479,0.171598,0.011834,0.088757,0.414201,0.029586,0.177515,0.390533,0.276688
1,1,0.457014,0.873303,0.158371,0.045249,0.117647,0.420814,0.031674,0.235294,0.941176,0.252288
2,0,0.351351,0.486486,0.108108,0.018018,0.045045,0.216216,0.054054,0.162162,0.702703,0.211765
3,0,0.454082,0.989796,0.178571,0.02551,0.035714,0.372449,0.091837,0.265306,0.413265,0.252934
4,0,0.202128,0.547872,0.101064,0.005319,0.010638,0.164894,0.015957,0.111702,0.335106,0.262755


# Split Data

Splitting the data 

In [8]:
class_column = 'result'
random_seed = 86753

X_train, X_test, y_train, y_test = train_test_split(finalset.drop(columns=class_column), finalset[class_column],
                                                   test_size=0.25, random_state=random_seed, stratify=finalset[class_column])

# Pipeline

The values have already been cleaned. However, I will try using a num_pipe with a StandardScaler following by using a LogisticRegression as the first test.

In [15]:


#generate the whole modeling pipeline with preprocessing
pipe = Pipeline(steps=[('mdl', LogisticRegression(penalty='elasticnet', solver='saga', tol=0.01, l1_ratio=0.1))])

pipe.fit(X_train, y_train)
pipe['mdl']
prediction = pipe.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       1.00      0.01      0.03        76
           1       0.81      1.00      0.89       319

    accuracy                           0.81       395
   macro avg       0.90      0.51      0.46       395
weighted avg       0.85      0.81      0.73       395



trying a different model to see difference in results


In [20]:
pipe = Pipeline(steps=[('mdl', RandomForestClassifier(n_estimators=250, criterion="gini", max_depth=10, min_samples_split=2, min_samples_leaf=2, min_weight_fraction_leaf=0, max_features="auto", max_leaf_nodes=None))])
pipe.fit(X_train, y_train)
pipe['mdl']
prediction = pipe.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))


              precision    recall  f1-score   support

           0       0.36      0.05      0.09        76
           1       0.81      0.98      0.89       319

    accuracy                           0.80       395
   macro avg       0.59      0.52      0.49       395
weighted avg       0.73      0.80      0.73       395

