### Will Yadier Molina be inducted into the MLB Hall of Fame?
This is a machine learning exercise with the objective of answering this controversial question. Will Yadier Molina be inducted into the hall of fame? This is a second version of this exercise. The improvements to this algorithm relative to the previous is the evaluation of different machie learning models and not just two. The best model is selected and then fed with Yadier Molina's data.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import pandas_profiling

# Import libraries for model validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score

# Import machine learning model libraries
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost

### Importing and checking the data

In [2]:
# Import the data frame with the catchers information. This dataframe was gathered in the past algorithm
stats_df = pd.read_csv('molina_stats_df.csv', index_col=0)
stats_df.head(2)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,birthCountry,G_c,G_all,AB,R,RBI,H,2B,3B,HR,AVG,PB,SB,inducted
0,allisdo01,Doug,Allison,1846.0,USA,277.0,316.0,1407.0,236.0,139.0,382.0,44.0,10.0,2.0,0.2715,166.0,0.0,False
1,alomasa02,Sandy,Alomar,1966.0,P.R.,1324.0,1377.0,4530.0,520.0,588.0,1236.0,249.0,10.0,112.0,0.272848,71.0,758.0,False


In [3]:
# Collect the Yadier Molina's Data. This data has the X format, which is different than the above format for all the other catchers
# This is a manual input with the numbers updated to sep-21-2022
molinaDict = [{'G_c':2180 , # Games as catcher
              'G_all':2221, # Games played
              'AB':7806,    # At-bat
              'R':777,      # Runs scored
              'RBI':1020,   # Runs Batted In
              'H':2167,     # Hits
              '2B':408,     # Doubles
              '3B':7,       # Triples
              'HR':176,     # Home Runs
              'AVG':278,    # Batting Average
              'PB':96,      # Pastball
              'SB':562}]    # Defensive Stolen Bases
molina = pd.DataFrame(molinaDict)
molina

Unnamed: 0,G_c,G_all,AB,R,RBI,H,2B,3B,HR,AVG,PB,SB
0,2180,2221,7806,777,1020,2167,408,7,176,278,96,562


In [4]:
# Checking the datatypes
stats_df.dtypes

playerID         object
nameFirst        object
nameLast         object
birthYear       float64
birthCountry     object
G_c             float64
G_all           float64
AB              float64
R               float64
RBI             float64
H               float64
2B              float64
3B              float64
HR              float64
AVG             float64
PB              float64
SB              float64
inducted           bool
dtype: object

In [5]:
# Profile report on the stats dataframe. This code takes 1 minute to run. Uncomment if needed
pandas_profiling.ProfileReport(stats_df)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [6]:
# Check descriptive statistics values for the stats dataframe
stats_df.describe()

Unnamed: 0,birthYear,G_c,G_all,AB,R,RBI,H,2B,3B,HR,AVG,PB,SB
count,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0
mean,1919.327586,1093.327586,1384.474138,4546.267241,567.543103,615.724138,1225.948276,212.396552,34.163793,103.491379,0.263997,103.974138,433.215517
std,36.110164,529.067001,555.743817,2070.56384,340.851756,348.163544,612.252503,118.757545,27.013988,104.234476,0.025724,100.738789,427.542763
min,1846.0,67.0,120.0,343.0,38.0,39.0,82.0,8.0,3.0,2.0,0.192708,6.0,0.0
25%,1889.75,699.0,939.5,2892.0,313.25,334.0,733.25,120.5,15.75,20.0,0.244849,56.0,0.0
50%,1922.0,1113.0,1435.0,4493.0,528.5,586.5,1209.5,204.5,27.5,67.0,0.264832,75.5,310.5
75%,1949.0,1456.5,1776.0,5864.0,712.75,815.5,1582.0,272.25,47.25,174.5,0.282194,121.5,753.5
max,1974.0,2427.0,2850.0,10876.0,1844.0,1430.0,3060.0,668.0,178.0,427.0,0.319598,639.0,1586.0


In [7]:
# Check for null values
# stats_df.isnull().sum() provides the null by feature
# stats_df.isnull().sum().sum() provides the total of null values
stats_df.isnull().sum().sum()

0

In [8]:
# See all of the different values for a particular column
stats_df['3B'].value_counts()

10.0     5
12.0     5
27.0     4
34.0     4
6.0      4
        ..
45.0     1
178.0    1
40.0     1
72.0     1
76.0     1
Name: 3B, Length: 61, dtype: int64

In [9]:
# Normalize data. This is a procedure which gets z value for all the cells.
# Z value is a descrptive statistics value which indicates how many standard deviations a value is away relative to the mean
scaler = preprocessing.StandardScaler()
stats_df_normalized = scaler.fit_transform(stats_df.drop(['playerID', 'nameFirst', 'nameLast', 'birthYear','birthCountry', 'inducted'], axis=1))

In [10]:
# Split the data into training and test data and create a baseline model using Dummy Classifier
X_train, X_test, y_train, y_test = train_test_split(stats_df_normalized, stats_df['inducted'], test_size=0.3, random_state=42)
bm = DummyClassifier()
bm.fit(X_train, y_train)
bm.score(X_test, y_test)

0.8285714285714286

In [11]:
# Create a models list. It will be used in the next steps
models_list = [
    DummyClassifier,
    LogisticRegression,
    DecisionTreeClassifier,
    KNeighborsClassifier,
    GaussianNB,
    SVC,
    RandomForestClassifier,
    xgboost.XGBClassifier
]

In [12]:
# Observe the different scores for the different models for normalized data
X = stats_df_normalized
y = stats_df['inducted'].values

for model in models_list:
    cls = model()
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    s = cross_val_score(cls, X, y, cv=kfold)
    print(
        f"{model.__name__:22} AUC: {s.mean():.3f} STD: {s.std():.2f}"
    )

DummyClassifier        AUC: 0.791 STD: 0.09
LogisticRegression     AUC: 0.834 STD: 0.10
DecisionTreeClassifier AUC: 0.783 STD: 0.09
KNeighborsClassifier   AUC: 0.844 STD: 0.08
GaussianNB             AUC: 0.801 STD: 0.09
SVC                    AUC: 0.858 STD: 0.13
RandomForestClassifier AUC: 0.843 STD: 0.09
XGBClassifier          AUC: 0.819 STD: 0.08


In [13]:
# With the best model predict for Yadier Molina
best_model = SVC()
best_model.fit(X,y)
best_model.predict(molina)



array([False])

In [14]:
# Observe the different predictions for the available models
for model in models_list:
    print(
        f'{model.__name__:22} Prediction: {model().fit(X,y).predict(molina)}'
    )

DummyClassifier        Prediction: [False]
LogisticRegression     Prediction: [ True]
DecisionTreeClassifier Prediction: [ True]
KNeighborsClassifier   Prediction: [ True]
GaussianNB             Prediction: [ True]
SVC                    Prediction: [False]




RandomForestClassifier Prediction: [ True]
XGBClassifier          Prediction: [1]


