# Machine Learning: Hall of Fame Classification

## Table of Contents

> #### 1. [Project Overview](#Project-Overview)
> #### 2. [Installing Necessary Packages and Reading Data](#Installing-Necessary-Packages-and-Reading-Data)
> #### 3. [Data Munging and Cleaning](#Data-Munging-and-Cleaning)
> #### 4. [Machine Learning Implementation](#Maching-Learning-Implementation)
> #### 5. [Model Evaluation and Summary](#Model-Evaluation-and-Summary)

## 1. <a name ="Project-Overview"></a> Project Overview

This project attempts to build a classification model to accurately predict if a  MLB position player will be inducted into the Hall of Fame based on several inputs including ```H``` (career hits), ```HR``` (homeruns), ```2B``` (doubles), ```batting_avg``` (batting average), ```SB``` (stolen bases), and ```career_all_star_games``` (number of all star game appearances).

This is a classification problem because the desired prediction ('Inducted into the Hall of Fame?') is a categorical output. The player is either inducted ('Y') or not inducted ('N'). In this project, I fit and test two different classification models (decision tree classifier and random forest classifier) and evaluate which is most accurate. 

## 2. <a name ="Installing-Necessary-Packages-and-Reading-Data"></a> Installing Necessary Packages and Reading Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#load necessary datasets
#players
#awardsPlayers
#HallofFame
#All Star full
#Batting

players = pd.read_csv('/Users/chriskucewicz/Documents/DataSciencefiles/Sabermetrics/baseballdatabank-2022.2/core/People.csv')
batting = pd.read_csv('/Users/chriskucewicz/Documents/DataSciencefiles/Sabermetrics/baseballdatabank-2022.2/core/Batting.csv')
allStar = pd.read_csv('/Users/chriskucewicz/Documents/DataSciencefiles/Sabermetrics/baseballdatabank-2022.2/core/AllstarFull.csv')
HoF = pd.read_csv('/Users/chriskucewicz/Documents/DataSciencefiles/Sabermetrics/baseballdatabank-2022.2/contrib/HallOfFame.csv')
awards = pd.read_csv('/Users/chriskucewicz/Documents/DataSciencefiles/Sabermetrics/baseballdatabank-2022.2/contrib/AwardsPlayers.csv')

## 3. <a name ="Data-Munging-and-Cleaning"></a> Data Munging and Cleaning

In [3]:
#Get a preview of batting 
batting.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,...,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1871,1,RC1,,25,118,30,32,6,...,13.0,8.0,1.0,4,0.0,,,,,0.0
2,allisar01,1871,1,CL1,,29,137,28,40,4,...,19.0,3.0,1.0,2,5.0,,,,,1.0
3,allisdo01,1871,1,WS3,,27,133,28,44,10,...,27.0,1.0,1.0,0,2.0,,,,,0.0
4,ansonca01,1871,1,RC1,,25,120,29,39,11,...,16.0,6.0,2.0,2,1.0,,,,,0.0


From the output above, each row in the ```batting``` data set represents one year of statistics for a player. So a player that played from 2000-2009 will have 10 rows that correspond to them in this dataframe. In order to have a useful data set for classification, the stats for each player for each year will need to be compiled into the career stats for each player.

In [4]:
#Get a preview of allStar 
allStar.head()

Unnamed: 0,playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
0,gomezle01,1933,0,ALS193307060,NYA,AL,1,1.0
1,ferreri01,1933,0,ALS193307060,BOS,AL,1,2.0
2,gehrilo01,1933,0,ALS193307060,NYA,AL,1,3.0
3,gehrich01,1933,0,ALS193307060,DET,AL,1,4.0
4,dykesji01,1933,0,ALS193307060,CHA,AL,1,5.0


Similar to ```batting```, each row in ```allStar``` corresponds to a specific year that a player was elected to the all-star team. To prepare for classification, the total number of all-stars selections a player had will need to be compiled from ```allStar```.

In [5]:
#Get a preview of HoF 
HoF.head()

Unnamed: 0,playerID,yearID,votedBy,ballots,needed,votes,inducted,category,needed_note
0,cobbty01,1936,BBWAA,226.0,170.0,222.0,Y,Player,
1,ruthba01,1936,BBWAA,226.0,170.0,215.0,Y,Player,
2,wagneho01,1936,BBWAA,226.0,170.0,215.0,Y,Player,
3,mathech01,1936,BBWAA,226.0,170.0,205.0,Y,Player,
4,johnswa01,1936,BBWAA,226.0,170.0,189.0,Y,Player,


From ```HoF```, the ```inducted``` feature will need to be extracted as that will be the feature my model will be predicting.

In [6]:
#Get a preview of awards 
awards.head()

Unnamed: 0,playerID,awardID,yearID,lgID,tie,notes
0,bondto01,Pitching Triple Crown,1877,NL,,
1,hinespa01,Triple Crown,1878,NL,,
2,heckegu01,Pitching Triple Crown,1884,AA,,
3,radboch01,Pitching Triple Crown,1884,NL,,
4,oneilti01,Triple Crown,1887,AA,,


In [7]:
#Get a preview of players 
players.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,2021.0,1.0,22.0,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


In [8]:
batting.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',
       '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',
       'SF', 'GIDP'],
      dtype='object')

In [9]:
#compile career AB, R, H, 2b, 3b, HR, RBI, SB, CS, BB, SO
career_batting = batting.groupby('playerID').sum()
career_batting.head()

Unnamed: 0_level_0,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
aardsda01,18084,9,331,4,0,0,0,0,0,0.0,0.0,0.0,0,2.0,0.0,0.0,1.0,0.0,0.0
aaronha01,45195,23,3298,12364,2174,3771,624,98,755,2297.0,240.0,73.0,1402,1383.0,293.0,32.0,21.0,121.0,328.0
aaronto01,13768,7,437,944,102,216,42,6,13,94.0,9.0,8.0,86,145.0,3.0,0.0,9.0,6.0,36.0
aasedo01,25786,13,448,5,0,0,0,0,0,0.0,0.0,0.0,0,3.0,0.0,0.0,0.0,0.0,0.0
abadan01,6010,3,15,21,1,2,0,0,0,0.0,0.0,1.0,4,5.0,0.0,0.0,0.0,0.0,1.0


In [10]:
#create function to determine career avg, obp, and slugging pct
def slash_line(df):
    for player in df:
        singles = df['H'] - df['2B'] - df['3B'] - df['HR']
        
        df['batting_avg'] = round(df['H']/ df['AB'], 3)
        
        #df['OBP'] = round((df['H'] + df['BB'] + df['HBP']) / \
         #           (df['AB'] + df['BB'] + df['HBP'] + df['SF']), 3)
        
        #df['SLG'] = round((singles + 2*df['2B'] + 3*df['3B'] + 4*df['HR']) / \
         #           df['AB'], 3)

In [11]:
slash_line(career_batting)

In [12]:
career_batting.head()

Unnamed: 0_level_0,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
aardsda01,18084,9,331,4,0,0,0,0,0,0.0,0.0,0.0,0,2.0,0.0,0.0,1.0,0.0,0.0,0.0
aaronha01,45195,23,3298,12364,2174,3771,624,98,755,2297.0,240.0,73.0,1402,1383.0,293.0,32.0,21.0,121.0,328.0,0.305
aaronto01,13768,7,437,944,102,216,42,6,13,94.0,9.0,8.0,86,145.0,3.0,0.0,9.0,6.0,36.0,0.229
aasedo01,25786,13,448,5,0,0,0,0,0,0.0,0.0,0.0,0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
abadan01,6010,3,15,21,1,2,0,0,0,0.0,0.0,1.0,4,5.0,0.0,0.0,0.0,0.0,1.0,0.095


In [13]:
#want to compile total number of all star games per player
allStar.groupby('playerID').count()['gameNum']

playerID
aaronha01    24
aasedo01      1
abreubo01     2
abreujo02     3
acunaro01     2
             ..
zimmery01     2
ziskri01      2
zitoba01      3
zobribe01     3
zuninmi01     1
Name: gameNum, Length: 1907, dtype: int64

In [14]:
#add total all star games variable to career batting df
career_batting['career_all_star_games'] = allStar.groupby('playerID').count()['gameNum']

In [15]:
career_batting.head()

Unnamed: 0_level_0,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,...,CS,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aardsda01,18084,9,331,4,0,0,0,0,0,0.0,...,0.0,0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,
aaronha01,45195,23,3298,12364,2174,3771,624,98,755,2297.0,...,73.0,1402,1383.0,293.0,32.0,21.0,121.0,328.0,0.305,24.0
aaronto01,13768,7,437,944,102,216,42,6,13,94.0,...,8.0,86,145.0,3.0,0.0,9.0,6.0,36.0,0.229,
aasedo01,25786,13,448,5,0,0,0,0,0,0.0,...,0.0,0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
abadan01,6010,3,15,21,1,2,0,0,0,0.0,...,1.0,4,5.0,0.0,0.0,0.0,0.0,1.0,0.095,


In [16]:
#Cleaning NaN values in career_all_star_games column
career_batting['career_all_star_games'] = career_batting[['career_all_star_games']].fillna(value = 0)

In [17]:
career_batting.head()

Unnamed: 0_level_0,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,...,CS,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aardsda01,18084,9,331,4,0,0,0,0,0,0.0,...,0.0,0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
aaronha01,45195,23,3298,12364,2174,3771,624,98,755,2297.0,...,73.0,1402,1383.0,293.0,32.0,21.0,121.0,328.0,0.305,24.0
aaronto01,13768,7,437,944,102,216,42,6,13,94.0,...,8.0,86,145.0,3.0,0.0,9.0,6.0,36.0,0.229,0.0
aasedo01,25786,13,448,5,0,0,0,0,0,0.0,...,0.0,0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
abadan01,6010,3,15,21,1,2,0,0,0,0.0,...,1.0,4,5.0,0.0,0.0,0.0,0.0,1.0,0.095,0.0


### Aggregating Career Awards

The code below was my attempt to compile the career awards (Silver Slugger, Gold Gloves, MVPs, Rookie of the Year, and Triple Crown) for each player. I was able to successfully aggregate the awards into a pivot table. In a future iteration of this project, I would merge the career awards onto ```career_batting``` and use these as addition input features for my models to use. To prevent these cells from running, they are marked as Raw NBConvert cells. 

In [18]:
#award_groups.get_level_values()

### Joining Dataframes

In [19]:
#create dataframe from HoF with results of HoF voting
HoF.head()

Unnamed: 0,playerID,yearID,votedBy,ballots,needed,votes,inducted,category,needed_note
0,cobbty01,1936,BBWAA,226.0,170.0,222.0,Y,Player,
1,ruthba01,1936,BBWAA,226.0,170.0,215.0,Y,Player,
2,wagneho01,1936,BBWAA,226.0,170.0,215.0,Y,Player,
3,mathech01,1936,BBWAA,226.0,170.0,205.0,Y,Player,
4,johnswa01,1936,BBWAA,226.0,170.0,189.0,Y,Player,


In [20]:
HoF_selected = HoF[['playerID','inducted']]
HoF_selected.set_index('playerID', inplace = True)

In [21]:
career_stats = career_batting.join(HoF_selected, how = 'outer')
career_stats.head()

Unnamed: 0_level_0,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,...,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games,inducted
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aardsda01,18084.0,9.0,331.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,
aaronha01,45195.0,23.0,3298.0,12364.0,2174.0,3771.0,624.0,98.0,755.0,2297.0,...,1402.0,1383.0,293.0,32.0,21.0,121.0,328.0,0.305,24.0,Y
aaronto01,13768.0,7.0,437.0,944.0,102.0,216.0,42.0,6.0,13.0,94.0,...,86.0,145.0,3.0,0.0,9.0,6.0,36.0,0.229,0.0,
aasedo01,25786.0,13.0,448.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,
abadan01,6010.0,3.0,15.0,21.0,1.0,2.0,0.0,0.0,0.0,0.0,...,4.0,5.0,0.0,0.0,0.0,0.0,1.0,0.095,0.0,


In [22]:
career_stats.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23144 entries, aardsda01 to zychto01
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   yearID                 23073 non-null  float64
 1   stint                  23073 non-null  float64
 2   G                      23073 non-null  float64
 3   AB                     23073 non-null  float64
 4   R                      23073 non-null  float64
 5   H                      23073 non-null  float64
 6   2B                     23073 non-null  float64
 7   3B                     23073 non-null  float64
 8   HR                     23073 non-null  float64
 9   RBI                    23073 non-null  float64
 10  SB                     23073 non-null  float64
 11  CS                     23073 non-null  float64
 12  BB                     23073 non-null  float64
 13  SO                     23073 non-null  float64
 14  IBB                    23073 non-null  float64
 

In [23]:
career_stats[career_stats['inducted'] == 'Y']

Unnamed: 0_level_0,yearID,stint,G,AB,R,H,2B,3B,HR,RBI,...,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games,inducted
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aaronha01,45195.0,23.0,3298.0,12364.0,2174.0,3771.0,624.0,98.0,755.0,2297.0,...,1402.0,1383.0,293.0,32.0,21.0,121.0,328.0,0.305,24.0,Y
alexape01,40336.0,22.0,703.0,1810.0,154.0,378.0,60.0,13.0,11.0,163.0,...,77.0,276.0,0.0,2.0,88.0,0.0,0.0,0.209,0.0,Y
alomaro01,37939.0,21.0,2379.0,9073.0,1508.0,2724.0,504.0,80.0,210.0,1134.0,...,1032.0,1140.0,62.0,50.0,148.0,97.0,206.0,0.300,12.0,Y
alstowa01,1936.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000,0.0,Y
andersp01,1959.0,1.0,152.0,477.0,42.0,104.0,9.0,3.0,0.0,34.0,...,42.0,53.0,1.0,1.0,5.0,2.0,15.0,0.218,0.0,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
yastrca01,45356.0,23.0,3308.0,11988.0,1816.0,3419.0,646.0,59.0,452.0,1844.0,...,1845.0,1393.0,190.0,40.0,13.0,105.0,323.0,0.285,18.0,Y
yawketo99,,,,,,,,,,,...,,,,,,,,,,Y
youngcy01,43722.0,24.0,918.0,2960.0,325.0,623.0,87.0,35.0,18.0,290.0,...,81.0,381.0,0.0,10.0,50.0,0.0,0.0,0.210,0.0,Y
youngro01,19215.0,10.0,1211.0,4627.0,812.0,1491.0,236.0,93.0,42.0,592.0,...,550.0,390.0,0.0,37.0,119.0,0.0,0.0,0.322,0.0,Y


In [24]:
player_names = players[['playerID', 'nameFirst', 'nameLast']]
#player_names.set_index('playerID', inplace = True)
player_names.head()

Unnamed: 0,playerID,nameFirst,nameLast
0,aardsda01,David,Aardsma
1,aaronha01,Hank,Aaron
2,aaronto01,Tommie,Aaron
3,aasedo01,Don,Aase
4,abadan01,Andy,Abad


In [25]:
career_stats = career_stats.merge(player_names, left_on = 'playerID', right_on = 'playerID')

In [26]:
career_stats

Unnamed: 0,playerID,yearID,stint,G,AB,R,H,2B,3B,HR,...,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games,inducted,nameFirst,nameLast
0,aardsda01,18084.0,9.0,331.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.000,0.0,,David,Aardsma
1,aaronha01,45195.0,23.0,3298.0,12364.0,2174.0,3771.0,624.0,98.0,755.0,...,293.0,32.0,21.0,121.0,328.0,0.305,24.0,Y,Hank,Aaron
2,aaronto01,13768.0,7.0,437.0,944.0,102.0,216.0,42.0,6.0,13.0,...,3.0,0.0,9.0,6.0,36.0,0.229,0.0,,Tommie,Aaron
3,aasedo01,25786.0,13.0,448.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000,1.0,,Don,Aase
4,abadan01,6010.0,3.0,15.0,21.0,1.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.095,0.0,,Andy,Abad
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23139,zupofr01,5876.0,3.0,16.0,18.0,3.0,3.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.167,0.0,,Frank,Zupo
23140,zuvelpa01,17875.0,9.0,209.0,491.0,41.0,109.0,17.0,2.0,2.0,...,1.0,2.0,18.0,0.0,8.0,0.222,0.0,,Paul,Zuvella
23141,zuverge01,19551.0,12.0,266.0,142.0,5.0,21.0,2.0,1.0,0.0,...,0.0,0.0,16.0,0.0,3.0,0.148,0.0,,George,Zuverink
23142,zwilldu01,7655.0,4.0,366.0,1280.0,167.0,364.0,76.0,15.0,30.0,...,0.0,4.0,31.0,0.0,0.0,0.284,0.0,,Dutch,Zwilling


In [27]:
#moving first and last name columns to the front of the dataframe
first_name = career_stats.pop('nameFirst')
career_stats.insert(1, 'FirstName', first_name)

In [28]:
#Moving last names column to index #2
last_name = career_stats.pop('nameLast')
career_stats.insert(2, 'LastName', last_name)

In [29]:
career_stats

Unnamed: 0,playerID,FirstName,LastName,yearID,stint,G,AB,R,H,2B,...,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games,inducted
0,aardsda01,David,Aardsma,18084.0,9.0,331.0,4.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.000,0.0,
1,aaronha01,Hank,Aaron,45195.0,23.0,3298.0,12364.0,2174.0,3771.0,624.0,...,1402.0,1383.0,293.0,32.0,21.0,121.0,328.0,0.305,24.0,Y
2,aaronto01,Tommie,Aaron,13768.0,7.0,437.0,944.0,102.0,216.0,42.0,...,86.0,145.0,3.0,0.0,9.0,6.0,36.0,0.229,0.0,
3,aasedo01,Don,Aase,25786.0,13.0,448.0,5.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.000,1.0,
4,abadan01,Andy,Abad,6010.0,3.0,15.0,21.0,1.0,2.0,0.0,...,4.0,5.0,0.0,0.0,0.0,0.0,1.0,0.095,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23139,zupofr01,Frank,Zupo,5876.0,3.0,16.0,18.0,3.0,3.0,1.0,...,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.167,0.0,
23140,zuvelpa01,Paul,Zuvella,17875.0,9.0,209.0,491.0,41.0,109.0,17.0,...,34.0,50.0,1.0,2.0,18.0,0.0,8.0,0.222,0.0,
23141,zuverge01,George,Zuverink,19551.0,12.0,266.0,142.0,5.0,21.0,2.0,...,9.0,39.0,0.0,0.0,16.0,0.0,3.0,0.148,0.0,
23142,zwilldu01,Dutch,Zwilling,7655.0,4.0,366.0,1280.0,167.0,364.0,76.0,...,128.0,155.0,0.0,4.0,31.0,0.0,0.0,0.284,0.0,


In [30]:
career_stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23144 entries, 0 to 23143
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   playerID               23144 non-null  object 
 1   FirstName              23107 non-null  object 
 2   LastName               23144 non-null  object 
 3   yearID                 23073 non-null  float64
 4   stint                  23073 non-null  float64
 5   G                      23073 non-null  float64
 6   AB                     23073 non-null  float64
 7   R                      23073 non-null  float64
 8   H                      23073 non-null  float64
 9   2B                     23073 non-null  float64
 10  3B                     23073 non-null  float64
 11  HR                     23073 non-null  float64
 12  RBI                    23073 non-null  float64
 13  SB                     23073 non-null  float64
 14  CS                     23073 non-null  float64
 15  BB

In [31]:
#Cleaning NaN All-Star values by converting NaN to 0
career_stats['career_all_star_games'] = career_stats['career_all_star_games'].fillna(value = 0)

In [32]:
#Cleaning NaN Inducted values by converting NaN to 'N'
career_stats['inducted'] = career_stats['inducted'].fillna(value = 'N')

In [33]:
career_stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23144 entries, 0 to 23143
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   playerID               23144 non-null  object 
 1   FirstName              23107 non-null  object 
 2   LastName               23144 non-null  object 
 3   yearID                 23073 non-null  float64
 4   stint                  23073 non-null  float64
 5   G                      23073 non-null  float64
 6   AB                     23073 non-null  float64
 7   R                      23073 non-null  float64
 8   H                      23073 non-null  float64
 9   2B                     23073 non-null  float64
 10  3B                     23073 non-null  float64
 11  HR                     23073 non-null  float64
 12  RBI                    23073 non-null  float64
 13  SB                     23073 non-null  float64
 14  CS                     23073 non-null  float64
 15  BB

In [34]:
career_stats['inducted'].value_counts()

N    22821
Y      323
Name: inducted, dtype: int64

In [35]:
career_stats.isna().sum()

playerID                    0
FirstName                  37
LastName                    0
yearID                     71
stint                      71
G                          71
AB                         71
R                          71
H                          71
2B                         71
3B                         71
HR                         71
RBI                        71
SB                         71
CS                         71
BB                         71
SO                         71
IBB                        71
HBP                        71
SH                         71
SF                         71
GIDP                       71
batting_avg              2404
career_all_star_games       0
inducted                    0
dtype: int64

In [36]:
career_stats[career_stats['batting_avg'].isnull()]

Unnamed: 0,playerID,FirstName,LastName,yearID,stint,G,AB,R,H,2B,...,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games,inducted
13,abbotgl01,Glenn,Abbott,23743.0,13.0,248.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
34,abreual01,Albert,Abreu,4041.0,2.0,30.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
36,abreubr01,Bryan,Abreu,6060.0,3.0,42.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
39,abreuju01,Juan,Abreu,2011.0,1.0,7.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
45,acevedo01,Domingo,Acevedo,2021.0,1.0,10.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23113,zimmeky01,Kyle,Zimmer,6060.0,3.0,83.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
23116,zinkch01,Charlie,Zink,2008.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
23121,zinsebi01,Bill,Zinser,1944.0,1.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N
23136,zumayjo01,Joel,Zumaya,10040.0,5.0,171.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,N


For this analysis I am only position players, not pitchers, so the code below removes any rows (players) that don't have more than 100 ABs as anyone with fewer than 100 at bats was likely not a regular position player.

In [37]:
career_stats = career_stats[career_stats['AB'] >= 100]
career_stats

Unnamed: 0,playerID,FirstName,LastName,yearID,stint,G,AB,R,H,2B,...,BB,SO,IBB,HBP,SH,SF,GIDP,batting_avg,career_all_star_games,inducted
1,aaronha01,Hank,Aaron,45195.0,23.0,3298.0,12364.0,2174.0,3771.0,624.0,...,1402.0,1383.0,293.0,32.0,21.0,121.0,328.0,0.305,24.0,Y
2,aaronto01,Tommie,Aaron,13768.0,7.0,437.0,944.0,102.0,216.0,42.0,...,86.0,145.0,3.0,0.0,9.0,6.0,36.0,0.229,0.0,N
7,abbated01,Ed,Abbaticchio,19051.0,11.0,855.0,3044.0,355.0,772.0,99.0,...,289.0,283.0,0.0,33.0,93.0,0.0,0.0,0.254,0.0,N
8,abbeybe01,Bert,Abbey,11365.0,7.0,79.0,225.0,21.0,38.0,3.0,...,21.0,54.0,0.0,0.0,6.0,0.0,0.0,0.169,0.0,N
9,abbeych01,Charlie,Abbey,9475.0,5.0,452.0,1756.0,307.0,493.0,67.0,...,167.0,122.0,0.0,23.0,19.0,0.0,0.0,0.281,0.0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23137,zuninmi01,Mike,Zunino,18153.0,9.0,814.0,2559.0,308.0,518.0,111.0,...,198.0,981.0,1.0,58.0,8.0,12.0,49.0,0.202,1.0,N
23138,zupcibo01,Bob,Zupcic,9964.0,6.0,319.0,795.0,99.0,199.0,47.0,...,57.0,137.0,3.0,6.0,20.0,8.0,15.0,0.250,0.0,N
23140,zuvelpa01,Paul,Zuvella,17875.0,9.0,209.0,491.0,41.0,109.0,17.0,...,34.0,50.0,1.0,2.0,18.0,0.0,8.0,0.222,0.0,N
23141,zuverge01,George,Zuverink,19551.0,12.0,266.0,142.0,5.0,21.0,2.0,...,9.0,39.0,0.0,0.0,16.0,0.0,3.0,0.148,0.0,N


In [38]:
career_stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12046 entries, 1 to 23142
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   playerID               12046 non-null  object 
 1   FirstName              12046 non-null  object 
 2   LastName               12046 non-null  object 
 3   yearID                 12046 non-null  float64
 4   stint                  12046 non-null  float64
 5   G                      12046 non-null  float64
 6   AB                     12046 non-null  float64
 7   R                      12046 non-null  float64
 8   H                      12046 non-null  float64
 9   2B                     12046 non-null  float64
 10  3B                     12046 non-null  float64
 11  HR                     12046 non-null  float64
 12  RBI                    12046 non-null  float64
 13  SB                     12046 non-null  float64
 14  CS                     12046 non-null  float64
 15  BB

## 4. <a name ="Maching-Learning-Implementation"></a> Machine Learning Implementation

In [39]:
#importing necessary packages and models

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [40]:
#Defining variable for target feature
y = career_stats['inducted']

In [41]:
#defining input features: H, 2B, HR, SB, batting_avg, career_all_star_games
features = ['H', '2B' ,'HR', 'SB', 'batting_avg', 'career_all_star_games']

x = career_stats[features]

In [42]:
#splitting data into training and testing data
train_x, test_x, train_y, test_y = train_test_split(x, y, train_size = 0.75, test_size = 0.25, random_state = 1)

In [43]:
#creating variables for models
hall_of_fame_decision_tree = DecisionTreeClassifier(random_state = 1)
hall_of_fame_random_forest = RandomForestClassifier(random_state = 1)

#hall_of_fame_KNeighbors = KNeighborsClassifier(random_state = 1)

In [44]:
#fitting decision tree model
hall_of_fame_decision_tree.fit(train_x, train_y)

DecisionTreeClassifier(random_state=1)

In [45]:
hall_of_fame_random_forest.fit(train_x, train_y)

RandomForestClassifier(random_state=1)

In [46]:
#getting decision tree predictions
test_predictions_decision_tree = hall_of_fame_decision_tree.predict(test_x)

In [47]:
#getting random forest predictions
test_predictions_random_forest = hall_of_fame_random_forest.predict(test_x)

## 5. <a name ="Model-Evaluation-and-Summary"></a> Model Evaluation and Summary

In [48]:
decision_tree_accuracy = accuracy_score(test_y, test_predictions_decision_tree)
random_forest_accuracy = accuracy_score(test_y, test_predictions_random_forest)

print(f'The decision tree model accurately predicted {round(decision_tree_accuracy*100, 2)}% of players who are in the Hall of Famers from the test data set.')
print()
print(f'The accuracy of the random forest model was: {round(random_forest_accuracy*100, 2)}% of players who are in the Hall of Famers from the test data set.')

The decision tree model accurately predicted 97.51% of players who are in the Hall of Famers from the test data set.

The accuracy of the random forest model was: 97.58% of players who are in the Hall of Famers from the test data set.


### Summary



For this project both the decision tree and random forest models were able to predict to high accuracies (97.51% and 97.58%, respectively) the hall of fame players from the test data set. This leads me to wonder if there might be a problem with overfitting my models. In future iterations of this project, I hope to explore this potential issue further. Additionally, I am interested to try other classification models such as K-Nearest Neighbors.