# HW4: Baseball Modeling

For this problem please install the <i>Lahman</i> package, a comprehensive package about Baseball statistics,
and use it to answer a few questions.

Important information:
<ul><li>
project home page (with links to impressive graphics):  http://lahman.r-forge.r-project.org/
</li><li>
package documentation (html):  http://lahman.r-forge.r-project.org/doc/
</li></ul>

The documentation includes descriptions of the many tables in this package, such as the
Salaries table: http://lahman.r-forge.r-project.org/doc/Salaries.html


#  The Goal

There are two problems for you to solve:
<ul><li>
Problem 1: construct a model that predicts a player's salary based on his baseball statistics.
Your model should have better performance (higher R-squared) than the baseline model given.
</li><li>
Problem 2: construct a model that predicts whether a player will be inducted into the Hall of Fame.
Your model should have better performance (higher Hall-of-Fame-Accuracy-Rate) than the baseline model given.
</li></ul>
Here, <i>Hall-of-Fame-Accuracy-Rate</i> is a weighted percentage of correct predictions
for players in the Hall of Fame:  <u>correct prediction for players in the Hall of Fame
is worth 100 times more than for players who are not in the Hall of Fame.</u>

Then, as in HW3, upload a .csv file containing your models to CCLE.


## Step 1: build the models

Using the 'RelevantInformation' table, one model should predict a player's maximum salary,
the other should predict whether or not they will get into the Hall of Fame.

<b>YOU CAN USE ANY MODEL YOU LIKE.</b>
The baseline models are a linear regression model and a logistic regression model ----------
but you can choose <i>any</i> model.
Please produce the most accurate models you can --
more accurate models will get a higher score.

<hr style="border-width:20px;">

## Step 2: generate a CSV file "HW4_Baseball_Models.csv" including your 2 models

If these were your two models, then to complete the assignment you would create
a CSV file <tt>HW4_Baseball_Models.csv</tt> containing two lines:

<code>
      0.8999,"lm( log10(max_salary) ~ AB+R+H+X2B+X3B+HR+RBI+SB, data = RelevantInformation)"
      0.7888,"glm( HallOfFame ~ AB+R+H+X2B+X3B+HR+SlugPct, data = RelevantInformation, family=binomial)"
</code>

<b>Each line gives the accuracy of a model</b>,
as well as <b>the exact command you used to generate the model</b>.
There is no length restriction on the lines.

<hr style="border-width:20px;">

## Step 3: upload your CSV file and notebook to CCLE

Finally, go to CCLE and upload:
<ul><li>
your output CSV file <tt>HW4_Baseball_Models.csv</tt>
</li><li>
your notebook file <tt>HW4_Baseball_Modeling.ipynb</tt>
</li></ul>

We are not planning to run any of the uploaded notebooks.
However, your notebook should have the commands you used in developing your models ---
in order to show your work.
As announced, all assignment grading in this course will be automated,
and the notebook is needed in order to check results of the grading program.

# Get the Lahman package for R -- a database of Baseball Statistics

<hr style="border-width:20px;">

### The safe way to install it, so it will work with Jupyter -- execute the command:

<pre>
         sudo conda install -c https://conda.anaconda.org/asmeurer r-lahman
</pre>
### (The 'sudo' is not necessary if your conda installation is not write-protected.)

<hr style="border-width:20px;">

### Another way to install the Lahman package (if this works from within your Jupyter session):

In [3]:
if (!(is.element("Lahman", installed.packages()))) install.packages("Lahman", repos="http://cran.us.r-project.org")

### Load the Lahman baseball data

In [4]:
library(Lahman)

<hr style="border-width:20px;">

### Another way to get the data, if you cannot load the Lahman package:

The files
<tt>PlayersAndStats.csv</tt>
and
<tt>PlayersAndStatsAndSalary.csv</tt>
are distributed with the homework assignment, and are used in the notebook below.

You can use these fiels rather than recompute the tables using the Lahman package.

# Extract Tables of Relevant Information for your Models

### Player information -- from the Master table
http://lahman.r-forge.r-project.org/doc/Master.html

In [5]:
SelectedColumns = c("playerID","nameFirst","nameLast","birthYear", "weight","height","bats","throws")
Players = na.omit( Master[, SelectedColumns] )
summary(Players)

   playerID          nameFirst           nameLast           birthYear   
 Length:17071       Length:17071       Length:17071       Min.   :1835  
 Class :character   Class :character   Class :character   1st Qu.:1902  
 Mode  :character   Mode  :character   Mode  :character   Median :1941  
                                                          Mean   :1935  
                                                          3rd Qu.:1969  
                                                          Max.   :1994  
     weight          height      bats      throws   
 Min.   : 65.0   Min.   :43.00   B: 1131   L: 3430  
 1st Qu.:170.0   1st Qu.:71.00   L: 4721   R:13641  
 Median :185.0   Median :72.00   R:11219            
 Mean   :186.2   Mean   :72.34                      
 3rd Qu.:200.0   3rd Qu.:74.00                      
 Max.   :320.0   Max.   :83.00                      

### Player Maximum Salary -- from the Salaries table
http://lahman.r-forge.r-project.org/doc/Salaries.html

In [6]:
summary(Salaries)

# example(Salaries)  # see demos of results from the Salaries table

PlayerMaxSalary = aggregate( salary ~ playerID, Salaries, max )
colnames(PlayerMaxSalary) = gsub( "salary", "max_salary", colnames(PlayerMaxSalary) )

head(PlayerMaxSalary)

     yearID         teamID      lgID         playerID        
 Min.   :1985   CLE    :  893   AL:12123   Length:24758      
 1st Qu.:1993   LAN    :  893   NL:12635   Class :character  
 Median :2000   PHI    :  893              Mode  :character  
 Mean   :2000   SLN    :  886                                
 3rd Qu.:2007   BAL    :  883                                
 Max.   :2014   BOS    :  883                                
                (Other):19427                                
     salary        
 Min.   :       0  
 1st Qu.:  260000  
 Median :  525000  
 Mean   : 1932905  
 3rd Qu.: 2199643  
 Max.   :33000000  
                   

Unnamed: 0,playerID,max_salary
1,aardsda01,4500000
2,aasedo01,675000
3,abadan01,327000
4,abadfe01,525900
5,abbotje01,300000
6,abbotji01,2775000


In [7]:
PlayerStartYear = aggregate( yearID ~ playerID, Salaries, min )
colnames(PlayerStartYear) = gsub( "yearID", "startYear", colnames(PlayerStartYear) )

PlayerEndYear = aggregate( yearID ~ playerID, Salaries, max )
colnames(PlayerEndYear) = gsub( "yearID", "endYear", colnames(PlayerEndYear) )

head(PlayerStartYear)

Unnamed: 0,playerID,startYear
1,aardsda01,2004
2,aasedo01,1986
3,abadan01,2006
4,abadfe01,2011
5,abbotje01,1998
6,abbotji01,1989


### Batting Statistics -- from the BattingStats table
http://lahman.r-forge.r-project.org/doc/battingStats.html
   
(See also the Batting table:
http://lahman.r-forge.r-project.org/doc/Batting.html )

A glossary for Baseball Statistics Acronyms is in
   http://en.wikipedia.org/wiki/Baseball_statistics

In [8]:
BattingStats = battingStats()

### Aggregate Batting Stats -- cumulative, over a player's career

In [9]:
TotalBattingCounts = aggregate( cbind(AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB) ~ playerID,
                                     BattingStats, sum)
nrow(TotalBattingCounts)
MaxBattingPcts = aggregate( cbind(SlugPct,OBP,OPS,BABIP) ~ playerID,
                                 BattingStats, max )
nrow(MaxBattingPcts)

AggregateBattingStats = merge(TotalBattingCounts,MaxBattingPcts, by="playerID")
summary(AggregateBattingStats)
nrow(AggregateBattingStats)

   playerID               AB                R                H         
 Length:11532       Min.   :    1.0   Min.   :   0.0   Min.   :   0.0  
 Class :character   1st Qu.:   19.0   1st Qu.:   1.0   1st Qu.:   3.0  
 Mode  :character   Median :  136.5   Median :  12.0   Median :  25.0  
                    Mean   :  896.7   Mean   : 117.6   Mean   : 234.8  
                    3rd Qu.:  834.5   3rd Qu.:  95.0   3rd Qu.: 199.0  
                    Max.   :14053.0   Max.   :2295.0   Max.   :4256.0  
      X2B              X3B                HR             RBI        
 Min.   :  0.00   Min.   :  0.000   Min.   :  0.0   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.0   1st Qu.:   1.0  
 Median :  4.00   Median :  0.000   Median :  1.0   Median :  10.0  
 Mean   : 41.29   Mean   :  6.723   Mean   : 21.4   Mean   : 109.6  
 3rd Qu.: 33.00   3rd Qu.:  5.000   3rd Qu.: 10.0   3rd Qu.:  85.0  
 Max.   :746.00   Max.   :173.000   Max.   :762.0   Max.   :2297.0  
       SB    

### Inducted into the Hall of Fame?  -- from the HallOfFame table
http://lahman.r-forge.r-project.org/doc/HallOfFame.html

In [10]:
data(HallOfFame)
head(HallOfFame)

InductedIntoHallOfFame = subset(HallOfFame, inducted == 'Y')[ , 1:2]

head(InductedIntoHallOfFame)
nrow(InductedIntoHallOfFame)

Unnamed: 0,playerID,yearID,votedBy,ballots,needed,votes,inducted,category,needed_note
1,cobbty01,1936,BBWAA,226,170,222,Y,Player,
2,ruthba01,1936,BBWAA,226,170,215,Y,Player,
3,wagneho01,1936,BBWAA,226,170,215,Y,Player,
4,mathech01,1936,BBWAA,226,170,205,Y,Player,
5,johnswa01,1936,BBWAA,226,170,189,Y,Player,
6,lajoina01,1936,BBWAA,226,170,146,N,Player,


Unnamed: 0,playerID,yearID
1,cobbty01,1936
2,ruthba01,1936
3,wagneho01,1936
4,mathech01,1936
5,johnswa01,1936
111,lajoina01,1937


### Include HallOfFame information in the Players table

In [11]:
PlayersWithHallOfFame = transform( merge( Players, InductedIntoHallOfFame, all.x=TRUE, by="playerID"),
                                        HallOfFame = ifelse( is.na(yearID), 0, 1 ),
                                        yearID = ifelse( is.na(yearID), 0, yearID )
                                        )
colnames(PlayersWithHallOfFame) = gsub( "yearID", "HallOfFameYear", colnames(PlayersWithHallOfFame) )
head(PlayersWithHallOfFame, 20)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0
2,aaronha01,Hank,Aaron,1934,180,72,R,R,1982,1
3,aaronto01,Tommie,Aaron,1939,190,75,R,R,0,0
4,aasedo01,Don,Aase,1954,190,75,R,R,0,0
5,abadan01,Andy,Abad,1972,184,73,L,L,0,0
6,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0
7,abadijo01,John,Abadie,1854,192,72,R,R,0,0
8,abbated01,Ed,Abbaticchio,1877,170,71,R,R,0,0
9,abbeybe01,Bert,Abbey,1869,175,71,R,R,0,0
10,abbeych01,Charlie,Abbey,1866,169,68,L,L,0,0


In [12]:
nrow(PlayersWithHallOfFame)
nrow(subset(PlayersWithHallOfFame, HallOfFame == 1))

In [13]:
PlayersAndStats = merge( PlayersWithHallOfFame, AggregateBattingStats )

nrow(PlayersAndStats)
nrow(subset(PlayersAndStats, HallOfFame == 1))

# write.csv(PlayersAndStats, file="PlayersAndStats.csv", quote=FALSE, row.names=FALSE)

# Join Information for your Baseball Salary model into one Table

### Merge Aggregate Batting Statistics and Maximum Salary into the Relevant Information table

In [14]:
PlayersAndStatsAndSalary = transform(
    merge( merge( merge( PlayersAndStats, PlayerMaxSalary ), PlayerStartYear), PlayerEndYear ),
    totalYears = endYear - startYear + 1
    )
head(PlayersAndStatsAndSalary)
nrow(PlayersAndStatsAndSalary)

# write.csv(PlayersAndStatsAndSalary, file="PlayersAndStatsAndSalary.csv", quote=FALSE, row.names=FALSE)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP,max_salary,startYear,endYear,totalYears
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0,3,0,0,0,0,0,0,0,0,0,0.0,4,0,0.0,0.0,0.0,0.0,4500000,2004,2012,9
2,aasedo01,Don,Aase,1954,190,75,R,R,0,0,5,0,0,0,0,0,0,0,0,0,0.0,5,0,0.0,0.0,0.0,0.0,675000,1986,1989,4
3,abadan01,Andy,Abad,1972,184,73,L,L,0,0,21,1,2,0,0,0,0,0,1,4,0.118,25,2,0.118,0.4,0.4,0.167,327000,2006,2006,1
4,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0,8,0,1,0,0,0,0,0,0,0,0.143,8,1,0.143,0.143,0.286,0.25,525900,2011,2014,4
5,abbotje01,Jeff,Abbott,1972,190,74,R,L,0,0,596,82,157,33,2,18,83,6,5,38,1.236,649,248,0.492,0.343,0.79,0.32,300000,1998,2001,4
6,abbotji01,Jim,Abbott,1967,200,75,L,L,0,0,21,0,2,0,0,0,3,0,0,0,0.095,24,2,0.095,0.095,0.19,0.182,2775000,1989,1999,11


# Problem 1: construct a model with better performance  (higher R-squared) than this Baseline Salary Model

### For this salary model, consider only those players who started playing after 2000:

In [15]:
RecentPlayersAndStatsAndSalary = subset( PlayersAndStatsAndSalary, startYear >= 2000 )
nrow(RecentPlayersAndStatsAndSalary)

In [16]:
summary(RecentPlayersAndStatsAndSalary)

   playerID          nameFirst           nameLast           birthYear   
 Length:1720        Length:1720        Length:1720        Min.   :1925  
 Class :character   Class :character   Class :character   1st Qu.:1978  
 Mode  :character   Mode  :character   Mode  :character   Median :1982  
                                                          Mean   :1982  
                                                          3rd Qu.:1985  
                                                          Max.   :1993  
     weight          height      bats     throws   HallOfFameYear   HallOfFame
 Min.   :150.0   Min.   :66.00   B: 153   L: 338   Min.   :0      Min.   :0   
 1st Qu.:190.0   1st Qu.:72.00   L: 507   R:1382   1st Qu.:0      1st Qu.:0   
 Median :205.0   Median :74.00   R:1060            Median :0      Median :0   
 Mean   :207.4   Mean   :73.53                     Mean   :0      Mean   :0   
 3rd Qu.:220.0   3rd Qu.:75.00                     3rd Qu.:0      3rd Qu.:0   
 Max.   :295.0 

In [17]:
head(RecentPlayersAndStatsAndSalary)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP,max_salary,startYear,endYear,totalYears
1,aardsda01,David,Aardsma,1981,205,75,R,R,0,0,3,0,0,0,0,0,0,0,0,0,0.0,4,0,0.0,0.0,0.0,0.0,4500000,2004,2012,9
3,abadan01,Andy,Abad,1972,184,73,L,L,0,0,21,1,2,0,0,0,0,0,1,4,0.118,25,2,0.118,0.4,0.4,0.167,327000,2006,2006,1
4,abadfe01,Fernando,Abad,1985,220,73,L,L,0,0,8,0,1,0,0,0,0,0,0,0,0.143,8,1,0.143,0.143,0.286,0.25,525900,2011,2014,4
10,abercre01,Reggie,Abercrombie,1980,215,75,R,R,0,0,386,65,86,20,2,9,34,18,8,21,0.718,421,137,0.509,0.339,0.848,0.484,327000,2006,2006,1
11,abernbr01,Brent,Abernathy,1977,185,73,R,R,0,0,868,97,212,36,5,8,79,21,7,60,0.825,955,282,0.382,0.328,0.71,0.291,300000,2002,2003,2
14,abreujo02,Jose,Abreu,1987,255,75,R,R,0,0,556,80,176,35,2,36,107,3,1,51,0.317,622,323,0.581,0.383,0.964,0.356,7000000,2014,2014,1


In [18]:
linear_accuracy = function(model, test_features, test_labels){
    pd = predict(model, test_features)
    return (var(pd) / var(test_labels))
}

In [19]:
BaselineSalaryModel = lm( log10(max_salary) ~
                         AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP + startYear + totalYears,
                         data = RecentPlayersAndStatsAndSalary)
summary(BaselineSalaryModel)


Call:
lm(formula = log10(max_salary) ~ AB + R + H + X2B + X3B + HR + 
    RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + startYear + 
    totalYears, data = RecentPlayersAndStatsAndSalary)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.11262 -0.16561 -0.03839  0.14415  1.53555 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.371e+01  4.032e+00 -15.803  < 2e-16 ***
AB          -3.093e-03  5.457e-04  -5.668 1.69e-08 ***
R           -1.767e-03  5.068e-04  -3.487 0.000500 ***
H            8.609e-04  3.506e-04   2.456 0.014160 *  
X2B          1.408e-03  7.283e-04   1.933 0.053370 .  
X3B          3.012e-03  1.658e-03   1.817 0.069368 .  
HR           2.309e-03  1.020e-03   2.263 0.023764 *  
RBI          9.654e-05  4.981e-04   0.194 0.846337    
SB           8.782e-04  5.399e-04   1.627 0.103968    
CS          -7.303e-04  1.786e-03  -0.409 0.682640    
BB          -2.381e-03  5.068e-04  -4.698 2.84e-06 ***
BA          -1.403e-01 

In [20]:
p1_base_accu = round(linear_accuracy(BaselineSalaryModel, RecentPlayersAndStatsAndSalary, 
                                     log10(RecentPlayersAndStatsAndSalary$max_salary)), digits=4)
print(p1_base_accu)

[1] 0.7349


In [21]:
RelevantInformation = RecentPlayersAndStatsAndSalary

In [22]:
if (!(is.element("leaps", installed.packages())))  install.packages("leaps",repos="http://cran.rstudio.com/")
library(leaps)
if (!(is.element("MASS", installed.packages())))  install.packages("MASS")
library(MASS) 

In [23]:
colnames(RelevantInformation)

In [24]:
rs = regsubsets( log10(max_salary) ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+exp(SlugPct)+OBP+BABIP + startYear + totalYears
                +weight+height+bats+throws+OPS, 
                data=RelevantInformation, nbest=1, nvmax=NULL , method = "exhaustive" )

summary(rs)

Subset selection object
Call: regsubsets.formula(log10(max_salary) ~ AB + R + H + X2B + X3B + 
    HR + RBI + SB + CS + BB + BA + PA + exp(SlugPct) + OBP + 
    BABIP + startYear + totalYears + weight + height + bats + 
    throws + OPS, data = RelevantInformation, nbest = 1, nvmax = NULL, 
    method = "exhaustive")
23 Variables  (and intercept)
             Forced in Forced out
AB               FALSE      FALSE
R                FALSE      FALSE
H                FALSE      FALSE
X2B              FALSE      FALSE
X3B              FALSE      FALSE
HR               FALSE      FALSE
RBI              FALSE      FALSE
SB               FALSE      FALSE
CS               FALSE      FALSE
BB               FALSE      FALSE
BA               FALSE      FALSE
PA               FALSE      FALSE
exp(SlugPct)     FALSE      FALSE
OBP              FALSE      FALSE
BABIP            FALSE      FALSE
startYear        FALSE      FALSE
totalYears       FALSE      FALSE
weight           FALSE      FALSE
heigh

In [25]:
p1_model = rlm(log10(max_salary) ~ AB+R+H+log10(X2B+5)+X3B+HR+RBI+log10(SB+1)+CS+BB+BA+PA+exp(SlugPct)+BABIP+log10(startYear)
                         +totalYears+height+throws+log10(OPS+0.01),data = RelevantInformation)
summary(p1_model)


Call: rlm(formula = log10(max_salary) ~ AB + R + H + log10(X2B + 5) + 
    X3B + HR + RBI + log10(SB + 1) + CS + BB + BA + PA + exp(SlugPct) + 
    BABIP + log10(startYear) + totalYears + height + throws + 
    log10(OPS + 0.01), data = RelevantInformation)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.09019 -0.14659 -0.00997  0.14532  1.58091 

Coefficients:
                  Value     Std. Error t value  
(Intercept)       -475.5310   28.0915   -16.9279
AB                  -0.0029    0.0005    -6.0046
R                   -0.0011    0.0004    -2.5726
H                    0.0009    0.0003     2.6926
log10(X2B + 5)       0.0135    0.0377     0.3590
X3B                  0.0017    0.0015     1.1813
HR                   0.0021    0.0009     2.3543
RBI                  0.0000    0.0004    -0.0767
log10(SB + 1)       -0.0329    0.0231    -1.4258
CS                   0.0026    0.0012     2.2647
BB                  -0.0023    0.0005    -5.1002
BA                  -0.1275    0.01

In [26]:
p1_accu = round(linear_accuracy(p1_model, RelevantInformation, 
                                     log10(RelevantInformation$max_salary)), digits=4)
print(p1_accu)

[1] 0.7915


In [27]:
p1 = c(p1_accu,
      "\"rlm(log10(max_salary) ~ AB+R+H+log10(X2B+5)+X3B+HR+RBI+log10(SB+1)+CS+BB+BA+PA+exp(SlugPct)+BABIP+log10(startYear)+totalYears+height+throws+log10(OPS+0.01),data = RelevantInformation)\"")

In [30]:
salary_test = read.csv(file="HW4_Baseball_Salary_test.csv",head=TRUE,sep=",")

In [37]:
salary_predictions = predict(p1_model, salary_test)
salary_predictions = 10^salary_predictions

In [38]:
write.table(salary_predictions, file="HW4_Baseball_Salary_predictions.csv", quote=FALSE, row.names=FALSE, col.names=c("max_salary"))

# Problem 2: construct a model with better performance  (higher accuracy) than this Baseline Hall of Fame Model

###  Hall of Fame election rules:


A. A baseball player must have been active as a player in the Major Leagues at some time during a period beginning fifteen (15) years before and ending five (5) years prior to election.

B. Player must have played in each of ten (10) Major League championship seasons, some part of which must have been within the period described in 3(A).

C. Player shall have ceased to be an active player in the Major Leagues at least five (5) calendar years preceding the election but may be otherwise connected with baseball.

### Consequently:   only consider players born before 1970
(They must start around 20 years of age, play at least 10 years, have stopped playing at least 5 years earlier, and take perhaps 10 years to win the ballot -- so born at least 45 years ago.)

In [40]:
HallOfFameContenders = subset( PlayersAndStats, birthYear < 1970 )
head(HallOfFameContenders)
nrow(HallOfFameContenders)

Unnamed: 0,playerID,nameFirst,nameLast,birthYear,weight,height,bats,throws,HallOfFameYear,HallOfFame,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB,SlugPct,OBP,OPS,BABIP
2,aaronha01,Hank,Aaron,1934,180,72,R,R,1982,1,12364,2174,3771,624,98,755,2297,240,73,1402,6.927,13940,6856,0.669,0.41,1.079,0.338
3,aaronto01,Tommie,Aaron,1939,190,75,R,R,0,0,944,102,216,42,6,13,94,9,8,86,1.545,1045,309,0.374,0.318,0.686,0.276
4,aasedo01,Don,Aase,1954,190,75,R,R,0,0,5,0,0,0,0,0,0,0,0,0,0.0,5,0,0.0,0.0,0.0,0.0
7,abadijo01,John,Abadie,1854,192,72,R,R,0,0,49,4,11,0,0,0,5,1,0,0,0.472,49,11,0.25,0.25,0.5,0.25
9,abbotji01,Jim,Abbott,1967,200,75,L,L,0,0,21,0,2,0,0,0,3,0,0,0,0.095,24,2,0.095,0.095,0.19,0.182
10,abbotku01,Kurt,Abbott,1969,180,71,R,R,0,0,2044,273,523,109,23,62,242,22,11,133,2.511,2227,864,0.465,0.326,0.77,0.354


In [41]:
summary(HallOfFameContenders)

   playerID          nameFirst           nameLast           birthYear   
 Length:8111        Length:8111        Length:8111        Min.   :1835  
 Class :character   Class :character   Class :character   1st Qu.:1911  
 Mode  :character   Mode  :character   Mode  :character   Median :1936  
                                                          Mean   :1932  
                                                          3rd Qu.:1954  
                                                          Max.   :1969  
     weight          height      bats     throws   HallOfFameYear   
 Min.   :120.0   Min.   :63.00   B: 556   L:1567   Min.   :   0.00  
 1st Qu.:170.0   1st Qu.:71.00   L:2289   R:6544   1st Qu.:   0.00  
 Median :180.0   Median :72.00   R:5266            Median :   0.00  
 Mean   :182.6   Mean   :72.25                     Mean   :  47.05  
 3rd Qu.:193.0   3rd Qu.:74.00                     3rd Qu.:   0.00  
 Max.   :295.0   Max.   :82.00                     Max.   :2015.00  
   Hal

In [42]:
BaselineHallOfFameModel = glm( HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP,
                         data = HallOfFameContenders, family=binomial)

summary(BaselineHallOfFameModel)


Call:
glm(formula = HallOfFame ~ AB + R + H + X2B + X3B + HR + RBI + 
    SB + CS + BB + BA + PA + SlugPct + OBP + BABIP, family = binomial, 
    data = HallOfFameContenders)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0236  -0.1609  -0.1354  -0.1225   3.2096  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.177921   0.245464 -21.094  < 2e-16 ***
AB          -0.015193   0.002452  -6.195 5.82e-10 ***
R            0.004717   0.001993   2.366  0.01797 *  
H            0.004591   0.001633   2.812  0.00493 ** 
X2B         -0.017811   0.003509  -5.076 3.86e-07 ***
X3B          0.021059   0.006559   3.210  0.00133 ** 
HR          -0.007845   0.003394  -2.312  0.02080 *  
RBI          0.006136   0.001578   3.890  0.00010 ***
SB           0.005979   0.002021   2.958  0.00309 ** 
CS          -0.034386   0.007499  -4.586 4.53e-06 ***
BB          -0.013913   0.002313  -6.015 1.80e-09 ***
BA           0.065975   0.135318   0.488  0.625

In [43]:
confusionMatrix = table( round(predict(BaselineHallOfFameModel, type="response")), HallOfFameContenders$HallOfFame )
confusionMatrix
# terrible prediction accuracy:  only 34 Hall-of-Fame players were identified correctly:

   
       0    1
  0 7899  155
  1   19   38

##  Warning!  This dataset is severely imbalanced.  Read Ch.16 of [APM]

Only about 1% or 2% of all players are inducted into the Hall of Fame:

In [44]:
( FameTally = table( HallOfFameContenders$HallOfFame ) )


   0    1 
7918  193 

In [45]:
data.frame( percentageOfHallOfFamers = FameTally[2] / sum(FameTally) )

Unnamed: 0,percentageOfHallOfFamers
1,0.02379485


##  The measure of accuracy will heavily emphasize correct prediction of Hall-of-Fame players

(i.e., the measurement of accuracy will focus on correct prediction of Hall-of-Fame players)

Even though classifying everybody as a NON-Hall-of-Fame player is right
for about 98% of the players, predictions for Hall-of-Fame players will be weighted heavily in this assignment.
Ignoring these players will get a very low score on this assignment.

Specifically, your model will be evaluated by its <b>Hall-of-Fame-Accuracy-Rate</b>:
<blockquote>
This rate is a weighted percentage of correct predictions
for players in the Hall of Fame:  <u>correct prediction for players in the Hall of Fame
is worth 100 times more than for players who are not in the Hall of Fame.</u>
</blockquote>


In [46]:
if (!(is.element("ada", installed.packages())))  install.packages("ada",repos="http://cran.rstudio.com/")
library(ada)
if (!(is.element("randomForest", installed.packages())))  install.packages("randomForest",repos="http://cran.rstudio.com/")
library(randomForest)
library(rpart)

Loading required package: rpart
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.


In [47]:
rs = regsubsets( HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP+weight+log10(height)+bats+throws+OPS, 
                data=HallOfFameContenders, nbest=1, nvmax=NULL , method = "exhaustive" )

summary(rs)

Subset selection object
Call: regsubsets.formula(HallOfFame ~ AB + R + H + X2B + X3B + HR + 
    RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + weight + 
    log10(height) + bats + throws + OPS, data = HallOfFameContenders, 
    nbest = 1, nvmax = NULL, method = "exhaustive")
21 Variables  (and intercept)
              Forced in Forced out
AB                FALSE      FALSE
R                 FALSE      FALSE
H                 FALSE      FALSE
X2B               FALSE      FALSE
X3B               FALSE      FALSE
HR                FALSE      FALSE
RBI               FALSE      FALSE
SB                FALSE      FALSE
CS                FALSE      FALSE
BB                FALSE      FALSE
BA                FALSE      FALSE
PA                FALSE      FALSE
SlugPct           FALSE      FALSE
OBP               FALSE      FALSE
BABIP             FALSE      FALSE
weight            FALSE      FALSE
log10(height)     FALSE      FALSE
batsL             FALSE      FALSE
batsR             FA

In [None]:
#p2_model = ada( HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP+weight+height+throws+OPS,
 #                        data = HallOfFameContenders)

In [49]:
RelevantInformation = HallOfFameContenders

In [50]:
# takes a while to run
p2_model = randomForest( HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP+weight+height+throws+OPS,
                         data = RelevantInformation)

In randomForest.default(m, y, ...): The response has five or fewer unique values.  Are you sure you want to do regression?

In [51]:
library(caret)
if (!(is.element("e1071", installed.packages())))  install.packages("e1071",repos="http://cran.rstudio.com/")
library(e1071)

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘ggplot2’

The following object is masked from ‘package:randomForest’:

    margin



In [52]:
confusion_matrix = table( round(predict(p2_model, HallOfFameContenders, type="response")), HallOfFameContenders$HallOfFame )
confusion_matrix

   
       0    1
  0 7918   17
  1    0  176

In [53]:
confusionMatrix(confusion_matrix)

Confusion Matrix and Statistics

   
       0    1
  0 7918   17
  1    0  176
                                          
               Accuracy : 0.9979          
                 95% CI : (0.9966, 0.9988)
    No Information Rate : 0.9762          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9529          
 Mcnemar's Test P-Value : 0.0001042       
                                          
            Sensitivity : 1.0000          
            Specificity : 0.9119          
         Pos Pred Value : 0.9979          
         Neg Pred Value : 1.0000          
             Prevalence : 0.9762          
         Detection Rate : 0.9762          
   Detection Prevalence : 0.9783          
      Balanced Accuracy : 0.9560          
                                          
       'Positive' Class : 0               
                                          

In [54]:
correct_score = confusion_matrix[1] + 100 * confusion_matrix[4]
total_score = correct_score + 100 * confusion_matrix[3] + confusion_matrix[2]
p2_accu = round(correct_score / total_score, digits=4)

In [55]:
p2 = c(p2_accu,
      "\"randomForest(HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP+weight+height+throws+OPS,data = RelevantInformation)\"")

In [56]:
halloffame_test = read.csv(file="HW4_Baseball_HallOfFame_test.csv",head=TRUE,sep=",")

In [63]:
halloffame_predictions = round(predict(p2_model, halloffame_test, type="response"))

In [64]:
write.table(halloffame_predictions, file="HW4_Baseball_HallOfFame_predictions.csv", quote=FALSE, row.names=FALSE, col.names=c("HallOfFame"))

In [59]:
retval = c()
retval = append(retval, paste(p1, collapse=","))
retval = append(retval, paste(p2, collapse=","))

In [60]:
write(retval, "HW4_Baseball_Models.csv")