inst/chapters/04_Over_Fitting.Rout


R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 4: Over-Fitting and Model Tuning
> ###
> ### Required packages: caret, doMC (optional), kernlab
> ###
> ### Data used: 
> ###
> ### Notes: 
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be 
> ### syntax differences that occur over time as packages evolve. These files 
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in 
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
> 
> ################################################################################
> ### Section 4.6 Choosing Final Tuning Parameters
> 
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> data(GermanCredit)
> 
> ## First, remove near-zero variance predictors then get rid of a few predictors 
> ## that duplicate values. For example, there are two possible values for the 
> ## housing variable: "Rent", "Own" and "ForFree". So that we don't have linear
> ## dependencies, we get rid of one of the levels (e.g. "ForFree")
> 
> GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
> GermanCredit$CheckingAccountStatus.lt.0 <- NULL
> GermanCredit$SavingsAccountBonds.lt.100 <- NULL
> GermanCredit$EmploymentDuration.lt.1 <- NULL
> GermanCredit$EmploymentDuration.Unemployed <- NULL
> GermanCredit$Personal.Male.Married.Widowed <- NULL
> GermanCredit$Property.Unknown <- NULL
> GermanCredit$Housing.ForFree <- NULL
> 
> ## Split the data into training (80%) and test sets (20%)
> set.seed(100)
> inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
> GermanCreditTrain <- GermanCredit[ inTrain, ]
> GermanCreditTest  <- GermanCredit[-inTrain, ]
> 
> ## The model fitting code shown in the computing section is fairly
> ## simplistic.  For the text we estimate the tuning parameter grid
> ## up-front and pass it in explicitly. This generally is not needed,
> ## but was used here so that we could trim the cost values to a
> ## presentable range and to re-use later with different resampling
> ## methods.
> 
> library(kernlab)
> set.seed(231)
> sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1)
> svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))
> 
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up the computations.
> 
> ### WARNING: Be aware of how much memory is needed to parallel
> ### process. It can very quickly overwhelm the available hardware. We
> ### estimate the memory usage (VSIZE = total memory size) to be 
> ### 2566M/core.
> 
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(4)
> 
> set.seed(1056)
> svmFit <- train(Class ~ .,
+                 data = GermanCreditTrain,
+                 method = "svmRadial",
+                 preProc = c("center", "scale"),
+                 tuneGrid = svmTuneGrid,
+                 trControl = trainControl(method = "repeatedcv", 
+                                          repeats = 5,
+                                          classProbs = TRUE))
Loading required package: class
> ## classProbs = TRUE was added since the text was written
> 
> ## Print the results
> svmFit
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictors
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa  Accuracy SD  Kappa SD
  0.25  0.744     0.362  0.0499       0.113   
  0.5   0.74      0.35   0.0516       0.117   
  1     0.746     0.348  0.0522       0.125   
  2     0.743     0.325  0.0467       0.116   
  4     0.744     0.322  0.0477       0.12    
  8     0.75      0.323  0.0464       0.13    
  16    0.745     0.302  0.0457       0.13    
  32    0.739     0.28   0.0451       0.126   
  64    0.743     0.284  0.0444       0.135   
  128   0.734     0.265  0.0445       0.124   

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00892 and C = 8. 
> 
> ## A line plot of the average performance. The 'scales' argument is actually an 
> ## argument to xyplot that converts the x-axis to log-2 units.
> 
> plot(svmFit, scales = list(x = list(log = 2)))
> 
> ## Test set predictions
> 
> predictedClasses <- predict(svmFit, GermanCreditTest)
> str(predictedClasses)
 Factor w/ 2 levels "Bad","Good": 1 2 2 2 1 2 2 2 1 2 ...
> 
> ## Use the "type" option to get class probabilities
> 
> predictedProbs <- predict(svmFit, newdata = GermanCreditTest, type = "prob")
> head(predictedProbs)
         Bad      Good
1 0.56094118 0.4390588
2 0.47684821 0.5231518
3 0.30632069 0.6936793
4 0.09835051 0.9016495
5 0.57531768 0.4246823
6 0.14280149 0.8571985
> 
> 
> ## Fit the same model using different resampling methods. The main syntax change
> ## is the control object.
> 
> set.seed(1056)
> svmFit10CV <- train(Class ~ .,
+                     data = GermanCreditTrain,
+                     method = "svmRadial",
+                     preProc = c("center", "scale"),
+                     tuneGrid = svmTuneGrid,
+                     trControl = trainControl(method = "cv", number = 10))
> svmFit10CV
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictors
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa   Accuracy SD  Kappa SD
  0.25  0.7       0       0            0       
  0.5   0.719     0.0934  0.0189       0.0709  
  1     0.744     0.277   0.0222       0.0795  
  2     0.759     0.361   0.0323       0.0763  
  4     0.755     0.368   0.0422       0.119   
  8     0.761     0.395   0.0365       0.104   
  16    0.766     0.419   0.0417       0.113   
  32    0.749     0.388   0.0443       0.103   
  64    0.729     0.349   0.0472       0.108   
  128   0.729     0.352   0.0468       0.108   

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00892 and C = 16. 
> 
> set.seed(1056)
> svmFitLOO <- train(Class ~ .,
+                    data = GermanCreditTrain,
+                    method = "svmRadial",
+                    preProc = c("center", "scale"),
+                    tuneGrid = svmTuneGrid,
+                    trControl = trainControl(method = "LOOCV"))
> svmFitLOO
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictors
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: 

Summary of sample sizes: 799, 799, 799, 799, 799, 799, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa
  0.25  0.7       0    
  0.5   0.718     0.1  
  1     0.749     0.305
  2     0.74      0.316
  4     0.749     0.358
  8     0.761     0.407
  16    0.761     0.417
  32    0.722     0.335
  64    0.716     0.327
  128   0.72      0.333

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00892 and C = 8. 
> 
> set.seed(1056)
> svmFitLGO <- train(Class ~ .,
+                    data = GermanCreditTrain,
+                    method = "svmRadial",
+                    preProc = c("center", "scale"),
+                    tuneGrid = svmTuneGrid,
+                    trControl = trainControl(method = "LGOCV", 
+                                             number = 50, 
+                                             p = .8))
> svmFitLGO 
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictors
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits Estimated (50 reps, 0.8%) 

Summary of sample sizes: 640, 640, 640, 640, 640, 640, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa   Accuracy SD  Kappa SD
  0.25  0.7       0       0            0       
  0.5   0.711     0.0669  0.00956      0.0388  
  1     0.737     0.259   0.0224       0.0632  
  2     0.741     0.318   0.0238       0.0607  
  4     0.743     0.351   0.0281       0.068   
  8     0.745     0.37    0.0252       0.0617  
  16    0.738     0.365   0.0304       0.0763  
  32    0.729     0.349   0.0296       0.0723  
  64    0.722     0.335   0.0293       0.0713  
  128   0.714     0.321   0.0304       0.0749  

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00892 and C = 8. 
> 
> set.seed(1056)
> svmFitBoot <- train(Class ~ .,
+                     data = GermanCreditTrain,
+                     method = "svmRadial",
+                     preProc = c("center", "scale"),
+                     tuneGrid = svmTuneGrid,
+                     trControl = trainControl(method = "boot", number = 50))
> svmFitBoot
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictors
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Bootstrapped (50 reps) 

Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa  Accuracy SD  Kappa SD
  0.25  0.704     0.019  0.0264       0.0327  
  0.5   0.728     0.186  0.0306       0.0879  
  1     0.739     0.29   0.0245       0.0677  
  2     0.742     0.329  0.0177       0.0504  
  4     0.742     0.345  0.0183       0.0475  
  8     0.741     0.354  0.0191       0.0502  
  16    0.735     0.347  0.0192       0.045   
  32    0.729     0.341  0.0217       0.049   
  64    0.723     0.33   0.023        0.0509  
  128   0.721     0.324  0.0232       0.0509  

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00892 and C = 4. 
> 
> set.seed(1056)
> svmFitBoot632 <- train(Class ~ .,
+                        data = GermanCreditTrain,
+                        method = "svmRadial",
+                        preProc = c("center", "scale"),
+                        tuneGrid = svmTuneGrid,
+                        trControl = trainControl(method = "boot632", 
+                                                 number = 50))
> svmFitBoot632
Support Vector Machines with Radial Basis Function Kernel 

800 samples
 41 predictors
  2 classes: 'Bad', 'Good' 

Pre-processing: centered, scaled 
Resampling: Bootstrapped (50 reps) 

Summary of sample sizes: 800, 800, 800, 800, 800, 800, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa  Accuracy SD  Kappa SD  AccuracyApparent  KappaApparent
  0.25  0.703     0.012  0.0264       0.0327    0.7               0            
  0.5   0.733     0.189  0.0306       0.0879    0.742             0.193        
  1     0.766     0.36   0.0245       0.0677    0.811             0.479        
  2     0.783     0.435  0.0177       0.0504    0.852             0.616        
  4     0.798     0.488  0.0183       0.0475    0.894             0.733        
  8     0.81      0.528  0.0191       0.0502    0.93              0.827        
  16    0.818     0.552  0.0192       0.045     0.96              0.903        
  32    0.823     0.569  0.0217       0.049     0.984             0.961        
  64    0.822     0.569  0.023        0.0509    0.991             0.979        
  128   0.823     0.571  0.0232       0.0509    0.998             0.994        

Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.00892 and C = 32. 
> 
> ################################################################################
> ### Section 4.8 Choosing Between Models
> 
> set.seed(1056)
> glmProfile <- train(Class ~ .,
+                     data = GermanCreditTrain,
+                     method = "glm",
+                     trControl = trainControl(method = "repeatedcv", 
+                                              repeats = 5))
> glmProfile
Generalized Linear Model 

800 samples
 41 predictors
  2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 

Resampling results

  Accuracy  Kappa  Accuracy SD  Kappa SD
  0.749     0.365  0.0516       0.122   

 
> 
> resamp <- resamples(list(SVM = svmFit, Logistic = glmProfile))
> summary(resamp)

Call:
summary.resamples(object = resamp)

Models: SVM, Logistic 
Number of resamples: 50 

Accuracy 
           Min. 1st Qu. Median   Mean 3rd Qu.  Max. NA's
SVM      0.6500   0.725 0.7625 0.7495  0.7875 0.825    0
Logistic 0.6125   0.725 0.7562 0.7490  0.7844 0.850    0

Kappa 
            Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
SVM      0.02778  0.2369 0.3488 0.3232  0.4272 0.5139    0
Logistic 0.07534  0.2831 0.3750 0.3648  0.4504 0.6250    0

> 
> ## These results are slightly different from those shown in the text.
> ## There are some differences in the train() function since the 
> ## original results were produced. This is due to a difference in
> ## predictions from the ksvm() function when class probs are requested
> ## and when they are not. See, for example, 
> ## https://stat.ethz.ch/pipermail/r-help/2013-November/363188.html
> 
> modelDifferences <- diff(resamp)
> summary(modelDifferences)

Call:
summary.diff.resamples(object = modelDifferences)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Accuracy 
         SVM    Logistic
SVM             5e-04   
Logistic 0.9299         

Kappa 
         SVM      Logistic
SVM               -0.04154
Logistic 0.007088         

> 
> ## The actual paired t-test:
> modelDifferences$statistics$Accuracy
$SVM.diff.Logistic

	One Sample t-test

data:  x
t = 0.0884, df = 49, p-value = 0.9299
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.01086969  0.01186969
sample estimates:
mean of x 
    5e-04 


> 
> ################################################################################
> ### Session Information
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] e1071_1.6-1     class_7.3-7     doMC_1.3.0      iterators_1.0.6
[5] foreach_1.4.0   kernlab_0.9-16  caret_6.0-22    ggplot2_0.9.3.1
[9] lattice_0.20-15

loaded via a namespace (and not attached):
 [1] car_2.0-16         codetools_0.2-8    colorspace_1.2-1   compiler_3.0.1    
 [5] dichromat_2.0-0    digest_0.6.3       grid_3.0.1         gtable_0.1.2      
 [9] labeling_0.1       MASS_7.3-26        munsell_0.4        plyr_1.8          
[13] proto_0.3-10       RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3      
[17] stringr_0.6.2      tools_3.0.1       
> 
> q("no")
> proc.time()
    user   system  elapsed 
3260.432  211.968  906.933