KDD2009vtreat
John Mount
Practical data science with R built chapter 6 built a number of single variable models. In Listing 6.11 it used an ad-hoc entropy based out of sample effect size estimate for variable selection. This likely (though it isn’t completely rigorous) picked variables conservatively.
We show here how to repeat this work on the KDD2009 dataset using more standard techniques more quickly. For vtreat details see: http://www.win-vector.com/blog/2014/08/vtreat-designing-a-package-for-variable-treatment/ and Chapter 6 of Practical Data Science with R: http://www.amazon.com/Practical-Data-Science/dp/1617291560 For details on data see: https://github.com/WinVector/zmPDSwR/tree/master/KDD2009 There is an issue that any data row used to build the single variable models isn’t exchangable with future unseen rows for the purposes of scoring and training. So the most hygienic way to work is to use one subset of data to build the single variable models, and then another to built the composite model, and a third for scoring. In particular models trained using rows used to build sub-models think the sub-models have large effects that the sub-models will in the future, and under-estimate degrees of freedom of complicated sub-models.
date()## [1] "Sat Jul 6 18:02:57 2019"
#load some libraries
library('vtreat')
packageVersion("vtreat")## [1] '1.4.3'
library('WVPlots')
library('parallel')
library('xgboost')
# load the data as in the book
# change this path to match your directory structure
#dir = '~/Documents/work/PracticalDataScienceWithR/zmPDSwR/KDD2009/'
#dir = '~/Documents/work/zmPDSwR/KDD2009/'
dir = "./"
d = read.table(paste(dir,'orange_small_train.data.gz',sep=''),
header=T,sep='\t',na.strings=c('NA',''),
stringsAsFactors=FALSE)
churn = read.table(paste(dir,'orange_small_train_churn.labels.txt',sep=''),
header=F,sep='\t')
d$churn = churn$V1
appetency = read.table(paste(dir,'orange_small_train_appetency.labels.txt',sep=''),
header=F,sep='\t')
d$appetency = appetency$V1
upselling = read.table(paste(dir,'orange_small_train_upselling.labels.txt',sep=''),
header=F,sep='\t')
d$upselling = upselling$V1
set.seed(729375)
d$rgroup = runif(dim(d)[[1]])
dTrain = subset(d,rgroup<=0.9) # set for building models and impact coding
dTest = subset(d,rgroup>0.9) # set for evaluation
rm(list=c('d','churn','appetency','upselling','dir'))
dim(dTrain)## [1] 45028 234
dim(dTest)## [1] 4972 234
outcomes = c('churn','appetency','upselling')
vars = setdiff(colnames(dTrain),
c(outcomes,'rgroup'))
yName = 'churn'
yTarget = 1
set.seed(239525)
ncore <- parallel::detectCores()
cl = parallel::makeCluster(ncore)
date()## [1] "Sat Jul 6 18:03:05 2019"
date()## [1] "Sat Jul 6 18:03:05 2019"
var_values <- vtreat::value_variables_C(dTrain,
vars,yName,yTarget,
smFactor=2.0,
parallelCluster=cl
)
knitr::kable(var_values)| rsq | count | sig | var | |
|---|---|---|---|---|
| Var1 | 2.376463e-05 | 2 | 0.9059967 | Var1 |
| Var10 | 9.813199e-04 | 2 | 0.0000028 | Var10 |
| Var100 | 3.502919e-05 | 2 | 0.7245050 | Var100 |
| Var101 | 6.009046e-04 | 2 | 0.0003219 | Var101 |
| Var102 | 1.742866e-04 | 2 | 0.0842595 | Var102 |
| Var103 | 9.813199e-04 | 2 | 0.0000028 | Var103 |
| Var104 | 2.278508e-04 | 2 | 0.0402904 | Var104 |
| Var105 | 2.278508e-04 | 2 | 0.0402904 | Var105 |
| Var106 | 9.593148e-04 | 2 | 0.0000037 | Var106 |
| Var107 | 9.813199e-04 | 2 | 0.0000028 | Var107 |
| Var108 | 2.411380e-05 | 2 | 0.8993961 | Var108 |
| Var109 | 1.922021e-04 | 2 | 0.0656641 | Var109 |
| Var11 | 8.812072e-04 | 2 | 0.0000098 | Var11 |
| Var110 | 1.029613e-04 | 2 | 0.2365791 | Var110 |
| Var111 | 8.364368e-04 | 2 | 0.0000170 | Var111 |
| Var112 | 2.752908e-03 | 2 | 0.0000000 | Var112 |
| Var113 | 6.466721e-03 | 3 | 0.0000000 | Var113 |
| Var114 | 8.842007e-04 | 2 | 0.0000094 | Var114 |
| Var115 | 2.278508e-04 | 2 | 0.0402904 | Var115 |
| Var116 | 2.376463e-05 | 2 | 0.9059967 | Var116 |
| Var117 | 9.593148e-04 | 2 | 0.0000037 | Var117 |
| Var118 | 5.017689e-07 | 1 | 0.9131691 | Var118 |
| Var119 | 2.669092e-03 | 2 | 0.0000000 | Var119 |
| Var12 | 3.286502e-05 | 2 | 0.7550253 | Var12 |
| Var120 | 9.813199e-04 | 2 | 0.0000028 | Var120 |
| Var121 | 2.376463e-05 | 2 | 0.9059967 | Var121 |
| Var122 | 8.842007e-04 | 2 | 0.0000094 | Var122 |
| Var123 | 2.752908e-03 | 2 | 0.0000000 | Var123 |
| Var124 | 9.593148e-04 | 2 | 0.0000037 | Var124 |
| Var125 | 2.013427e-03 | 2 | 0.0000000 | Var125 |
| Var126 | 1.382784e-02 | 2 | 0.0000000 | Var126 |
| Var127 | 5.822901e-04 | 2 | 0.0004071 | Var127 |
| Var128 | 5.822901e-04 | 2 | 0.0004071 | Var128 |
| Var129 | 2.376463e-05 | 2 | 0.9059967 | Var129 |
| Var13 | 5.312771e-03 | 2 | 0.0000000 | Var13 |
| Var130 | 8.812072e-04 | 2 | 0.0000098 | Var130 |
| Var131 | 8.857586e-05 | 2 | 0.2948029 | Var131 |
| Var132 | 2.752908e-03 | 2 | 0.0000000 | Var132 |
| Var133 | 2.752908e-03 | 2 | 0.0000000 | Var133 |
| Var134 | 2.752908e-03 | 2 | 0.0000000 | Var134 |
| Var135 | 9.593148e-04 | 2 | 0.0000037 | Var135 |
| Var136 | 6.724954e-05 | 2 | 0.4136311 | Var136 |
| Var137 | 2.376463e-05 | 2 | 0.9059967 | Var137 |
| Var138 | 9.593148e-04 | 2 | 0.0000037 | Var138 |
| Var139 | 9.813199e-04 | 2 | 0.0000028 | Var139 |
| Var14 | 8.812072e-04 | 2 | 0.0000098 | Var14 |
| Var140 | 2.972593e-03 | 2 | 0.0000000 | Var140 |
| Var142 | 8.540294e-05 | 2 | 0.3097124 | Var142 |
| Var143 | 2.752908e-03 | 2 | 0.0000000 | Var143 |
| Var144 | 5.421069e-03 | 2 | 0.0000000 | Var144 |
| Var145 | 9.593148e-04 | 2 | 0.0000037 | Var145 |
| Var146 | 9.813199e-04 | 2 | 0.0000028 | Var146 |
| Var147 | 9.813199e-04 | 2 | 0.0000028 | Var147 |
| Var148 | 9.813199e-04 | 2 | 0.0000028 | Var148 |
| Var149 | 3.035956e-04 | 2 | 0.0146283 | Var149 |
| Var150 | 9.593148e-04 | 2 | 0.0000037 | Var150 |
| Var151 | 6.334417e-04 | 2 | 0.0002138 | Var151 |
| Var152 | 9.593148e-04 | 2 | 0.0000037 | Var152 |
| Var153 | 2.752908e-03 | 2 | 0.0000000 | Var153 |
| Var154 | 2.376463e-05 | 2 | 0.9059967 | Var154 |
| Var155 | 9.593148e-04 | 2 | 0.0000037 | Var155 |
| Var156 | 1.905424e-05 | 2 | 1.0000000 | Var156 |
| Var157 | 8.364368e-04 | 2 | 0.0000170 | Var157 |
| Var158 | 6.009046e-04 | 2 | 0.0003219 | Var158 |
| Var159 | 8.842007e-04 | 2 | 0.0000094 | Var159 |
| Var16 | 9.813199e-04 | 2 | 0.0000028 | Var16 |
| Var160 | 2.752908e-03 | 2 | 0.0000000 | Var160 |
| Var161 | 9.593148e-04 | 2 | 0.0000037 | Var161 |
| Var162 | 8.842007e-04 | 2 | 0.0000094 | Var162 |
| Var163 | 2.752908e-03 | 2 | 0.0000000 | Var163 |
| Var164 | 9.593148e-04 | 2 | 0.0000037 | Var164 |
| Var165 | 6.009046e-04 | 2 | 0.0003219 | Var165 |
| Var166 | 9.813199e-04 | 2 | 0.0000028 | Var166 |
| Var168 | 2.632379e-04 | 2 | 0.0250101 | Var168 |
| Var17 | 9.593148e-04 | 2 | 0.0000037 | Var17 |
| Var170 | 8.842007e-04 | 2 | 0.0000094 | Var170 |
| Var171 | 5.822901e-04 | 2 | 0.0004071 | Var171 |
| Var172 | 9.813199e-04 | 2 | 0.0000028 | Var172 |
| Var173 | 2.752908e-03 | 2 | 0.0000000 | Var173 |
| Var174 | 9.593148e-04 | 2 | 0.0000037 | Var174 |
| Var176 | 8.812072e-04 | 2 | 0.0000098 | Var176 |
| Var177 | 8.842007e-04 | 2 | 0.0000094 | Var177 |
| Var178 | 1.729184e-04 | 2 | 0.0858903 | Var178 |
| Var179 | 9.593148e-04 | 2 | 0.0000037 | Var179 |
| Var18 | 9.593148e-04 | 2 | 0.0000037 | Var18 |
| Var180 | 2.376463e-05 | 2 | 0.9059967 | Var180 |
| Var181 | 2.752908e-03 | 2 | 0.0000000 | Var181 |
| Var182 | 9.593148e-04 | 2 | 0.0000037 | Var182 |
| Var183 | 8.842007e-04 | 2 | 0.0000094 | Var183 |
| Var184 | 8.842007e-04 | 2 | 0.0000094 | Var184 |
| Var186 | 2.376463e-05 | 2 | 0.9059967 | Var186 |
| Var187 | 2.376463e-05 | 2 | 0.9059967 | Var187 |
| Var188 | 8.842007e-04 | 2 | 0.0000094 | Var188 |
| Var189 | 1.215778e-02 | 2 | 0.0000000 | Var189 |
| Var19 | 9.593148e-04 | 2 | 0.0000037 | Var19 |
| Var190 | 7.148827e-05 | 2 | 0.3861432 | Var190 |
| Var191 | 5.822881e-04 | 2 | 0.0004071 | Var191 |
| Var192 | 5.621971e-03 | 2 | 0.0000000 | Var192 |
| Var193 | 7.309619e-03 | 2 | 0.0000000 | Var193 |
| Var194 | 6.818729e-04 | 2 | 0.0001165 | Var194 |
| Var195 | 8.627090e-04 | 2 | 0.0000123 | Var195 |
| Var196 | 1.182542e-04 | 2 | 0.1882689 | Var196 |
| Var197 | 9.702002e-04 | 2 | 0.0000033 | Var197 |
| Var198 | 4.062008e-03 | 2 | 0.0000000 | Var198 |
| Var199 | 8.519979e-03 | 2 | 0.0000000 | Var199 |
| Var2 | 8.842007e-04 | 2 | 0.0000094 | Var2 |
| Var200 | 5.225536e-03 | 2 | 0.0000000 | Var200 |
| Var201 | 6.739672e-04 | 2 | 0.0001287 | Var201 |
| Var202 | 3.203968e-03 | 2 | 0.0000000 | Var202 |
| Var203 | 2.129657e-04 | 2 | 0.0493503 | Var203 |
| Var204 | 1.715598e-03 | 2 | 0.0000000 | Var204 |
| Var205 | 7.834535e-03 | 2 | 0.0000000 | Var205 |
| Var206 | 1.260427e-02 | 2 | 0.0000000 | Var206 |
| Var207 | 5.925474e-03 | 2 | 0.0000000 | Var207 |
| Var208 | 8.690949e-05 | 2 | 0.3025293 | Var208 |
| Var21 | 2.669092e-03 | 2 | 0.0000000 | Var21 |
| Var210 | 4.061590e-03 | 2 | 0.0000000 | Var210 |
| Var211 | 1.982178e-03 | 2 | 0.0000000 | Var211 |
| Var212 | 9.900362e-03 | 2 | 0.0000000 | Var212 |
| Var213 | 8.362860e-04 | 2 | 0.0000170 | Var213 |
| Var214 | 5.225536e-03 | 2 | 0.0000000 | Var214 |
| Var215 | 1.905564e-05 | 2 | 1.0000000 | Var215 |
| Var216 | 4.452981e-03 | 2 | 0.0000000 | Var216 |
| Var217 | 1.197942e-02 | 2 | 0.0000000 | Var217 |
| Var218 | 1.218248e-02 | 2 | 0.0000000 | Var218 |
| Var219 | 2.882058e-04 | 2 | 0.0179332 | Var219 |
| Var22 | 2.752908e-03 | 2 | 0.0000000 | Var22 |
| Var220 | 4.062008e-03 | 2 | 0.0000000 | Var220 |
| Var221 | 3.638509e-03 | 2 | 0.0000000 | Var221 |
| Var222 | 4.062008e-03 | 2 | 0.0000000 | Var222 |
| Var223 | 1.066714e-04 | 2 | 0.2237208 | Var223 |
| Var224 | 2.279107e-04 | 2 | 0.0402577 | Var224 |
| Var225 | 5.896221e-03 | 2 | 0.0000000 | Var225 |
| Var226 | 2.463084e-03 | 2 | 0.0000000 | Var226 |
| Var227 | 5.584397e-03 | 2 | 0.0000000 | Var227 |
| Var228 | 9.281733e-03 | 2 | 0.0000000 | Var228 |
| Var229 | 7.058449e-03 | 2 | 0.0000000 | Var229 |
| Var23 | 9.813199e-04 | 2 | 0.0000028 | Var23 |
| Var24 | 2.381645e-04 | 2 | 0.0350371 | Var24 |
| Var25 | 2.752908e-03 | 2 | 0.0000000 | Var25 |
| Var26 | 9.813199e-04 | 2 | 0.0000028 | Var26 |
| Var27 | 9.813199e-04 | 2 | 0.0000028 | Var27 |
| Var28 | 2.758522e-03 | 2 | 0.0000000 | Var28 |
| Var29 | 2.992132e-05 | 2 | 0.7995316 | Var29 |
| Var3 | 8.812072e-04 | 2 | 0.0000098 | Var3 |
| Var30 | 2.376463e-05 | 2 | 0.9059967 | Var30 |
| Var33 | 6.334417e-04 | 2 | 0.0002138 | Var33 |
| Var34 | 8.842007e-04 | 2 | 0.0000094 | Var34 |
| Var35 | 2.752908e-03 | 2 | 0.0000000 | Var35 |
| Var36 | 8.842007e-04 | 2 | 0.0000094 | Var36 |
| Var37 | 9.593148e-04 | 2 | 0.0000037 | Var37 |
| Var38 | 2.752908e-03 | 2 | 0.0000000 | Var38 |
| Var4 | 9.593148e-04 | 2 | 0.0000037 | Var4 |
| Var40 | 8.842007e-04 | 2 | 0.0000094 | Var40 |
| Var41 | 2.376463e-05 | 2 | 0.9059967 | Var41 |
| Var43 | 8.842007e-04 | 2 | 0.0000094 | Var43 |
| Var44 | 2.752908e-03 | 2 | 0.0000000 | Var44 |
| Var45 | 6.815674e-05 | 2 | 0.4075594 | Var45 |
| Var46 | 8.842007e-04 | 2 | 0.0000094 | Var46 |
| Var47 | 2.376463e-05 | 2 | 0.9059967 | Var47 |
| Var49 | 8.842007e-04 | 2 | 0.0000094 | Var49 |
| Var5 | 9.813199e-04 | 2 | 0.0000028 | Var5 |
| Var50 | 2.376463e-05 | 2 | 0.9059967 | Var50 |
| Var51 | 6.809351e-04 | 2 | 0.0001179 | Var51 |
| Var53 | 9.100583e-05 | 2 | 0.2839318 | Var53 |
| Var54 | 8.842007e-04 | 2 | 0.0000094 | Var54 |
| Var56 | 1.729184e-04 | 2 | 0.0858903 | Var56 |
| Var57 | 2.044753e-04 | 3 | 0.0831623 | Var57 |
| Var58 | 2.376463e-05 | 2 | 0.9059967 | Var58 |
| Var59 | 2.278508e-04 | 2 | 0.0402904 | Var59 |
| Var6 | 2.669092e-03 | 2 | 0.0000000 | Var6 |
| Var60 | 9.813199e-04 | 2 | 0.0000028 | Var60 |
| Var61 | 6.334417e-04 | 2 | 0.0002138 | Var61 |
| Var62 | 1.707981e-05 | 2 | 1.0000000 | Var62 |
| Var63 | 1.905424e-05 | 2 | 1.0000000 | Var63 |
| Var64 | 7.138463e-05 | 2 | 0.3867893 | Var64 |
| Var65 | 2.974265e-03 | 2 | 0.0000000 | Var65 |
| Var66 | 1.905424e-05 | 2 | 1.0000000 | Var66 |
| Var67 | 9.813199e-04 | 2 | 0.0000028 | Var67 |
| Var68 | 8.842007e-04 | 2 | 0.0000094 | Var68 |
| Var69 | 9.813199e-04 | 2 | 0.0000028 | Var69 |
| Var7 | 9.401229e-03 | 2 | 0.0000000 | Var7 |
| Var70 | 9.813199e-04 | 2 | 0.0000028 | Var70 |
| Var71 | 8.364368e-04 | 2 | 0.0000170 | Var71 |
| Var72 | 2.007878e-03 | 2 | 0.0000000 | Var72 |
| Var73 | 2.011271e-02 | 3 | 0.0000000 | Var73 |
| Var74 | 6.365026e-03 | 2 | 0.0000000 | Var74 |
| Var75 | 8.842007e-04 | 2 | 0.0000094 | Var75 |
| Var76 | 2.752908e-03 | 2 | 0.0000000 | Var76 |
| Var77 | 3.158199e-05 | 2 | 0.7739775 | Var77 |
| Var78 | 2.752908e-03 | 2 | 0.0000000 | Var78 |
| Var80 | 9.813199e-04 | 2 | 0.0000028 | Var80 |
| Var81 | 2.669092e-03 | 2 | 0.0000000 | Var81 |
| Var82 | 9.593148e-04 | 2 | 0.0000037 | Var82 |
| Var83 | 2.752908e-03 | 2 | 0.0000000 | Var83 |
| Var84 | 8.812072e-04 | 2 | 0.0000098 | Var84 |
| Var85 | 2.752908e-03 | 2 | 0.0000000 | Var85 |
| Var86 | 2.376463e-05 | 2 | 0.9059967 | Var86 |
| Var87 | 2.376463e-05 | 2 | 0.9059967 | Var87 |
| Var88 | 5.822901e-04 | 2 | 0.0004071 | Var88 |
| Var89 | 1.729184e-04 | 2 | 0.0858903 | Var89 |
| Var9 | 2.376463e-05 | 2 | 0.9059967 | Var9 |
| Var90 | 2.376463e-05 | 2 | 0.9059967 | Var90 |
| Var91 | 8.364368e-04 | 2 | 0.0000170 | Var91 |
| Var92 | 1.829928e-05 | 2 | 1.0000000 | Var92 |
| Var93 | 9.813199e-04 | 2 | 0.0000028 | Var93 |
| Var94 | 3.706455e-04 | 2 | 0.0060810 | Var94 |
| Var95 | 8.842007e-04 | 2 | 0.0000094 | Var95 |
| Var96 | 8.842007e-04 | 2 | 0.0000094 | Var96 |
| Var97 | 9.813199e-04 | 2 | 0.0000028 | Var97 |
| Var98 | 1.749498e-05 | 2 | 1.0000000 | Var98 |
| Var99 | 9.593148e-04 | 2 | 0.0000037 | Var99 |
summary(var_values$sig < 1/nrow(var_values))## Mode FALSE TRUE
## logical 59 153
vars <- var_values$var[var_values$sig < 1/nrow(var_values)]
date()## [1] "Sat Jul 6 18:08:22 2019"
date()## [1] "Sat Jul 6 18:08:22 2019"
# Run other models (with proper coding/training separation).
#
# This gets us back to AUC 0.74 range
customCoders = list('c.PiecewiseV.num' = vtreat::solve_piecewise,
'n.PiecewiseV.num' = vtreat::solve_piecewise,
'c.knearest.num' = vtreat::square_window,
'n.knearest.num' = vtreat::square_window,
'c.spline.num' = vtreat::spline_variable,
'n.spline.num' = vtreat::spline_variable)
# 'n.poolN.center' = vtreat::ppCoderN,
# 'c.poolC.center' = vtreat::ppCoderC)
# 'n.NonDecreasingV.num' = vtreat::solveNonDecreasing,
# 'n.NonIncreasingV.num' = vtreat::solveNonIncreasing,
# 'c.NonDecreasingV.num' = vtreat::solveNonDecreasing,
# 'c.NonIncreasingV.num' = vtreat::solveNonIncreasing)
cfe = mkCrossFrameCExperiment(dTrain,
vars,yName,yTarget,
customCoders=customCoders,
smFactor=2.0,
parallelCluster=cl)## [1] "vtreat 1.4.3 start initial treatment design Sat Jul 6 18:08:22 2019"
## [1] " start cross frame work Sat Jul 6 18:15:08 2019"
## [1] " vtreat::mkCrossFrameCExperiment done Sat Jul 6 18:17:50 2019"
treatmentsC = cfe$treatments
scoreFrame = treatmentsC$scoreFrame
table(scoreFrame$code)##
## catB catP clean isBAD knearest lev
## 28 28 122 120 2 121
## PiecewiseV spline
## 118 83
selvars <- scoreFrame$varName
treatedTrainM <- cfe$crossFrame[,c(yName,selvars),drop=FALSE]
treatedTrainM[[yName]] = treatedTrainM[[yName]]==yTarget
treatedTest = prepare(treatmentsC,
dTest,
pruneSig=NULL,
varRestriction = selvars,
parallelCluster=cl)
treatedTest[[yName]] = treatedTest[[yName]]==yTarget
# prepare plotting frames
treatedTrainP = treatedTrainM[, yName, drop=FALSE]
treatedTestP = treatedTest[, yName, drop=FALSE]
date()## [1] "Sat Jul 6 18:17:51 2019"
date()## [1] "Sat Jul 6 18:17:51 2019"
mname = 'xgbPred'
print(paste(mname,length(selvars)))## [1] "xgbPred 622"
params <- list(max_depth = 5,
objective = "binary:logistic",
nthread = ncore)
model <- xgb.cv(data = as.matrix(treatedTrainM[, selvars, drop = FALSE]),
label = treatedTrainM[[yName]],
nrounds = 400,
params = params,
nfold = 5,
early_stopping_rounds = 10,
eval_metric = "logloss")## [1] train-logloss:0.503118+0.000619 test-logloss:0.504132+0.001167
## Multiple eval metrics are present. Will use test_logloss for early stopping.
## Will train until test_logloss hasn't improved in 10 rounds.
##
## [2] train-logloss:0.400317+0.001025 test-logloss:0.402179+0.002146
## [3] train-logloss:0.338216+0.001284 test-logloss:0.341292+0.002991
## [4] train-logloss:0.299239+0.001445 test-logloss:0.303358+0.003886
## [5] train-logloss:0.274050+0.001584 test-logloss:0.279312+0.004833
## [6] train-logloss:0.257435+0.001738 test-logloss:0.263867+0.005338
## [7] train-logloss:0.246297+0.001758 test-logloss:0.254254+0.005848
## [8] train-logloss:0.238483+0.001759 test-logloss:0.247876+0.006400
## [9] train-logloss:0.232978+0.001624 test-logloss:0.243947+0.006830
## [10] train-logloss:0.228714+0.001697 test-logloss:0.241427+0.006993
## [11] train-logloss:0.225146+0.001731 test-logloss:0.239994+0.007271
## [12] train-logloss:0.222296+0.001564 test-logloss:0.238938+0.007755
## [13] train-logloss:0.219822+0.001521 test-logloss:0.238190+0.007910
## [14] train-logloss:0.217801+0.001439 test-logloss:0.237995+0.007935
## [15] train-logloss:0.215918+0.001480 test-logloss:0.237707+0.007992
## [16] train-logloss:0.213984+0.001646 test-logloss:0.237540+0.008098
## [17] train-logloss:0.212287+0.001626 test-logloss:0.237529+0.008226
## [18] train-logloss:0.210559+0.001695 test-logloss:0.237715+0.008316
## [19] train-logloss:0.209040+0.001769 test-logloss:0.237638+0.008203
## [20] train-logloss:0.207600+0.001686 test-logloss:0.237583+0.008416
## [21] train-logloss:0.206182+0.001846 test-logloss:0.237564+0.008456
## [22] train-logloss:0.204802+0.001708 test-logloss:0.237705+0.008422
## [23] train-logloss:0.203413+0.001602 test-logloss:0.237697+0.008590
## [24] train-logloss:0.202214+0.001850 test-logloss:0.237877+0.008622
## [25] train-logloss:0.201347+0.001721 test-logloss:0.237981+0.008500
## [26] train-logloss:0.199966+0.001484 test-logloss:0.238009+0.008654
## [27] train-logloss:0.198762+0.001469 test-logloss:0.238275+0.008497
## Stopping. Best iteration:
## [17] train-logloss:0.212287+0.001626 test-logloss:0.237529+0.008226
nrounds <- model$best_iteration
print(paste("nrounds", nrounds))## [1] "nrounds 17"
model <- xgboost(data = as.matrix(treatedTrainM[, selvars, drop = FALSE]),
label = treatedTrainM[[yName]],
nrounds = nrounds,
params = params)## [1] train-error:0.071733
## [2] train-error:0.071800
## [3] train-error:0.071644
## [4] train-error:0.071800
## [5] train-error:0.071711
## [6] train-error:0.071578
## [7] train-error:0.071733
## [8] train-error:0.071644
## [9] train-error:0.071689
## [10] train-error:0.071667
## [11] train-error:0.071600
## [12] train-error:0.070956
## [13] train-error:0.070845
## [14] train-error:0.070734
## [15] train-error:0.070245
## [16] train-error:0.070134
## [17] train-error:0.070023
treatedTrainP[[mname]] = predict(model,
newdata=as.matrix(treatedTrainM[, selvars, drop = FALSE]),
type='response')
treatedTestP[[mname]] = predict(model,
newdata=as.matrix(treatedTest[, selvars, drop = FALSE]),
n.trees=nTrees)
date()## [1] "Sat Jul 6 18:22:34 2019"
date()## [1] "Sat Jul 6 18:22:34 2019"
t1 = paste(mname,'trainingM data')
print(DoubleDensityPlot(treatedTrainP, mname, yName,
title=t1))print(ROCPlot(treatedTrainP, mname, yName, yTarget,
title=t1))print(WVPlots::PRPlot(treatedTrainP, mname, yName, yTarget,
title=t1))t2 = paste(mname,'test data')
print(DoubleDensityPlot(treatedTestP, mname, yName,
title=t2))print(ROCPlot(treatedTestP, mname, yName, yTarget,
title=t2))print(WVPlots::PRPlot(treatedTestP, mname, yName, yTarget,
title=t2))print(date())## [1] "Sat Jul 6 18:22:42 2019"
print("*****************************")## [1] "*****************************"
date()## [1] "Sat Jul 6 18:22:42 2019"
if(!is.null(cl)) {
parallel::stopCluster(cl)
cl = NULL
}




