<h1> <center> Predicting the NHL Playoffs Using a Bagged Elastic Net Regression </center></h1> <br>

This is straight up a bootstrap aggregated elastic net GLM that has objective function:

$$log(L(p)) - \lambda\left[\frac{{1-\alpha}}{2} \sum_{j = 1}^p B_{j}^2 + \alpha \sum_{j = 1}^p \lvert B_{j}\rvert \right]$$ <br>

where $log(L(p))$ is the loglikelihood function of the binomial distribution (in this case of a binary classification) and  $B_{j}$  are the fitted coefficients for each feature. We tune over values of $\lambda$ and $\alpha$ to get the best out of sample AUROC. Note that $log[L(p)] = \sum_{i=1}^n [y_{i} * log(p_{i}) + (1-y_{i}) * log(1-p_{i})]$ but $p_{i} = \frac{exp(-X_{i}B)}{1 + exp(-X_{i}B)}$ which is clearly not trivial to solve.

I chose this model mostly because I wanted a framework that had embedded feature selection while still representing predictions as a linear combination of features. The sample size is low, so a "simple" linear function is probably the best we can get right now. From the loss function, the model explicitly controls for both collinearity in predictors (large amounts in this problem) through the ridge $\sum_{j = 1}^p B_{j}^2$ and noisy predictors through the LASSO  $\sum_{j = 1}^p \lvert B_{j}\rvert$ . $\alpha$ controls the mixing amount of the two penalties and $\lambda$ controls the entire contribution of penalty from both the ridge and LASSO. It should be clear that the model penalizes large coefficients, a symptom of bad collinearity or perfect separation in any linear regression while also allowing for the coefficient of a predictor to be exactly zero if the predictor is not useful in maximizing out of sample performance measures. A model that maximizes the log likelihood is not necessarily the best model if it is overfit; in some ways this is similar to the overall idea of Akaike's Information Criterion for finding a parsimonious model but the penalty term is very different. One easy way to interpret this model is as follows: if the loglikelihood function does not increase enough (or at all) to warrent a relatively "large"  $B_{j}$, then the influence the variable has on the predictions from the model is lowered by shrinking the variable's corresponding coefficient (in magnitude). In some cases, if the variable is completely non-predictive than the coefficient for that variable is shrunk to zero, meaning that the variable is straight up not included in the model anymore. Furthermore, coefficients that are arbitrarily (and not meaningfully) large due to collinearity/perfect separation are shrunk to smaller values to decrease their arbitrary large influence on predictions, reducing the variance of out of sample predictions (and therefore improving out of sample predictive performance).

The model is the best out of all models tried giving an AUROC of about 0.62 using about 140 features as input (though many aren't actually used in final predictions). It predicts slightly better than the bagged gradient boosted GLM which performs around 0.603 AUROC. This means that in general, if we were to randomly pick a true winner and a true loser from the dataset 1000 times, the bagged elastic net model would be expected to rank the true winner as having a higher probability of being a winner over the true loser about 614-620 out of 1000 times. Not bad for how hard NHL games are to predict, and the log loss is finally getting to what the most complex NHL machine learning models I've seen evaluate at (0.6709 log loss; that is, on average a probability of about 0.511 is being predicted for a true winner/true loser to be an actual winner/loser which is clearly not a very confident prediction. My model sits at about 0.506-0.508). This is better than simply randomly guessing (AUROC = 0.5, or Log Loss of 0.6931 = ln2) or defaulting to selecting the higher seed in the playoffs (should give an AUROC of around 0.547 in that case).

Be warned: This script takes a long time to run because the package used (glmnet) is written in R which is really slow. Perhaps there exists a package written in C or C++ that would make this a lot faster.

<h3><center>Reported Validation Scores (150 Repeats of Nested Cross Validation): </center></h3>
Final AUROC: 0.617137444516268 <br>
Final Log Loss: 0.67967740764237 <br>
A 95% CI for the AUROC is: [0.61360512004268, 0.620669768989856] <br>
A 95% CI for the Log Loss is: [0.677274842965873, 0.682079972318866] <br>



In [1]:
#Set the directory for parallel computation status checks.
setwd("C:/Users/Brayden/Documents/NHLModel/Status")

In [2]:
#Dependencies

require(glmnet)
require(caret)
require(pROC)
require(tidyverse)
require(recipes)
require(moments)
require(doParallel)
require(foreach)
require(fastknn)

Loading required package: glmnet
Loading required package: Matrix
"package 'Matrix' was built under R version 3.5.2"Loading required package: foreach
Loaded glmnet 2.0-16

Loading required package: caret
"package 'caret' was built under R version 3.5.2"Loading required package: lattice
"package 'lattice' was built under R version 3.5.2"Loading required package: ggplot2
"package 'ggplot2' was built under R version 3.5.1"Loading required package: pROC
"package 'pROC' was built under R version 3.5.2"Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following object is masked from 'package:glmnet':

    auc

The following objects are masked from 'package:stats':

    cov, smooth, var

Loading required package: tidyverse
"package 'tidyverse' was built under R version 3.5.2"-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v tibble  1.4.2     v purrr   0.2.5
v tidyr   0.8.2     v dplyr   0.7.8
v readr   1.3.1     v stringr 1.3.1
v tibble  

In [3]:
#Bagging Function
baggedModel = function(train, test, label_train, alpha.a, s_lambda.a){
  
  set.seed(40689)
  samples = caret::createResample(y = label_train, times = 15)
  pred = list()
  varImp = list()
  
  for (g in 1:length(samples)){
    train_temp = train[samples[[g]], ]
    a = label_train[samples[[g]]]
    modelX = glmnet(x = data.matrix(train_temp), y = a, family = "binomial", alpha = alpha.a, nlambda = 120, standardize = FALSE)
    pred[[g]] = predict(modelX, newx = data.matrix(test[, !names(test) %in% c("ResultProper")]), type = "response")[, s_lambda.a]
    varImp[[g]] = varImp(modelX, lambda = modelX$lambda[s_lambda.a])
    colnames(varImp[[g]])[1] = paste("Overall:", g, sep = "")
    remove(modelX, train_temp, a)
  }
  
  pred = pred %>% Reduce(function(x,y) cbind(x,y),.) %>% as_tibble() %>%
    mutate(Predicted = rowMeans(.))
  
  varImp = varImp %>% Reduce(function(x,y) cbind(x,y),.) %>% as_tibble() %>%
    mutate(VariableImportance = rowMeans(.))
  
                             
  varImp = tibble::rownames_to_column(cbind.data.frame(meanImportance = varImp$VariableImportance), var = "Variable")
  
  out = list(Predictions = pred$Predicted, VariableImportance = varImp)
}

In [None]:
#LogLoss Function
logLoss = function(scores, label){
  
  if (is.factor(label)){
    u = ifelse(label ==  "W", 1,0)
  } else{
    u = label
  }
  
  tmp = data.frame(scores = scores, target = u)
  tmp = tmp %>% mutate(scores = ifelse(scores == 1, 0.9999999999999999, ifelse(scores == 0 , 0.0000000000000001, scores))) %>%
    mutate(logLoss = -(target * log(scores) + (1-target) * log(1-scores)))
  
  out = mean(tmp$logLoss)
  
}

In [None]:
#PCA Function

addPCA_variables = function(traindata, testdata){
    
    traindata_tmp = traindata %>% select_if(., is.numeric)
    testdata_tmp = testdata %>% select_if(., is.numeric)
    
    pca_parameters = prcomp(traindata_tmp, center = FALSE, scale. = FALSE)
    pca_newdata = predict(pca_parameters, newdata = testdata_tmp)[,1:5]
    pca_traindata = predict(pca_parameters, newdata = traindata_tmp)[,1:5]
    out = list(train = cbind(traindata, pca_traindata), test = cbind(testdata, pca_newdata))

}

The kNN function is not used because it does not improve results (in fact, makes performance slightly worse):

Using kNN with the same outer folds; 
Final ROC: 0.615624937154349 
Final Log Loss: 0.68043590826959 

A 95% CI for the AUROC is: [0.6114276275119, 0.619822246796798] 
A 95% CI for the Log Loss is: [0.67794196509003, 0.68292985144915] 

In [None]:
#kNN Function

addKNN_variables = function(traindata, testdata, include_PCA = FALSE){
    
    y = traindata$ResultProper

    if(include_PCA == TRUE){
        
    traindata_tmp = traindata %>% select_if(., is.numeric)
    testdata_tmp = testdata %>% select_if(., is.numeric)
        
        }else{
        
    traindata_tmp = traindata %>% select_if(., is.numeric) %>% as_tibble(.) %>% select(-starts_with("PC"))
    testdata_tmp = testdata %>% select_if(., is.numeric) %>% as_tibble(.) %>% select(-starts_with("PC"))
    }
    
    newframeswithKNN = fastknn::knnExtract(xtr = data.matrix(traindata_tmp), ytr = y, xte = data.matrix(testdata_tmp), k = 1)
    KNN_train = newframeswithKNN$new.tr %>% as_tibble(.) %>% transmute_all(., .funs = function(x) (x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE))
    KNN_test = newframeswithKNN$new.te %>% as_tibble(.) %>% transmute_all(., .funs = function(x) (x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)) 
    out = list(train = cbind(traindata, KNN_train), test = cbind(testdata, KNN_test))
}

The data represents differences in a stat from the first seeds perspective. This way we get a single prediction per a series. There could be better ways to do this, perhaps a ratio, product, sum, etc. I thought a difference would make the most sense.

In [None]:
#Read Data In

cat("Reading in Data..... \n")
allData = read.csv("C:/Users/Brayden/Documents/NHLModel/FullDataSet_Dec29.csv", na.strings = "#N/A")
allData = allData[, 3:ncol(allData)]
indx = colnames(allData[, grepl(paste(c("X", "X."), collapse = "|"), colnames(allData))])
allData = allData[, !names(allData) %in% indx]
remove(indx)

In [None]:
colnames(allData)

In [None]:
#...................................Do Some Engineering of Features..................#
allData = allData %>% mutate(Round = as.factor(rep(c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,4),12))) %>%
            mutate(Ratio_of_GoalstoGoalsAgainst = Goals/GoalsAgainst) %>%
            mutate(Ratio_of_HitstoBlocks = HitsatES/BlocksatES) %>%
            mutate(PenaltiestoPowerPlay = Penalties/PowerPlay) %>%
            mutate(PenaltiestoPowerPlaylog = sign(PenaltiestoPowerPlay) * log(abs(PenaltiestoPowerPlay) + 1)) %>%
            mutate(logofPoints = sign(Points) * log(abs(Points) + 1)) %>%
            mutate(sqrtofPoints = abs(Points)^0.5) %>%
            mutate(Ratio_of_SRStoPoints = SRS/Points) %>%
            mutate(AverageGoalDiff_PerGame = Goals/82) %>%
            mutate(AveragePenaltyDiff_PerGame = Penalties/82) %>%
            mutate(logofSOG = sign(SOG) * log(abs(SOG) + 1)) %>%
            mutate(sqrtofRPI = abs(RPI)^0.5) %>%
            mutate(PowerPlaytoPenaltyKill = PowerPlay/PenaltyKill) %>%
            mutate(PowerPlaytoPenaltyKilllog = sign(PowerPlaytoPenaltyKill) * log(abs(PowerPlaytoPenaltyKill) + 1)) %>%
            mutate(SCFtoGoalsAgainst = SCF/GoalsAgainst) %>%
            mutate(PointsPercentage = Points/164) %>%
            mutate(PlusMinus = Goals - GoalsAgainst) %>%
            mutate(GS_max_log = sign(Average__GS_max) * log(abs(Average__GS_max) + 1)) %>%
            mutate(CA_Per60Team_log = sign(CA_Per60Team) * log(abs(CA_Per60Team) + 1)) %>%
            mutate(Ratio_of_GoalstoGoalsAgainstlog = sign(Ratio_of_GoalstoGoalsAgainst) * log(abs(Ratio_of_GoalstoGoalsAgainst) +1)) %>%
            mutate(Ratio_of_HitstoBlockslog = sign(Ratio_of_HitstoBlocks) * log(abs(Ratio_of_HitstoBlocks) + 1)) %>%
            mutate(SCFtoGoalsAgainstlog = sign(SCFtoGoalsAgainst) * log(abs(SCFtoGoalsAgainst) + 1)) %>%
            mutate(PointsPercentagesqrt = abs(PointsPercentage)^0.5) %>%
            mutate(CorsiDifftoSOSlog = sign((CF_Per60Team - CA_Per60Team)/SOS) * log(abs((CF_Per60Team - CA_Per60Team)/SOS) + 1)) %>%
            mutate(xGDifftoSOS = (xGF.60 - xGA.60)/SOS) %>% 
            mutate(GStoSOS = Average__GS_mean / SOS) %>%
            mutate(SRStoSOS = SRS/SOS) %>%
            
            mutate_if(is.numeric, funs(ifelse(is.nan(.), 0,.))) %>%
            mutate_if(is.numeric, funs(ifelse(is.infinite(.), 0,.)))

options(repr.matrix.max.rows=600, repr.matrix.max.cols=200, scipen = 999)

kurt = allData %>% select_if(., is.numeric) %>% summarize_all(., funs(moments::kurtosis(., na.rm=TRUE))) %>%
                                         gather(., Variable, Kurtosis)
allData %>% select_if(., is.numeric) %>% summarize_all(., funs(moments::skewness(., na.rm=TRUE))) %>%
                                         gather(., Variable, Skewness) %>%
                                         left_join(., kurt, by = "Variable")
rm(kurt)
allData %>% select_if(., is.numeric) %>% summarize_all(., funs(moments::skewness(., na.rm=TRUE))) %>%
                                          gather(., Variable, Skew) %>%
                                          filter(., Skew >= 1)

In [None]:
#...................................Data Splitting....................................#
set.seed(40689)
seeds = sample(1:1000000000, 150, replace = FALSE)
ROC_rep = numeric()
LogLoss_rep = numeric()

There is a bit of a data leak in the tuning of the hyperparameters because we aggregate then split the dataset. However, the final reported metric in the outer fold of nested cross validation is still unbiased since the aggregation still only occurs on the test set.

In [None]:
cluster = makeCluster(detectCores())
registerDoParallel(cluster)

results = foreach(p = 1:length(seeds), .packages = c("tidyverse", "caret", "recipes", "glmnet", "pROC", "fastknn"),
                  .combine = "c") %dopar% {

set.seed(seeds[p])
allFolds = caret::createFolds(y = allData$ResultProper, k = 3)

ROC = numeric()
LogLoss = numeric()
finalParam = list()
VarImp = list()

for(j in 1:length(allFolds)) {

  #..................................Generate Random Grid and Define Training Set..............................#
  cat("Generating Random Grid.....\n")
  set.seed(346002)
  randomGrid = data.frame(alpha = runif(146,0,1), lambda = sample(1:70,146, replace = TRUE))
  
  trainX = allData[-allFolds[[j]],]
  
  #...........................Create Inner Data Partition for Hyper Parameter Tuning.....#
  set.seed(40689)
  innerPartition = caret::createDataPartition(y=trainX$ResultProper, times = 1, p = 0.80)
  
  #..................................Aggregate...........................................#
    RoundLookup = trainX %>% group_by(Round) %>%
      summarise_at(., funs(mean(.,na.rm = TRUE), sd(., na.rm=TRUE)), .vars = c("Points", "Goals", "GoalsAgainst", "SRS", "VegasOddsOpening", "ELORating", "HitsatES", "BlocksatES", "RPI", "SCF",
                                                                              "Fenwick_Last20", "ZSR_mean", "Q4Record")) 
    
    trainX = trainX %>% left_join(., RoundLookup, by = "Round")
    
  #...................................Tune Model.........................................#
  
  ROCFinal = list()
  LogLossFinal = list()
 for (k in 1:length(innerPartition)){
    
    innerTrainX = trainX[innerPartition[[k]],]
    
    #...................................Define Recipe, Do More Engineering.................#

    innermainRecipe = recipe(ResultProper ~., data=innerTrainX) %>%
      step_zv(all_numeric()) %>%
      step_center(all_numeric()) %>%
      step_scale(all_numeric()) %>%
      step_dummy(all_predictors(), -all_numeric()) %>%
      step_zv(all_predictors()) %>%
      step_knnimpute(K = 15, all_numeric(), all_predictors()) %>%
      step_interact(terms = ~ SRS:Fenwick:ELORating) %>%
      step_interact(terms = ~ RegularSeasonWinPercentage:contains("Points")) %>%
      step_interact(terms = ~ FaceoffWinPercentage:ShotPercentage) %>%
      step_interact(terms = ~ contains("Round"):VegasOddsOpening) %>%
      step_interact(terms = ~ SDRecord:SOS) 

    innerPreProcessing = prep(innermainRecipe, training = innerTrainX)
    innerTrainX = bake(innerPreProcessing, newdata=innerTrainX)
    y = innerTrainX$ResultProper
     
    innerTestX = trainX[-innerPartition[[k]],]
    innerTestX = bake(innerPreProcessing, newdata=innerTestX)
    
    frameswithPCA = addPCA_variables(traindata = innerTrainX, testdata = innerTestX)
    innerTrainX = frameswithPCA$train
    innerTestX = frameswithPCA$test
    rm(frameswithPCA)
     
    #....................................Training Model...................................#
    ROCtemp = numeric()
    logLosstemp = numeric()

for (m in 1:nrow(randomGrid)){
        
      alpha_val = as.numeric(randomGrid[m, 1])
      s.lambda_val = as.integer(randomGrid[m,2])
    
      modelX = baggedModel(train = innerTrainX[, !names(innerTrainX) %in% c("ResultProper")], test = innerTestX, 
                            label_train = y, alpha.a = alpha_val, s_lambda.a = s.lambda_val)
      ROCtemp[m] = roc(response = innerTestX$ResultProper, predictor = modelX$Predictions,
                       levels = c("L", "W"))$auc
      logLosstemp[m] = logLoss(scores = modelX$Predictions, label = innerTestX$ResultProper)
      remove(modelX, alpha_val, s.lambda_val)
    }
    ROCFinal[[k]] = ROCtemp
    LogLossFinal[[k]] = logLosstemp
    remove(innerTrainX, innermainRecipe, innerTestX, innerPreProcessing, ROCtemp, logLosstemp, m)
    gc()
  }
  
  remove(innerPartition, k)
    
#...................................Get the Best Parameters...........................#
    
  #For ROC:
  ROCFinal = ROCFinal %>% Reduce(function(x,y) cbind(x,y), .) %>% as_tibble(.) %>% mutate(., AverageROC = rowMeans(.))
  indx = which.max(ROCFinal$AverageROC)
  alpha_final = as.numeric(randomGrid[indx, 1])
  s.lambda_final = as.integer(randomGrid[indx, 2])
  remove(indx)
  finalParam[[j]] = data.frame(alpha = alpha_final, lambda = s.lambda_final)
  
  #For LogLoss:
  #LogLossFinal = LogLossFinal %>% Reduce(function(x,y) cbind(x,y), .) %>% data.table(.)
  #LogLossFinal = LogLossFinal[, .(AverageLogLoss = rowMeans(.SD)), ]
  #indx = which.min(LogLossFinal$AverageLogLoss)
  #mstop_final = as.integer(randomGrid[indx, 1])
  #nu_final = as.numeric(randomGrid[indx, 2])
  #remove(indx)
  #finalParam[[j]] = data.table(mstop = mstop_final, nu= nu_final)
  
  
  #...................................Define Recipe, Do More Engineering................#
  
   mainRecipe = recipe(ResultProper ~., data=trainX) %>%
      step_zv(all_numeric()) %>%
      step_center(all_numeric()) %>%
      step_scale(all_numeric()) %>%
      step_dummy(all_predictors(), -all_numeric()) %>%
      step_zv(all_predictors()) %>%
      step_knnimpute(K = 15, all_numeric(), all_predictors()) %>%
      step_interact(terms = ~ SRS:Fenwick:ELORating) %>%
      step_interact(terms = ~ RegularSeasonWinPercentage:contains("Points")) %>%
      step_interact(terms = ~ FaceoffWinPercentage:ShotPercentage) %>%
      step_interact(terms = ~ contains("Round"):VegasOddsOpening) %>%
      step_interact(terms = ~ SDRecord:SOS) 
  
  #.......Join Aggregations with Test Data and Pre Process Training and Test Data..................#
  
  trainXparam = prep(mainRecipe, training = trainX)
  
  trainX = bake(trainXparam, newdata=trainX)
  y = trainX$ResultProper
                                 
  testX = allData[allFolds[[j]],] %>% left_join(., RoundLookup, by = "Round")
  testX = bake(trainXparam, newdata = testX)
                                 
  frameswithPCA = addPCA_variables(traindata = trainX, testdata = testX)
  trainX = frameswithPCA$train
  testX = frameswithPCA$test
  
  rm(frameswithPCA)

  modelX = baggedModel(train = trainX[, !names(trainX) %in% c("ResultProper")], test=testX, label_train = y, 
                       alpha.a = alpha_final, s_lambda.a = s.lambda_final)
                                 
  ROC[j] = roc(response = testX$ResultProper, predictor = modelX$Predictions, levels = c("L", "W"))$auc 
  LogLoss[j] = logLoss(scores = modelX$Predictions, label = testX$ResultProper) 
  VarImp[[j]] = modelX$VariableImportance
  remove(modelX, alpha_final, s.lambda_final, trainXparam, mainRecipe, trainX, y, ROCFinal, LogLossFinal, 
         RoundLookup)
  gc()
                            
}                                 

ROC_rep = mean(ROC)
LogLoss_rep = mean(LogLoss)

rm(ROC, LogLoss)

                                 
#............................Extract Variable Importance for Outer Fold..............................#
#Variable Importance...if we ncomp as a tuning variable this no longer has any meaning. Either select
#ncomp using the embedded feature selection or don't run this part of the code.
                                 
finalVarImp = VarImp %>% Reduce(function(x,y) cbind(x, y$meanImportance),.)
finalVarImp = finalVarImp %>% set_names(., c("Variable", seq(1, length(allFolds),1))) %>%
                              as_tibble(.) %>%
                              mutate(AverageImp = rowMeans(select(., -Variable))) %>%
                              select(., c(Variable, AverageImp)) %>%
                              arrange(desc(AverageImp)) %>%
                              mutate(RelativeImportance = round(AverageImp/sum(AverageImp), 5))                           
remove(VarImp)

VarImp_rep = finalVarImp
remove(finalVarImp)

write.table(ROC_rep, file = paste("Iteration_", p, ".txt", sep=""), row.names = FALSE)
list(ROC_rep, LogLoss_rep, VarImp_rep)  
                      
}
stopCluster(cluster)
ROC_rep = results[seq(1,length(results), 3)] %>% unlist(.)
LogLoss_rep = results[seq(2, length(results), 3)] %>% unlist(.)
VarImp = results[seq(3, length(results),3)]
 
writeLines(paste("Final AUROC:", mean(ROC_rep), sep = " "))
writeLines(paste("Final Log Loss:", mean(LogLoss_rep), sep = " "))

writeLines(paste("A 95% CI for the AUROC is:", "[", mean(ROC_rep) - 1.96 * (sd(ROC_rep))/(length(ROC_rep)^0.5), ", ",
                mean(ROC_rep) + 1.96 * (sd(ROC_rep))/(length(ROC_rep)^0.5), "]", sep = ""))
writeLines(paste("A 95% CI for the Log Loss is:", "[", mean(LogLoss_rep) - 1.96 * (sd(LogLoss_rep))/(length(LogLoss_rep)^0.5), ", ",
                mean(LogLoss_rep) + 1.96 * (sd(LogLoss_rep))/(length(LogLoss_rep)^0.5), "]", sep = ""))
rm(cluster)

In [None]:
#Extract Variable Importances
VarImp_final = VarImp %>% Reduce(function(...) left_join(..., by="Variable", all.x=TRUE), .)
indx = colnames(VarImp_final[, grepl(paste(c("Variable", "RelativeImportance"), collapse = "|"), colnames(VarImp_final))])
VarImp_final = VarImp_final[, names(VarImp_final) %in% indx]
rm(indx)
                                 
options(repr.matrix.max.rows=600, repr.matrix.max.cols=200, scipen = 999)
Variable = as.data.frame(VarImp_final$Variable) %>% set_names("Variable")
VarImp_final = VarImp_final[,2:ncol(VarImp_final)] %>% mutate(AverageRelativeImportance = rowMeans(.)) %>%
                                                       select(., c(AverageRelativeImportance)) %>%
                                                       bind_cols(Variable,.) %>%
                                                       arrange(desc(AverageRelativeImportance)) 
VarImp_final