In [1]:
source("../Data Generator.r")
library(randomForest)

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.


Note: the following is where a1=5,a2=-5

We only use RF to select features from V1 to V400. Therefore, we count the number of times V1 to V400 are selected out of n_run times and even if RF does not select "treatment", "time" and "time2", we do not worry since eventually we will use them plus the features selected from V1 to V400.

In [29]:
### training and test set ###
set.seed(100)
n = 100 # change n here each time
p = 400
imp_mod = c(1,4)
var_noise = 1
data = sim_time(n=n,p=p,imp_mod=imp_mod, var_noise=var_noise)
data$time2 = (data$time)^2

# test set
set.seed(101)
n_test = 100
data_test = sim_time(n=n_test,p=p,imp_mod=imp_mod, var_noise=var_noise)
data_test$time2 = (data_test$time)^2

In [3]:
n_run = 50 # the number of times RF will run on the data set
n_top = 12 # the top n_top variables will be selected
# create empty data frame to save simulation results in
result_rf = matrix(0,n_run+1,p+4) # the last row is for average
result_rf = data.frame(result_rf)
names(result_rf)[p+1] = "time"
names(result_rf)[p+2] = "time2"
names(result_rf)[p+3] = "treatment"
names(result_rf)[p+4] = "error"
names(result_rf)[1:p] = paste("V",1:p,sep="")

In [30]:
mtry=300 # change mtry for each n

system.time({
for(Repeat in 1:n_run){
    set.seed(Repeat+32) # change seed each loop
    
    var = c(paste("V",1:p,sep=""),"time","time2","treatment")
    Formula = as.formula(paste("y~",paste(var,collapse = "+")))
    rf <- randomForest(formula = Formula, data = data, mtry=mtry) 
    
    # error on the test set
    preds <- predict(rf, newdata=data_test)
    error = mean((data_test$y-preds)^2)


    # this is a quicker way to get the ranking (not juct choosing) of varibales 
    importance_order <- sort(rf$importance, decreasing = TRUE,index.return=TRUE) # sorts features by importance
    top_variables = importance_order$ix[1:n_top] # the ranking

    # If variable was selected as important, indicate with 1 (otherwise 0)
    for (i in 1:p){
      result_rf[Repeat,i] <- as.numeric(i %in% top_variables)
    }
    result_rf[Repeat,p+1] <- as.numeric("time" %in% top_variables)
    result_rf[Repeat,p+2] <- as.numeric("time2" %in% top_variables)
    result_rf[Repeat,p+3] <- as.numeric("treatment" %in% top_variables)
    result_rf[Repeat,p+4] <- error

    flush.console()
    cat(Repeat,"\n")
}
})
result_rf[n_run+1,] = colMeans(result_rf[1:n_run,])
name = paste("rf_n",n,".csv",sep="")
write.csv(result_rf,file = name)

1 


   user  system elapsed 
 210.39    0.37  214.75 

In [32]:
# result_rf

In [18]:
# sort(result_rf[n_run+1,][1:(p+3)],index.return=TRUE,decreasing = TRUE)[1:20]

In [12]:
# plot(1:p,result_rf[n_run+1,][1:p])

In [13]:
# imp_var = c(1,2,3,301,302,303)
# plot(1:6,result_rf[n_run+1,][imp_var])
# axis(1, at=1:6, labels=imp_var)