In [21]:
source("SimData.r")
library("glmertree")
library("WGCNA")
library("pre")

Problems
* When fixed_regress = NULL, we can let the user decide whether to use PC or not as regressors. If don't use PC, that is a WGCNA+ regular RE-EM
* Right now we can only do random intercept
* User may want to tune other parameters in WGCNA (in addition to power). How to do this in a elegant way? (Not urgent since the current parameters can cluster correctly)
* Right now all the alpha in the algorithm are the same.

Output: a glmertree object (trained tree)

Parameters:
* data: training data
* fixed_regress: the regressors used no matter what such as time and time^2; if fixed_regress = NULL, use PC as regressor at screening step
* fixed_split: a char vector containing features definitely used in splitting
* var_select: a char vector containing features to be selected. These features will be clustered by WGCNA and the chosen ones will be used in regression and splitting
* power: parameters of WGCNA
* cluster: the variable name of each cluster (in terms of random effect)
* Fuzzy = TRUE: Screen like Fuzzy Forest; Fuzzy= FALSE: first screen within non-grey modules and then select the final non-grey features within the selected ones from each non-grey module; Use this final non-grey features as regressors (plus fixed_regress) and use grey features as split_var to select grey features. The use final non-grey features and selected grey features together in splitting and regression variables, to do the final prediction. Fuzzy=FALSE is used if there are so many non-grey features and you want to protect grey features.
* maxdepth_factor_screen: when selecting features from one module, the maxdepth of the glmertree is set to ceiling function of maxdepth_factor_screen*(#features in that module). Default is 0.04. 
* maxdepth_factor_select: Given screened features (from each modules, if Fuzzy=FALSE,that is the selected non-grey features from each non-grey modules), we want to select again from those screened features. The maxdepth of that glmertree is set to be ceiling of maxdepth_factor_select*(#screened features). Default is 0.6.
* for the maxdepth of the prediction tree (final tree), maxdepth is set to the length of the split_var (fixed+chosen ones)
* The most important parameters are alpha and maxdepth.

In [12]:
Longtree = function(data,fixed_regress=NULL,fixed_split=NULL,var_select=NULL,
                    power=6,cluster,alpha=0.2,maxdepth_factor_screen=0.04,
                    maxdepth_factor_select=0.5,Fuzzy=TRUE){
    ### if there are no features to select, just use fixed_regress and fixed_split
    if(length(var_select)==0){
        if (length(fixed_regress)==0){
            if (length(fixed_split)==0){
                stop("no features to split and regress on")
            }
            fixed_regress = "1"
        }
        maxdepth = length(fixed_split)
        Formula = as.formula(paste("y~",paste(fixed_regress,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(fixed_split,collapse = "+")))
        mytree = lmertree(Formula, data = data,alpha = alpha,maxdepth=maxdepth)
        return (mytree)
    } ###
    # Now var_select is not empty
    # If don't specify fixed_regress: use PC as regressors at screening step
    if (length(fixed_regress)==0){
        cat("Use Longtree_PC\n")
        return(Longtree_PC(data=data,fixed_split=fixed_split,
                    var_select=var_select,
                    power=power,cluster=cluster,alpha=alpha,
                    maxdepth_factor_screen=maxdepth_factor_screen, 
                    maxdepth_factor_select=maxdepth_factor_select,Fuzzy=Fuzzy))
    }###
    # Now we have non-empty var_select,fixed_regress
    cat("Use Longtree_time\n")
    return(Longtree_time(data=data,fixed_regress=fixed_regress,
                        fixed_split=fixed_split, var_select=var_select,
                    power=power,cluster=cluster,alpha=alpha,
                    maxdepth_factor_screen=maxdepth_factor_screen, 
                    maxdepth_factor_select=maxdepth_factor_select,Fuzzy=Fuzzy))
                        
}

# Longtree_time: used when var_select and fixed_regress are non-empty
# Longtree is equivalent to this Longtree_time in this case
Longtree_time = function(data,fixed_regress,fixed_split,var_select,power,cluster,
                         alpha,maxdepth_factor_screen,maxdepth_factor_select,Fuzzy){
    # Cluster var_select
    data_WGCNA = data[var_select]
    # Must set numericLabels = FALSE so that it uses actual colors like "grey"
    net = blockwiseModules(data_WGCNA, power = power,TOMType = "unsigned", 
                           minModuleSize = 30,reassignThreshold = 0, 
                           mergeCutHeight = 0.25,numericLabels = FALSE, 
                           pamRespectsDendro = FALSE,verbose = 0)
    # the correspondance betweeen feature names and colors
    colors = net$colors # it is a string vector with names (that is the name is V1)
    module_names = unique(colors) # all color names
    #"dictionary"with keys=name of color,values=names of features of that color
    module_dic = list() 
    for (i in 1:length(module_names)){
        module_dic[[module_names[i]]] = names(colors[colors==module_names[i]])
    }
    
    imp_var = list() # used to store the names of important features
    
    if(Fuzzy==TRUE){
        # Do the selection step just like Fuzzy Forest:
        # for each module (including grey), use them as split_var and select
        # finally, use all selected ones as split_var and select
        
        for (name in module_names){
        # in the formula, add fixed_split as split_var, also include the module features
        split_var = c(module_dic[[name]],fixed_split)
        maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
        # use fixed_regress as regressor
        regress_var = fixed_regress

        # Formula for lmtree
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                               "|",cluster,"|",
                    paste(split_var,collapse = "+")))

        # fit the tree
        mytree = lmertree(Formula, data = data,alpha = alpha,maxdepth=maxdepth) 

        #extract important features
        imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }
        
        # the variables selected from all the modules
        final_var = imp_var[[1]]
        if (length(imp_var)>1){
            # There are at least 2 modules
            for (i in 2:length(imp_var)){
            final_var = c(final_var,imp_var[[i]])
         }
            cat("after screening within modules",final_var,"\n")
            
            # the final selection among all the chosen features 
            regress_var = fixed_regress
            split_var = c(final_var,fixed_split)
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha = alpha,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            cat("final features",final_var)      
        }else{
            # only grey module
            cat("There is only one module, final features",final_var)
        }
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        regress_var = c(fixed_regress,final_var)
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        mytree = lmertree(Formula, data = data,alpha=alpha,maxdepth=maxdepth) 
        return(mytree)           
    }
    if(Fuzzy==FALSE){
        # first do the screening and selecting in non-grey modules
        # Then use those non-grey estimated true features as regressors
        # and grey features as split_var, choose grey features and keep them
        for (name in module_names){
        if(name=="grey"){
            next
        }
        split_var = c(module_dic[[name]],fixed_split)
        maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
        regress_var = fixed_regress

        # Formula for lmtree
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                               "|",cluster,"|",
                    paste(split_var,collapse = "+")))

        # fit the tree
        mytree = lmertree(Formula, data = data,alpha = alpha,maxdepth=maxdepth) 

        #extract important features
        imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }# Now imp_var contains important features from modules that are not grey
        if(length(imp_var)==0){
            # only grey module, no other modules
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = fixed_regress
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                               "|",cluster,"|",
                    paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha = alpha,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            cat("There is only one module which is grey, final features",final_var)
        }else{
            # at least one non-grey module
            final_var = imp_var[[1]]
            # if only one non-grey module: final_var is the chosen non-grey features
            # if at least two non-grey modules:
            if (length(imp_var)>1){
                # There are at least 2 modules
                for (i in 2:length(imp_var)){
                final_var = c(final_var,imp_var[[i]])
                }
            
            cat("After screening within non-grey modules",final_var,"\n")
            # select from selected non-grey features
            regress_var = fixed_regress
            split_var = c(final_var,fixed_split)
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha = alpha,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            }
            cat("The chosen non-grey features are",final_var,"\n")
            
            # use final_var (chosen non-grey features) to select grey features
            regress_var = c(fixed_regress,final_var)
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha = alpha,maxdepth=maxdepth) 
            grey_var = get_split_names(mytree$tree,data)
            cat("The chosen grey features are",grey_var,"\n")
            # use final_var and grey_var do to the final model tree
            final_var = c(final_var,grey_var)    
            cat("final features",final_var)  
        }
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        regress_var = c(fixed_regress,final_var)
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        mytree = lmertree(Formula, data = data,alpha=alpha,maxdepth=maxdepth) 
        return(mytree)
    }
    
}

# Methods for extracting names of splitting features used in a tree
# tree: a tree object; data: the train or test set
get_split_names = function(tree,data){
    # path: the string that contains all the node information
    paths <- pre:::list.rules(tree, removecomplements = FALSE)
    vnames = names(data)
    # the regex for a variable
    # tomatch = paste(paste(var,"<="),"|",paste(var,">"),sep="")
    # match to tomatch in path
    tmp = vnames[sapply(sapply(vnames, FUN = function(var) grep(paste(paste(var,"<="),"|",paste(var,">"),sep=""), paths)), length) > 0]
    return (tmp)
}

# Longtree_PC: used when fixed_regress are NULL, use PC as regressors for non-grey module
# Longtree is equivalent to this Longtree_PC in this case
Longtree_PC = function(data,fixed_split, var_select, power=power,cluster,
                    alpha, maxdepth_factor_screen,maxdepth_factor_select,Fuzzy){
    # Cluster var_select
    data_WGCNA = data[var_select]
    # Must set numericLabels = FALSE so that it uses actual colors like "grey"
    net = blockwiseModules(data_WGCNA, power = power,TOMType = "unsigned", 
                           minModuleSize = 30,reassignThreshold = 0, 
                           mergeCutHeight = 0.25,numericLabels = FALSE, 
                           pamRespectsDendro = FALSE,verbose = 0)
    # the correspondance betweeen feature names and colors
    colors = net$colors # it is a string vector with names (that is the name is V1)
    module_names = unique(colors) # all color names
    #"dictionary"with keys=name of color,values=names of features of that color
    module_dic = list() 
    for (i in 1:length(module_names)){
        module_dic[[module_names[i]]] = names(colors[colors==module_names[i]])
    }
    
    # extract eigengenes and rename the column
    # The eigengene(1st pricinpal component) is L2 normalized
    eigengene = net$MEs
    # eigengene
    # add eigengen to training data (for grey group, eigengene is meaningless)
    for (name in module_names){
        if (name == "grey"){
            next
        }
        eigen_name = paste("ME",name,sep="")
        data[[eigen_name]] = eigengene[[eigen_name]]
    }
    imp_var = list() # used to store the names of important features
    
    if (Fuzzy==TRUE){
        # first screen then select, just like Fuzzy Forest
        # For each module that is not grey, use model tree as following:
        # use its eigengene as regression variables and all features as splitting ones
        # For grey module, use regular REEM (set regressor = "1")
        # Then select by using all chosen features as split_var and regress="1"
        # Finally, use all the selected features for splitting and regression variables
       
        for (name in module_names){
            split_var = c(module_dic[[name]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            # use eigengene as regressor
            if (name == "grey"){
                regress_var = "1"
            }else{
                eigen_name = paste("ME",name,sep="")
                regress_var = eigen_name
            }
            # Formula for lmtree: use PC as regressors
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))

            # fit the tree
            mytree = lmertree(Formula, data=data,alpha=alpha,maxdepth=maxdepth) 
            #extract important features
            imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }        
        # the variables selected from all the modules
        final_var = imp_var[[1]]      
        if (length(imp_var)>1){
            for (i in 2:length(imp_var)){
            final_var = c(final_var,imp_var[[i]])
         }
            cat("After screening within modules ",final_var,"\n")
            # select features again
            # use all selected features as split_var with no regressors
            split_var = c(final_var,fixed_split)
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            Formula = as.formula(paste("y~","1",
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha=alpha,maxdepth=maxdepth)
            final_var = get_split_names(mytree$tree,data)
            cat("Final features ",final_var,"\n")
            
        }else{
            # length(imp_var) now is 1, only one module
            cat("There is only one module ",final_var,"\n")
        }
        
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        Formula = as.formula(paste("y~",paste(final_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        mytree = lmertree(Formula, data = data,alpha=alpha,maxdepth=maxdepth)
        return (mytree)
    }
    
    if (Fuzzy== FALSE){
        # select features from non-grey modules and use them as regressors 
        # to select features from grey module. Then use all of them as split and regressor
        
        # for non-grey groups
        for (name in module_names){
            if (name == "grey"){
                next
            }
            split_var = c(module_dic[[name]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            # use eigengene as regressor
            eigen_name = paste("ME",name,sep="")
            regress_var = eigen_name
            # Formula for lmtree: use PC as regressors
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))

            # fit the tree
            mytree = lmertree(Formula, data=data,alpha=alpha,maxdepth=maxdepth) 
            #extract important features
            imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }
        # Now imp_var contains all the non-grey screened features
        if(length(imp_var)==0){
            # there is only one module which is grey
            # just select from the grey module
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = "1"
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            cat("There is only grey module ",final_var,"\n")
        }
        if(length(imp_var)==1){
            # only one non-grey module
            final_var = imp_var[[1]]
            cat("There is only one non-grey module",final_var,"\n")
            # use final_var as regressors
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = final_var
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha,maxdepth=maxdepth) 
            grey_var = get_split_names(mytree$tree,data)
            final_var = c(final_var,grey_var)
            cat("The final features ",final_var,"\n")
        }
        if(length(imp_var)>1){
            # at least two non-grey modules, screen among non-grey modules
            final_var = imp_var[[1]]
            for (i in 2:length(imp_var)){
                final_var = c(final_var,imp_var[[i]])
                }
            cat("After screening from non-grey modules ",final_var,"\n")
            # now final_var contains all the non-grey screened features
            split_var = c(final_var,fixed_split)  
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            regress_var = "1"
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha,maxdepth=maxdepth)
            final_var = get_split_names(mytree$tree,data)
            # Now final_var contains final non-grey features
            cat("Final non-grey features ",final_var,"\n")
            # use final_var as regressors and select features from grey features
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = final_var
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha,maxdepth=maxdepth) 
            grey_var = get_split_names(mytree$tree,data)
            cat("Final grey features ",grey_var,"\n")
            final_var = c(final_var,grey_var)
            cat("The final features ",final_var,"\n")
        }
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        Formula = as.formula(paste("y~",paste(final_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        mytree = lmertree(Formula, data = data,alpha=alpha,maxdepth=maxdepth)
        return (mytree)
    }
    
    
}

# Longtree_PC Sample Run

In [13]:
n <- 300 # number of patients
T <-  5 # number of observations per patients
set.seed(100)

data <- as.data.frame(sim_3(n, T)) 
colnames(data)[401] <- "y"
for (i in 1:n){
    data$patient[(1+(i-1)*T):(i*T)] = rep(i,T)
}

In [14]:
n_test <- 100 
T <-  5 
set.seed(101)
data_test <- as.data.frame(sim_3(n_test, T)) 
colnames(data_test)[401] <- "y"
for (i in 1:n_test){
    data_test$patient[(1+(i-1)*T):(i*T)] = rep(i,T)
}

In [15]:
cluster = "patient"
fixed_regress = NULL
fixed_split = NULL
var_select = paste("V",1:400,sep="")
# fixed_split = "V303"
# var_select = paste("V",setdiff(1:400,303),sep="")

In [20]:
system.time({
    mytree = Longtree(data,fixed_regress=fixed_regress,fixed_split=fixed_split,
                  var_select=var_select,cluster=cluster,Fuzzy=FALSE)
})
mean((predict(mytree,newdata=data_test,re.form=NA)-data_test$y)**2)

Use Longtree_PC
After screening from non-grey modules  V1 V2 V3 V60 V124 V144 V169 V212 V264 
Final non-grey features  V1 V2 V3 
Final grey features  V301 V302 V306 
The final features  V1 V2 V3 V301 V302 V306 


   user  system elapsed 
  60.65    3.05   79.74 

# Longtree_time: Sample Run

In [4]:
# training data
n <- 300 # number of patients
T <-  5 # number of observations per patients

set.seed(100)

data <- sim_quad(n,T,a1=1,a2=-1)
# add time_squared
data$time2 = (data$time)^2

# testing data
n_test <- 100 # number of patients
T <-  5 # number of observations per patients
set.seed(101)
data_test <- sim_quad(n,T,a1=1,a2=-1)
# data_test <- sim_quad(n_test, T)
data_test$time2 = (data_test$time)^2

In [5]:
fixed_regress = c("time","time2")
fixed_split = c("treatment")
cluster = "patient"
var_select = paste("V",1:400,sep="")

In [6]:
# Fuzzy=TRUE 
# alpha = 0.2, maxdepth_factor_select = 0.5 (all default)
system.time({
    mytree = Longtree(data,fixed_regress=fixed_regress,fixed_split=fixed_split,
                  var_select=var_select,cluster=cluster,Fuzzy=TRUE)
})
mean((predict(mytree,newdata=data_test,re.form=NA)-data_test$y)**2)
coef(mytree)

Use Longtree_time

boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular


after screening within modules V1 V2 V3 V45 V82 V106 V154 V172 V301 V302 V303 V337 
final features V1 V2 V3 V301 V302 V303

   user  system elapsed 
  36.06    2.16   45.45 

Unnamed: 0,(Intercept),time,time2,V1,V2,V3,V301,V302,V303
5,4.983755,-6.301702,1.022215,5.716317,-2.192455,-2.544785,4.695668,-2.8369292,-1.2793611
6,7.755519,-4.363281,0.7542165,3.612934,-2.240745,-1.407468,5.350652,5.9533509,-2.3847875
8,-14.574241,7.139503,-1.1745854,5.25427,-1.877526,-3.1589,4.711909,0.1069969,-3.5295269
9,-8.837379,3.451156,-0.5560499,5.055151,-3.992844,-1.49885,5.373518,2.2775681,0.610018
11,8.107291,-6.571144,1.0143115,5.296218,-4.037329,-1.980469,4.234784,1.8434044,6.8752239
13,-15.224294,7.015788,-1.2149732,6.393265,-5.632512,-4.990573,4.926471,-0.3142398,6.8114503
14,-6.134286,4.618203,-0.7749549,4.848217,2.769662,0.375647,5.470692,0.7132261,6.6111945
18,2.751689,-5.557215,0.9636823,4.912952,5.374369,7.160465,5.079374,-1.4286037,-2.5019284
19,9.478673,-6.295303,1.0039252,4.484499,6.492562,6.016471,5.063083,-0.7972708,4.7254974
21,-11.638915,4.652561,-0.7546413,5.134555,7.125025,5.947514,4.978068,-0.1967351,-0.4333253


In [7]:
# Fuzzy=False 
# alpha = 0.1, maxdepth_factor_select = 0.5
system.time({
    mytree = Longtree(data,fixed_regress=fixed_regress,fixed_split=fixed_split,
                  var_select=var_select,alpha=0.1,cluster=cluster,Fuzzy=FALSE)
})
mean((predict(mytree,newdata=data_test,re.form=NA)-data_test$y)**2)
coef(mytree)

Use Longtree_time

boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular


After screening within non-grey modules V1 V2 V3 V82 
The chosen non-grey features are V1 V2 V3 
The chosen grey features are V301 V302 V303 
final features V1 V2 V3 V301 V302 V303

   user  system elapsed 
  32.72    2.01   42.21 

Unnamed: 0,(Intercept),time,time2,V1,V2,V3,V301,V302,V303
5,4.983755,-6.301702,1.022215,5.716317,-2.192455,-2.544785,4.695668,-2.8369292,-1.2793611
6,7.755519,-4.363281,0.7542165,3.612934,-2.240745,-1.407468,5.350652,5.9533509,-2.3847875
8,-14.574241,7.139503,-1.1745854,5.25427,-1.877526,-3.1589,4.711909,0.1069969,-3.5295269
9,-8.837379,3.451156,-0.5560499,5.055151,-3.992844,-1.49885,5.373518,2.2775681,0.610018
11,8.107291,-6.571144,1.0143115,5.296218,-4.037329,-1.980469,4.234784,1.8434044,6.8752239
13,-15.224294,7.015788,-1.2149732,6.393265,-5.632512,-4.990573,4.926471,-0.3142398,6.8114503
14,-6.134286,4.618203,-0.7749549,4.848217,2.769662,0.375647,5.470692,0.7132261,6.6111945
18,2.751689,-5.557215,0.9636823,4.912952,5.374369,7.160465,5.079374,-1.4286037,-2.5019284
19,9.478673,-6.295303,1.0039252,4.484499,6.492562,6.016471,5.063083,-0.7972708,4.7254974
21,-11.638915,4.652561,-0.7546413,5.134555,7.125025,5.947514,4.978068,-0.1967351,-0.4333253


## Benchmark

In [13]:
# random forest
library("randomForest")
var = c(paste("V",1:400,sep=""),"time","time2","treatment")
Formula = as.formula(paste("y~",paste(var,collapse = "+")))
system.time({
#     set.seed(20)
    rf <- randomForest(Formula,data)
})
mean((predict(rf,newdata=data_test)-data_test$y)**2)
# sorts features by importance
importance_order <- sort(rf$importance, decreasing = TRUE,index.return=TRUE) 
# the ranking; 6 here should be a parameters
var[importance_order$ix[1:15]] 

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:pre':

    importance



   user  system elapsed 
  87.72    0.18   90.22 

In [14]:
# Fuzzy Forest
library("fuzzyforest")
# since treatment is categorical, we cannot include it in WGCNA
system.time({
data_WGCNA = data[,1:400] # only the covariates

net = blockwiseModules(data_WGCNA, power = 6,TOMType = "unsigned", 
                       minModuleSize = 30,reassignThreshold = 0, 
                       mergeCutHeight = 0.25,numericLabels = FALSE, 
                       pamRespectsDendro = FALSE,verbose = 0)

var = c(paste("V",1:400,sep=""),"time","time2","treatment")
Formula = as.formula(paste("y~",paste(var,collapse = "+")))
    
net$colors[["time"]] = "grey"
net$colors[["time2"]] = "grey"
net$colors[["treatment"]] = "grey"

ff_fit = ff(Formula,data = data,module_membership=net$colors,
        screen_params = screen_control(min_ntree = 500),
        select_params = select_control(min_ntree = 500,number_selected = 15), 
        final_ntree = 5000, num_processors = 1)        
})
mean((predict(ff_fit,new_data=data_test)-data_test$y)**2)
ff_fit$feature_list[,1]

"package 'fuzzyforest' was built under R version 3.6.1"

   user  system elapsed 
 659.29    8.44  711.25 