In [23]:
source("SimData.r")
library("glmertree")
library("WGCNA")
library("pre")

Problems
* When fixed_regress = NULL, we can let the user decide whether to use PC or not as regressors. If don't use PC, that is a WGCNA+ regular RE-EM
* Right now we can only do random intercept
* User may want to tune other parameters in WGCNA (in addition to power). How to do this in a elegant way? (Not urgent since the current parameters can cluster correctly)
* Right now all the alpha in the algorithm are the same.

Output: a glmertree object (trained tree)

Parameters:
* data: training data
* fixed_regress: the regressors used no matter what such as time and time^2; if fixed_regress = NULL, use PC as regressor at screening step
* fixed_split: a char vector containing features definitely used in splitting
* var_select: a char vector containing features to be selected. These features will be clustered by WGCNA and the chosen ones will be used in regression and splitting
* power: parameters of WGCNA
* cluster: the variable name of each cluster (in terms of random effect)
* Fuzzy = TRUE: Screen like Fuzzy Forest; Fuzzy= FALSE: first screen within non-grey modules and then select the final non-grey features within the selected ones from each non-grey module; Use this final non-grey features as regressors (plus fixed_regress) and use grey features as split_var to select grey features. The use final non-grey features and selected grey features together in splitting and regression variables, to do the final prediction. Fuzzy=FALSE is used if there are so many non-grey features and you want to protect grey features.
* maxdepth_factor_screen: when selecting features from one module, the maxdepth of the glmertree is set to ceiling function of maxdepth_factor_screen*(#features in that module). Default is 0.04. 
* maxdepth_factor_select: Given screened features (from each modules, if Fuzzy=FALSE,that is the selected non-grey features from each non-grey modules), we want to select again from those screened features. The maxdepth of that glmertree is set to be ceiling of maxdepth_factor_select*(#screened features). Default is 0.6.
* for the maxdepth of the prediction tree (final tree), maxdepth is set to the length of the split_var (fixed+chosen ones)
* minsize_multiplier: At the final prediction tree, the minsize = minsize_multiplier times thelength of final regressors. The default is 5. Note that we only set minsize for the final prediction tree instead of trees at the feature selection step since during feature selection, we don't have to be so careful. Note that when tuning the parameters, larger alpha and samller minsize_multiplier will result in deeper tree and therefore may cause overfitting problem. You'd better decrease alpha and decrease minsize_multiplier at the same time.
* alpha_screen, alpha_select and alpha_predict are the alpha used in trees and the screening, selecting and preidition step respectively.
* The most important parameters are alpha, maxdepth and minsize_multiplier.

In [26]:
Longtree = function(data,fixed_regress=NULL,fixed_split=NULL,var_select=NULL,
                    power=6,cluster,maxdepth_factor_screen=0.04,
                    maxdepth_factor_select=0.5,Fuzzy=TRUE,minsize_multiplier = 5,
                    alpha_screen=0.2, alpha_select=0.2, alpha_predict=0.05){
    ### if there are no features to select, just use fixed_regress and fixed_split
    if(length(var_select)==0){
        if (length(fixed_regress)==0){
            if (length(fixed_split)==0){
                stop("no features to split and regress on")
            }
            fixed_regress = "1"
        }
        maxdepth = length(fixed_split)
        Formula = as.formula(paste("y~",paste(fixed_regress,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(fixed_split,collapse = "+")))
        mytree = lmertree(Formula,data=data,alpha=alpha_predict,maxdepth=maxdepth)
        mytree$final_selection = NULL
        return (mytree)
    } ###
    # Now var_select is not empty
    # If don't specify fixed_regress: use PC as regressors at screening step
    if (length(fixed_regress)==0){
        cat("Use Longtree_PC\n")
        return(Longtree_PC(data=data,fixed_split=fixed_split,
                    var_select=var_select,
                    power=power,cluster=cluster,
                    maxdepth_factor_screen=maxdepth_factor_screen, 
                    maxdepth_factor_select=maxdepth_factor_select,Fuzzy=Fuzzy,
                    minsize_multiplier=minsize_multiplier,
                    alpha_screen=alpha_screen, alpha_select=alpha_select, 
                    alpha_predict=alpha_predict))
    }###
    # Now we have non-empty var_select,fixed_regress
    cat("Use Longtree_time\n")
    return(Longtree_time(data=data,fixed_regress=fixed_regress,
                        fixed_split=fixed_split, var_select=var_select,
                    power=power,cluster=cluster,
                    maxdepth_factor_screen=maxdepth_factor_screen, 
                    maxdepth_factor_select=maxdepth_factor_select,Fuzzy=Fuzzy,
                    minsize_multiplier=minsize_multiplier,
                    alpha_screen=alpha_screen, alpha_select=alpha_select, 
                    alpha_predict=alpha_predict))
                        
}

# Longtree_time: used when var_select and fixed_regress are non-empty
# Longtree is equivalent to this Longtree_time in this case
Longtree_time = function(data,fixed_regress,fixed_split,var_select,power,cluster,
                         maxdepth_factor_screen,maxdepth_factor_select,
                         Fuzzy,minsize_multiplier,alpha_screen,alpha_select,
                         alpha_predict){
    # Cluster var_select
    data_WGCNA = data[var_select]
    # Must set numericLabels = FALSE so that it uses actual colors like "grey"
    net = blockwiseModules(data_WGCNA, power = power,TOMType = "unsigned", 
                           minModuleSize = 30,reassignThreshold = 0, 
                           mergeCutHeight = 0.25,numericLabels = FALSE, 
                           pamRespectsDendro = FALSE,verbose = 0)
    # the correspondance betweeen feature names and colors
    colors = net$colors # it is a string vector with names (that is the name is V1)
    module_names = unique(colors) # all color names
    #"dictionary"with keys=name of color,values=names of features of that color
    module_dic = list() 
    for (i in 1:length(module_names)){
        module_dic[[module_names[i]]] = names(colors[colors==module_names[i]])
    }
    
    imp_var = list() # used to store the names of important features
    
    if(Fuzzy==TRUE){
        # Do the selection step just like Fuzzy Forest:
        # for each module (including grey), use them as split_var and select
        # finally, use all selected ones as split_var and select
        
        for (name in module_names){
        # in the formula, add fixed_split as split_var, also include the module features
        split_var = c(module_dic[[name]],fixed_split)
        maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
        # use fixed_regress as regressor
        regress_var = fixed_regress

        # Formula for lmtree
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                               "|",cluster,"|",
                    paste(split_var,collapse = "+")))

        # fit the tree
        mytree = lmertree(Formula, data = data,alpha=alpha_screen,maxdepth=maxdepth) 

        #extract important features
        imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }
        
        # the variables selected from all the modules
        final_var = imp_var[[1]]
        if (length(imp_var)>1){
            # There are at least 2 modules
            for (i in 2:length(imp_var)){
            final_var = c(final_var,imp_var[[i]])
         }
            cat("after screening within modules",final_var,"\n")
            
            # the final selection among all the chosen features 
            regress_var = fixed_regress
            split_var = c(final_var,fixed_split)
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha=alpha_select,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            cat("final features",final_var)      
        }else{
            # only grey module
            cat("There is only one module, final features",final_var)
        }
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        regress_var = c(fixed_regress,final_var)
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        minsize = round(minsize_multiplier*length(regress_var))
        mytree = lmertree(Formula, data = data,alpha=alpha_predict,maxdepth=maxdepth,
                         minsize = minsize)
        mytree$final_selection = final_var
        return(mytree)           
    }
    if(Fuzzy==FALSE){
        # first do the screening and selecting in non-grey modules
        # Then use those non-grey estimated true features as regressors
        # and grey features as split_var, choose grey features and keep them
        for (name in module_names){
        if(name=="grey"){
            next
        }
        split_var = c(module_dic[[name]],fixed_split)
        maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
        regress_var = fixed_regress

        # Formula for lmtree
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                               "|",cluster,"|",
                    paste(split_var,collapse = "+")))

        # fit the tree
        mytree = lmertree(Formula, data = data,alpha=alpha_screen,maxdepth=maxdepth) 

        #extract important features
        imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }# Now imp_var contains important features from modules that are not grey
        if(length(imp_var)==0){
            # only grey module, no other modules
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = fixed_regress
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                               "|",cluster,"|",
                    paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha = alpha_screen,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            cat("There is only one module which is grey, final features",final_var)
        }else{
            # at least one non-grey module
            final_var = imp_var[[1]]
            # if only one non-grey module: final_var is the chosen non-grey features
            # if at least two non-grey modules:
            if (length(imp_var)>1){
                # There are at least 2 modules
                for (i in 2:length(imp_var)){
                final_var = c(final_var,imp_var[[i]])
                }
            
            cat("After screening within non-grey modules",final_var,"\n")
            # select from selected non-grey features
            regress_var = fixed_regress
            split_var = c(final_var,fixed_split)
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha = alpha_select,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            }
            cat("The chosen non-grey features are",final_var,"\n")
            
            # use final_var (chosen non-grey features) to select grey features
            regress_var = c(fixed_regress,final_var)
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha = alpha_screen,maxdepth=maxdepth) 
            grey_var = get_split_names(mytree$tree,data)
            cat("The chosen grey features are",grey_var,"\n")
            # use final_var and grey_var do to the final model tree
            final_var = c(final_var,grey_var)    
            cat("final features",final_var)  
        }
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        regress_var = c(fixed_regress,final_var)
        Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        minsize = round(minsize_multiplier*length(regress_var))
        mytree = lmertree(Formula, data = data,alpha=alpha_predict,maxdepth=maxdepth,
                         minsize=minsize) 
        mytree$final_selection = final_var
        return(mytree)
    }
    
}

# Methods for extracting names of splitting features used in a tree
# tree: a tree object; data: the train or test set
get_split_names = function(tree,data){
    # path: the string that contains all the node information
    paths <- pre:::list.rules(tree, removecomplements = FALSE)
    vnames = names(data)
    # the regex for a variable
    # tomatch = paste(paste(var,"<="),"|",paste(var,">"),sep="")
    # match to tomatch in path
    tmp = vnames[sapply(sapply(vnames, FUN = function(var) grep(paste(paste(var,"<="),"|",paste(var,">"),sep=""), paths)), length) > 0]
    return (tmp)
}

# Longtree_PC: used when fixed_regress are NULL, use PC as regressors for non-grey module
# Longtree is equivalent to this Longtree_PC in this case
Longtree_PC = function(data,fixed_split, var_select, power,cluster,
                    maxdepth_factor_screen,maxdepth_factor_select,Fuzzy,
                    minsize_multiplier,alpha_screen,alpha_select,
                    alpha_predict){
    # Cluster var_select
    data_WGCNA = data[var_select]
    # Must set numericLabels = FALSE so that it uses actual colors like "grey"
    net = blockwiseModules(data_WGCNA, power = power,TOMType = "unsigned", 
                           minModuleSize = 30,reassignThreshold = 0, 
                           mergeCutHeight = 0.25,numericLabels = FALSE, 
                           pamRespectsDendro = FALSE,verbose = 0)
    # the correspondance betweeen feature names and colors
    colors = net$colors # it is a string vector with names (that is the name is V1)
    module_names = unique(colors) # all color names
    #"dictionary"with keys=name of color,values=names of features of that color
    module_dic = list() 
    for (i in 1:length(module_names)){
        module_dic[[module_names[i]]] = names(colors[colors==module_names[i]])
    }
    
    # extract eigengenes and rename the column
    # The eigengene(1st pricinpal component) is L2 normalized
    eigengene = net$MEs
    # eigengene
    # add eigengen to training data (for grey group, eigengene is meaningless)
    for (name in module_names){
        if (name == "grey"){
            next
        }
        eigen_name = paste("ME",name,sep="")
        data[[eigen_name]] = eigengene[[eigen_name]]
    }
    imp_var = list() # used to store the names of important features
    
    if (Fuzzy==TRUE){
        # first screen then select, just like Fuzzy Forest
        # For each module that is not grey, use model tree as following:
        # use its eigengene as regression variables and all features as splitting ones
        # For grey module, use regular REEM (set regressor = "1")
        # Then select by using all chosen features as split_var and regress="1"
        # Finally, use all the selected features for splitting and regression variables
       
        for (name in module_names){
            split_var = c(module_dic[[name]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            # use eigengene as regressor
            if (name == "grey"){
                regress_var = "1"
            }else{
                eigen_name = paste("ME",name,sep="")
                regress_var = eigen_name
            }
            # Formula for lmtree: use PC as regressors
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))

            # fit the tree
            mytree = lmertree(Formula, data=data,alpha=alpha_screen,maxdepth=maxdepth) 
            #extract important features
            imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }        
        # the variables selected from all the modules
        final_var = imp_var[[1]]      
        if (length(imp_var)>1){
            for (i in 2:length(imp_var)){
            final_var = c(final_var,imp_var[[i]])
         }
            cat("After screening within modules ",final_var,"\n")
            # select features again
            # use all selected features as split_var with no regressors
            split_var = c(final_var,fixed_split)
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            Formula = as.formula(paste("y~","1",
                                       "|",cluster,"|",
                                     paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data = data,alpha=alpha_select,maxdepth=maxdepth)
            final_var = get_split_names(mytree$tree,data)
            cat("Final features ",final_var,"\n")
            
        }else{
            # length(imp_var) now is 1, only one module
            cat("There is only one module ",final_var,"\n")
        }
        
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        Formula = as.formula(paste("y~",paste(final_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        minsize = round(minsize_multiplier*length(final_var))
        mytree = lmertree(Formula, data = data,alpha=alpha_predict,maxdepth=maxdepth,
                         minsize = minsize)
        mytree$final_selection = final_var
        return (mytree)
    }
    
    if (Fuzzy== FALSE){
        # select features from non-grey modules and use them as regressors 
        # to select features from grey module. Then use all of them as split and regressor
        
        # for non-grey groups
        for (name in module_names){
            if (name == "grey"){
                next
            }
            split_var = c(module_dic[[name]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            # use eigengene as regressor
            eigen_name = paste("ME",name,sep="")
            regress_var = eigen_name
            # Formula for lmtree: use PC as regressors
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))

            # fit the tree
            mytree = lmertree(Formula, data=data,alpha=alpha_screen,maxdepth=maxdepth) 
            #extract important features
            imp_var[[length(imp_var)+1]] = get_split_names(mytree$tree,data)
        }
        # Now imp_var contains all the non-grey screened features
        if(length(imp_var)==0){
            # there is only one module which is grey
            # just select from the grey module
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = "1"
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha_screen,maxdepth=maxdepth) 
            final_var = get_split_names(mytree$tree,data)
            cat("There is only grey module ",final_var,"\n")
        }
        if(length(imp_var)==1){
            # only one non-grey module
            final_var = imp_var[[1]]
            cat("There is only one non-grey module",final_var,"\n")
            # use final_var as regressors
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = final_var
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha_screen,maxdepth=maxdepth) 
            grey_var = get_split_names(mytree$tree,data)
            final_var = c(final_var,grey_var)
            cat("The final features ",final_var,"\n")
        }
        if(length(imp_var)>1){
            # at least two non-grey modules, select among non-grey modules
            final_var = imp_var[[1]]
            for (i in 2:length(imp_var)){
                final_var = c(final_var,imp_var[[i]])
                }
            cat("After screening from non-grey modules ",final_var,"\n")
            # now final_var contains all the non-grey screened features
            split_var = c(final_var,fixed_split)  
            maxdepth = ceiling(maxdepth_factor_select*length(split_var))
            regress_var = "1"
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha_select,maxdepth=maxdepth)
            final_var = get_split_names(mytree$tree,data)
            # Now final_var contains final non-grey features
            cat("Final non-grey features ",final_var,"\n")
            # use final_var as regressors and select features from grey features
            split_var = c(module_dic[["grey"]],fixed_split)
            maxdepth = ceiling(maxdepth_factor_screen*length(split_var))
            regress_var = final_var
            Formula = as.formula(paste("y~",paste(regress_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
            mytree = lmertree(Formula, data=data,alpha=alpha_screen,maxdepth=maxdepth) 
            grey_var = get_split_names(mytree$tree,data)
            cat("Final grey features ",grey_var,"\n")
            final_var = c(final_var,grey_var)
            cat("The final features ",final_var,"\n")
        }
        # use the final features as split&regression variables
        split_var = c(final_var,fixed_split)
        maxdepth = length(split_var)
        Formula = as.formula(paste("y~",paste(final_var,collapse = "+"),
                                   "|",cluster,"|",
                                 paste(split_var,collapse = "+")))
        minsize = round(minsize_multiplier*length(final_var))
        mytree = lmertree(Formula, data = data,alpha=alpha_predict,maxdepth=maxdepth,
                         minsize = minsize)
        mytree$final_selection = final_var
        return (mytree)
    }
    
    
}

# Longtree_PC Sample Run

In [4]:
n <- 1000 # number of patients
T <-  5 # number of observations per patients
set.seed(100)

data <- as.data.frame(sim_3(n, T)) 
colnames(data)[401] <- "y"
for (i in 1:n){
    data$patient[(1+(i-1)*T):(i*T)] = rep(i,T)
}

ERROR: Error in sim_3(n, T): could not find function "sim_3"


In [9]:
n_test <- 100 
T <-  5 
set.seed(101)
data_test <- as.data.frame(sim_3(n_test, T)) 
colnames(data_test)[401] <- "y"
for (i in 1:n_test){
    data_test$patient[(1+(i-1)*T):(i*T)] = rep(i,T)
}

In [10]:
cluster = "patient"
fixed_regress = NULL
fixed_split = NULL
var_select = paste("V",1:400,sep="")
# fixed_split = "V303"
# var_select = paste("V",setdiff(1:400,303),sep="")

In [7]:
system.time({
    mytree = Longtree(data,fixed_regress=fixed_regress,fixed_split=fixed_split,
                  var_select=var_select,cluster=cluster,Fuzzy=FALSE)
})
mean((predict(mytree,newdata=data_test,re.form=NA)-data_test$y)**2)

Use Longtree_PC
After screening from non-grey modules  V1 V2 V3 V141 V150 
Final non-grey features  V1 V2 V3 
Final grey features  V301 V302 V303 
The final features  V1 V2 V3 V301 V302 V303 


   user  system elapsed 
 162.48    1.25  174.78 

# Longtree_time: Sample Run

In [27]:
# training data
n <- 1000 # number of patients
T <-  5 # number of observations per patients

set.seed(100)

data <- sim_quad(n,T,a1=1,a2=-1)
# add time_squared
data$time2 = (data$time)^2

# testing data
n_test <- 100 # number of patients
T <-  5 # number of observations per patients
set.seed(101)
data_test <- sim_quad(n_test,T,a1=1,a2=-1)
# data_test <- sim_quad(n_test, T)
data_test$time2 = (data_test$time)^2

In [28]:
fixed_regress = c("time","time2")
fixed_split = c("treatment")
cluster = "patient"
var_select = paste("V",1:400,sep="")

In [29]:
# Fuzzy=TRUE 
# alpha = 0.2, maxdepth_factor_select = 0.8 
system.time({
    mytree = Longtree(data,fixed_regress=fixed_regress,fixed_split=fixed_split,
                  var_select=var_select,cluster=cluster,Fuzzy=TRUE,
                     maxdepth_factor_select = 0.8,minsize_multiplier=5)
})
mean((predict(mytree,newdata=data_test,re.form=NA)-data_test$y)**2)
coef(mytree)

Use Longtree_time
after screening within modules V1 V2 V3 V45 V301 V302 V303 V365 
final features V1 V2 V3 V45 V301 V302 V303

   user  system elapsed 
 284.23    4.03  291.34 

Unnamed: 0,(Intercept),time,time2,V1,V2,V3,V45,V301,V302,V303
7,-6.349571,-7.350222,1.2114900,4.171255,-6.8839144,-7.3718101,1.541683559,4.494505,-0.7512000,-3.9651174
8,-3.658046,-6.883360,1.1881768,5.108242,-7.1688688,-5.0943271,0.057436244,4.526475,-2.8828999,0.9714997
9,-3.698400,-5.602412,0.9491308,5.115616,-8.6631469,-4.8501337,-0.564776744,5.377504,-1.5535279,6.3739251
12,2.665192,-4.347118,0.7527754,5.103950,-1.2328158,1.2561532,-0.003304848,4.863811,-1.4622593,-3.7378360
14,6.238134,-6.017547,1.0210227,5.133300,-1.7202024,-0.8393982,0.049126577,5.055169,-1.9300952,0.5422217
15,8.367578,-5.629857,0.9427324,4.275650,-1.0969766,-0.2755758,0.054170399,4.870142,-2.1582460,3.5697818
16,14.394620,-6.824156,1.1617445,5.539400,-1.7106410,-0.9579946,0.793770034,5.224937,-2.4107370,8.8672637
19,-19.405325,4.318409,-0.7500138,4.424966,-6.7462368,-6.0823057,1.608117424,5.017769,-4.8821657,1.5511117
21,-6.724431,4.362079,-0.7458413,5.397429,-0.8061076,-0.7187922,-0.296876866,5.236399,-7.0547208,2.7592528
22,-12.449150,5.888207,-0.9973595,5.003771,-0.4268434,-0.5087410,0.239971645,5.043480,-3.7509750,-0.5490829


In [30]:
# Fuzzy=TRUE 
# alpha = 0.2, maxdepth_factor_select = 0.5 (all default)
system.time({
    mytree = Longtree(data,fixed_regress=fixed_regress,fixed_split=fixed_split,
                  var_select=var_select,cluster=cluster,Fuzzy=TRUE)
})
mean((predict(mytree,newdata=data_test,re.form=NA)-data_test$y)**2)
coef(mytree)

Use Longtree_time
after screening within modules V1 V2 V3 V45 V301 V302 V303 V365 
final features V1 V2 V3 V45 V301

"Some predictor variables are on very different scales: consider rescaling"

   user  system elapsed 
 132.88    0.88  135.89 

Unnamed: 0,(Intercept),time,time2,V1,V2,V3,V45,V301
4,-6.0281412,-6.4612153,1.138401,6.245828,-7.25967013,-8.472994007,0.72959766,5.08056
5,-23.0382615,6.6997398,-1.1139051,5.432328,-5.50408537,-7.271320455,-0.49546859,5.241933
7,7.770932,-6.2607119,1.0863676,4.919968,-0.25198985,-0.680108175,-0.14906637,4.603007
9,-14.7016229,5.2435583,-0.9248665,4.705778,-4.09335889,-5.297219656,2.11906574,5.107379
10,-6.4238948,3.8370777,-0.7122208,5.280119,0.05253092,0.005806421,-0.6753463,5.113255
14,19.9524455,-8.9498873,1.4947665,11.27602,1.49114124,114.244400242,-5.72667504,5.203355
15,6.90276,-5.4595861,0.942931,4.930984,4.33239491,4.582721437,0.09611803,4.877859
17,-0.2449789,-3.209318,0.5496439,7.93451,9.70144399,4.494394216,-0.09366256,5.852352
18,-10.7662559,-7.015785,1.1615518,4.430759,12.03460505,11.515776179,2.1965132,3.906054
21,-7.0867582,5.0626319,-0.903098,5.848162,2.0507536,1.12287955,0.30167057,5.209441


## Benchmark

In [31]:
# random forest
library("randomForest")
var = c(paste("V",1:400,sep=""),"time","time2","treatment")
Formula = as.formula(paste("y~",paste(var,collapse = "+")))
system.time({
#     set.seed(20)
    rf <- randomForest(Formula,data)
})
mean((predict(rf,newdata=data_test)-data_test$y)**2)
# sorts features by importance
importance_order <- sort(rf$importance, decreasing = TRUE,index.return=TRUE) 
# the ranking; 6 here should be a parameters
var[importance_order$ix[1:15]] 

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:pre':

    importance



   user  system elapsed 
 524.88    1.44  540.55 

In [32]:
var[importance_order$ix] 

In [9]:
# Fuzzy Forest
library("fuzzyforest")
# since treatment is categorical, we cannot include it in WGCNA
system.time({
data_WGCNA = data[,1:400] # only the covariates

net = blockwiseModules(data_WGCNA, power = 6,TOMType = "unsigned", 
                       minModuleSize = 30,reassignThreshold = 0, 
                       mergeCutHeight = 0.25,numericLabels = FALSE, 
                       pamRespectsDendro = FALSE,verbose = 0)

var = c(paste("V",1:400,sep=""),"time","time2","treatment")
Formula = as.formula(paste("y~",paste(var,collapse = "+")))
    
net$colors[["time"]] = "grey"
net$colors[["time2"]] = "grey"
net$colors[["treatment"]] = "grey"

ff_fit = ff(Formula,data = data,module_membership=net$colors,
        screen_params = screen_control(min_ntree = 500),
        select_params = select_control(min_ntree = 500,number_selected = 15), 
        final_ntree = 5000, num_processors = 1)        
})
mean((predict(ff_fit,new_data=data_test)-data_test$y)**2)
ff_fit$feature_list[,1]

"package 'fuzzyforest' was built under R version 3.6.1"

   user  system elapsed 
3384.45    8.99 3405.07 