In [5]:
library(MASS)

# Readme

* ${\large \textbf{Settings}}$:
    * There are p = 100*4 features. Each measurement of each individual is denoted by $x = (X^{(1)},X^{(2)},...,X^{(400)})$. The features are grouped into 4 groups, with 100 each. The groups are always independent with each other, although in some cases they are correlated within groups (we always assume that features are independent within the 4-th group)
    * There are n units and each unit is measured T times. Therefore that data vector of ith unit at time t is $x_{it} = (X^{(1)}_{it},X^{(2)}_{it},...,X^{(400)}_{it})$. We assume that different units are independent.
    * Define $$\text{f_sim} (x_{t}) = 5X^{(1)}_{t}+2X^{(2)}_{t}+2X^{(3)}_{t}+5X^{(2)}_{t}X^{(3)}_{t}
+5X_{t}^{(301)}+2X^{(302)}_{t}+2X^{(303)}_{t}+5X^{(302)}_{t}X^{(303)}_{t}$$ 
    * Our model: 
    $$y_{it} = \text{f_xim}(x_{it}) + \epsilon_{it} $$ where $\epsilon_{it}\stackrel{}{\sim} N(0,\alpha I)$. $\alpha$ is the noise level (in the code, it is called var_noise). Also denote $\epsilon_{i} = (\epsilon_{i1},...,\epsilon_{iT})$
    
* ${\large \textbf{Meaning of time structure}}$: for the unit i, the p+1 dimensional vector $(X_{t},y_{t})$ (t=1,2,...,T) are not independent.

* ${\large \textbf{Different simulating functions}}$: Input is n,T,var_noise,cor_feature. Return a $nT\times (p+1)$ matrix with the last column be the response y. All of the numerical data will be Gaussian with mean 0.
    * sim_1: within the first three groups, features are correlated with covariance = cor_feature, specified by users (If cor_feature = 0, all p features are independent.) That is , the covariance matrix is given by 
    $$
   \text{cov_feature} = \Sigma=
  \begin{bmatrix}
    \Sigma^{*} & & &\\
    & \Sigma^{*} & & \\
    & & \Sigma^{*} &\\
    & & & I
  \end{bmatrix}
$$ where $$\Sigma^{*} = \begin{bmatrix}
                         1 & \text{cor_feature} & \dots & \text{cor_feature}\\
                         \text{cor_feature} & 1 & \dots & \text{cor_feature}\\
                         \vdots & \vdots & \vdots & \vdots\\
                         \text{cor_feature} & \text{cor_feature}&\dots &1
                         \end{bmatrix}$$
Also, the covariance matrix of $\epsilon_{i}$ is diag(var_noise) (with no time structure) and the x vector within a unit are iid.
    
    * sim_2: (CS structure on y) In addition to the features in sim_1, we allow the CS covariance structure in $\epsilon_{i}$. The covariance matrix of $\epsilon_{i}$ is given by 
$$ \text{cov_noise} = \begin{bmatrix}
                         \text{var_noise} & \text{cor_noise} & \dots & \text{cor_noise}\\
                         \text{cor_noise} & \text{var_noise} & \dots & \text{cor_noise}\\
                         \vdots & \vdots & \vdots & \vdots\\
                         \text{cor_noise} & \text{cor_noise}&\dots&\text{var_noise}
                         \end{bmatrix}$$
    * sim_3: In the previous two functions, the X matrix does not have any time struture. That is, each row of X are iid. In sim_3, we follow the simulation_AR pdf file to construct a X matrix with AR structure. Also, we still use $y_{it} = \text{f_xim}(x_{it}) + \epsilon_{it} $ and $\epsilon_{i}$ has diagnal covariance (no time structure. Since X has AR structure and y is a function of X, y also have time structure. In the pdf, $x_{t+1} = \alpha x_{t} + \sqrt{1-\alpha^{2}}\text{std_normal}$ with alpha = 0.8. Here $\alpha$ can be chose by the users.
    
    * sim_4: CS structure on X as follows: for a given patient i 
        * Generate $\gamma_{i} = (\gamma_{i}^{1},...,\gamma_{i}^{400})$ from $N(0,\Sigma)$ where $\Sigma$ is cov_feature.
        * For each t, generate $e_{it} = (e_{it}^{1},...,e_{it}^{400})$ from $N(0,\Sigma)$ and independent with $\gamma$
        * Generate $X_{it} = \beta \gamma_{i} + \sqrt{1-\beta^{2}}e_{it}$
        * Then for any feature k, the vector within a unit has a covariance matrix
        $$cov(X_{i1}^{k},...,X_{iT}^{k}) = \begin{bmatrix}
                         1 & \beta^{2} & \dots & \beta^{2}\\
                         \beta^{2} & 1 & \dots & \beta^{2}\\
                         \vdots & \vdots & \vdots & \vdots\\
                         \beta^{2} & \beta^{2}&\dots &1
                         \end{bmatrix}$$
    * sim_quad: CS structure on y and time dependence and group/treatment effect. There are two groups of patients, the first half is group 1 and the second half is group 2. For the first half, $y_{it} = \text{f_xim}(x_{it}) + a_{1}t^{2} + b_{1}t+c_{1}+\epsilon_{it}$ and group 2 has $y_{it} = \text{f_xim}(x_{it}) + a_{2}t^{2} + b_{2}t+c_{2}+\epsilon_{it}$ where the a,b&c coefficients are fixed effect. Patients are still independent of each other.
    
* ${\large \textbf{Simulated data in csv file} }$
    * data1: independent features and no time structure
    * data2: grouped features, no time structure (like in paper of Fuzzy Forest)
    * data3: grouped features, CS time structure only on y (rows of X are iid)
    * data4: grouped features, AR structure on X as in pdf simulation_AR
    * data5: grouped feature, CS structure on X and the magnitude of CS struture is controlled by $\beta$

# Functions

In [18]:
# Input: a matrix x_data; Output, a vector with the f_sim applied to each row of x_data
f_sim = function (X_data){
    y = (5*X_data[,1]+2*X_data[,2]+2*X_data[,3]+5*X_data[,2]*X_data[,3]
         +5*X_data[,301]+2*X_data[,302]+2*X_data[,303]+5*X_data[,302]*X_data[,303])
    return (y)
}

f_sim_linear = function (X_data){
    y = (5*X_data[,1]+2*X_data[,2]+2*X_data[,3]
         +5*X_data[,301]+2*X_data[,302]+2*X_data[,303])
    return (y)
}

## for all of the following return, the last column is the label
# No time structure; The features are grouped
sim_1 = function (n,T,cor_feature=0.8,var_noise=1){
    p = 400
    p0 = 100
    data = matrix(0,nrow = n*T, ncol = p+1)
    
    #### covariance matrix between features: it is either 0 (independent) or cor_feature ####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(cor_feature,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    # Create X matrix
    data[1:(n*T),1:p] = mvrnorm(n=n*T,rep(0,p),cov_feature)
    
    # create label y
    data[1:(n*T),p+1] = (f_sim(data[1:(n*T),1:p]) 
                         + mvrnorm(n=1,rep(0,n*T),diag(x=var_noise,n*T)))
    return (data)
}

# CS time structured on y and grouped features
sim_2 = function(n,T,cor_feature=0.8,var_noise=1,cor_noise=0.8){
    p = 400
    p0 = 100
    data = matrix(0,nrow = n*T, ncol = p+1)
    
    #### covariance matrix between features: it is either 0 (independent) or cor_feature ####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(cor_feature,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    #### covariance matrix of noise #####
    # the covariance matrix of noise within a unit
    cov_noise_star = matrix(cor_noise,nrow = T, ncol = T)
    diag(cov_noise_star) = var_noise
    # the overal cov matrix has diagonal block matrix as cov_noise_star
    cov_noise = matrix(0,nrow = n*T, ncol = n*T)
    for (i in 1:n){
        cov_noise[(1+(i-1)*T):(i*T),(1+(i-1)*T):(i*T)] = cov_noise_star
    }
    ###

    # Create X matrix
    data[1:(n*T),1:p] = mvrnorm(n=n*T,rep(0,p),cov_feature)
    
    # create label y
    data[1:(n*T),p+1] = (f_sim(data[1:(n*T),1:p]) 
                         + mvrnorm(n=1,rep(0,n*T),cov_noise))
    
    return(data)
    
}

# Following pdf file: simulation_AR
# x(i+1) = alpha*x(i) + (1-alpha^2)^0.5*std_normal; in the pdf, alpha = 0.8
sim_3 = function(n,T,cor_feature=0.8,var_noise=1,alpha=0.8){
    p = 400
    p0 = 100
    data = matrix(0,nrow = n*T, ncol = p+1)
    
    #### covariance matrix between features: it is either 0 (independent) or cor_feature ####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(cor_feature,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    # create x matrix
    tmp = (1-alpha**2)**0.5
    for (i in 1:n){
        data[1+(i-1)*T,1:p] = mvrnorm(n = 1, rep(0, p), cov_feature)
        for (j in 2:T){
            data[j+(i-1)*T,1:p] = (alpha*data[j-1+(i-1)*T,1:p]+
                                    tmp*mvrnorm(n = 1, rep(0, p), cov_feature))
        }
    }
    
    # create y 
    data[1:(n*T),p+1] = ( f_sim(data[1:(n*T),1:p])+ 
                          mvrnorm(n = 1, rep(0,n*T), diag(x=var_noise,n*T)) )
    
    return (data)
}

# CS structure on X
sim_4= function(n,T,cor_feature=0.8,var_noise=1,beta=0.8){
    p = 400
    p0 = 100
    data = matrix(0,nrow = n*T, ncol = p+1)
    
    #### covariance matrix between features: it is either 0 (independent) or cor_feature ####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(cor_feature,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    # create X matrix
    tmp = (1-beta**2)**0.5
    for (i in 1:n){
        # note that I put the coefficient beta and tmp in gamma and e directly
        gamma = mvrnorm(n = 1, rep(0, p), cov_feature)*beta
        e = mvrnorm(n = T, rep(0, p), cov_feature)*tmp
        
        for (j in 1:T){
            data[j+(i-1)*T,1:p] = gamma + e[j,] 
        }
    }
    
    # create y 
    data[1:(n*T),p+1] = ( f_sim(data[1:(n*T),1:p])+ 
                          mvrnorm(n = 1, rep(0,n*T), diag(x=var_noise,n*T)) )
    
    return (data)
}


# CS structure on X, but with f_sim_linear
sim_4_linear= function(n,T,cor_feature=0.8,var_noise=1,beta=0.8){
    p = 400
    p0 = 100
    data = matrix(0,nrow = n*T, ncol = p+1)
    
    #### covariance matrix between features: it is either 0 (independent) or cor_feature ####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(cor_feature,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    # create X matrix
    tmp = (1-beta**2)**0.5
    for (i in 1:n){
        # note that I put the coefficient beta and tmp in gamma and e directly
        gamma = mvrnorm(n = 1, rep(0, p), cov_feature)*beta
        e = mvrnorm(n = T, rep(0, p), cov_feature)*tmp
        
        for (j in 1:T){
            data[j+(i-1)*T,1:p] = gamma + e[j,] 
        }
    }
    
    # create y 
    data[1:(n*T),p+1] = ( f_sim_linear(data[1:(n*T),1:p])+ 
                          mvrnorm(n = 1, rep(0,n*T), diag(x=var_noise,n*T)) )
    
    return (data)
}

##### sim_qaud #####
# still grouped features; Now there is no time structure on X (like sim_2)
# instead of (un)structured error for each patient, now every patient has
# random intercept
# The first half patients are assigned to treatment1 and others treatment2
# treatment1 corresponds to a convex quadratic function of time and treatment2 a concave one
# Now the response is given by (T1 is the indicator of treatment1, med = median(1:T)
# y = f(t) + a1*(t-med)^2*T1 + a2*(t-med)^2*T2 + b (a1>0,a2<0)
# In the quadratic term, substract the median(1:T) so that the slope of the 
# quadratic function of time changes sign between t in [1,5]
# Note ： if use linear regression to esitmate the quadractic function of t,
# use time^2,time (since (t-med)^2 contains linear term)
sim_quad = function(n,T,cor_feature=0.8,var_noise=1,cor_noise=0.8,a1=5,a2=-5){
  p = 400
    p0 = 100

    #### covariance matrix between features #####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(cor_feature,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    # Create X matrix
    data = mvrnorm(n=n*T,rep(0,p),cov_feature)
    data <- data.frame(data)
    names(data) = paste("V",1:p,sep="")

    #### random intercept for each patient ####
    # random intercept draw from N(0,1)
    b = mvrnorm(n = 1, rep(0,n), diag(n))
    data$rand_int = rep(b,each = T)

    data$time <- rep(1:T, n) # time
    # treatment 1 or 2 ,categorical type
    data$treatment[1:(n*T/2)] <- 1 
    data$treatment[((n*T/2)+1):(n*T)] <- 2
    data$treatment = factor(data$treatment)

    # patient information
    data$patient = rep(1:n,each = T)

    # response y
    med = median(1:T)
    data$y = (f_sim(data[1:(n*T),1:p])+ 
        (data$treatment==1)*a1*(data$time-med)^2 + 
        (data$treatment==2)*a2*(data$time-med)^2 + data$rand_int)
    
    return(data)

}
# use the following code to see the result of sim_quad
# data = sim_quad(n=100,T=5)
# plot(data$time[251:500],data$y[251:500])
# plot(data$time[1:250],data$y[1:250])
# plot(data$time,data$y)

#### end sim_quad ####

# Save in a CSV file

In [29]:
n = 100
T = 5

In [30]:
# independent features and no time structure
data1 = sim_1(n,T,cor_feature=0,var_noise=1)
write.csv(data1, file = 'data1.csv')
# grouped features, no time structure (like in paper of Fuzzy Forest)
data2 = sim_1(n,T,cor_feature = 0.8,var_noise=1)
write.csv(data2, file = 'data2.csv')
# grouped features, CS time structure only on y (rows of X are iid)
data3 = sim_2(n,T,cor_feature=0.8,var_noise=1,cor_noise=0.8)
write.csv(data3, file = 'data3.csv')
# grouped features, AR structure on X as in pdf simulation_AR
data4 = sim_3(n,T,cor_feature=0.8,var_noise=1,alpha=0.8)
write.csv(data4, file = 'data4.csv')
# CS on X, grouped features
data5 = sim_4(n,T,cor_feature=0.8,var_noise=1,beta=0.8)
write.csv(data4, file = 'data5.csv')

## This is a test of WGCNA with different measure of time series. Data is AR structure.  WGCNA_TS test

In [24]:
# AR on X
data4 = sim_3(n,T,cor_feature=0.3,var_noise=1,alpha=0.8)
write.csv(data4, file = 'data4_hard.csv')

In [31]:
# CS on X, grouped features
data5 = sim_4(n,T,cor_feature=0.3,var_noise=1,beta=0.8)
write.csv(data4, file = 'data5_hard.csv')