In [1]:
library(MASS)

# Create data X and label y

The data set X: please read simulation_AR (pdf file) on how the construction of X.
X (T*p) satisfies the following:
* Each row of X has covariance matrix cov_feature, which has three block matrix and an identity matrix on the diagonal.
* Each column of X has the same longitudinal covariance structure (AR).

The label y: suppose there are 100*4 features, denoted by $X^{(1)},X^{(2)},...,X^{(400)}$.
Our model is $$y_{t} = 5X^{(1)}_{t}+2X^{(2)}_{t}+2X^{(3)}_{t}+5X^{(2)}_{t}X^{(3)}_{t}
+5X_{t}^{(301)}+2X^{(302)}_{t}+2X^{(303)}_{t}+5X^{(302)}_{t}X^{(303)}_{t}
+\epsilon_{t}$$ where $\epsilon_{t}\stackrel{}{\sim} N(0,\alpha I)$. $\alpha$ is the noise level (in the code, it is called var_noise)

Therefore, the expected results of feature selection are:
* The module 1 and 4 are chosen
* The feature 1,2,3,301,302,303 are chosen

You'd better not change p0 since our model of the label depend on p0

In [2]:
# Generate X for one sample; X is T*p
# p features and T oberservations
# p= 4*p0 : the first 3 p0 features are 3 modules, independent bewteen modules and
# correlated (cov=0.8) within each. The last module is independent within and with the 
# first three
# var_noise: noise level (used in generating label y)
# cor_within: the correlation within module 1,2 and 3. Default is 0.8. If change to 
# 0, then along the time is independent (no time structure), but feature within 1,2,3 
# modules are still correlated with cov = 0.8

# return a T*(p+1) matrix, the last row is the label y
Data_AR = function(T,var_noise,cor_within=0.8){
    
    p0 = 100
    p = 4*p0
    
    #### covariance matrix between features: it is either 0 (independent) or 0.8 ####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(0.8,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    #### X_data ####
    # T*p Data matrix, pre-allocate memory
    X_data = matrix(0,nrow = T,ncol = p)
    # the first row
    X_data[1,] = mvrnorm(n = 1, rep(0, p), cov_feature)
    # the next row depends on the previous one
    for (i in 2:T){
        tmp = (1-cor_within^2)^(1/2)
        X_data[i,] = cor_within*X_data[i-1,]+tmp*mvrnorm(n = 1, rep(0, p), cov_feature)
    }
    ###
    
    ### create labels y ###
    # create a n vector for labels
    y = matrix(data=0,nrow = T)
    # build y according to our model
    y = (5*X_data[,1]+2*X_data[,2]+2*X_data[,3]+5*X_data[,2]*X_data[,3]
         +5*X_data[,301]+2*X_data[,302]+2*X_data[,303]+5*X_data[,302]*X_data[,303]
         +mvrnorm(n = 1, rep(0, T), diag(x=var_noise,T)))
    ###
    
    # return a T*(p+1) matrix
    return (cbind(X_data,y))
}

# Test

In [3]:
# set.seed(0)
X = Data_AR(T=4,var_noise = 0.3)
X

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,y
-0.2914104,-0.63026629,-0.402194,0.1281363,0.5739251,-0.586226,-0.33185775,-0.1390336,-0.3448947,0.1600565,...,0.2527466,0.2805203,-0.3472624,-0.44977872,-0.2505663,1.8829548,1.26308183,-0.03061479,0.7075738,0.04708205
0.4662102,-0.22289197,0.1556311,1.0564471,1.0861187,0.6039013,0.46372418,0.9367936,0.7372861,0.7891657,...,0.2017965,1.1395966,-1.2612957,-0.08603207,-0.8454392,1.7373665,-0.09971447,0.75459947,1.5027533,2.75049244
0.4462323,-0.06042924,0.2413065,0.8997249,0.4897849,0.953736,0.3328894,0.7151182,0.2548791,0.5152065,...,-0.6963344,0.8045154,-0.2101597,-0.75951663,-1.1865856,0.1997048,0.36016685,0.18381273,0.6806605,2.59089045
0.4329178,0.42105181,0.0769341,0.3319665,0.6022696,0.821322,0.04763099,0.8677284,-0.1608225,0.4832928,...,-0.4485637,0.6588786,0.1092213,-1.24946185,-0.4789227,0.594518,0.47763373,0.81424835,-0.2612834,-1.2376035


In [20]:
# Test whether the covariance is correct
k = 1000
x_22 = 1:k
x_32 = 1:k
x_42 = 1:k
for(i in 1:k){
    X = Data_AR(T=4,var_noise = 0.1,cor_within = 0)
    
    x_22[i] = X[2,2]
    x_32[i] = X[3,2]
    x_42[i] = X[4,2]
    
}


In [21]:
cov(x_22,x_32)

In [22]:
cov(x_22,x_42)

# Save in a CSV file

In [7]:
p = 400
k = 100 # The number of patients
T = 7 # The times each patient
X_all = matrix(0,nrow = k*T,ncol=p+1) # all the data

for (count in 1:k){
    X_all[(T*(count-1)+1):(T*count),] = Data_AR(T,var_noise = 0.1)
}

In [8]:
write.csv(X_all, file = 'CS_noise0.1.csv')

In [10]:
dim(X_all)

In [None]:
# No time structure
# set cor_within = 0
X_all = matrix(0,nrow = k*T,ncol=p+1) # all the data

for (count in 1:k){
    X_all[(T*(count-1)+1):(T*count),] = Data_AR(T,var_noise = 0.1,cor_within = 0)
}
write.csv(X_all, file = 'NoTime_noise0.1.csv')