In [1]:
library(MASS)

# Create data X and label y

The data set X: please read simulation_AR (pdf file) on how the construction of X.
X (T*p) satisfies the following:
* Each row of X has covariance matrix cov_feature, which has three block matrix and an identity matrix on the diagonal.
* Each column of X has the same longitudinal covariance structure (AR).

The label y: suppose there are 100*4 features, denoted by $X^{(1)},X^{(2)},...,X^{(400)}$.
Our model is $$y_{t} = 5X^{(1)}_{t}+2X^{(2)}_{t}+2X^{(3)}_{t}+5X^{(2)}_{t}X^{(3)}_{t}
+5X_{t}^{(301)}+2X^{(302)}_{t}+2X^{(303)}_{t}+5X^{(302)}_{t}X^{(303)}_{t}
+\epsilon_{t}$$ where $\epsilon_{t}\stackrel{}{\sim} N(0,\alpha I)$. $\alpha$ is the noise level (in the code, it is called var_noise)

Therefore, the expected results of feature selection are:
* The module 1 and 4 are chosen
* The feature 1,2,3,301,302,303 are chosen

You'd better not change p0 since our model of the label depend on p0

In [2]:
# Generate X for one sample; X is T*p
# p features and T oberservations
# p= 4*p0 : the first 3 p0 features are 3 modules, independent bewteen modules and
# correlated (cov=0.8) within each. The last module is independent within and with the 
# first three
# var_noise: noise level (used in generating label y)
# cor_within: the correlation within module 1,2 and 3. Default is 0.8. If change to 
# 0, then along the time is independent (no time structure), but feature within 1,2,3 
# modules are still correlated with cov = 0.8

# return a T*(p+1) matrix, the last row is the label y
Data_AR = function(T,var_noise,cor_within=0.8){
    
    p0 = 100
    p = 4*p0
    
    #### covariance matrix between features: it is either 0 (independent) or 0.8 ####
    cov_feature = matrix(0,nrow = p, ncol = p)
    # cov within the first three modules
    cov_star = matrix(0.8,nrow = p0,ncol = p0)
    diag(cov_star)=1
    # put cov_star into cov_feature
    cov_feature[1:p0,1:p0] = cov_star
    cov_feature[(p0+1):(2*p0),(p0+1):(2*p0)] = cov_star
    cov_feature[(2*p0+1):(3*p0),(2*p0+1):(3*p0)] = cov_star
    cov_feature[(3*p0+1):(4*p0),(3*p0+1):(4*p0)] = diag(p0)
    ####
    
    #### X_data ####
    # T*p Data matrix, pre-allocate memory
    X_data = matrix(0,nrow = T,ncol = p)
    # the first row
    X_data[1,] = mvrnorm(n = 1, rep(0, p), cov_feature)
    # the next row depends on the previous one
    for (i in 2:T){
        tmp = (1-cor_within^2)^(1/2)
        X_data[i,] = cor_within*X_data[i-1,]+tmp*mvrnorm(n = 1, rep(0, p), cov_feature)
    }
    ###
    
    ### create labels y ###
    # create a n vector for labels
    y = matrix(data=0,nrow = T)
    # build y according to our model
    y = (5*X_data[,1]+2*X_data[,2]+2*X_data[,3]+5*X_data[,2]*X_data[,3]
         +5*X_data[,301]+2*X_data[,302]+2*X_data[,303]+5*X_data[,302]*X_data[,303]
         +mvrnorm(n = 1, rep(0, T), diag(x=var_noise,T)))
    ###
    
    # return a T*(p+1) matrix
    return (cbind(X_data,y))
}

# Test

In [3]:
# set.seed(0)
X = Data_AR(T=4,var_noise = 0.3)
X

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,y
1.0204328,0.2051415,-1.151525,-0.7715811,-1.11874,-0.6981551,0.4361992,-0.7538781,-0.5186276,-0.9985378,...,0.9985381,-0.621768,0.44637086,1.1073235,-0.5938726,-0.7702208,0.2100299,0.2734708,1.0537839,-10.63664
-0.3921831,-0.1964851,-1.471789,-1.2750637,-1.424061,-1.410451,-0.3241805,-1.5499908,-1.3279774,-1.5568053,...,1.2992969,0.605447,0.3347643,0.3347911,-0.6475179,-0.4928394,-0.5458133,1.6622275,1.5158797,-27.15445
-0.4674783,-0.6608124,-1.378136,-1.2613017,-0.991504,-1.3055952,-0.4667863,-1.4567742,-1.1083493,-1.6496095,...,0.538674,0.5227297,0.04814782,1.2561174,0.1060752,0.3302202,-1.1478602,2.4326741,0.9570827,-22.37686
-1.3768253,-1.3983573,-2.228893,-2.2788361,-2.270154,-2.5187522,-1.1220127,-2.5325497,-1.771832,-1.9207698,...,1.1346498,0.8853738,-1.54615766,1.1811409,0.506976,-0.2838566,-1.794427,3.0324935,1.0813412,-18.43324


In [20]:
# Test whether the covariance is correct
k = 1000
x_22 = 1:k
x_32 = 1:k
x_42 = 1:k
for(i in 1:k){
    X = Data_AR(T=4,var_noise = 0.1)
    
    x_22[i] = X[2,2]
    x_32[i] = X[3,2]
    x_42[i] = X[4,2]
    
}


In [21]:
cov(x_22,x_32)

In [22]:
cov(x_22,x_42)

# Save in a CSV file

In [5]:
p = 400
k = 100 # The number of patients
T = 7 # The times each patient

In [7]:
# AR structure
X_all = matrix(0,nrow = k*T,ncol=p+1) # all the data

for (count in 1:k){
    X_all[(T*(count-1)+1):(T*count),] = Data_AR(T,var_noise = 0.1)
}

write.csv(X_all, file = 'CS_noise0.1.csv')

In [6]:
# No time structure
# set cor_within = 0
X_all = matrix(0,nrow = k*T,ncol=p+1) # all the data

for (count in 1:k){
    X_all[(T*(count-1)+1):(T*count),] = Data_AR(T,var_noise = 0.1,cor_within = 0)
}
write.csv(X_all, file = 'NoTime_noise0.1.csv')