## Function to generate simulated data
#### Two domains (X and Y), each with same number os observations but different number of variables. Both X and Y contain the same latent variables underlying their structures in different way

I tried to extend the example from Hao-Tings tutorial, so that we can generate bigger-sized datasets. 
https://github.com/htwangtw/cca_primer/blob/master/cca_notebook.ipynb

In [2]:
import numpy as np

np.random.seed(42)

Generate_data() function generates simulated X and Y data with common latent variables underlying their structure in different ways.

In [51]:
def generate_data(n_obs, x_nvars, y_nvars, n_latent):
    """Generate simulated X and Y datasets with common latent variables hidden in the data.
    
    X will contain latent variables as repeated columns.
    Structure underlying X: [l1, l1, l1, l2, l2, l2, (l2)] and if x_nvars%n_latent != 0, the last column is repeated!
    
    Y will contain latent variables as alternated colums.
    Structure underlying Y: [l1, l2, l1, l2, l1, l2, (l2)] and if y_nvars%n_latent != 0, the last column is repeated!
    
    To add:
    - assertions for:
        n_obs > n_vars 
        n_vars > n_latent
    
    - assertions for expected X and Y shapes
        
    
    Parameters
    ----------
    n_obs : int
        Number of observations (this is common to X and Y)
    x_nvars : int
        Number of (observed) variables in X dataset
    y_nvars : int
        Number of (observed) variables in Y dataset
    n_latent : int
        Number of latent variables hidden in both X and Y datasets
    
    Returns
    -------
    X : array
        Dataset X of size (n_obs, x_nvars) containing n_latent hidden latent variables
    Y : array
        Dataset Y of size (n_obs, y_nvars) containing n_latent hidden latent variables
    """
    
    # Create latent variables (as columns in l-array)
    l_array = np.ones(shape=(n_obs, n_latent))
    for i in range(n_latent):
        l_array[:,i] = np.random.normal(size=n_obs)    
    
    lx = np.repeat(l_array, x_nvars//n_latent, axis=1)
    if x_nvars%n_latent:  # if x_nvars not multiple of n_latent, repeat last columns as many times necessary
        lx = np.hstack((lx, np.tile(lx[:, [-1]], x_nvars%n_latent)))
    
    ly = np.tile(l_array, y_nvars//n_latent)
    if y_nvars%n_latent:  # same for Y dataset
        ly = np.hstack((ly, np.tile(ly[:, [-1]], y_nvars%n_latent)))
    
    # check here that latent_x and y sizes are as expected
    
    X = lx + np.random.normal(size=(n_obs,x_nvars))
    Y = ly + np.random.normal(size=(n_obs,y_nvars))
    
    return X, Y, lx, ly  #returning lx and ly for checkup only

In [53]:
n_obs = 100
n_latent = 2
x_nvars = 7
y_nvars = 5

X, Y, lx, ly = generate_data(n_obs, x_nvars, y_nvars, n_latent)

# lx and ly are the latent structures, which I return for checking that they look ok
print(lx[:5,:])

[[-1.4220585  -1.4220585  -1.4220585   0.54325696  0.54325696  0.54325696
   0.54325696]
 [ 0.21503194  0.21503194  0.21503194 -1.80758965 -1.80758965 -1.80758965
  -1.80758965]
 [ 1.07336658  1.07336658  1.07336658 -0.6896555  -0.6896555  -0.6896555
  -0.6896555 ]
 [ 0.86957252  0.86957252  0.86957252 -0.05321244 -0.05321244 -0.05321244
  -0.05321244]
 [-0.80973547 -0.80973547 -0.80973547  0.25368355  0.25368355  0.25368355
   0.25368355]]


In [54]:
print(ly[:5,:])

[[-1.4220585   0.54325696 -1.4220585   0.54325696  0.54325696]
 [ 0.21503194 -1.80758965  0.21503194 -1.80758965 -1.80758965]
 [ 1.07336658 -0.6896555   1.07336658 -0.6896555  -0.6896555 ]
 [ 0.86957252 -0.05321244  0.86957252 -0.05321244 -0.05321244]
 [-0.80973547  0.25368355 -0.80973547  0.25368355  0.25368355]]
