# Data Generation
We assume the generator is composed of $k$ parts, each of which is generated by a diffeomorphic function $g_i:[0,1]^l \mapsto \mathbb{R}^m$. The final observation is simply the stacking of the individual parts, i.e., $g:[0,1)^{k\times l} \mapsto \mathbb{R}^{k\times m}$ with $g({\bf z}_1,{\bf z}_2,{\bf z}_3)=[g_1({\bf z}_1), g_2({\bf z}_2), g_3({\bf z}_3)]$.

We consider two different sampling strategies from the latent space:

1. **Random**: Sample uniformly from the full latent distribution $[0,1)^{k \times l}$.
2. **Diagonal**: Sample $\bf v$ uniformly from $[0,1)^l$ and then generate samples according to $g({\bf v},{\bf v},{\bf v})$.

We follow [1,2,3] and design the mixing function as an MLP with
- 2 layers (hidden layer dimension $D$)
- leaky ReLU (with 0.2 negative slope) to ensure invertability
- $L_2$-normalized weight matrices with minimum condition number of 7500 uniformely distributed samples
- same number of units in all layers?
- what about bias?

---
- [1]: A. Hyvarinen and H. Morioka, “Nonlinear ICA of Temporally Dependent Stationary Sources,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Apr. 2017, pp. 460–469. Accessed: Jul. 06, 2022. [Online]. Available: https://proceedings.mlr.press/v54/hyvarinen17a.html
- [2] A. Hyvarinen and H. Morioka, “Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA,” in Advances in Neural Information Processing Systems, 2016, vol. 29. Accessed: Jul. 06, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2016/hash/d305281faf947ca7acade9ad5c8c818c-Abstract.html
- [3] J. Brady et al., "Provably Learning Object-Centric Representations"

In [1]:
import math
import torch
import torch.nn as nn

def get_generators(k: int, l: int, m: int, D: int=50) -> list[torch.nn.Module]:
    return [build_generator(l, m, D) for _ in range(k)]


@torch.no_grad()
def get_regression_targets(n:int, k: int, l: int, generators: list[torch.nn.Module], sample_mode: str='random') -> tuple[torch.Tensor, torch.Tensor]:
    if sample_mode == 'random':
        z = torch.rand(n, k, l)
    elif sample_mode == 'diagonal':
        z = torch.repeat_interleave(torch.rand(n, l), k, dim=0)
        z = torch.reshape(z, (n, k, l))
    
    x = [torch.stack([generators[j](z[i][j]) for j in range(k)]) for i in range(n)]
    x = torch.stack(x)

    return z, x


def build_generator(l: int, m: int, D: int, slope: float=0.2) -> nn.Sequential:
    g = nn.Sequential(
        nn.Linear(l, D),
        nn.LeakyReLU(slope),
        nn.Linear(D, m),
        nn.LeakyReLU(slope)
    )
    g.apply(init_min_cond)
    return g


# class Generator(torch.nn.Module):
#     def __init__(self, l: int, m: int, D: int):
#         super(Generator, self).__init__()
#         self.fc1 = nn.Linear(l, D)
#         self.relu1 = nn.LeakyReLU(0.2)
#         self.fc2 = nn.Linear(D, m)
#         self.relu2 = nn.LeakyReLU(0.2)
#         self.apply(init_min_cond)
    
#     def forward(self, x):
#         x = self.relu1(self.fc1(x))
#         x = self.relu2(self.fc2(x))
#         return x


def init_min_cond(m: torch.nn.Module, n_samples: int=7500) -> torch.Tensor:
    if isinstance(m, nn.Linear):
        w = m.weight.data
        k = 1 / w.size(0)

        w = torch.nn.functional.normalize(w, p=2)
        cond = condition_number(w)

        for _ in range(n_samples):
            _w = 2 * math.sqrt(k) * torch.rand(w.size()) - math.sqrt(k)
            _w = nn.functional.normalize(_w, p=2)
            _cond = condition_number(_w)

            if _cond < cond:
                w = _w
                cond = _cond
        
        m.weight.data = w


def condition_number(t: torch.Tensor) -> float:
    return torch.norm(t, p=2) / torch.norm(torch.pinverse(t), p=2)


# Models
We consider two feedforward models:
1. An appropriately sized MLP that maps the full $\mathbb{R}^{k \times m}$ to the full latent space $\mathbb [0, 1)^{k\times l}$.
2. A “compositional” model consisting of $k$ MLPs $f_i({\bf x}): \mathbb{R}^{k \times m}\mapsto [0,1)^{l}$ that each map to a subpart of the latents. Most importantly, the first MLP $f_1$ receives $[g_1({\bf z}_1),{\bf 0},{\bf 0}]$ as an input, the second MLP receives $[{\bf 0},g_2({\bf z}_2),{\bf 0}]$, and so forth. Doing that ensures that the model is compositional by design, but the input dimension is as close as possible to 1 (to avoid confounders).

We follow [1] and design the model with
- 2 layers
- hidden layer of size $D = 120$ for the MLP and $D_i = \frac{D}{k}$ for the models in the compositional MLP to roughly match the number of parameters
- LeakyReLU with slope 0.2

---
[1] J. Brady et al., "Provably Learning Object-Centric Representations"

In [26]:
def build_MLP(d_in: int, d_out: int, D: int=120, slope: float=0.2) -> nn.Sequential:
    return nn.Sequential(
        nn.Linear(d_in, D),
        nn.LeakyReLU(slope),
        nn.Linear(D, d_out),
        nn.LeakyReLU(slope)
    )


def MLP(k: int, l: int, m: int, D: int=120):
    return build_MLP(k * m, k * l, D)


class CompositionalMLP(torch.nn.Module):
    def __init__(self, k: int, l: int, m: int, D: int=120):
        super(CompositionalMLP, self).__init__()
        self.k = k
        self.models = nn.ModuleList([build_MLP(k * m, l, round(D / k)) for _ in range(k)])
    
    def forward(self, x):
        x = x.reshape(x.size(0), self.k, -1)
        out = []
        for i in range(len(self.models)):
            x_i = torch.zeros_like(x)
            x_i[:, i, :] = x[:, i, :]
            x_i = torch.flatten(x_i, start_dim = 1)
            out.append(self.models[i](x_i))
        return torch.cat(out, dim=1)
    
    
        

# Train & Test
We do simple supervised regression and evaluate the $R^2$ distance on random samples.

In [16]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchmetrics import R2Score
from tqdm import tqdm


class Dataset(torch.utils.data.Dataset):
    def __init__(self, n: int, k: int, l: int, generators: list[torch.nn.Module], sample_mode: str='random'):
        super(Dataset, self).__init__()
        self.n = n
        self.z, self.x = get_regression_targets(n, k, l, generators, sample_mode)
    
    def __len__(self):
        return self.n
    
    def __getitem__(self, idx):
        return self.x[idx], self.z[idx]


def train(model: torch.nn.Module, trainloader: torch.utils.data.DataLoader, lr: float=0.001, epochs: int=10):
    criterion = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    # for epoch in tqdm(range(epochs)):
    for epoch in range(epochs):
        cum_loss = 0

        for batch, data in enumerate(trainloader, 0):
            x, z = data

            optimizer.zero_grad()

            out = model(torch.flatten(x, start_dim=1))
            loss = criterion(out, torch.flatten(z, start_dim=1))
            cum_loss += loss
            loss.backward()
            optimizer.step()
        
        cum_loss /= (batch + 1)
    
    return cum_loss


@torch.no_grad()
def test(model: torch.nn.Module, testloader: torch.utils.data.DataLoader):
    cum_score = 0

    for batch, data in enumerate(testloader, 0):
        x, z = data
        out = model(torch.flatten(x, start_dim=1))
        r2score = R2Score(out.size(1))
        score = r2score(out, torch.flatten(z, start_dim=1))
        cum_score += score
    
    cum_score /= (batch + 1)
    return cum_score

In [5]:
import copy

k = 4
l = 2
m = 10

torch.manual_seed(0)

print('Build generators...')
g = get_generators(k, l, m)

Build generators...


In [6]:
print('Build test data...')
te_ds = Dataset(1000, k, l, g, 'random')
te_ldr = torch.utils.data.DataLoader(te_ds, batch_size=1000, shuffle=True)

Build test data...


In [27]:
n = 1000
bs = 4
e = 100

print('Build train data...')
tr_ds_rand = Dataset(n, k, l, g, 'random')
tr_ds_diag = Dataset(n, k, l, g, 'diagonal')
tr_ldr_rand = torch.utils.data.DataLoader(tr_ds_rand, batch_size=bs, shuffle=True)
tr_ldr_diag = torch.utils.data.DataLoader(tr_ds_diag, batch_size=bs, shuffle=True)

print('Build models...')
mlp_rand = MLP(k, l, m)
mlp_diag = copy.deepcopy(mlp_rand)
cmlp_rand = CompositionalMLP(k, l, m)
cmlp_diag = copy.deepcopy(cmlp_rand)

print('Train models...')
for i in range(50):
    loss = train(cmlp_rand, tr_ldr_rand, epochs=10)
    score = test(cmlp_rand, te_ldr)
    print(f'  {i:3d}\tloss: {loss:2.4f}\tR²: {score:2.4f}')
    # train(mlp_diag, tr_ldr_diag, epochs=e)
    # train(cmlp_rand, tr_ldr_rand, epochs=e)
    # train(cmlp_diag, tr_ldr_diag, epochs=e)

Build train data...
Build models...
Train models...
    0	loss: 0.0556	R²: 0.3451
    1	loss: 0.0398	R²: 0.5313
    2	loss: 0.0285	R²: 0.6628
    3	loss: 0.0212	R²: 0.7460
    4	loss: 0.0167	R²: 0.7986
    5	loss: 0.0137	R²: 0.8341
    6	loss: 0.0112	R²: 0.8629
    7	loss: 0.0091	R²: 0.8877
    8	loss: 0.0073	R²: 0.9091
    9	loss: 0.0059	R²: 0.9254
   10	loss: 0.0049	R²: 0.9381
   11	loss: 0.0042	R²: 0.9470
   12	loss: 0.0037	R²: 0.9530
   13	loss: 0.0033	R²: 0.9581
   14	loss: 0.0031	R²: 0.9611
   15	loss: 0.0029	R²: 0.9637
   16	loss: 0.0028	R²: 0.9652
   17	loss: 0.0027	R²: 0.9670
   18	loss: 0.0026	R²: 0.9680
   19	loss: 0.0025	R²: 0.9689
   20	loss: 0.0024	R²: 0.9699
   21	loss: 0.0024	R²: 0.9709
   22	loss: 0.0023	R²: 0.9716
   23	loss: 0.0022	R²: 0.9725
   24	loss: 0.0021	R²: 0.9735
   25	loss: 0.0020	R²: 0.9746
   26	loss: 0.0020	R²: 0.9753
   27	loss: 0.0019	R²: 0.9759
   28	loss: 0.0019	R²: 0.9765
   29	loss: 0.0018	R²: 0.9768
   30	loss: 0.0018	R²: 0.9772
   31	loss: 0.0018