# Data Generation
We assume the generator is composed of $k$ parts, each of which is generated by a diffeomorphic function $g_i:[0,1]^l \mapsto \mathbb{R}^m$. The final observation is simply the stacking of the individual parts, i.e., $g:[0,1)^{k\times l} \mapsto \mathbb{R}^{k\times m}$ with $g({\bf z}_1,{\bf z}_2,{\bf z}_3)=[g_1({\bf z}_1), g_2({\bf z}_2), g_3({\bf z}_3)]$.

We consider two different sampling strategies from the latent space:

1. **Random**: Sample uniformly from the full latent distribution $[0,1)^{k \times l}$.
2. **Diagonal**: Sample $\bf v$ uniformly from $[0,1)^l$ and then generate samples according to $g({\bf v},{\bf v},{\bf v})$.

We follow [1,2,3] and design the mixing function as an MLP with
- 2 layers (hidden layer dimension $D$)
- leaky ReLU (with 0.2 negative slope) to ensure invertability
- $L_2$-normalized weight matrices with minimum condition number of 7500 uniformely distributed samples
- same number of units in all layers?
- what about bias?

---
- [1]: A. Hyvarinen and H. Morioka, “Nonlinear ICA of Temporally Dependent Stationary Sources,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Apr. 2017, pp. 460–469. Accessed: Jul. 06, 2022. [Online]. Available: https://proceedings.mlr.press/v54/hyvarinen17a.html
- [2] A. Hyvarinen and H. Morioka, “Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA,” in Advances in Neural Information Processing Systems, 2016, vol. 29. Accessed: Jul. 06, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2016/hash/d305281faf947ca7acade9ad5c8c818c-Abstract.html
- [3] J. Brady et al., "Provably Learning Object-Centric Representations"

In [72]:
import math
import torch
import torch.nn as nn

torch.manual_seed(0)

def get_generators(k: int, l: int, m: int, D: int=50) -> list[torch.nn.Module]:
    return [build_generator(l, m, D) for _ in range(k)]


@torch.no_grad()
def get_regression_targets(n:int, k: int, l: int, generators: list[torch.nn.Module], sample_mode: str='random') -> tuple[torch.Tensor, torch.Tensor]:
    if sample_mode == 'random':
        z = torch.rand(n, k, l)
    elif sample_mode == 'diagonal':
        z = torch.repeat_interleave(torch.rand(n, l), k, dim=0)
        z = torch.reshape(z, (n, k, l))
    
    x = [torch.stack([generators[j](z[i][j]) for j in range(k)]) for i in range(n)]
    x = torch.stack(x)

    return z, x


def build_generator(l: int, m: int, D: int, slope: float=0.2) -> nn.Sequential:
    g = nn.Sequential(
        nn.Linear(l, D),
        nn.LeakyReLU(slope),
        nn.Linear(D, m),
        nn.LeakyReLU(slope)
    )
    g.apply(init_min_cond)
    return g


# class Generator(torch.nn.Module):
#     def __init__(self, l: int, m: int, D: int):
#         super(Generator, self).__init__()
#         self.fc1 = nn.Linear(l, D)
#         self.relu1 = nn.LeakyReLU(0.2)
#         self.fc2 = nn.Linear(D, m)
#         self.relu2 = nn.LeakyReLU(0.2)
#         self.apply(init_min_cond)
    
#     def forward(self, x):
#         x = self.relu1(self.fc1(x))
#         x = self.relu2(self.fc2(x))
#         return x


def init_min_cond(m: torch.nn.Module, n_samples: int=7500) -> torch.Tensor:
    if isinstance(m, nn.Linear):
        w = m.weight.data
        k = 1 / w.size(0)

        w = torch.nn.functional.normalize(w, p=2)
        cond = condition_number(w)

        for _ in range(n_samples):
            _w = 2 * math.sqrt(k) * torch.rand(w.size()) - math.sqrt(k)
            _w = nn.functional.normalize(_w, p=2)
            _cond = condition_number(_w)

            if _cond < cond:
                w = _w
                cond = _cond
        
        m.weight.data = w


def condition_number(t: torch.Tensor) -> float:
    return torch.norm(t, p=2) / torch.norm(torch.pinverse(t), p=2)


In [73]:
g = get_generators(4, 2, 10)

# Models
We consider two feedforward models:
1. An appropriately sized MLP that maps the full $\mathbb{R}^{k \times m}$ to the full latent space $\mathbb [0, 1)^{k\times l}$.
2. A “compositional” model consisting of $k$ MLPs $f_i({\bf x}): \mathbb{R}^{k \times m}\mapsto [0,1)^{l}$ that each map to a subpart of the latents. Most importantly, the first MLP $f_1$ receives $[g_1({\bf z}_1),{\bf 0},{\bf 0}]$ as an input, the second MLP receives $[{\bf 0},g_2({\bf z}_2),{\bf 0}]$, and so forth. Doing that ensures that the model is compositional by design, but the input dimension is as close as possible to 1 (to avoid confounders).

We follow [1] and design the model with
- 2 layers
- hidden layer of size $D = 120$ for the MLP and $D_i = \frac{D}{k}$ for the models in the compositional MLP to roughly match the number of parameters
- LeakyReLU with slope 0.2

---
[1] J. Brady et al., "Provably Learning Object-Centric Representations"

In [86]:
def build_MLP(d_in: int, d_out: int, D: int=120, slope: float=0.2) -> nn.Sequential:
    return nn.Sequential(
        nn.Linear(d_in, D),
        nn.LeakyReLU(slope),
        nn.Linear(D, d_out),
        nn.LeakyReLU(slope)
    )


def MLP(k: int, l: int, m: int, D: int=120):
    return build_MLP(k * m, k * l, D)


class CompositionalMLP(torch.nn.Module):
    def __init__(self, k: int, l: int, m: int, D: int=120):
        super(CompositionalMLP, self).__init__()
        self.models = [build_MLP(k * m, l, round(D / k)) for _ in range(k)]
    
    def forward(self, x):
        out = []
        for i in range(len(self.models)):
            x_i = torch.zeros_like(x)
            x_i[:, i, :] = x[:, i, :]
            x_i = torch.flatten(x_i, start_dim = 1)
            out.append(self.models[i](x_i))
        return torch.cat(out)
    
    
        

In [87]:
mlp = MLP(4, 2, 10)
cmlp = CompositionalMLP(4, 2, 10)

# Train

In [88]:
z, x = get_regression_targets(100, 4, 2, g)