# Recap and background
Here we recap the findings we have and review some background on  continual learning from the point of view of Ricahrd Sutton. 

*Outline* We proved in the draft that if we duplicate a hyperclone a model in a super-symmetric way (more accurately, forward and backward symmetry hold), then the forward and backward vectors of the network are cloned. More concretely, the forward and backward of the model are essentially the cloned (duplicated) versions of a smaller model from which they are cloned. This situation has a very dramatic consequence that, we can perfectly predict the training dynamics of the larger model with a smaller model, with the only caveat that the learning rate for different layers are set in a layer and module-dependent manner. This suggests that this particular way of cloning may catasrophically limit the model's ability to learn, because it is at best as good as the smaller model. This brings this result closer to the notion of *loss of plasticity* in continual learning literature, which we will very briefly review here. 

*Richard Sutton's view on Continual Learning (CL)*: Through a series of works on Richard Sutton, he proposes that one of the most fundumental p

In [6]:
import torch
import torch.nn as nn
import torch.func as functorch

# Define a customizable MLP with a Sequential container
class CustomizableMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, output_size, activation=nn.ReLU):
        """
        Parameters:
            input_size (int): Number of input features.
            hidden_sizes (list of int): Sizes of the hidden layers.
            output_size (int): Number of output features.
            activation (nn.Module): Activation function (e.g., nn.ReLU, nn.Tanh).
        """
        super(CustomizableMLP, self).__init__()
        layers = []
        in_features = input_size
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(in_features, hidden_size))
            layers.append(activation())
            in_features = hidden_size
        # Output layer (no activation here, but you can add one if desired)
        layers.append(nn.Linear(in_features, output_size))
        self.layers = nn.Sequential(*layers)
        
    def forward(self, x):
        return self.layers(x)

# Instantiate the model
input_size = 784
hidden_sizes = [256, 128, 64]
output_size = 10
model = CustomizableMLP(input_size, hidden_sizes, output_size, activation=nn.ReLU)

# --- Slicing the model ---
# Let's say we want the Jacobian from the input up to the output of the first hidden block.
# Since our Sequential container is a list of modules, you can slice it.
# For example, if we want to use the first two modules (a Linear layer and its activation):
# layer_slice = model.layers[0:2]  # This represents: Linear(input_size, 256) followed by ReLU
slice1 = model.layers[0:1]
slice2 = model.layers[1:]

# For a single input vector (shape: [input_size]), the Jacobian will have shape [output_dim, input_size]
x_single = torch.randn(input_size)
h_single = slice1(x_single) 
jacobian_single = functorch.jacrev(slice2)(h_single)
print("Jacobian for single input shape:", jacobian_single.shape)
# For example, if the Linear layer outputs 256 features, then jacobian_single.shape will be [256, 784]

# For a batch of inputs, we can use vmap to compute a Jacobian per input.
# Assume a batch of 5 inputs (shape: [5, input_size])
x_batch = torch.randn(5, input_size)
h_batch = slice1(x_batch)
jacobian_batch = functorch.vmap(functorch.jacrev(slice2))(h_batch)
print("Jacobian for batch input shape:", jacobian_batch.shape)
# Expected shape: [batch, output_dim, input_size], e.g., [5, 256, 784]


Jacobian for single input shape: torch.Size([10, 256])
Jacobian for batch input shape: torch.Size([5, 10, 256])
