# Condensation Phenomenon 
The condensation phenomenon in neural networks describes how, during the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. We will demonstrate this effect in the accompanying code example, using a fully connected network trained to fit a one-dimensional function.

# Related Papers
[1] Tao Luo#, Zhi-Qin John Xu#, Zheng Ma, Yaoyu Zhang*, Phase diagram for two-layer ReLU neural networks at infinite-width limit. arxiv 2007.07497 (2020), Journal of Machine Learning Research (2021) [pdf](https://ins.sjtu.edu.cn/people/xuzhiqin/pub/phasediagram2020.pdf), and in [arxiv](https://arxiv.org/abs/2007.07497). 

[2] Hanxu Zhou, Qixuan Zhou, Tao Luo, Yaoyu Zhang*, Zhi-Qin John Xu*, Towards Understanding the Condensation of Neural Networks at Initial Training. arxiv 2105.11686 (2021) [pdf](https://ins.sjtu.edu.cn/people/xuzhiqin/pub/initial2105.11686.pdf), and in [arxiv](https://arxiv.org/abs/2105.11686), see slides and [video talk in Chinese](https://www.bilibili.com/video/BV1tb4y1d7CZ/?spm_id_from%253D333.999.0.0), NeurIPS2022. 

[3] Zhi-Qin John Xu*, Yaoyu Zhang, Zhangchen Zhou, An overview of condensation phenomenon in deep learning. arXiv:2504.09484 (2025), [pdf](https://ins.sjtu.edu.cn/people/xuzhiqin/pub/condensationoverview2025.pdf), and in [arxiv](https://arxiv.org/abs/2504.09484).

For more details, see [xuzhiqin condense](https://ins.sjtu.edu.cn/people/xuzhiqin/pubcondense.html)

In [37]:
# 在这里我们导入一些常用的库
import os
import time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List
import argparse
import matplotlib.pyplot as plt
import datetime
import re
import copy


# Condense 
![Condense](./pic/condense.png)
This is an ideal illustration of condensation. At initialization, the input weights of each neuron differ greatly due to randomly initialization, represented by different colors. However, after a period of training, the intermediate hidden neurons are divided into two groups: the first two neurons form one group, and the last three form another. Within each group, the input weights of different neurons are exactly the same (same color), and thus their outputs are also identical.

#   Default configuration parameter settings.
`argparse` is a Python package that provides a convenient way to parse command line arguments. It allows us to define the arguments our program expects and will parse them for us. This makes it easy to write user-friendly command-line interfaces for our programs.

To use `argparse`, we first create an `ArgumentParser` object, which will hold all the information necessary to parse the command-line arguments. We then define the arguments we expect using the `add_argument` method. This method takes several parameters, such as the name of the argument, its type, and a help message.

In this code, we are using `argparse` to parse the command-line arguments that are passed to the program. We define several arguments, such as the learning rate, optimizer, and number of epochs, and then parse them using `args, unknown = parser.parse_known_args()`. This allows us to easily customize the behavior of our program without having to modify the code itself.

Please note that you should specify the path to your own directory for saving experiment results in `ini_path`.

In [38]:
parser = argparse.ArgumentParser(description='PyTorch 1D dataset Training')



parser.add_argument('--lr', default=0.00001, type=float, help='learning rate')  # 学习率
parser.add_argument('--optimizer', default='adam', help='optimizer: sgd | adam')  # 优化器选择：sgd 或 adam
parser.add_argument('--epochs', default=2000, type=int, metavar='N', help='number of total epochs to run')  # 总训练轮数
parser.add_argument('--test_size', default=10000, type=int, help='the test size for model (default: 10000)')  # 测试集大小
parser.add_argument('--save', default='trained_nets', help='path to save trained nets')  # 保存训练模型的路径
parser.add_argument('--save_epoch', default=10, type=int, help='save every save_epochs')  # 每多少轮保存一次模型
parser.add_argument('--rand_seed', default=0, type=int, help='seed for random num generator')  # 随机数生成器的种子
parser.add_argument('--gamma', type=float, default=2, help='parameter initialization distribution variance power(We first assume that each layer is the same width.)')  # 参数初始化分布方差幂（假设每层宽度相同）
parser.add_argument('--boundary', nargs='+', type=str, default=['-1', '1'], help='the boundary of 1D data')  # 一维数据的边界
parser.add_argument('--training_size', default=80, type=int, help='the training size for model (default: 1000)')  # 训练集大小
parser.add_argument('--act_func_name', default='Tanh', help='activation function')  # 激活函数名称
parser.add_argument('--hidden_layers_width', nargs='+', type=int, default=[50])  # 隐藏层宽度
parser.add_argument('--input_dim', default=5, type=int, help='the input dimension for model (default: 1)')  # 模型输入维度
parser.add_argument('--output_dim', default=1, type=int, help='the output dimension for model (default: 1)')  # 模型输出维度
parser.add_argument('--device', default='cuda', type=str, help='device used to train (cpu or cuda)')  # 训练使用的设备（CPU 或 CUDA）
parser.add_argument('--plot_epoch', default=100, type=int, help='step size of plotting interval (default: 1000)')  # 绘图间隔的步长
parser.add_argument('--ini_path', default='/home/zhouzhangchen/condensation', type=str, help='the path to save experiment results')  # 保存实验结果的路径

args, unknown = parser.parse_known_args()


#   Make the Directories of the Expetiments

In [39]:
def mkdirs(fn):  # Create directorys
    if not os.path.isdir(fn):
        os.makedirs(fn)
    return fn


def create_save_dir(path_ini): 
    subFolderName = re.sub(r'[^0-9]', '', str(datetime.datetime.now()))
    path = os.path.join(path_ini, subFolderName)
    mkdirs(path)
    # mkdirs(os.path.join(path, 'output'))
    return path


args.path = create_save_dir(args.ini_path)


# Generate data for training and testing

The target function is:

\begin{equation}
f(x)=\sum\limits_{k=1}^53.5\sin(5x_k+1)
\end{equation}
where the data $\boldsymbol{x} = (x_1,x_2,x_3,x_4,x_5)$ are uniformly sampled from the boundaries are $[-4,2]$.


In [None]:
def get_y(x):  # Function to fit
    y = 3.5 * torch.sum(torch.sin(5*x+1),dim=1)
    y = torch.unsqueeze(y, 1)
    return y

for i in range(2):
    if isinstance(args.boundary[i], str):
        args.boundary[i] = eval(args.boundary[i])

args.test_input = torch.rand((args.test_size, args.input_dim)) * 6 - 4

args.training_input = torch.rand((args.training_size, args.input_dim)) * 6 - 4
args.test_target = get_y(args.test_input)
args.training_target = get_y(args.training_input)



# Define activation functions
We define $f(x)= x * \mathrm{tanh} (x)$.

In [41]:
class xtanh(nn.Module):
    def __init__(self):
        super(xtanh,self).__init__()
    
    def forward(self,x):
        return x * nn.Tanh()(x) 


# Network model and parameter initialization.

Given $\theta\in \mathbb{R}^M$, the FNN function $f_{\theta}(\cdot)$ is defined recursively. First, we denote $f^{[0]}_{\theta}(x)=x$ for all $x\in\mathbb{R}^d$. Then, for $l\in[L-1]$, $f^{[l]}_{\theta}$ is defined recursively as 
$f^{[l]}_{\theta}(x)=\sigma (W^{[l]} f^{[l-1]}_{\theta}(x)+b^{[l]})$, where $\sigma$ is a non-linear activation function.
Finally, we denote
\begin{equation*}
    f_{\theta}(x)=f(x,\theta)=f^{[L]}_{\theta}(x)=W^{[L]} f^{[L-1]}_{\theta}(x)+b^{[L]}.
\end{equation*}

The parameter initialization is under the Gaussian distribution as follows,

\begin{equation*}
    \theta_{l} \sim N(0, \frac{1}{m_{l}^{\gamma}}),
\end{equation*}
where the $l$ th layer parameters of $\theta$ is the ordered pair $\theta_{l}=\Big(W^{[l]},b^{[l]}\Big),\quad l\in[L]$, $m_{l}$ is the width of the $l$ th layer.

In [42]:
class Linear(nn.Module):
    def __init__(self, gamma, hidden_layers_width=[100],  input_size=20, num_classes: int = 1000, act_layer: nn.Module = nn.ReLU()):
        super(Linear, self).__init__()
        self.num_classes = num_classes
        self.input_size = input_size
        self.hidden_layers_width = hidden_layers_width
        self.gamma = gamma
        layers: List[nn.Module] = []
        self.layers_width = [self.input_size]+self.hidden_layers_width
        for i in range(len(self.layers_width)-1):
            layers += [nn.Linear(self.layers_width[i],
                                    self.layers_width[i+1]), act_layer]
        layers += [nn.Linear(self.layers_width[-1], num_classes, bias=False)]
        self.features = nn.Sequential(*layers)
        self._initialize_weights()


    def forward(self, x):

        x = x.view(x.size(0), -1)
        x = self.features(x)
        return x

    def _initialize_weights(self) -> None:

        for obj in self.modules():
            if isinstance(obj, (nn.Linear, nn.Conv2d)):
                nn.init.normal_(obj.weight.data, 0, 1 /
                                self.hidden_layers_width[0]**(self.gamma))
                if obj.bias is not None:
                    nn.init.normal_(obj.bias.data, 0, 1 /
                                    self.hidden_layers_width[0]**(self.gamma))


def get_act_func(act_func):
    if act_func == 'Tanh':
        return nn.Tanh()
    elif act_func == 'ReLU':
        return nn.ReLU()
    elif act_func == 'Sigmoid':
        return nn.Sigmoid()
    elif act_func == 'xTanh':
        return xtanh()
    else:
        raise NameError('No such act func!')


act_func = get_act_func(args.act_func_name)

model = Linear(args.gamma, args.hidden_layers_width, args.input_dim,
               args.output_dim, act_func).to(args.device)

para_init = copy.deepcopy(model.state_dict())


# One-step training function.

The training data set is denoted as  $S=\{(x_i,y_i)\}_{i=1}^n$, where $x_i\in\mathbb{R}^d$ and $y_i\in \mathbb{R}^{d'}$. For simplicity, we assume an unknown function $y$ satisfying $y(x_i)=y_i$ for $i\in[n]$. The empirical risk reads as
\begin{equation*}
    R_S(\theta)=\frac{1}{n}\sum_{i=1}^n\ell(f(x_i,\theta),y(x_i)),
\end{equation*}
where the loss function $\ell(\cdot,\cdot)$ is differentiable and the derivative of $\ell$ with respect to its first argument is denoted by $\nabla\ell(y,y^*)$. 

For a one-step gradient descent, we have, 

\begin{equation*}
    \theta_{t+1}=\theta_t-\eta\nabla R_S(\theta).
\end{equation*}

In [43]:
def train_one_step(model, optimizer, loss_fn,  args):

    model.train()
    device = args.device
    data, target = args.training_input.to(
        device), args.training_target.to(device).to(torch.float)

    optimizer.zero_grad()
    outputs = model(data)
    loss = loss_fn(outputs, target)
    loss.backward()
    optimizer.step()

    return loss.item()


# One-step test function.

In [44]:
def test(model, loss_fn, args):
    model.eval()
    device = args.device
    with torch.no_grad():
        data, target = args.test_input.to(
            device), args.test_target.to(device).to(torch.float)
        outputs = model(data)
        loss = loss_fn(outputs, target)

    return loss.item(), outputs


# Plot the loss value during the training process

In [45]:
def plot_loss(path, loss_train, x_log=False):

    plt.figure()
    ax = plt.gca()
    y2 = np.asarray(loss_train)
    plt.plot(y2, 'k-', label='Train')
    plt.xlabel('epoch', fontsize=18)
    ax.tick_params(labelsize=18)
    plt.yscale('log')
    if x_log == False:
        fntmp = os.path.join(path, 'loss.jpg')

    else:
        plt.xscale('log')
        fntmp = os.path.join(path, 'loss_log.jpg')
    plt.tight_layout()
    plt.savefig(fntmp,dpi=300)


    plt.close()


# Plot the figure of the model output.


In [46]:
def plot_model_output(path, args, output, epoch):

    plt.figure()
    ax = plt.gca()

    plt.plot(args.training_input.detach().cpu().numpy(),
             args.training_target.detach().cpu().numpy(), 'b*', label='True')
    plt.plot(args.test_input.detach().cpu().numpy(),
             output.detach().cpu().numpy(), 'r-', label='Test')

    ax.tick_params(labelsize=18)
    plt.legend(fontsize=18)
    fn = mkdirs(os.path.join('%s'%path,'output'))
    fntmp = os.path.join(fn, str(epoch)+'.jpg')

    plt.savefig(fntmp, dpi=300)


    plt.close()


# Defining Functions to Visualize Features of High-Dimensional Neurons ($d>1$) During Training

For high-dimensional data, polar coordinates cannot be used. In this case, we can utilize cosine similarity to measure the size of the angle between two vectors.

**Cosine Similarity**: The cosine similarity between two vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ is defined as
\begin{equation}
D(\boldsymbol{u},\boldsymbol{v}) = \frac{\boldsymbol{u}^T\boldsymbol{v}}{(\boldsymbol{u}^T\boldsymbol{u})^{1/2}(\boldsymbol{v}^{T}\boldsymbol{v})^{1/2}}.
\end{equation}

In [47]:
def get_parameter(checkpoint):
    wei1 = checkpoint['features.0.weight']
    bias = checkpoint['features.0.bias']
    wei2 = checkpoint['features.2.weight']

    return wei1, bias, wei2

def normalize_vectorgroup(checkpoint):
    wei1, bias, wei2 = get_parameter(checkpoint)
    bias = torch.unsqueeze(bias,dim=1)
    vector_group = torch.cat((wei1,bias),dim=1)
    vector_group = vector_group.detach().cpu().numpy()
    norms = np.linalg.norm(vector_group,axis=1)
    mask = norms > 0
    vector_masked = vector_group[mask]
    norms = norms[mask]
    norms = norms[:, np.newaxis]
    vector_normalized = vector_masked / norms
    return vector_normalized,vector_masked.shape[0]


def seperate_vectors_by_eigenvector(vector_group):
    mask = np.linalg.norm(vector_group,axis=1) > 0
    vector_group = vector_group[mask]
    similarity_matrix = np.dot(vector_group,vector_group.transpose())
    w,v = np.linalg.eig(similarity_matrix)
    index = np.argmax(w)
    tmpeig = v[:,index]
    order_mask = np.argsort(tmpeig)
    
    similarity_matrix = similarity_matrix[order_mask,:]
    similarity_matrix = similarity_matrix[:,order_mask]
    return similarity_matrix,order_mask

def plot_weight_heatmap_eigen(weight, path, args, nota=''):

    weight_normalized,masked_shape = normalize_vectorgroup(weight)
    similarity_matrix,order = seperate_vectors_by_eigenvector(weight_normalized)
    fn = mkdirs(os.path.join('%s'%path,'cosine_similarity'))
    plt.figure()
    plt.pcolormesh(similarity_matrix,vmin=-1,vmax=1,cmap='YlGnBu')
    plt.colorbar()
    plt.xlabel('index',fontsize=18)
    plt.xticks(fontsize=18)
    plt.ylabel('index',fontsize=18)
    plt.yticks(fontsize=18)
    plt.tight_layout()
    plt.savefig(os.path.join(fn,'%s'%nota))
    plt.close()
    return order

# Training Process
With the definitions of functions related to the training process, we can now train the neural network and visualize the features of the neurons.

In [None]:

args.gamma = 4
args.lr = 0.001
args.epochs = 100
args.save_epoch = 10
args.plot_epoch = 10
args.optimizer = 'adam'
args.act_func_name = 'Tanh'
args.savepath = os.path.join(args.path, 't=%s'%args.gamma)
os.makedirs(args.savepath, exist_ok=True)
act_func = get_act_func(args.act_func_name)

model = Linear(args.gamma, args.hidden_layers_width, args.input_dim,
               args.output_dim, act_func).to(args.device)

para_init = copy.deepcopy(model.state_dict())
if args.optimizer=='sgd':
  optimizer = torch.optim.SGD(model.parameters(), lr=args.lr)
else:
  optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
loss_fn = nn.MSELoss(reduction='mean')
t0 = time.time()
loss_training_lst=[]
loss_test_lst = []
for epoch in range(args.epochs+1):

      model.train()
      loss = train_one_step(
        model, optimizer, loss_fn, args)
      loss_test, output = test(
          model, loss_fn, args)
      loss_training_lst.append(loss)
      loss_test_lst.append(loss_test)
      if epoch % args.plot_epoch == 0:
            print("[%d] loss: %.6f valloss: %.6f time: %.2f s" %
                  (epoch + 1, loss, loss_test, (time.time()-t0)))
  
      if (epoch+1) % (args.plot_epoch) == 0:
          plot_loss(path=args.savepath,
                    loss_train=loss_training_lst, x_log=True)
          plot_loss(path=args.savepath,
                    loss_train=loss_training_lst, x_log=False)

          
          para_now = copy.deepcopy(model.state_dict())
          plot_weight_heatmap_eigen(para_now, args.savepath, args, nota='%s'%epoch)
          
