### <span style="color:#3bbfa0; font-size:2em;">*Peering Behind the Iron Curtain of the LLM Cold War:*</span> <span style="color:lightgray; font-size:2em;">*Minor Leaks Reveal Some Insights into GPT-4's Evolution*</span>

<img src="https://i.ibb.co/t3Z8PxB/Picture1.png" width="500">

<span style="color:gray; font-size:0.8em;">The comparison of OpenAI to the USSR is sensational, but it's intended to underscore a particular aspect: the quiet yet intense competition among major corporations. This rivalry, aimed at protecting their respective advancements, mirrors the silent standoff that characterized the nuclear arms race during the Cold War.</span>


### <span style="color:#3bbfa0; font-size:1.2em;">*1. Background of the MoE Leak:*</span>
___

Yam Peleg leaked unconfirmed third party information about GPT-4 onto twitter. Yam claimed in his now removed post <span style="color:red">**“it is over. Everything is here:”**</span>
. Which is likely to cause a skeptic’s “Hyperbolic alt-news salesman” alarms to ring. Nevertheless, let's examine these leaks.


<img src="https://i.ibb.co/LhzQSDq/Picture2.png" width="600">

<div style="position: relative; width: 700px;">
    <img src="./assets/img/stan.gif" style="position: absolute; bottom: 70px; right: 0px; width: 130px;">
</div>

#### 1.1 Leak Overview
First order of business, let's get the mundane filler points out of the way without belaboring them excessively:

- **Model Size**: GPT-4 is a behemoth, boasting about 1.8 trillion parameters across 120 to 128 layers. It's more than 10 times the size of its predecessor, GPT-3, marking a significant leap in the evolution of AI models.
- **Training Data**: GPT-4 was trained on approximately 13 trillion tokens (not unique), with two epochs for text-based data and four for code-based data. Fine-tuning data was sourced from Scale AI and OpenAI's internal datasets.
- **Cost**: The estimated cost of training GPT-4 in the cloud is around 63 million dollars, considering that an A100 costs about 1 dollar per hour.
- <span style="color:blue;">**Legality**</span>: There is also some buzz about OpenAI using a lot of Textbooks as training data. I believe they are adopting a legal version of “better to ask for forgiveness than permission”. Or perhaps more fittingly,”it's better to pay the fine than the licensing because by the time the lawsuits roll in, we'll have achieved AGI.” I admire the audacious approach in an era where clear-cut laws are still few and far between.


Does any of that strike you as groundbreaking or significant enough to declare 'it's over'? I do not think so!


However, there is one really interesting revelation here which is OpenAI’s use of Mixture of Experts (MoE). This approach has been used by Meta AI and Google AI for both language and vision models. MoE is a type of conditional computation where parts of the network are activated on a per-example basis. This approach increases model capacity without a proportional increase in computation. However, a poor expert routing strategy can lead to under-training of certain experts.

The idea is quite simple. Rather than combining multiple models in the usual manner, a router is used to decide which model should receive a particular sample for prediction. And this isn't exactly a new concept:

<img src="https://i.ibb.co/nmDnSwZ/Picture3.png" width="600">

However this report indicated that 16 different experts were used. This is likely a tuned parameter (Yikes).
If the reports are indeed accurate and GPT-4 is built in this manner, it's certainly intriguing. This is undoubtedly information that OpenAI would prefer to keep confidential. However, it is laughable to claim “It is over. Everything is here”. Most of what adversarial companies would want to know are details about the routing methodology. Moreover, it's probable that most major competitors had already assumed this technique was being used, given its ability to accelerate the training of “Outrageously Large Networks”. Regardless, the MoE approach is quite fascinating. I've provided more details on the subject below.

### <span style="color:#3bbfa0; font-size:1.2em;">*2. Literature Review (With Code):*</span>
___


https://doi.org/10.48550/arXiv.2208.02813

<img src="./assets/img/PaperAbs.png" width="900">


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

<img src="./assets/img/MoEDemo.png" width="400">

### <span style="color:#3bbfa0; font-size:1.2em;">*2.1 Basic Mixture of Experts (MoE) Models*</span>

This paper covers the evolution of the Mixture of Experts (MoE) methods:

- **SimpleMoE**: Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm. Neural computation 6 181–214. https://www.cs.toronto.edu/~hinton/absps/hme.pdf
- **SparseMoE**: Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 
https://doi.org/10.48550/arXiv.1701.06538
- **SingleMoE**: Fedus, W., Zoph, B. and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961 
https://doi.org/10.48550/arXiv.2101.03961


In [2]:
"""
Where each expert is chosen to be a two-layer CNN
"""
class Expert(nn.Module):
    def __init__(self, input_channels, num_classes):
        super(Expert, self).__init__()
        self.conv1 = nn.Conv2d(input_channels, 64, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        self.fc = nn.Linear(128*32*32, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(x.size(0), -1)  # Flatten the output
        x = self.fc(x)
        return F.softmax(x, dim=1)

In [3]:
"""
SimpleMoE: Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm. 
Neural computation 6 181–214. https://www.cs.toronto.edu/~hinton/absps/hme.pdf
"""
class SimpleMoE(nn.Module):
    def __init__(self, num_experts, input_channels, output_channels):
        super(SimpleMoE, self).__init__()
        self.experts = nn.ModuleList([Expert(input_channels, output_channels) for _ in range(num_experts)])
        self.gating_network = nn.Sequential(
            nn.Flatten(),  # Add this line
            nn.Linear(input_channels*32*32, num_experts),  # Modify this line
            nn.Softmax(dim=1))

    def forward(self, x):
        weights = self.gating_network(x)
        outputs = torch.stack([expert(x) * weight.unsqueeze(0) for expert, weight in zip(self.experts, weights)])
        return outputs.sum(dim=0)

In [4]:
"""
SparseMoE: Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. and Dean, J. (2017). 
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint 
arXiv:1701.06538 https://doi.org/10.48550/arXiv.1701.06538
"""
class SparseMoE(nn.Module):
    def __init__(self, num_experts, input_channels, output_channels, fraction_experts=0.1):
        super(SparseMoE, self).__init__()
        self.experts = nn.ModuleList([Expert(input_channels, output_channels) for _ in range(num_experts)])
        self.gating_network = nn.Sequential(
            nn.Flatten(),  # Add this line
            nn.Linear(input_channels*32*32, num_experts),  # Modify this line
            nn.Softmax(dim=1)
        )
        self.num_experts = num_experts
        self.fraction_experts = fraction_experts

    def forward(self, x):
        weights = self.gating_network(x)
        # Get the top k experts
        top_weights, top_indices = torch.topk(weights, k=int(self.num_experts * self.fraction_experts))  # Select experts
        outputs = torch.stack([self.experts[idx](x) * weight for idx, weight in zip(top_indices, top_weights)])
        return outputs.sum(dim=0)

___
<img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbsjq4iseHi-Azxcj0irBjGkma0yd4geSPPombnJSdd5dyzTguUU2pdFfZu4G38G4F4TiymUOaIkQnXGVAix5x8wF3-9Ov3NJwWaEZNvJY84CWCgU5MbUYI_DjKa_BvalTHu3eyfCJGR89UqwskKngsppDy94Gahz3HAoKLh2vmh-Jzb7ZedRI91OwFw/w640-h339/image1.jpg" width="500">

In [18]:
"""
SingleMoE: Fedus, W., Zoph, B. and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models
with simple and efficient sparsity. arXiv preprint arXiv:2101.03961 https://doi.org/10.48550/arXiv.2101.03961
"""
class SingleMoE(nn.Module):
    def __init__(self, num_experts, input_channels, output_channels, topk=1):
        super(SingleMoE, self).__init__()
        self.experts = nn.ModuleList([Expert(input_channels, output_channels) for _ in range(num_experts)])
        self.gating_network = nn.Sequential(
            nn.Flatten(),
            nn.Linear(input_channels*32*32, num_experts),
            nn.Softmax(dim=1)
        )
        self.num_experts = num_experts
        self.topk = topk
        self.expert_inputs = [[] for _ in range(num_experts)]  # Store inputs for each expert

    def forward(self, x):
        weights = self.gating_network(x)
        # Get the top k experts
        top_weights, top_indices = torch.topk(weights, k=self.topk, dim=1)
        
        outputs = []
        for i in range(x.shape[0]):  # For each item in the batch
            expert_outputs = []
            for idx in top_indices[i]:
                self.expert_inputs[idx].append(x[i].detach().cpu().numpy())  # Log inputs
                expert_outputs.append(self.experts[idx](x[i].unsqueeze(0)))
            expert_outputs = torch.stack(expert_outputs)
            outputs.append((top_weights[i].unsqueeze(-1) * expert_outputs).sum(dim=0))  # Weighted sum of top-k expert outputs
        
        return torch.cat(outputs, dim=0)
    
"""
The softmax operation in the gating network is used to convert the raw scores from the gating network into a 
probability distribution over the experts. This distribution represents the "confidence" of the gating network 
in each expert for the given input.

When the output of the selected expert is multiplied by the corresponding softmax score, it's a way of weighting
the output of the expert by the confidence of the gating network in that expert. This can be interpreted as a 
form of "attention" mechanism, where the model pays more attention to the outputs of the experts that the gating 
network is more confident in.
"""
class SingleAttnMoE(nn.Module):
    def __init__(self, num_experts, input_channels, output_channels):
        super(SingleAttnMoE, self).__init__()
        self.experts = nn.ModuleList([Expert(input_channels, output_channels) for _ in range(num_experts)])
        self.gating_network = nn.Sequential(
            nn.Flatten(),  # Add this line
            nn.Linear(input_channels*32*32, num_experts),  # Modify this line
        )
        self.num_experts = num_experts

    def forward(self, x):
        weights = self.gating_network(x)
        weights = F.softmax(weights, dim=1)  # Convert raw scores to probabilities
        _, top_indices = torch.topk(weights, k=1)  # Select top 1 expert
        expert = self.experts[top_indices[0]]
        output = expert(x) * weights[0, top_indices[0]]  # Weight output by softmax score
        return output

Load the data
<img src="https://production-media.paperswithcode.com/datasets/4fdf2b82-2bc3-4f97-ba51-400322b228b1.png" width="500">

In [19]:
# Define the device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load the CIFAR-10 dataset
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64,
                                         shuffle=False, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


Init Model

In [20]:
# Define the model
num_experts = 10
input_channels = 3  # CIFAR-10 images have 3 color channels
output_channels = 10  # CIFAR-10 has 10 classes
model = SingleMoE(num_experts, input_channels, output_channels).to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Train

In [24]:
# Train the model
for epoch in range(30):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i%2000==0:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss/2000))
            running_loss = 0.0

print('Finished Training')

<img src="https://i.ibb.co/xDXVkH8/Capture-Mo-E.png" width="800">

<br><br><br><br><br><br><br><br><br><br>

### <span style="color:#3bbfa0; font-size:1.2em;">*2.2 Novel Mixture of Experts (MoE) Improvements*</span>
___
<span style="color:#3bbfa0; font-size:1.3em;">**2.2.1. Expert Diversification and Dispatching (EDD)**</span><br>
<span style="color:gray; font-size:1.2em;">*The authors propose the Expert Diversification and Dispatching (EDD) method, which fosters diversity among experts and optimizes data routing by introducing additional loss terms, thereby improving the overall performance of the Mixture of Experts model.*</span>
<br><br>
<span style="color:#3bbfa0; font-size:1.3em;">**2.2.2. Stability by Smoothing**</span><br>
<span style="color:gray; font-size:1.2em;">*The authors introduce a noise term to the gating network output, which ensures a smooth transition between different routing behaviors and makes the router more stable.*</span>
___
<br><br><br>
<span style="color:#3bbfa0; font-size:1.4em;">**2.2.1 Expert Diversification and Dispatching (EDD)**</span><br>

- **Introduction**: The section opens with an explanation about the behavior of Mixture of Experts (MoE) models. Unlike a single model, experts in a MoE model tend to diversify. The reason behind this behavior lies in the MoE model's unique design and training process. Each expert in the model is designed to specialize in a particular segment of the input space, fostering diversity and preventing the experts from collapsing into a single model.

- **Model Architecture and Training**: The text then describes how a component in the MoE model called the router functions. The router's job is to allocate data to the appropriate expert through a learning process. It employs a gating network that uses a softmax function to assign probabilities to each expert. The gating network is then trained via backpropagation and gradient descent, akin to other elements of the neural network.

- **Expert Diversification and Dispatching (EDD)**: The document introduces a novel training methodology for MoE models. This technique, termed "Expert Diversification and Dispatching (EDD)", promotes expert diversification. It accomplishes this by introducing an additional loss term, which penalizes the similarity between outputs from different experts, thereby encouraging each expert to specialize in unique parts of the input space.

- **Advantages of EDD**: The EDD method also enhances the router's efficiency in dispatching data to the correct expert. It incorporates a dispatching loss that compels the router to allocate each data point to the expert most likely to yield the correct output. This is accomplished by comparing each expert's output with the target output and penalizing the router if it does not give the highest probability to the expert with the closest output to the target. This technique improves the router's proficiency at mapping inputs to experts, thereby enhancing the MoE model's overall performance.

- **Conclusion**: The document concludes that despite the potential complexity, the EDD method aids in creating more diverse experts within a MoE model and improves its overall performance. The text implies that future research could further optimize this model by making the dispatching process more efficient.

In [5]:
class SparseMoeEddLoss(nn.Module):
    def __init__(self, num_experts, input_channels, output_channels, topk=1, num_samples=10, noise_stddev=0.1):
        super(SparseMoeEddLoss, self).__init__()
        self.experts = nn.ModuleList([Expert(input_channels, output_channels) for _ in range(num_experts)])
        self.gating_network = nn.Sequential(
            nn.Linear(input_channels*32*32, num_experts),
            nn.Softmax(dim=1)
        )
        self.num_experts = num_experts
        self.topk = topk
        self.expert_inputs = [[] for _ in range(num_experts)]  # Store inputs for each expert
        self.num_samples = num_samples  # Number of noisy samples to use for smoothing
        self.noise_stddev = noise_stddev  # Standard deviation of the Gaussian noise

    def forward(self, x):
        x_flat = x.view(x.size(0), -1)  # Flatten the input
        weights = self.gating_network(x_flat)
        top_weights, top_indices = torch.topk(weights, k=self.topk, dim=1)  # Get the top k experts

        outputs = []
        averaged_noisy_outputs_list = []
        for i in range(x.shape[0]):  # For each item in the batch
            averaged_noisy_outputs = []
            for idx in top_indices[i]:
                self.expert_inputs[idx].append(x[i].detach().cpu().numpy())  # Log inputs

                # Add noise to the input and average over multiple noisy samples
                noisy_outputs = []
                for _ in range(self.num_samples):
                    noise = torch.randn_like(x[i]) * self.noise_stddev
                    noisy_output = self.experts[idx]((x[i] + noise).unsqueeze(0))
                    noisy_outputs.append(noisy_output)
                averaged_noisy_outputs.append(torch.stack(noisy_outputs).mean(dim=0))

            averaged_noisy_outputs = torch.stack(averaged_noisy_outputs)
            averaged_noisy_outputs_list.append(averaged_noisy_outputs)
            outputs.append((top_weights[i].unsqueeze(-1) * averaged_noisy_outputs).sum(dim=0))  # Weighted sum of top-k expert outputs

        return torch.cat(outputs, dim=0), averaged_noisy_outputs_list, top_indices


    def edd_loss(self, x, target, alpha=0.5):
        x_flat = x.view(x.size(0), -1)  # Flatten the input
        weights = self.gating_network(x_flat)
        top_weights, top_indices = torch.topk(weights, k=self.topk, dim=1)  # Get the top k experts

        # Diversification loss
        outputs = []
        for i in range(x.shape[0]):  # For each item in the batch
            expert_outputs = []
            for idx in top_indices[i]:
                expert_outputs.append(self.experts[idx](x[i].unsqueeze(0)))
            expert_outputs = torch.stack(expert_outputs)
            outputs.append((top_weights[i].unsqueeze(-1) * expert_outputs).sum(dim=0))  # Weighted sum of top-k expert outputs

        outputs = torch.cat(outputs, dim=0)
        diversification_loss = ((outputs - outputs.mean(dim=0, keepdim=True))**2).sum()

        # Dispatching loss
        dispatching_loss = 0
        for i in range(x.shape[0]):  # For each item in the batch
            expert_outputs = []
            for idx in top_indices[i]:
                expert_output = self.experts[idx](x[i].unsqueeze(0))
                expert_outputs.append(expert_output)
            expert_outputs = torch.stack(expert_outputs)
            target_output = target[i].unsqueeze(0).expand_as(expert_outputs)
            dispatching_loss += (top_weights[i].unsqueeze(-1) * (expert_outputs - target_output)**2).sum()

        dispatching_loss = dispatching_loss / x.shape[0]  # Average over the batch size

        return alpha * diversification_loss + (1-alpha) * dispatching_loss

<br><br><br>
<span style="color:#3bbfa0; font-size:1.4em;">**2.2.2 Stability by Smoothing**</span><br>

- **Introduction**: The section begins by stating that the stability of a model can be improved by smoothing the model's predictions. The idea is to make the model's output less sensitive to small changes in the input. This is particularly important in machine learning models that are used for decision-making, where a small change in the input should not drastically change the model's decision.

- **Smoothing Method**: The document then describes a specific method of smoothing. This method involves adding a small amount of random noise to the input data and then averaging the model's predictions over many noisy versions of the input. This process is known as Monte Carlo integration. The noise is added according to a Gaussian distribution, which is a common choice for this kind of operation.

- **Advantages**: The document mentions that this smoothing method has several advantages. It can improve the model's stability and robustness, making it less likely to make drastically different predictions for similar inputs. It can also help to prevent overfitting, where the model learns to perform very well on the training data but poorly on new, unseen data.

- **Disadvantages**: However, the document also notes that this smoothing method can be computationally expensive. This is because it requires the model to make predictions for many different versions of each input. This can be mitigated by using a smaller number of noisy versions, but this may also reduce the effectiveness of the smoothing.

- **Conclusion**: The document concludes by stating that despite the computational cost, smoothing can be a valuable tool for improving the stability of machine learning models. It suggests that future research could explore ways to make the smoothing process more efficient.

In [6]:
class SparseMoeEddLossStabSmoothing(nn.Module):
    def __init__(self, num_experts, input_channels, output_channels, topk=1, num_samples=10, noise_stddev=0.1):
        super(SparseMoeEddLoss, self).__init__()
        self.experts = nn.ModuleList([Expert(input_channels, output_channels) for _ in range(num_experts)])
        self.gating_network = nn.Sequential(
            nn.Linear(input_channels*32*32, num_experts),
            nn.Softmax(dim=1)
        )
        self.num_experts = num_experts
        self.topk = topk
        self.expert_inputs = [[] for _ in range(num_experts)]  # Store inputs for each expert
        self.num_samples = num_samples  # Number of noisy samples to use for smoothing
        self.noise_stddev = noise_stddev  # Standard deviation of the Gaussian noise

    def forward(self, x):
        x_flat = x.view(x.size(0), -1)  # Flatten the input
        weights = self.gating_network(x_flat)
        top_weights, top_indices = torch.topk(weights, k=self.topk, dim=1)  # Get the top k experts

        outputs = []
        averaged_noisy_outputs_list = []
        for i in range(x.shape[0]):  # For each item in the batch
            averaged_noisy_outputs = []
            for idx in top_indices[i]:
                self.expert_inputs[idx].append(x[i].detach().cpu().numpy())  # Log inputs

                # Add noise to the input and average over multiple noisy samples
                noisy_outputs = []
                for _ in range(self.num_samples):
                    noise = torch.randn_like(x[i]) * self.noise_stddev
                    noisy_output = self.experts[idx]((x[i] + noise).unsqueeze(0))
                    noisy_outputs.append(noisy_output)
                averaged_noisy_outputs.append(torch.stack(noisy_outputs).mean(dim=0))

            averaged_noisy_outputs = torch.stack(averaged_noisy_outputs)
            averaged_noisy_outputs_list.append(averaged_noisy_outputs)
            outputs.append((top_weights[i].unsqueeze(-1) * averaged_noisy_outputs).sum(dim=0))  # Weighted sum of top-k expert outputs

        return torch.cat(outputs, dim=0), averaged_noisy_outputs_list, top_indices


    def edd_loss(self, x, target, alpha=0.5):
        x_flat = x.view(x.size(0), -1)  # Flatten the input
        weights = self.gating_network(x_flat)
        top_weights, top_indices = torch.topk(weights, k=self.topk, dim=1)  # Get the top k experts

        # Diversification loss
        outputs = []
        for i in range(x.shape[0]):  # For each item in the batch
            expert_outputs = []
            for idx in top_indices[i]:
                expert_outputs.append(self.experts[idx](x[i].unsqueeze(0)))
            expert_outputs = torch.stack(expert_outputs)
            outputs.append((top_weights[i].unsqueeze(-1) * expert_outputs).sum(dim=0))  # Weighted sum of top-k expert outputs

        outputs = torch.cat(outputs, dim=0)
        diversification_loss = ((outputs - outputs.mean(dim=0, keepdim=True))**2).sum()

        # Dispatching loss
        dispatching_loss = 0
        for i in range(x.shape[0]):  # For each item in the batch
            expert_outputs = []
            for idx in top_indices[i]:
                expert_output = self.experts[idx](x[i].unsqueeze(0))
                expert_outputs.append(expert_output)
            expert_outputs = torch.stack(expert_outputs)
            target_output = target[i].unsqueeze(0).expand_as(expert_outputs)
            dispatching_loss += (top_weights[i].unsqueeze(-1) * (expert_outputs - target_output)**2).sum()

        dispatching_loss = dispatching_loss / x.shape[0]  # Average over the batch size

        return alpha * diversification_loss + (1-alpha) * dispatching_loss

<br><br><br>
<span style="color:#3bbfa0; font-size:1.4em;">**3. Sample Results:**</span><br>

Weight (EDD) 0.1 Alpha (EDD) 0.1 Samples (Smoothing) n-smples10 Noise (Smoothing) 0.05
<img src="./assets/img/Weight (EDD) 0.1 Alpha (EDD) 0.1 Samples (Smoothing) n-smples10 Noise (Smoothing) 0.05.gif">

<br><br>
Weight (EDD) 0.5 Alpha (EDD) 0.7 Samples (Smoothing) n-smples30 Noise (Smoothing) 0.05
<img src="./assets/img/Weight (EDD) 0.5 Alpha (EDD) 0.7 Samples (Smoothing) n-smples30 Noise (Smoothing) 0.05.gif">