# Explanation

This paper introduces the mixture-of-experts (MoE) method used to massively scale the number of parameters in models, which boosts the intelligence of the model.

GPT-4 is rumored to be (and very likely is) an MoE model, and Mistral AI's open-source Mixtral 8x7B is also a mixture of experts model.

### Intuition

Generally, scaling up the number of parameters is one of the most effective ways to increase the intelligence of the network. However, the amount of compute required to train a model also increases with the parameters.

MoE provides a way to train models with more parameters without increasing the amount of compute required to train them.

It accomplishes this by using a large number of individual neural networks that are connected together with a "sparsely-gated" layer to make them work together to produce the model output.

Usually, to train a model with 10x parameters, it will take 10x compute to train these parameters since you are performing 10x more computations on each training step.

MoE works around this by "gating" which sub-networks are active at each training step so that only a small subset of the total networks contribute to computing any specific output. By doing this, each training step is still training roughly the same amount of parameters, so this does not use too much more compute.

However, the model is able to use a far larger number of parameters to contribute to its representations. The idea behind MoE is that with the constraint of the sparsely-gated layer, the model will develop different "experts" in its subnetworks that become good at different tasks.

### Implementation

The MoE sparsely-gated layer is the main introduction of this paper that enables this model. This gating layer has it's own weights $W_g$ that it can use to tune which experts it selects for different inputs. The gating layer introduces sparsity through the gating function $G(x)$ which limits which experts actually contribute to the output:

$$
y = \sum_{i=1}^n G(x)_i E_i(x)
$$

This function is computed using the weight matrix as well as some Gaussian noise.

$$
G(x) = Softmax(KeepTopK(H(x), k)) \\
H(x)_i = (x \cdot W_g)_i + Gaussian() \cdot Softplus((x \cdot W_{noise})_i)
$$


Importantly, by default, the network may converge to prioritizing the same experts for almost every task, which would defeat the purpose of the mixture-of-experts architecture.

In order to incentivize the network to spread out computation among it's experts, we add a term to the loss function to penalize experts that are overly active:

$$
Importance(X) = \sum_{x \in X} G(x) \\
L_{importance}(X) = w_{importance} \cdot CV(Importance(X))^2
$$

When an expert is active for too many of the training samples in a training batch $X$, the $CV$ value will be high, contributing a relatively large loss from the importance of the expert.

This work has later been adapted to many model types, including transformers.

# My Notes

## 📜 [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/pdf/1701.06538)

> The capacity of a neural network to absorb information is limited by its number of parameters.

> Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation.

> In this work, we […] finally realize the promise of conditional
> computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency.

> We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example.

> Various forms of conditional computation have been proposed as a way to increase model capacity without a proportional increase in computational costs.

> In these schemes, large parts of a network are active or inactive on a per-example basis.

> Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.

### The Structure of the Mixture-of-Experts Layer

> The Mixture-of-Experts (MoE) layer consists of a set of $n$ “expert networks” $E_1, …, E_n$ and a “gating network” $G$ whos output is a sparse $n$-dimensional vector.

![Screenshot 2024-05-17 at 3.30.44 PM.png](../../images/Screenshot_2024-05-17_at_3.30.44_PM.png)

> Let us denote by $G(x)$ and $E_i(x)$ the output of the gating network and the output of the $i$-th expert network for a given input $x$. The output $y$ of the MoE module can be written as follows:

$$
y = \sum_{i=1}^n G(x)_i E_i(x)
$$

> We save computation based on the sparsity of the output of $G(x)$. Wherever $G(x)_i$ = 0, we need not compute $E_i(x)$. In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them for every example.

Because of the gating mechanism, only a fraction of experts are actually used in each training run, meaning that while the total network may have $N \times P$ total parameters, only $n \times P$ need to be optimized on every training run where $n$ is the number of active experts at once, meaning setting $n \ll N$ saves significant compute, while maintaining model size.

Hierarchical MoE also exists, further improving compute.

**1. Gating Network**

A simple approach for gating would be to use a trainable weight matrix $W_g$ for gating with the softmax function

$$
G_\sigma(x) = Softmax(x \cdot W_g)
$$

> We add two components to the Softmax gating network: sparsity and noise. Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to $-\infty$.

$$
G(x) = Softmax(KeepTopK(H(x), k)) \\
H(x)_i = (x \cdot W_g)_i + Gaussian() \cdot Softplus((x \cdot W_{noise})_i)
$$

### Addressing Performance Challenges

**1. The Shrinking Batch Problem**

> If the gating network chooses $k$ out of $n$ experts for each example, then for a batch of $b$ examples, each expert receives a much smaller batch of approximately $\frac{kb}{n} \ll b$ examples. This causes a naive MoE implementation to become very inefficient as the number of experts increases.

> The solution to this shrinking batch problem is to make the original batch size as large as possible. However, batch size tends to be limited by the memory necessary to store activations between the forwards and backwards passes.

This mixture-of-experts implementation uses a form of data-parallelism where each expert lives on every device, meaning that it gets trained on the batches of data sent to each device, synchronously.

This adds a $d$ term, making the total amount of batch data training each expert $\frac{kbd}{n}$.

**2. Network Bandwidth**

> Another major performance concern in distributed computing is network bandwidth.

> To maintain computational efficiency, the ratio of an expert’s computation to the size of its input and output must exceed the ratio of computational to network capacity of the computing device.

### Balancing Expert Utilization

> We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts.

If left unattended, expert networks will naturally prioritize certain experts, and will keep training and improving the same ones.

To ensure equal importance of all experts across the batch, we introduce an additional importance term to the loss function.

$$
Importance(X) = \sum_{x \in X} G(x) \\
L_{importance}(X) = w_{importance} \cdot CV(Importance(X))^2
$$

This function calculates the total importance of each expert across a batch by summing the gates for that expert for each training example, taking the square of the coefficient of variation, and multiplying it by a scaling factor $w_{importance}$.

### Experiments

![Screenshot 2024-05-17 at 3.59.23 PM.png](../../images/Screenshot_2024-05-17_at_3.59.23_PM.png)

![Screenshot 2024-05-17 at 3.59.51 PM.png](../../images/Screenshot_2024-05-17_at_3.59.51_PM.png)

### Conclusion

> This work is the first to demonstrate major wins from conditional computation in deep networks.
