# Covariance Matrix Adaptation - Evolution Strategy

CMA-ES is a multivariate EDA using a parametric distribution. To represent the joint distribution CMA-ES uses a multivariate Gaussian, represented as a mean vector $m$ and a covariance matrix $C$.
CMA-ES samples from the distribution and then uses the samples to update $m$ and $C$.



## $(\frac{\mu}{\mu_W}, \lambda)$ CMA - ES
Generate $\lambda$ individuals from the current distribution with mean vector $m$ and covariance matrix $C$. Keep the $\mu$ fittest individuals  (truncate selections, since it is an evolution strategy). Use the $\mu$ selected individuals to update $m$ and $C$.

Individuals in CMA-ES are sampled from a multivariate Gaussian distribution with mean $m$ and covariance matrix $C$ and then are scaled by a mutation factor $\sigma$, called also step size.


### Update
Assume the individuals are ordered w.r.t. fitness with the best one being $x^{(i)}$.
A simple way to udpate $m$ and $C$ is
$$
    z^{(i)} = \frac{x^{(i)}-m}{\sigma}
$$
$$
    C \leftarrow \frac{1}{\mu-1} \sum_{i=1}^{\mu}
    z^{(i)} z^{(i)T}
$$
$$
    m \leftarrow \frac{1}{\mu} \sum_{i=1}^{\mu} x^{(i)}
$$
Note that the algorithm does not use the updated mean vector to recompute $C$. First we update $C$ using the old mean and then we update $m$. This is done in order to limit the risk of premature convergence.



### Weighted Update
We can add weights to the update as follows:
$$
    C \leftarrow \sum_{i=1}^{\mu} w_i z^{(i)} z^{(i)T}
$$
$$
    m \leftarrow \frac{1}{\mu} \sum_{i=1}^{\mu} w_i x^{(i)}
$$
where
$$

    w_i = \ln \left( \frac{\lambda+1}{2i} \right) / \sum_{j=1}^{\mu} \ln\left( \frac{\lambda+1}{2j} \right)
$$



### Rank $\mu$ Update
The covariance matrix $C$ can be updated gradually:
$$
    C \leftarrow (1-c_{\mu})C+c_{\mu} \underbrace{\sum_{i=1}^{\mu} w_i z^{(i)} z^{(i)T}}_{\text{rank $\mu$ matrix}}
$$



### Evolution Path

We can keep track of where the $m$ has been historically heading in the evolution path
vector $p$ which is updated at each time step as

$$

p \leftarrow (1-c_c)p + c_c \frac{m-m_{\text{old}}}{\sigma}

$$
for a learning rate $c_c$.

The covariance matrix is updated as

$$
    C \leftarrow (1-c_1-c_{\mu})C + c_1(pp^T) + c_{\mu} \sum_{i=1}^{\mu} w_i z^{(i)} z^{(i)T}
$$

Here $pp^T$ si a rank 1 matrix (hence the name $c_1$ for its coefficient) indicating the average direction in which the distribution has moved in the past.


