# Evolution Stratgies

This is a set of notes based on [@hardmaru](https://twitter.com/hardmaru?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)'s blog posts that can be found [here](http://blog.otoro.net/2017/10/29/visual-evolution-strategies/) and [here](http://blog.otoro.net/2017/11/12/evolving-stable-strategies/).

Two simple toy problems for testing continuous black-box optimization algorithms:

* Schaffer function
* Rastrigin function

General outline of an evolution strategy:

* **Objective function** that takes a given **solution** and returns a single **fitness** score.
* Based on the current fitness score, the algorith produces the next generation of candidate soutions that is likely to produce even better results.
* Iterate the above steps until a satisfactory solution is found.

## Simple Evolution Strategy

We draw solutions from a Normal distribution with mean and standard deviation of $\mu$ and $\sigma$. 

Run through the solutions and produce a fitness score for each. Keep the best solution and use them as the new $(\mu, \sigma)$ of a Normal distribution from which the next generation of solutions will be drawn. 

This algorithm is **greedy**, so it can be prone to be stuck at a local optimum for more complicated problems. We need a more **diverse** set of ideas!

## Simple Genetic Algorithm

Instead of keeping only the best solution, in GA we keep the top 10% of the solutions in the current generation. Let the rest of the population die. 

For the next generation, randomly select two solutions from the survivors, recombine their parameters to form a new solution. This **crossover** recombination process uses a coin toss to determine which parent to take each paramter from. 

Gaussian noise and a fixed standard deviation will be injected into each new solution after this recombination process.

GA helps to diversify. However, in practice, most of the solutions in the elite surviving population tend to convere to a **local optimum** over time. 

Other sophisticated GA out there: CoSyNe, ESP, and NEAT.

## Covariance-Matrix Adaptation Evolution (CMA-ES)

Drawback of simple ES and simple GA: standard deviation noise paramter is fixed. But there are times when we want to explore more and increase the stdev of the search space, and vice versa. 

CMA-ES can adaptively increase or decrease the search space for the next generation. It will calculate the entire **covariance matrix** of the parameter space. At each generation, CMA-ES provides the parameters of a multi-variate normal distribution to sample solutions from. 

**Algorithm**

1. Calculate the fitness scores for all candidate solutions in generation (g)
2. Take the top 25% of generation (g),
3. Calculate the means for next generation (g+1) $\mu^{(g+1)}$ in the population using the top 25% of current generation (g).
4. Calculate the 2D covariance matrix $C^{(g+1)}$ for generate (g+1), but using the current generation's $\mu^{(g)}$ (top 25% still).
5. Sample new set of solutions using $(\mu^{(g+1)}, C^{(g+1)})$. 

Complexity is $O(N^2)$, approximations can get to $O(N)$ recently. 

**Ok to use when search space is less than 1k parameters, up to 10k if patient.**

Detail see [this](https://arxiv.org/abs/1604.00772)

In [1]:
import numpy as np

def mu(x):
    '''
    Parameters:
    x:
        parameter matrix for current generation.
        
    Returns:
    mean for next generation.
    '''
    return np.mean(x, axis=0)


def vcov(x, mu_prev):
    '''
    Parameters:
    x:
        parameter matrix for current generation
    mu_prev:
        mean of parameters from previous generation
        
    returns:
    covariance matrix for next generation.
    '''
    _, D = x.shape
    L = mu_prev.shape
    assert(D == L)
    
    # populate diagnal
    vcov = np.diag(np.mean(np.power(x - mu_prev, 2), axis=0))
    
    # populate off-diagnal items.
    for i in range(D):
        for j in range(D):
            if i == j:
                continue
            vcov[i, j] = np.mean((x[:, i] - mu_prev[i]) * (x[:, j] - mu_prev[j]))
            vcov[j, i] = vcov[i, j]
            
            
def compute_next(x, mu_prev):
    mu_next = mu(x)
    vcov_next = vcov(x, mu_prev)
    return (mu_next, vcov_next)

## Natural Evolution Strategy

[paper](http://www.jmlr.org/papers/volume15/wierstra14a/wierstra14a.pdf)

Weakness of CMA-ES and other simple strategies so far: Weak solutions that contain info on what **not** to do is discarded.

**REINFORCE-ES** Idea: maximize the **expected value** of the fitness score of a sampled solution. This is **almost** the same as maximizing the total fitness score of the entire population.

Can use gradient descent methods, such as momentum SGD, RMSProp or Adam.

Unlike CMA-ES, there is no correlation structure in this implementation. Complexity is $O(N)$

## OpenAI ES

In their [paper](https://blog.openai.com/evolution-strategies/) they implemented a special case of REINFORCE-ES algorithm: 

* keep $\sigma$ constant, only update $\mu$ at each generation.
* modified update rule suitable for parallel computation across multiple machines.

Paper discussed lots of practial aspects of deploying ES / RL-sytle tasks, worth a read.

## Fitness Shaping

**Fitness shaping** allow us to avoid outliers in the population dominating the approximate gradient calculation mentioned above.

Idea is to apply a **rank transformation** of the raw fitness scores, normalizes to [-.5, .5]. Similar to batch norm. 

## Discussion

ES is good at problems where it is difficult to calculation accurate gradients. 

# Evolution Strategies for Reinforcement Learning

General steps in RL:

```
env.reset()
done = False
total_reward = 0

while not done:
    a = agent.get_action(obs)
    obs, reward, done = env.step(a)
    total_reward += reward
```

## Deterministic Policies

The agent can be modelled with many things, hard coded rules, decision trees, linear functions, or RNN. 

Example uses a 2-layer FC network with `tanh` activation. Weights: $W_1$ and $W_2$.

## Stochastic Policies

###  Bayesian Neural Networks

Instead of having weights $W$ explicitly, we have $N(\mu, \sigma)$, during each forward pass, a new $W$ is sampled from $N(\mu, \sigma)$. 

Stochastic policy network / **Proximal Policy Optimizaton (PPO)** samples from $N(\mu, \sigma)$ for the final layer. 

**Adding noise to parameters** are also known to encourage the agent to explore the environment and **escape from local optima**.