# Natural Evolution Strategies (NES) toy example that optimizes a quadratic function

A bare bones example of optimizing a black-box function (f) using
Natural Evolution Strategies (NES), where the parameter distribution is a 
gaussian of fixed standard deviation.

Adapted from: https://gist.github.com/karpathy/77fbb6a8dac5395f1b73e7a89300318d
Originally linked from https://blog.openai.com/evolution-strategies/

In [1]:
import numpy as np
np.random.seed(0)

In [2]:
# the function we want to optimize
def f(w):
  # here we would normally:
  # ... 1) create a neural network with weights w
  # ... 2) run the neural network on the environment for some time
  # ... 3) sum up and return the total reward

  # but for the purposes of an example, lets try to minimize
  # the L2 distance to a specific solution vector. So the highest reward
  # we can achieve is 0, when the vector w is exactly equal to solution
  reward = -np.sum(np.square(solution - w))
  return reward

In [3]:
# hyperparameters
npop = 50 # population size
sigma = 0.1 # noise standard deviation
alpha = 0.001 # learning rate

In [10]:
# start the optimization

solution = np.array([0.5, 0.1, -0.3])
w = np.random.randn(3) # our inital guess is random

for i in range(300):
    
    # print current fitness of the most likely parameter setting
    if i % 20 == 0:
        print("iter {} w: {}, solution: {}, reward: {}"
              .format(i, str(w), str(solution), f(w)))
        
    # initialize memory for a population of w's, and their rewards
    N = np.random.randn(npop,3)
    R = np.zeros(npop)
    for j in range(npop):
        w_try = w + sigma*N[j] # jitter w using guassian of sigma 0.1
        R[j] = f(w_try) # evalutate the jittered version
    
    # standardize the rewards to have a gaussian distribution
    A = (R - np.mean(R)) / np.std(R)
    # perform parameter update. The matrix multiply below
    # is just an efficient way to sum up all the rows of the noise matrix N,
    # where each row N[j] is weighted by A[j]
    w = w + alpha/(npop*sigma) * np.dot(N.T, A)    

iter 0 w: [-0.69843176  1.28072549  0.69720699], solution: [ 0.5  0.1 -0.3], reward: -3.824773158987239
iter 20 w: [-0.58066891  1.15975134  0.5881584 ], solution: [ 0.5  0.1 -0.3], reward: -3.079743534636482
iter 40 w: [-0.46523764  1.04138043  0.48773192], solution: [ 0.5  0.1 -0.3], reward: -2.438402393417708
iter 60 w: [-0.35270921  0.91490314  0.39706625], solution: [ 0.5  0.1 -0.3], reward: -1.87708147977002
iter 80 w: [-0.22040744  0.79679601  0.3051493 ], solution: [ 0.5  0.1 -0.3], reward: -1.3707172273391777
iter 100 w: [-0.10843605  0.66833215  0.20260919], solution: [ 0.5  0.1 -0.3], reward: -0.9458118655214777
iter 120 w: [0.01204189 0.56321217 0.08864005], solution: [ 0.5  0.1 -0.3], reward: -0.6037097110784341
iter 140 w: [ 0.13462939  0.44321302 -0.00830926], solution: [ 0.5  0.1 -0.3], reward: -0.33637435307150726
iter 160 w: [ 0.2602041   0.32826352 -0.09911975], solution: [ 0.5  0.1 -0.3], reward: -0.14995918222059737
iter 180 w: [ 0.36213197  0.22241824 -0.18699382]

In [6]:
w

array([ 1.77651471,  0.36221492, -0.71565331])