# Neuroevolution with Gymnasium
## Lunar Lander experiments

Name: Ana Madrid Serrano

This notebook explores different ways of interacting with the **LunarLander-v2** environment from the Gymnasium library.

The experiments are structured progressively:

1. **Manual control** of the environment by a human user, to understand the task.
2. **A rule-based agent**, using a handcrafted policy based on observations.
3. **A neural model**, designed as a basis for future neuroevolutionary approaches.




In [3]:
pip install gymnasium[box2d]




## 1. Human control of Lunar Lander


Before implementing any automated agent, the environment is tested using **manual control**.
This allows direct observation of the state variables, the effect of actions, and the difficulty of the landing task.

In [4]:
import gymnasium as gym

env = gym.make("LunarLander-v2", render_mode="rgb_array")

import numpy as np
import pygame
import gymnasium.utils.play

lunar_lander_keys = {
    (pygame.K_UP,): 2,
    (pygame.K_LEFT,): 1,
    (pygame.K_RIGHT,): 3,
}
gymnasium.utils.play.play(env, zoom=3, keys_to_action=lunar_lander_keys, noop=0)

  self.func()


## 2. Rule-based agent (heuristic policy)


The Lunar Lander environment provides an **observation vector** with the following components:

observation = [x, y, vx, vy, angle, angular_velocity, left_leg, right_leg]

The **action space** is discrete and defined as:

actions = [do nothing, fire left engine, fire main engine, fire right engine]



Based on these variables, a simple handcrafted policy is defined.

In [None]:
def policy (observation):
    if observation[3]<-0.2:
        print('⬆︎',end='')
        return 2
    
    if observation[4]<-0.1:
        print('⬅︎',end='')
        return 1 
        
    if observation[5]<-0.1:
        print('⬅︎',end='')
        return 1
    
    if observation[2]<-0.1:
        print('➡︎',end='')
        return 3
    
    if observation[0]<-0.1:
        print('➡︎',end='')
        return 3
 
    if observation[4]> 0.1:
        print('➡︎',end='')
        return 3

    if observation[5]>0.1:
        print('➡︎',end='')
        return 3

    if observation[2]>0.1:
        print('⬅︎',end='')
        return 1
    
    if observation[0]>0.1:
        print('⬅︎',end='')
        return 1
    
    if observation[3]>0.1:
        print('⬇︎',end='')
        return 0

    return 0

The environment is now executed using the heuristic policy.
At each time step, the policy receives the current observation and selects an action.

The accumulated reward is used as a performance measure for the episode.


In [None]:
import gymnasium as gym

env = gym.make("LunarLander-v2", render_mode="human")

def run ():
    observation, info = env.reset()
    racum = 0

    while True:
        action = policy(observation)
        observation, reward, terminated, truncated, info = env.step(action)
        racum += reward

        if terminated or truncated:
            r = (racum + 1000) / 2000
            print(racum, r)
            return racum
    
run()


➡︎➡︎➡︎➡︎➡︎➡︎⬅︎➡︎⬅︎➡︎⬅︎➡︎⬅︎➡︎⬅︎➡︎⬅︎➡︎⬅︎➡︎⬅︎➡︎⬅︎➡︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎⬆︎➡︎➡︎⬆︎⬆︎➡︎⬆︎⬅︎⬆︎➡︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬆︎➡︎⬆︎⬆︎➡︎➡︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎➡︎➡︎⬆︎➡︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎➡︎⬆︎⬆︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎⬅︎➡︎⬆︎⬆︎⬆︎➡︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎⬆︎⬅︎⬅︎⬆︎⬅︎⬆︎⬆︎⬆︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎➡︎➡︎⬆︎➡︎⬆︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬆︎⬆︎⬅︎⬆︎⬆︎⬅︎⬅︎⬆︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎⬆︎➡︎⬆︎⬆︎➡︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎⬅︎➡︎⬆︎⬅︎⬆︎⬆︎⬆︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬆︎⬆︎⬅︎⬆︎⬅︎⬅︎⬆︎⬅︎⬆︎➡︎⬆︎➡︎⬆︎⬆︎➡︎⬆︎➡︎⬆︎➡︎⬆︎⬅︎⬆︎⬅︎⬆︎⬅︎⬅︎⬆︎⬆︎➡︎⬆︎⬅︎➡︎➡︎➡︎➡︎➡︎➡︎➡︎➡︎⬅︎➡︎➡︎➡︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎⬅︎260.5318062615744 0.6302659031307871


260.5318062615744

## 3. Neural model for neuroevolution

In this section, a **neuroevolutionary approach** is applied to the Lunar Lander problem.

Instead of designing rules manually or training neural networks using gradient-based methods,
the agent's behavior is optimized using **SALGA (Simple Adaptive Learning Genetic Algorithm)**.

SALGA evolves a population of candidate solutions, called chromosomes, by applying genetic operators
such as selection, mutation, and recombination.
Each chromosome encodes the parameters of a neural controller.



The agent is represented by a **single-layer perceptron** that maps environment observations to actions.

- **Inputs (8):** Lunar Lander observation vector  
- **Outputs (4):** Discrete action space  

The perceptron parameters (weights and biases) are encoded as a chromosome,
making them suitable for evolutionary optimization using SALGA.


A perceptron model is instantiated to serve as the neural controller of the agent.
The structure of the network remains fixed during evolution; only its parameters are evolved.


In [1]:
import numpy as np
class Perceptron:
    def __init__(self, ninput, noutput):
        self.ninput = ninput
        self.noutput = noutput
        self.w = np.random.rand(ninput,noutput)-0.5
        self.b = np.random.rand(noutput)-0.5
        
    def forward (self, x): # propaga un vector x y devuelve la salida
        u = np.dot(x, self.w) + self.b              
        return np.piecewise(u, [u<0, u>=0], [0,1])
                   
        
    def update (self, x, d, alpha): # realiza una iteración de entrenamiento
        s = self.forward(x) # propaga
        # Calcula la actualización de los pesos y el sesgo
        error = d - s
        self.w += alpha * np.outer(x,error)
        self.b += error*alpha
        
               
    def RMS (self, X, D): # calcula el error RMS
        S = self.forward(X)
        return np.mean(np.sqrt(np.mean(np.square(S-D),axis=1)))
        
    def accuracy (self, X, D): # calcula el ratio de aciertos
        S = self.forward(X)
        errors = np.mean(np.abs(D-S))
        return 1.0 - errors
    
    def info (self, X, D): # traza de cómno va el entrenamiento
        print('     RMS: %6.5f' % self.RMS(X,D))
        print('Accuracy: %6.5f' % self.accuracy(X,D))
        
    def train (self, X, D, alpha, epochs, trace=0): # entrena usando update
        for e in range(1,epochs+1):
            for i in range(len(X)):
                self.update(X[i],D[i], alpha)
            if trace!=0 and e%trace == 0:
                print('\n   Epoch: %d' % e)
                self.info(X,D)

    def from_chromosome(self, chromosome):
    # Extraer los pesos y bias de la lista del cromosoma
        w_size = self.ninput * self.noutput
        w = np.array(chromosome[:w_size]).reshape(self.ninput, self.noutput)
        b = np.array(chromosome[w_size:w_size+self.noutput])

        # Actualizar los pesos y bias de la red
        self.w = w
        self.b = b

The following chromosome corresponds to the **best individual obtained after applying SALGA**.

This chromosome was selected based on its fitness, defined as the accumulated reward obtained
by the agent during an episode of the Lunar Lander environment.


ch=[-0.061264477472171563, 1.2334191493374835, 0.40603380952043866, 0.9553591142153932, 0.14556935003108212, 0.17990814841861139, -0.8292220744876733, 0.3686721742648328, 0.1958597704484823, 2.3265527364414105, 1.1800661108170107, 1.1311726379626674, 2.4006362522310893, 0.754321877904867, -2.00412648105665, -0.22351230687257362, -0.24490131279372235, -2.5545128379705715, -1.7445478055667343, -1.1100652947300667, 1.9377778639443637, -1.0145382491395563, -2.1039384504087746, 0.5057075365908111, 0.47659236811333505, -0.7576359187542151, -0.3814027318230223, 1.1234334031505608, 1.425999767768399, -0.6016325392277657, 1.2577655010588522, -1.4565977437520088, -0.1549568560654263, 0.025309412884416155, -0.13978695682664993, 2.0723721822560996]


The chromosome is decoded and loaded into the perceptron.
This process assigns the evolved weights and biases to the neural network,
fully defining the behavior of the agent.


In [None]:
model = Perceptron(8,4)
ch=[-0.061264477472171563, 1.2334191493374835, 0.40603380952043866, 0.9553591142153932, 0.14556935003108212, 0.17990814841861139, -0.8292220744876733, 0.3686721742648328, 0.1958597704484823, 2.3265527364414105, 1.1800661108170107, 1.1311726379626674, 2.4006362522310893, 0.754321877904867, -2.00412648105665, -0.22351230687257362, -0.24490131279372235, -2.5545128379705715, -1.7445478055667343, -1.1100652947300667, 1.9377778639443637, -1.0145382491395563, -2.1039384504087746, 0.5057075365908111, 0.47659236811333505, -0.7576359187542151, -0.3814027318230223, 1.1234334031505608, 1.425999767768399, -0.6016325392277657, 1.2577655010588522, -1.4565977437520088, -0.1549568560654263, 0.025309412884416155, -0.13978695682664993, 2.0723721822560996]
model.from_chromosome(ch)

A policy function is defined using the evolved neural controller.

For each observation:
1. The perceptron computes the output activations
2. The action corresponding to the maximum activation is selected

This policy directly reflects the solution found by the SALGA algorithm.


In [None]:
def policy (observation):
    s = model.forward(observation)
    action = np.argmax(s)
    return action

This section evaluates an agent in the **LunarLander-v2** environment using the previously defined policy.


In [None]:
env = gym.make("LunarLander-v2", render_mode="human")

def run ():
    observation, info = env.reset()
    ite = 0
    racum = 0
    while True:
        action = policy(observation)
        observation, reward, terminated, truncated, info = env.step(action)
        
        racum += reward

        if terminated or truncated:
            r = (racum+1000) / 2000
            print(racum, r)
            return racum
    
run()

NameError: name 'policy' is not defined

The obtained reward provides an indication of the agent's performance.

Higher accumulated rewards generally correspond to smoother landings,
lower fuel consumption, and successful touchdowns.