# Use a Fixed Deterministic Policy to Control LunderLanderContinuous-v2

### Policy

The policy is adopted from `gym.envs.box2d.lunar_lander.demo_heuristic_lander`.

Given the observation be $(x,y,v_x,v_y,\theta,v_\theta,i_\text{left},i_\text{right})$, let action be
$\left(f_y,f_\theta\right)$ where

$f_y=\begin{cases}5.5\left|x\right|-10y-10v_y-1,&i_\text{left}=0,i_\text{right}=0\\-10v_y-1,&\text{otherwise}\end{cases}$

$f_\theta=\begin{cases}\mathrm{clip}(5x+10v_x,-4,4)-10\theta-20v_\theta,&i_\text{left}=0,i_\text{right}=0\\0,&\text{otherwise}.\end{cases}$


     
### Test

In [1]:
import numpy as np
import gym
np.random.seed(0)
env = gym.make('LunarLanderContinuous-v2')
env.seed(0)

[0]

In [2]:
class Agent:
    def decide(self, observation):
        x, y, v_x, v_y, angle, v_angle, contact_left, contact_right = observation

        if contact_left or contact_right: # legs have contact
            f_y = -10. * v_y - 1.
            f_angle = 0.
        else:
            f_y = 5.5 * np.abs(x) - 10. * y - 10. * v_y - 1.
            f_angle = -np.clip(5. * x + 10. * v_x, -4, 4) + 10. * angle + 20. * v_angle

        action = np.array([f_y, f_angle])
        return action

agent = Agent()

In [3]:
def play_once(env, agent):
    observation = env.reset()
    episode_reward = 0.
    while True:
        action = agent.decide(observation)
        observation, reward, done, _ = env.step(action)
        episode_reward += reward
        if done:
            break
    return episode_reward

Test 100 episodes

In [4]:
episode_rewards = [play_once(env, agent) for _ in range(100)]
print('average episode rewards = {:.2f}'.format(np.mean(episode_rewards)))

average episode rewards = 282.56


Test 1000 episodes

In [5]:
episode_rewards = [play_once(env, agent) for _ in range(1000)]
print('average episode rewards = {:.2f}'.format(np.mean(episode_rewards)))

average episode rewards = 282.89


In [6]:
env.close()