<img src="header.png" align="left"/>

# Exercise: Reinforcement Learning Moon Lander (10 points)


The goal of this exercise is to work with reinforcement learning models and get a basic understanding of the topic. We will first develop controlers for the simple cart pole model and later for the lunar lander.
Neil Armstrong was the first to control a lunar lander in 1969. See a [video](https://youtu.be/xc1SzgGhMKc?t=520) about this masterpiece.
Luckily we do not have to go to the moon, but can do our experiments in simulation based on the [Openai gym](https://gym.openai.com/) software.


**Note**: openai gym is not well supported in anaconda. Please install gym in your conda environment using the following command:

```
pip install gym
pip install box2d box2d-kengz
```

**Note**: it can happend that the rendering window does not show up or close properly. In this case please check your environment and look for a solution and post it in the forum.


# Module imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gym
import time

In [None]:
# suppress some warnings
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=Warning)

np.set_printoptions(precision=4)
np.set_printoptions(suppress=True)

In [None]:
# GPU support
import tensorflow as tf
print ( tf.__version__ ) 

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR )
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config))

# Task: Very basic RL example (1 points)

Run this basic cart pole example and find out how it works and what the basic functions of gym are. Document the code with python comments. Find out what the observation and action values mean.

In [None]:
#
# Result: documented code...
#
env = gym.make('CartPole-v0')
env.reset()
cumulated_reward = 0
for i in range(50):

    env.render(mode='close')
    
    action = env.action_space.sample()
    observation, reward, done, info = env.step( action )
    cumulated_reward += reward
    
    print( '\r', 'o:{} r:{} cr:{} d:{}   a:{}'.format(observation,reward,cumulated_reward,done,action), end='' )
    
    if done:
        env.reset()

    # some delay important for display to catch up
    time.sleep(0.1)
      
env.close()

# Task: Implement a basic on-off control strategy (1 points)

Before we go into advanced control strategies, lets attempt to control the cart pole with a simple on-off control strategy. Reading the [documentation](https://github.com/openai/gym/wiki/CartPole-v0) of this gym we find that it has two actions (push cart left = 0 and push cart right = 1). So, one idea could be to just look at the pole's angle and push the cart left if the pole leans to the left and vice versa. Give it a try.

In [None]:
env = gym.make('CartPole-v0')
env.reset()

cumulated_reward = 0

for i in range(100):

    env.render(mode='close')
    
    observation, reward, done, info = env.step( action )
    cumulated_reward += reward

    #
    # Result: implement your control strategy here
    #
        
    print( '\r', 'a:{:.2f} p:{:.2f} r:{} cr:{} d:{}   a:{}'.format(observation[2],observation[0],reward,cumulated_reward,done,action), end='' )
    
    if done:
        env.reset()
        cululated_reward = 0

    # some delay important for display to catch up
    time.sleep(0.1)
      
env.close()

# Task: DQN Solution to cart pole balancing (2 point)

Now lets build a first version based on advanced RL technique, the Deep Q-Network. Here a neural network is trained to estimate the best action for a state based on the Q-learning concept.

The code is based on the work by Greg Surma and it can be found [here](https://github.com/gsurma/cartpole).

Please go through the code and answer the questions in the comments of the code (marked by Task). 

**Note**: Place your answer as comment below the questions.

In [None]:
import numpy as np
import random
import pandas as pd

from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.models import model_from_json
import gym

prefix = 'results/04_dqn_'

# hyperparameters from https://towardsdatascience.com/ai-learning-to-land-a-rocket-reinforcement-learning-84d61f97d055

GAMMA = 0.99
LEARNING_RATE = 0.001
LEARNING_RATE_DECAY = 0.0001
MEMORY_SIZE = 1000000
BATCH_SIZE = 40
EXPLORATION_MAX = 0.5
EXPLORATION_MIN = 0.1
EXPLORATION_DECAY = 0.995

class DQNControl:

    def __init__(self, observation_space, action_space,layout=[24,24],name='nona'):
        
        print ('building DQN model with observation space {} and action space {} layer {} name {}'.format(observation_space, action_space,layout,name) )
        
        self.exploration_rate = EXPLORATION_MAX
        self.action_space = action_space
        self.memory = deque(maxlen=MEMORY_SIZE)
        self.name = name
        
        self.model = Sequential()
        self.model.add(Dense(layout[0], input_shape=(observation_space,), activation="relu"))
        self.model.add(Dense(layout[1], activation="relu"))
        self.model.add(Dense(self.action_space, activation="linear"))
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE,decay=LEARNING_RATE_DECAY ))

        
    def save(self):
        modelName = prefix + self.name + "model.json"
        weightName = prefix + self.name + "model.h5"
        model_json = self.model.to_json()
        with open( modelName , "w") as json_file:
            json_file.write(model_json)
        # serialize weights to HDF5
        self.model.save_weights( weightName )
        print("saved model to disk as {} {}".format(modelName,weightName))

        
    def load(self):    
        modelName = prefix + self.name + "model.json"
        weightName = prefix + self.name + "model.h5"
        json_file = open(modelName, 'r')
        loaded_model_json = json_file.read()
        json_file.close()
        self.model = model_from_json(loaded_model_json)
        self.model.load_weights(weightName)
        print("loaded model from disk")
        
        
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
        
        
    def action(self,state):
        q_values = self.model.predict(state)
        return np.argmax(q_values[0])
        
        
        
    def act(self, state):
        #
        # Task: what is the purpose of this if statement
        # Result: ....
        #
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space)

        q_values = self.model.predict(state)
        
        #
        # Task: what is the idea behind this step (to come from value to action)?
        # Result: ....
        #
        return np.argmax(q_values[0])

    
    def experience_replay(self):
        
        if len(self.memory) < BATCH_SIZE:
            return
        
        batch = random.sample(self.memory, BATCH_SIZE)
        
        for state, action, reward, state_next, done in batch:
            
            q_update = reward
            if not done:
                #
                # Task: give an explanation for the formula of the update of the Q-value
                # Result: ...
                #
                q_update = (reward + GAMMA * np.amax( self.model.predict(state_next)[0] ) )
            
            q_values = self.model.predict(state)
            
            q_values[0][action] = q_update
            
            self.model.fit(state, q_values, verbose=0)
            
            
            
    def close_episode(self):
        #
        # Task: what is going on here?
        # Result: ...
        #
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)
        
            
            
            
            


def trainDQN(env,episodes=50,layout=[24,24], name='nona', termination_reward=None, termination_runs=None, termination_runs_reward=None ):
    
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n

    dqn_solver = DQNControl(observation_space, action_space,layout,name)
    
    history = []
    run = 0
    
    accumulated_reward = 0
    sliding_accumulated_reward = 0
    
    while run < episodes:
        
        state = env.reset()
        state = np.reshape(state, [1, observation_space])
        step = 0
        while True:
            
            step += 1
            
            env.render(mode='close')
            
            action = dqn_solver.act(state)
            
            state_next, reward, terminal, info = env.step(action)
            
            accumulated_reward += reward
            
            if not (termination_runs is None) and step > termination_runs:
                terminal = True
                if not (termination_runs_reward is None):
                    reward = termination_runs_reward
            else:
                if terminal and not (termination_reward is None):
                    reward = termination_reward
            
            state_next = np.reshape(state_next, [1, observation_space])
            
            dqn_solver.remember(state, action, reward, state_next, terminal)
            
            state = state_next
            
            if terminal:
                
                sliding_accumulated_reward = sliding_accumulated_reward * 0.9 + accumulated_reward * 0.1
                
                print ( '\r', 'episode: {}, exploration: {:.3f}, score: {} sliding score {}'.format(run,dqn_solver.exploration_rate,accumulated_reward,sliding_accumulated_reward), end='' )
                
                history.append([run,dqn_solver.exploration_rate,accumulated_reward,sliding_accumulated_reward,step])
                
                accumulated_reward = 0
                break
            
            dqn_solver.experience_replay()
        
        
        dqn_solver.close_episode()
        
        
        run += 1

    env.close()
    return dqn_solver,history

In [None]:
env = gym.make("CartPole-v1")
control,history = trainDQN(env=env,episodes=60,layout=[24,24],name='cartdqn',termination_reward=-200,termination_runs=100,termination_runs_reward=None)

In [None]:
# Save model for later
control.save()

In [None]:
df = pd.DataFrame(history)

In [None]:
df[1].plot()

In [None]:
df[2].plot()
df[3].plot()

# Test the DQN control

In [None]:
env = gym.make('CartPole-v1')

env.reset()

observation_space = env.observation_space.shape[0]
action_space = env.action_space.n
control = DQNControl(observation_space, action_space)
control.load()

state = env.reset()
cumulated_reward = 0

for i in range(100):
    env.render(mode='close')

    # Result: implement your control strategy here
    action = control.action( np.reshape(state, [1, observation_space]) )
    observation, reward, done, _ = env.step( action )
    
    cumulated_reward += reward
        
    print( '\r', 'a:{:.2f} p:{:.2f} r:{} cr:{} d:{}   a:{}'.format(observation[2],observation[0],reward,cumulated_reward,done,action), end='' )
    
    if done:
        state = env.reset()
        cumulated_reward = 0

    # some delay important for display to catch up
    time.sleep(0.05)
      
env.close()

# Actor critic model (2 points)

Implement the actor critic model [1] for the cart pole gym. Lookup some tutorials and implement it in a similar structure as DQNControll. Compare the results.

- [1] https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf


In [None]:
#
# Result: actor critic solution for cart
#
class ACControl:
    pass

def trainACModel(episodes=500): 
    pass


In [None]:
accontrol, history = trainACModel(episodes=300)

In [None]:
accontrol.save()

In [None]:
df = pd.DataFrame(history)

In [None]:
#
# Plot scores
#
df[2].plot()

# Lunar lander problem

How we are looking into the lunar lander problem. We reuse the DQN controller from above with different parameters. Play with this problem and get an understanding of the rewards. Configuration is taken from [2]. A general discussion about this approach was published in [1].

- [1] https://www.researchgate.net/publication/333145451_Deep_Q-Learning_on_Lunar_Lander_Game
- [2] https://towardsdatascience.com/ai-learning-to-land-a-rocket-reinforcement-learning-84d61f97d055

In [None]:
env = gym.make('LunarLander-v2')
control,history = trainDQN(env=env,episodes=150,layout=[64,32],name='lunar',termination_reward=None,termination_runs=150,termination_runs_reward=-200)

In [None]:
# Save model for later
control.save()

In [None]:
df[2].plot()
df[3].plot()

# Task: Implement an improved controler for the lunar lander (4 points)

Search the internet for leadboards for lunar lander and try to implement one of the best solutions. Select your solution by simplicity and clarity of code. Comment the code.

In [None]:
#
# Result: implementation of improved controler
#
