Safety can be roughly translated as the agent avoiding bad trajectories (state/action).

We aim at improving the policy, so that the output of the safety function, i.e. how quantifiably safe
is my current state and action pair, is as small as possible.

**Approach and artefacts**

A seperate **Safety Controller** is used to advice the agent in critical states with action probabilities to ensure risk minimization. While the agent generally aims at accumelating the biggest expected return over a set of trajectories, the seperate Safety Controller network is aimed at minimizing risk. In the provided environment (LunarSafe-v0) we are given the values of risk measured by safety bounds, i.e. the vector between the coordinates of the Lander and predetermined boundaries. The Controller thereby modifies the optimizaiton criterion toward which the policy tends to converge, and the exploration process for the agent to include the notation of safety. 


The main function of the Safety Controller is to learn which states are unsafe, and which actions to take in proximity to those states. It learns the unsafe state in through a constraint given by the environment. It 

In [13]:
import gym
import safe_agents as sa
import matplotlib
%matplotlib inline


## Training set

In [None]:
env = gym.make('LunarSafe-v0')
n_states = env.observation_space.shape[0] - 2
n_actions = env.action_space.n

# Get memory from agent for training set
agent = sa.agents.DQNAgent(env, n_states, n_actions, tb=True)
scores, bounds = agent.train_agent(episodes=100)

episode: 0  | score: 14.896757222653946  | memory: 70 | epsilon: 0.9930240953515695
episode: 1  | score: -249.00764300545558  | memory: 157 | epsilon: 0.9844218297185002
episode: 2  | score: -151.40998432918562  | memory: 291 | epsilon: 0.9713179143142632
episode: 3  | score: -54.63902572726187  | memory: 425 | epsilon: 0.9583884288075943
episode: 4  | score: -5.047438606285652  | memory: 512 | epsilon: 0.9500862014166829
episode: 5  | score: -105.39856779714327  | memory: 628 | epsilon: 0.9391283320998
episode: 6  | score: -183.52558318048125  | memory: 731 | epsilon: 0.9295044770190672
episode: 7  | score: -102.9616415783691  | memory: 834 | epsilon: 0.9199792438023008
episode: 8  | score: -83.82657573479983  | memory: 914 | epsilon: 0.9126484057557739
episode: 9  | score: -260.8099151144768  | memory: 1030 | epsilon: 0.9021223272298245
episode: 10  | score: -206.82854610278423  | memory: 1097 | epsilon: 0.8960980104149296
episode: 11  | score: -46.70527590464661  | memory: 1223 | ep

In [None]:
sa.utils.plot_visuals(agent, scores, bounds)

## Test set

How to update agent's network to include safety bounds in calculations?

Build a seperate safety critic network. 
    + Include bounds as parameter for reward calculations.?
    + Include bounds as parameter for agent action?
    
    s (list): The state. Attributes:
                  s[0] is the horizontal coordinate
                  s[1] is the vertical coordinate
                  s[2] is the horizontal speed
                  s[3] is the vertical speed
                  s[4] is the angle
                  s[5] is the angular speed
                  s[6] 1 if first leg has contact, else 0
                  s[7] 1 if second leg has contact, else 0

In [None]:
bounds

In [24]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

# state = v_speed, l_bound, bound

def safety_critic(state_size, action_size, learning_rate=0.0001):
    model = Sequential()
    model.add(Dense(64, input_dim=state_size, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(32, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(action_size, activation='linear', kernel_initializer='he_uniform'))
    #model.summary()
    model.compile(loss='mse', optimizer=Adam(lr=learning_rate))
    return model


