<a href="https://colab.research.google.com/github/dandanelbaz/ai_week/blob/master/ai_week.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install coach
Just use pip

In [0]:
pip install rl_coach

# AI Week Workshop 

### ***1 Runing Coach*** 

##### ***1.1 Training with default parameters*** 

In [0]:
from rl_coach.agents.dqn_agent import DQNAgentParameters
from rl_coach.environments.gym_environment import GymEnvironmentParameters, Atari, atari_schedule
from rl_coach.graph_managers.graph_manager import VisualizationParameters
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager


# creating graph manager
graph_mgr = BasicRLGraphManager(
    agent_params = DQNAgentParameters(), 
    env_params = Atari(level = 'Breakout-v0'), 
    schedule_params = atari_schedule, 
    vis_params = VisualizationParameters())

In [0]:
graph_mgr.improve()

##### ***1.2 Changing default parameters***

In [0]:
from rl_coach.agents.clipped_ppo_agent import ClippedPPOAgentParameters
from rl_coach.environments.gym_environment import GymVectorEnvironment
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.architectures.embedder_parameters import InputEmbedderParameters

# Reset tensorflow graph as the network has changed.
import tensorflow as tf
tf.reset_default_graph()

# Define the environment parameters
bit_length = 10
env_params = GymVectorEnvironment(level='rl_coach.environments.toy_problems.bit_flip:BitFlip')
env_params.additional_simulator_parameters = {'bit_length': bit_length, 'mean_zero': True}

# Clipped PPO
agent_params = ClippedPPOAgentParameters()
agent_params.network_wrappers['main'].input_embedders_parameters = {
    'state': InputEmbedderParameters(scheme=[]),
    'desired_goal': InputEmbedderParameters(scheme=[])
}

graph_manager = BasicRLGraphManager(
    agent_params=agent_params,
    env_params=env_params,
    schedule_params=SimpleSchedule()
)

In [0]:
graph_manager.improve()

##### ***1.3 Running a Coach preset***

When running Coach from the command line, we use a Preset module to define the experiment parameters. As its name implies, a preset is a predefined set of parameters to run some agent on some environment. Coach has many predefined presets that follow the algorithms definitions in the published papers, and allows training some of the existing algorithms with essentially no coding at all. This presets can easily be run from the command line. For example:

**coach -p CartPole_DQN**

You can find all the predefined presets under the presets directory, or by listing them using the following command:

**coach -l**

Coach can also be used with an externally defined preset by passing the absolute path to the module and the name of the graph manager object which is defined in the preset:

**coach -p /home/my_user/my_agent_dir/my_preset.py:graph_manager**

Some presets are generic for multiple environment levels, and therefore require defining the specific level through the command line:

**coach -p Mujoco_ClippedPPO -lvl humanoid**

There are plenty of other command line arguments you can use in order to customize the experiment. A full documentation of the available arguments can be found using the following command:

**coach -h**

In [0]:
from rl_coach.agents.clipped_ppo_agent import ClippedPPOAgentParameters
from rl_coach.architectures.layers import Dense
from rl_coach.base_parameters import VisualizationParameters, PresetValidationParameters, DistributedCoachSynchronizationType
from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps
from rl_coach.environments.environment import SingleLevelSelection
from rl_coach.environments.gym_environment import GymVectorEnvironment, mujoco_v2
from rl_coach.exploration_policies.additive_noise import AdditiveNoiseParameters
from rl_coach.filters.filter import InputFilter
from rl_coach.filters.observation.observation_normalization_filter import ObservationNormalizationFilter
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import ScheduleParameters
from rl_coach.schedules import LinearSchedule

####################
# Graph Scheduling #
####################

schedule_params = ScheduleParameters()
schedule_params.improve_steps = TrainingSteps(10000000)
schedule_params.steps_between_evaluation_periods = EnvironmentSteps(2048)
schedule_params.evaluation_steps = EnvironmentEpisodes(5)
schedule_params.heatup_steps = EnvironmentSteps(0)

#########
# Agent #
#########
agent_params = ClippedPPOAgentParameters()


agent_params.network_wrappers['main'].learning_rate = 0.0003
agent_params.network_wrappers['main'].input_embedders_parameters['observation'].activation_function = 'tanh'
agent_params.network_wrappers['main'].input_embedders_parameters['observation'].scheme = [Dense(64)]
agent_params.network_wrappers['main'].middleware_parameters.scheme = [Dense(64)]
agent_params.network_wrappers['main'].middleware_parameters.activation_function = 'tanh'
agent_params.network_wrappers['main'].batch_size = 64
agent_params.network_wrappers['main'].optimizer_epsilon = 1e-5
agent_params.network_wrappers['main'].adam_optimizer_beta2 = 0.999

agent_params.algorithm.clip_likelihood_ratio_using_epsilon = 0.2
agent_params.algorithm.clipping_decay_schedule = LinearSchedule(1.0, 0, 1000000)
agent_params.algorithm.beta_entropy = 0
agent_params.algorithm.gae_lambda = 0.95
agent_params.algorithm.discount = 0.99
agent_params.algorithm.optimization_epochs = 10
agent_params.algorithm.estimate_state_value_using_gae = True
# Distributed Coach synchronization type.
agent_params.algorithm.distributed_coach_synchronization_type = DistributedCoachSynchronizationType.SYNC

agent_params.input_filter = InputFilter()
agent_params.exploration = AdditiveNoiseParameters()
agent_params.pre_network_filter = InputFilter()
agent_params.pre_network_filter.add_observation_filter('observation', 'normalize_observation',
                                                       ObservationNormalizationFilter(name='normalize_observation'))

###############
# Environment #
###############
env_params = GymVectorEnvironment(level=SingleLevelSelection(mujoco_v2))
# Set the target success
env_params.target_success_rate = 1.0

graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,
                                    schedule_params=schedule_params, vis_params=VisualizationParameters(),
                                    preset_validation_params=preset_validation_params)

In [0]:
!coach -l

### ***2 Adding a new environment***

In this section we will implement the short corridor environment from Sutton & Barto Book.

![short_corridor](https://drive.google.com/uc?id=1rYLI9dC92sfpF0BVxVENF964MfWJkxZq)

*   Three non terminal states - The location of the agent

*   The observations are one-hot encoding of the states
*   Actions are reversed in the second state


*   Reward is -1 for each time step






##### ***2.1 Helper function*** 
The following code snippet contains some defines and an one-hot encoding helper function.

In [0]:
%%writefile short_corridor_env_helpper.py
import numpy as np

LEFT = 0
RIGHT = 1
START_STATE = 0
GOAL_STATE = 3
NUM_STATES = 4
REVERSE_STATE = 1

def to_one_hot(state):
    observation = np.zeros((NUM_STATES,))
    observation[state] = 1
    return observation

Overwriting short_corridor_env_helpper.py


##### ***2.2 Write short corridor environment*** 
Compete the following functions:
 function and the step function

1.   is_done - will return a boolean . True only at termination state

2.   reset - Resets environment to initial state
3.   step - Returns the next observation, reward, and the boolean flag done





* **complete code**


In [0]:
%%writefile short_corridor_env.py
import numpy as np
import gym
from gym import spaces
from  short_corridor_env_helpper import *


class ShortCorridorEnv(gym.Env):

    def __init__(self):
        # Class constructor- Initializes class variables and sets initial state
        self.observation_space = spaces.Box(0, 1, shape=(NUM_STATES,))
        self.action_space = spaces.Discrete(2)
        self.reset()

    def reset(self):
        '''
        Resets the environment to start state
        '''
        # Boolean. True only if the goal state is reached
        self.goal_reached = ???
        # An integer representing the state. Number between zero and three
        self.current_state = ???
        observation = to_one_hot(???)
        return observation

    def _is_done(self, current_state):
        '''
        return done a Boolean- True only if we reached the goal state
        '''
        ???
        return done

    def step(self, action):
        '''
        Returns the next observation, reward, and the boolean flag done
        '''

        if action ==LEFT:
          step = -1
        elif action == RIGHT:
           ???

        if self.current_state == REVERSE_STATE:
        ### Replace step = -1 with step = 1 and vise versa
            ???

        self.current_state += step
        self.current_state = max(0, self.current_state)

        observation = to_one_hot(self.current_state)
        reward = ???
        done = self._is_done(self.current_state)

        return observation, reward, done, {}



Overwriting short_corridor_env.py


##### ***2.3 Write preset to run existing agent on the new environment***
*We will use the same preset from DQN example*.

Since our environment is already using Gym API we are almost good to go.

When selecting the environment parametes in the preset use **GymEnvironmentParameters** and pass the path of the environment source code using the level parameter

In [0]:
%%writefile short_corridor_dqn_preset.py
from rl_coach.environments.gym_environment import GymEnvironmentParameters
from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
from rl_coach.agents.dqn_agent import DQNAgentParameters
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.memories.memory import MemoryGranularity


####################
# Graph Scheduling #
####################
schedule_params = SimpleSchedule()


#########
# Agent #
#########
agent_params = DQNAgentParameters()
agent_params.input_filter = NoInputFilter()
agent_params.output_filter = NoOutputFilter()
# DQN params
# ER size
agent_params.memory.max_size = (MemoryGranularity.Transitions, 40000)


###############
# Environment #
###############
env_params = GymEnvironmentParameters(level='short_corridor_env:ShortCorridorEnv')


#################
# Graph Manager #
#################
graph_manager = BasicRLGraphManager(agent_params=agent_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params)


Overwriting short_corridor_dqn_preset.py


##### ***2.4 Run new preset***

In [0]:
!coach -p /content/short_corridor_dqn_preset.py:graph_manager


### ***3 Adding a new agent***
Coach modularity makes adding an agent a clean and simple task.
Typically consists of four parts:


1.   Implement an agent's specific network head (and loss)
2.   Implement exploration policy (optional)
3.   Define new parameters class that extends `AgentParameters`
4.   Implement a preset to run the agent on some environment



##### ***3.1 Write stochastic output layer***
We use stochastic policy, meaning that we only produce the probability of going left and going right.
This layer takes in the input from previous layer, the middleware, and outputs two numbers. 

![Probabilistic output](https://drive.google.com/uc?id=1hB_AsKUlxlu43sMkPAFfLaK6Z5sz1I-n)

In [0]:
%%writefile probabilistic_layer.py
import tensorflow as tf
from rl_coach.architectures.tensorflow_components.layers import Dense

class ProbabilisticLayer(object):
    def __init__(self, input_layer, num_actions):
        super().__init__()
        scores = Dense(num_actions)(input_layer, name='logit')
        self.event_probs = tf.nn.softmax(scores, name="policy")
        # define the distributions for the policy and the old policy
        self.policy_distribution = tf.contrib.distributions.Categorical(probs=self.event_probs)

    def log_prob(self, action):
        return self.policy_distribution.log_prob(action)

    def layer_output(self):
        return self.event_probs

Overwriting probabilistic_layer.py


##### ***3.2 Implement network head i.e. implement the loss***
The Head needs to inherit from the base class `Head`.

In order to maximize the sum of rewards, we want to go in the following direction $-\Sigma_i R_i \nabla_Wlog(\pi(a_i|x_i))$

`Complete code`



In [0]:
%%writefile simple_pg_head.py
import tensorflow as tf
from rl_coach.architectures.tensorflow_components.heads.head import Head
from rl_coach.base_parameters import AgentParameters
from rl_coach.spaces import SpacesDefinition
from probabilistic_layer import ProbabilisticLayer


class SimplePgHead(Head):
    def __init__(self, agent_parameters: AgentParameters,
                 spaces: SpacesDefinition, network_name: str,
                 head_idx: int = 0, is_local: bool = True):
        super().__init__(agent_parameters, spaces, network_name)

        self.exploration_policy = agent_parameters.exploration

    def _build_module(self, input_layer):
        # Define inputs
        actions = tf.placeholder(tf.int32, [None], name="actions")
        returns = tf.placeholder(tf.float32, [None], name="returns")

        # Two actions, left or right
        policy_distribution = ProbabilisticLayer(input_layer, num_actions=2)

        # calculate loss
        log_prob = policy_distribution.log_prob(???)
        # We only want to encourage good actions, so we multiply the log probability with ...
        modulated_log_prob = ???
        expected_modulated_log_prob = tf.reduce_mean(modulated_log_prob)

        ### Coach bookeeping
        # List of placeholders for additional inputs to the stochastic head 
        #(except from the middleware input)
        self.input.append(???)
        # The output of the stochastic head, which is also the output of the network.
        self.output.append(???)
        # Placeholder for the target that we will use to train the network
        self.target = returns
        # The loss that we will use to train the network.
        # We take the gradient of this loss and move in the opposite direction
        self.loss = ???
        tf.losses.add_loss(self.loss)



Overwriting simple_pg_head.py


##### ***3.3 Define exploration policy*** 
Every iteration we want to sample from the network output distribution i.e. toss a bias coin to get the agent's actual move

**`Complete code`**

In [0]:
%%writefile simple_pg_exploration.py

import numpy as np
from rl_coach.exploration_policies.exploration_policy import ExplorationPolicy, ExplorationParameters
from rl_coach.spaces import ActionSpace


class DiscreteExplorationParameters(ExplorationParameters):
    @property
    def path(self):
        return 'simple_pg_exploration:DiscreteExploration'


class DiscreteExploration(ExplorationPolicy):
    """
    Discrete exploration policy is intended for discrete action spaces. It expects the action values to
    represent a probability distribution over the action
    """
    def __init__(self, action_space: ActionSpace):
        """
        :param action_space: the action space used by the environment
        """
        super().__init__(action_space)

    def get_action(self, probabilities):
        # choose actions according to the probabilities
        chosen_action = np.random.choice(self.action_space.actions, p=???)
        return chosen_action, probabilities


Overwriting simple_pg_exploration.py


##### ***3.4 Define new agent parameters***
Coach is modular!

Each class in Coach has a complementary parameters class which defines its constructor. 
This is also true for the agent. The agent has a complementary `AgentParameters` class. This class enables selecting the paramenters of the agent sub modules.

It consists of the following four parts:



1.   Algorithm
2.   Exploration
3.   Memory
4.   Networks



In [0]:
%%writefile simple_pg_params.py
from rl_coach.architectures.embedder_parameters import InputEmbedderParameters
from rl_coach.architectures.head_parameters import HeadParameters
from rl_coach.architectures.middleware_parameters import FCMiddlewareParameters
from rl_coach.base_parameters import NetworkParameters, AlgorithmParameters, \
    AgentParameters

from rl_coach.exploration_policies.additive_noise import AdditiveNoiseParameters
from rl_coach.exploration_policies.categorical import CategoricalParameters
from rl_coach.memories.episodic.single_episode_buffer import SingleEpisodeBufferParameters
from rl_coach.spaces import DiscreteActionSpace, BoxActionSpace
from rl_coach.agents.policy_optimization_agent import PolicyGradientRescaler
from simple_pg_exploration import DiscreteExplorationParameters

class SimplePgAgentParameters(AgentParameters):
    def __init__(self):
        super().__init__(algorithm=SimplePGAlgorithmParameters(),
                         exploration=DiscreteExplorationParameters(),
                         memory=SingleEpisodeBufferParameters(),
                         networks={"main": SimplePgTopology()})
    @property
    def path(self):
        #return 'simple_pg_agent:SimplePgAgent'
        return 'rl_coach.agents.policy_gradients_agent:PolicyGradientsAgent'

        
    
# Since we are adding a new head we need to tell coach the heads path
class SimplePgHeadParams(HeadParameters):
    def __init__(self):
        super().__init__(parameterized_class_name="AiWeekHead")

    @property
    def path(self):
        return 'simple_pg_head:SimplePgHead'


class SimplePgTopology(NetworkParameters):
    def __init__(self):
        super().__init__()
        self.input_embedders_parameters = {'observation': InputEmbedderParameters()}
        self.middleware_parameters = FCMiddlewareParameters()
        self.heads_parameters = [SimplePgHeadParams()]


class SimplePGAlgorithmParameters(AlgorithmParameters):
    """
    :param num_steps_between_gradient_updates: (int)
        The number of steps between calculating gradients for the collected data. In the A3C paper, this parameter is
        called t_max. Since this algorithm is on-policy, only the steps collected between each two gradient calculations
        are used in the batch.
    """
    def __init__(self):
        super().__init__()
        # TOTAL_RETURN
        # FUTURE_RETURN
        # FUTURE_RETURN_NORMALIZED_BY_EPISODE 
        # FUTURE_RETURN_NORMALIZED_BY_TIMESTEP
        # Q_VALUE
        # A_VALUE
        # TD_RESIDUAL
        # DISCOUNTED_TD_RESIDUAL
        # GAE
        self.policy_gradient_rescaler = PolicyGradientRescaler.FUTURE_RETURN
        self.num_steps_between_gradient_updates = 20000  # this is called t_max in all the papers






##### ***3.5 Write preset to run new agent on short corridor***
complete code
* **complete code**
* **Hint: look at DQN preset**


In [0]:
%%writefile short_corridor_new_agent_preset.py
from rl_coach.base_parameters import VisualizationParameters
from rl_coach.core_types import EnvironmentEpisodes, EnvironmentSteps
from rl_coach.environments.gym_environment import GymEnvironmentParameters
from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.memories.memory import MemoryGranularity
from rl_coach.schedules import LinearSchedule
from simple_pg_params import SimplePgAgentParameters


####################
# Graph Scheduling #
####################
schedule_params = SimpleSchedule()


#########
# Agent #
#########
agent_params = ???
agent_params.input_filter = NoInputFilter()
agent_params.output_filter = NoOutputFilter()


###############
# Environment #
###############
env_params = GymEnvironmentParameters(level='short_corridor_env:ShortCorridorEnv')

#################
# Graph Manager #
#################
graph_manager = BasicRLGraphManager(agent_params=agent_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params)



Writing short_corridor_new_agent_preset.py


##### ***3.6 Run preset of the new agent on the new environment***

**`Complete code`**




In [0]:
???

# AI Week Workshop Solution

### ***Training with default parameters***

In [0]:
from rl_coach.agents.dqn_agent import DQNAgentParameters
from rl_coach.environments.gym_environment import GymEnvironmentParameters, Atari, atari_schedule
from rl_coach.graph_managers.graph_manager import VisualizationParameters
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager


# creating graph manager
graph_mgr = BasicRLGraphManager(
    agent_params = DQNAgentParameters(), 
    env_params = Atari(level = 'Breakout-v0'), 
    schedule_params = atari_schedule, 
    vis_params = VisualizationParameters())

In [0]:
graph_mgr.improve()

[30;46mCreating graph - name: BasicRLGraphManager[0m
[30;46mCreating agent - name: agent[0m

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.





Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Use keras.layers.dense instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

[30;46msimple_rl_graph: Starting heatup[0m
2019-11-13-15:47:49.504635 [95mHeatup[0m - [94mName: [0mmain_level/agent [94mWorker: [0m0 [94mEpisode: [0m1 [94mTotal reward: [0m0.0 [94mExploration: [0m1 [94mSteps: [0m24 [94mTraining itera

KeyboardInterrupt: ignored

### ***Changing default parameters***

In [0]:
from rl_coach.agents.clipped_ppo_agent import ClippedPPOAgentParameters
from rl_coach.environments.gym_environment import GymVectorEnvironment
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.architectures.embedder_parameters import InputEmbedderParameters

# Resetting tensorflow graph as the network has changed.
import tensorflow as tf
tf.reset_default_graph()

# define the environment parameters
bit_length = 10
env_params = GymVectorEnvironment(level='rl_coach.environments.toy_problems.bit_flip:BitFlip')
env_params.additional_simulator_parameters = {'bit_length': bit_length, 'mean_zero': True}

# Clipped PPO
agent_params = ClippedPPOAgentParameters()
agent_params.network_wrappers['main'].input_embedders_parameters = {
    'state': InputEmbedderParameters(scheme=[]),
    'desired_goal': InputEmbedderParameters(scheme=[])
}

graph_manager = BasicRLGraphManager(
    agent_params=agent_params,
    env_params=env_params,
    schedule_params=SimpleSchedule()
)

In [0]:
graph_manager.improve()

### ***Running Coach using preset***

When running Coach from the command line, we use a Preset module to define the experiment parameters. As its name implies, a preset is a predefined set of parameters to run some agent on some environment. Coach has many predefined presets that follow the algorithms definitions in the published papers, and allows training some of the existing algorithms with essentially no coding at all. This presets can easily be run from the command line. For example:

**coach -p CartPole_DQN**

You can find all the predefined presets under the presets directory, or by listing them using the following command:

**coach -l**

Coach can also be used with an externally defined preset by passing the absolute path to the module and the name of the graph manager object which is defined in the preset:

**coach -p /home/my_user/my_agent_dir/my_preset.py:graph_manager**

Some presets are generic for multiple environment levels, and therefore require defining the specific level through the command line:

**coach -p Atari_DQN -lvl breakout**

There are plenty of other command line arguments you can use in order to customize the experiment. A full documentation of the available arguments can be found using the following command:

**coach -h**

In [0]:
from rl_coach.agents.clipped_ppo_agent import ClippedPPOAgentParameters
from rl_coach.architectures.layers import Dense
from rl_coach.base_parameters import VisualizationParameters, PresetValidationParameters, DistributedCoachSynchronizationType
from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps
from rl_coach.environments.environment import SingleLevelSelection
from rl_coach.environments.gym_environment import GymVectorEnvironment, mujoco_v2
from rl_coach.exploration_policies.additive_noise import AdditiveNoiseParameters
from rl_coach.filters.filter import InputFilter
from rl_coach.filters.observation.observation_normalization_filter import ObservationNormalizationFilter
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import ScheduleParameters
from rl_coach.schedules import LinearSchedule

####################
# Graph Scheduling #
####################

schedule_params = ScheduleParameters()
schedule_params.improve_steps = TrainingSteps(10000000)
schedule_params.steps_between_evaluation_periods = EnvironmentSteps(2048)
schedule_params.evaluation_steps = EnvironmentEpisodes(5)
schedule_params.heatup_steps = EnvironmentSteps(0)

#########
# Agent #
#########
agent_params = ClippedPPOAgentParameters()


agent_params.network_wrappers['main'].learning_rate = 0.0003
agent_params.network_wrappers['main'].input_embedders_parameters['observation'].activation_function = 'tanh'
agent_params.network_wrappers['main'].input_embedders_parameters['observation'].scheme = [Dense(64)]
agent_params.network_wrappers['main'].middleware_parameters.scheme = [Dense(64)]
agent_params.network_wrappers['main'].middleware_parameters.activation_function = 'tanh'
agent_params.network_wrappers['main'].batch_size = 64
agent_params.network_wrappers['main'].optimizer_epsilon = 1e-5
agent_params.network_wrappers['main'].adam_optimizer_beta2 = 0.999

agent_params.algorithm.clip_likelihood_ratio_using_epsilon = 0.2
agent_params.algorithm.clipping_decay_schedule = LinearSchedule(1.0, 0, 1000000)
agent_params.algorithm.beta_entropy = 0
agent_params.algorithm.gae_lambda = 0.95
agent_params.algorithm.discount = 0.99
agent_params.algorithm.optimization_epochs = 10
agent_params.algorithm.estimate_state_value_using_gae = True
# Distributed Coach synchronization type.
agent_params.algorithm.distributed_coach_synchronization_type = DistributedCoachSynchronizationType.SYNC

agent_params.input_filter = InputFilter()
agent_params.exploration = AdditiveNoiseParameters()
agent_params.pre_network_filter = InputFilter()
agent_params.pre_network_filter.add_observation_filter('observation', 'normalize_observation',
                                                       ObservationNormalizationFilter(name='normalize_observation'))

###############
# Environment #
###############
env_params = GymVectorEnvironment(level=SingleLevelSelection(mujoco_v2))
# Set the target success
env_params.target_success_rate = 1.0

graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,
                                    schedule_params=schedule_params, vis_params=VisualizationParameters(),
                                    preset_validation_params=preset_validation_params)

In [0]:
!coach -l

### ***Adding a new environmen***t

In this section we will implement the short corridor environment from Sutton & Barto Book.

![short_corridor](https://drive.google.com/uc?id=1rYLI9dC92sfpF0BVxVENF964MfWJkxZq)

*   Three non terminal states- The location of the agent

*   The observations are one-hot encoding of the states
*   Actions are reversed in the second state


*   Reward is -1 for each time step






##### ***Helper function*** 
The following code snippet contains some defines and an one-hot encoding helper function.

In [0]:
%%writefile short_corridor_env_helpper.py
import numpy as np

LEFT = 0
RIGHT = 1
START_STATE = 0
GOAL_STATE = 3
NUM_STATES = 4
REVERSE_STATE = 1

def to_one_hot(state):
    observation = np.zeros((NUM_STATES,))
    observation[state] = 1
    return observation

Overwriting short_corridor_env_helpper.py


##### ***Implement short corridor environment*** 
Complete the following functions:
 function and the step function

1.   is_done - will return a boolean . True only at termination state

2.   reset - Resets environment to initial state
3.   step - Returns the next observation, reward, and the boolean flag done





* **complete code**


In [0]:
%%writefile short_corridor_env.py
import numpy as np
import gym
from gym import spaces
from  short_corridor_env_helpper import *


class ShortCorridorEnv(gym.Env):

    def __init__(self):
        # Class constructor- Initializes class variables and sets initial state
        self.observation_space = spaces.Box(0, 1, shape=(NUM_STATES,))
        self.action_space = spaces.Discrete(2)
        self.reset()

    def reset(self):
        '''
        Resets the environment to start state
        '''
        # Boolean. True only if the goal state is reached
        self.goal_reached = False
        # An integer representing the state. Number between zero and three
        self.current_state = START_STATE
        observation = to_one_hot(self.current_state)
        return observation

    def _is_done(self, current_state):
        '''
        return done a Boolean- True only if we reached the goal state
        '''
        done = (self.current_state == GOAL_STATE)
        return done

    def step(self, action):
        '''
        Returns the next observation, reward, and the boolean flag done
        '''

        if action ==LEFT:
          step = -1
        elif action == RIGHT:
          step = 1

        if self.current_state == REVERSE_STATE:
        ### Replace step = -1 with step = 1 and vise versa
            step = -step

        self.current_state += step
        self.current_state = max(0, self.current_state)

        observation = to_one_hot(self.current_state)
        reward = -1
        done = self._is_done(self.current_state)

        return observation, reward, done, {}



Overwriting short_corridor_env.py


##### ***Write preset to run existing agent on the new environment***
*We will use the same preset from DQN example*.

Since our environment is already using Gym API we are almost good to go.

When selecting the environment parametes in the preset use **GymEnvironmentParameters** and pass the path of the environment source code using the level parameter

In [0]:
%%writefile short_corridor_dqn_preset.py
from rl_coach.environments.gym_environment import GymEnvironmentParameters
from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
from rl_coach.agents.dqn_agent import DQNAgentParameters
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.memories.memory import MemoryGranularity


####################
# Graph Scheduling #
####################
schedule_params = SimpleSchedule()


#########
# Agent #
#########
agent_params = DQNAgentParameters()
agent_params.input_filter = NoInputFilter()
agent_params.output_filter = NoOutputFilter()
# DQN params
# ER size
agent_params.memory.max_size = (MemoryGranularity.Transitions, 40000)


###############
# Environment #
###############
env_params = GymEnvironmentParameters(level='short_corridor_env:ShortCorridorEnv')


#################
# Graph Manager #
#################
graph_manager = BasicRLGraphManager(agent_params=agent_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params)


Overwriting short_corridor_dqn_preset.py


##### ***Run new preset***

In [0]:
!coach -p /content/short_corridor_dqn_preset.py:graph_manager


### ***Add new agent***
Coach modularity makes adding an agent a clean and simple task.
Typically consists of four parts:


1.   Implement an agent spesific network head (and loss)
2.   Implement exploration policy (optional)
3.   Define new parametes class that extends `AgentParameters`
4.   Implement a preset to run the agent on some environment



##### ***Write stochastic output layer***
We use stochastic policy, meaning that we only produce the probability of going left and going right.
This layer takes in the input from previous layer, the middleware, and outputs two numbers. 

![Probabilistic output](https://drive.google.com/uc?id=1hB_AsKUlxlu43sMkPAFfLaK6Z5sz1I-n)

In [0]:
%%writefile probabilistic_layer.py
import tensorflow as tf
from rl_coach.architectures.tensorflow_components.layers import Dense

class ProbabilisticLayer(object):
    def __init__(self, input_layer, num_actions):
        super().__init__()
        scores = Dense(num_actions)(input_layer, name='logit')
        self.event_probs = tf.nn.softmax(scores, name="policy")
        # define the distributions for the policy and the old policy
        self.policy_distribution = tf.contrib.distributions.Categorical(probs=self.event_probs)

    def log_prob(self, action):
        return self.policy_distribution.log_prob(action)

    def layer_output(self):
        return self.event_probs

Overwriting probabilistic_layer.py


##### ***Implement network head i.e. implement the loss***
The Head needs to inherit from the base class `Head`.

Inorder to maximize the sum of rewards, we want to go in the following direction $-\Sigma_i R_i \nabla_Wlog(\pi(a_i|x_i))$

`Complete code`



In [0]:
%%writefile simple_pg_head.py
import tensorflow as tf
from rl_coach.architectures.tensorflow_components.heads.head import Head
from rl_coach.base_parameters import AgentParameters
from rl_coach.spaces import SpacesDefinition
from probabilistic_layer import ProbabilisticLayer


class SimplePgHead(Head):
    def __init__(self, agent_parameters: AgentParameters,
                 spaces: SpacesDefinition, network_name: str,
                 head_idx: int = 0, is_local: bool = True):
        super().__init__(agent_parameters, spaces, network_name)

        self.exploration_policy = agent_parameters.exploration

    def _build_module(self, input_layer):
        # Define inputs
        actions = tf.placeholder(tf.int32, [None], name="actions")
        returns = tf.placeholder(tf.float32, [None], name="returns")

        # Two actions, left or right
        policy_distribution = ProbabilisticLayer(input_layer, num_actions=2)

        # calculate loss
        log_prob = policy_distribution.log_prob(actions)
        # We only want to encourage good actions, so we multiply the log probability with ...
        modulated_log_prob = returns * log_prob
        expected_modulated_log_prob = tf.reduce_mean(modulated_log_prob)

        ### Coach bookeeping
        # List of placeholders for additional inputs to the stochastic head 
        #(except from the middleware input)
        self.input.append(actions)
        # The output of the stochastic head, which is also the output of the network.
        self.output.append(policy_distribution.layer_output())
        # Placeholder for the target that we will use to train the network
        self.target = returns
        # The loss that we will use to train the network.
        # We take the gradient of this loss and move in the opposite direction
        self.loss = -expected_modulated_log_prob
        tf.losses.add_loss(self.loss)



Overwriting simple_pg_head.py


##### ***Define exploration policy*** 
Every iteration we want to sample from the network output distribution i.e. toss a bias coin to get the agent actual move

**`Complete code`**

In [0]:
%%writefile simple_pg_exploration.py

import numpy as np
from rl_coach.exploration_policies.exploration_policy import ExplorationPolicy, ExplorationParameters
from rl_coach.spaces import ActionSpace


class DiscreteExplorationParameters(ExplorationParameters):
    @property
    def path(self):
        return 'simple_pg_exploration:DiscreteExploration'


class DiscreteExploration(ExplorationPolicy):
    """
    Discrete exploration policy is intended for discrete action spaces. It expects the action values to
    represent a probability distribution over the action
    """
    def __init__(self, action_space: ActionSpace):
        """
        :param action_space: the action space used by the environment
        """
        super().__init__(action_space)

    def get_action(self, probabilities):
        # choose actions according to the probabilities
        chosen_action = np.random.choice(self.action_space.actions, p=probabilities)
        return chosen_action, probabilities


Overwriting simple_pg_exploration.py


##### ***Define new agent parameters***
Coach is modular!

Each class in Coach has a complementary parameters class which defines its constructor. 
This is also true for the agent. The agent has a complementary `AgentParameters` class. This class enable to select the paramenters of the agent sub modules.

It consists of the following four parts:



1.   algorithm
2.   exploration
3.   memory
4.   Networks



In [0]:
%%writefile simple_pg_params.py
from rl_coach.architectures.embedder_parameters import InputEmbedderParameters
from rl_coach.architectures.head_parameters import HeadParameters
from rl_coach.architectures.middleware_parameters import FCMiddlewareParameters
from rl_coach.base_parameters import NetworkParameters, AlgorithmParameters, \
    AgentParameters

from rl_coach.exploration_policies.additive_noise import AdditiveNoiseParameters
from rl_coach.exploration_policies.categorical import CategoricalParameters
from rl_coach.memories.episodic.single_episode_buffer import SingleEpisodeBufferParameters
from rl_coach.spaces import DiscreteActionSpace, BoxActionSpace
from rl_coach.agents.policy_optimization_agent import PolicyGradientRescaler
from simple_pg_exploration import DiscreteExplorationParameters

class SimplePgAgentParameters(AgentParameters):
    def __init__(self):
        super().__init__(algorithm=SimplePGAlgorithmParameters(),
                         #exploration=CategoricalParameters(),
                         exploration=DiscreteExplorationParameters(),
                         memory=SingleEpisodeBufferParameters(),
                         networks={"main": SimplePgTopology()})
    @property
    def path(self):
        #return 'simple_pg_agent:SimplePgAgent'
        return 'rl_coach.agents.policy_gradients_agent:PolicyGradientsAgent'

        
    
# Since we are adding a new head we need to tell coach the heads path
class SimplePgHeadParams(HeadParameters):
    def __init__(self):
        super().__init__(parameterized_class_name="AiWeekHead")

    @property
    def path(self):
        return 'simple_pg_head:SimplePgHead'


class SimplePgTopology(NetworkParameters):
    def __init__(self):
        super().__init__()
        self.input_embedders_parameters = {'observation': InputEmbedderParameters()}
        self.middleware_parameters = FCMiddlewareParameters()
        self.heads_parameters = [SimplePgHeadParams()]


class SimplePGAlgorithmParameters(AlgorithmParameters):
    """
    :param num_steps_between_gradient_updates: (int)
        The number of steps between calculating gradients for the collected data. In the A3C paper, this parameter is
        called t_max. Since this algorithm is on-policy, only the steps collected between each two gradient calculations
        are used in the batch.
    """
    def __init__(self):
        super().__init__()
        # TOTAL_RETURN
        # FUTURE_RETURN
        # FUTURE_RETURN_NORMALIZED_BY_EPISODE 
        # FUTURE_RETURN_NORMALIZED_BY_TIMESTEP
        # Q_VALUE
        # A_VALUE
        # TD_RESIDUAL
        # DISCOUNTED_TD_RESIDUAL
        # GAE
        self.policy_gradient_rescaler = PolicyGradientRescaler.FUTURE_RETURN
        self.num_steps_between_gradient_updates = 20000  # this is called t_max in all the papers






Overwriting simple_pg_params.py


##### ***Write preset to run new agent on short corridor***
complete code
* **complete code**
* **Hint: look at DQN preset**


In [0]:
%%writefile short_corridor_new_agent_preset.py
from rl_coach.base_parameters import VisualizationParameters
from rl_coach.core_types import EnvironmentEpisodes, EnvironmentSteps
from rl_coach.environments.gym_environment import GymEnvironmentParameters
from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.memories.memory import MemoryGranularity
from rl_coach.schedules import LinearSchedule
from simple_pg_params import SimplePgAgentParameters


####################
# Graph Scheduling #
####################
schedule_params = SimpleSchedule()


#########
# Agent #
#########
agent_params = SimplePgAgentParameters()
agent_params.input_filter = NoInputFilter()
agent_params.output_filter = NoOutputFilter()


###############
# Environment #
###############
env_params = GymEnvironmentParameters(level='short_corridor_env:ShortCorridorEnv')

#################
# Graph Manager #
#################
graph_manager = BasicRLGraphManager(agent_params=agent_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params)



Writing short_corridor_new_agent_preset.py


##### ***Run preset of the new agent on the new environment***

**`Complete code`**




In [0]:
!coach -p /content/short_corridor_new_agent_preset.py:graph_manager