<a href="https://colab.research.google.com/github/dandanelbaz/ai_week/blob/master/ai_week.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intstall coach
Just use pip

In [1]:
pip install rl_coach

Collecting rl_coach
[?25l  Downloading https://files.pythonhosted.org/packages/2e/78/78df71ee5174c4db71deb1116c4ee7bc20e3fc3f6a9094e5c763df3cb099/rl-coach-1.0.1.tar.gz (374kB)
[K     |████████████████████████████████| 378kB 4.8MB/s 
[?25hCollecting annoy>=1.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/cc/66/eab272ae940d36d698994058e303fe7d1264d10ec120e0a508d0c8fb3ca5/annoy-1.16.2.tar.gz (636kB)
[K     |████████████████████████████████| 645kB 58.8MB/s 
Collecting pygame>=1.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/8e/24/ede6428359f913ed9cd1643dd5533aefeb5a2699cc95bea089de50ead586/pygame-1.9.6-cp36-cp36m-manylinux1_x86_64.whl (11.4MB)
[K     |████████████████████████████████| 11.4MB 169kB/s 
Collecting gym==0.12.5
[?25l  Downloading https://files.pythonhosted.org/packages/0c/c4/307107c687f75267d645415d57db8c0a6e29e20ac30d8f4a10e8030b6737/gym-0.12.5.tar.gz (1.5MB)
[K     |████████████████████████████████| 1.5MB 42.9MB/s 
Collecting kubernete

# AI Week Workshop 

### ***Add new environmen***t

In this section we will implement the short corridor environment from Sutton & Barto Book.

*   Three non terminal states- The location of the agent

*   The observations are one-hot encoding of the states
*   Actions are reversed in the second state


*   Reward is -1 for each time step






##### ***Helper function*** 
The following code snippet contains some defines and an one-hot encoding helper function.

In [3]:
%%writefile short_corridor_env_helpper.py
import numpy as np

LEFT = 0
RIGHT = 1
START_STATE = 0
GOAL_STATE = 3
NUM_STATES = 4
REVERSE_STATE = 1

def to_one_hot(state):
    observation = np.zeros((NUM_STATES,))
    observation[state] = 1
    return observation

Overwriting short_corridor_env_helpper.py


##### ***Write short corridor environment*** 
Compete the following functions:
 function and the step function

1.   is_done - will return a boolean . True only at termination state

2.   reset - Resets environment to initial state
3.   step - Returns the next observation, reward, and the boolean flag done





* **complete code**


In [9]:
%%writefile short_corridor_env.py
import numpy as np
import gym
from gym import spaces
from  short_corridor_env_helpper import *


class ShortCorridorEnv(gym.Env):

    def __init__(self):
        # Class constructor- Initializes class variables and sets initial state
        self.observation_space = spaces.Box(0, 1, shape=(NUM_STATES,))
        self.action_space = spaces.Discrete(2)
        self.reset()

    def reset(self):
        '''
        Resets the environment to start state
        '''
        # Boolean. True only if the goal state is reached
        self.goal_reached = ???
        # An integer representing the state. Number between zero and three
        self.current_state = ???
        observation = to_one_hot(???)
        return observation

    def _is_done(self, current_state):
        '''
        return done a Boolean- True only if we reached the goal state
        '''
        ???
        return done

    def step(self, action):
        '''
        Returns the next observation, reward, and the boolean flag done
        '''

        if action ==LEFT:
          step = -1
        elif action == RIGHT:
          step = ???

        if self.current_state == REVERSE_STATE:
        ### Replace step = -1 with step = 1 and vise versa
            ???

        self.current_state += step
        self.current_state = max(0, self.current_state)

        observation = to_one_hot(self.current_state)
        reward = ???
        done = self._is_done(self.current_state)

        return observation, reward, done, {}



Overwriting short_corridor_env.py


##### ***Write preset to run existing agent on the new environment***
*We will use the same preset from DQN example*.

Since our environment is already using Gym API we are almost good to go.

When selecting the environment parametes in the preset use **GymEnvironmentParameters** and pass the path of the environment source code using the level parameter

In [5]:
%%writefile short_corridor_dqn_preset.py
from rl_coach.environments.gym_environment import GymEnvironmentParameters
from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
from rl_coach.agents.dqn_agent import DQNAgentParameters
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.memories.memory import MemoryGranularity


####################
# Graph Scheduling #
####################
schedule_params = SimpleSchedule()


#########
# Agent #
#########
agent_params = DQNAgentParameters()
agent_params.input_filter = NoInputFilter()
agent_params.output_filter = NoOutputFilter()
# DQN params
# ER size
agent_params.memory.max_size = (MemoryGranularity.Transitions, 40000)


###############
# Environment #
###############
env_params = GymEnvironmentParameters(level='short_corridor_env:ShortCorridorEnv')


#################
# Graph Manager #
#################
graph_manager = BasicRLGraphManager(agent_params=agent_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params)


Writing short_corridor_dqn_preset.py


##### ***Run new preset***

In [0]:
!coach -p /content/short_corridor_dqn_preset.py:graph_manager


### ***Add new agent***
Coach modularity makes adding an agent a clean and simple task.
Tipicaly consists of four parts:


1.   Implement an agent spesific network head (and loss)
2.   Implement exploration policy (optional)
3.   Define new parametes class that extends `AgentParametes`
4.   Implement a preset to run the agent on some environment



##### ***Write stochastic output layer***
We use stochastic policy, meaning that we only produce the probability of going left and going right.
This layer takes in the input from previous layer, the middleware, and outputs two numbers. 

In [0]:
%%writefile probabilistic_layer.py
import tensorflow as tf
from rl_coach.architectures.tensorflow_components.layers import Dense

class ProbabilisticLayer(object):
    def __init__(self, input_layer, num_actions):
        super().__init__()
        scores = Dense(num_actions)(input_layer, name='logit')
        self.event_probs = tf.nn.softmax(scores, name="policy")
        # define the distributions for the policy and the old policy
        self.policy_distribution = tf.contrib.distributions.Categorical(probs=self.event_probs)

    def log_prob(self, action):
        return self.policy_distribution.log_prob(action)

    def layer_output(self):
        return self.event_probs

Writing probabilistic_layer.py


##### ***Implement network head i.e. implement the loss***
The Head needs to inherit from the base class `Head`.

Inorder to maximize the sum of rewards, we want to go in the following direction $-\Sigma_i A_i \nabla_Wlog(\pi(a_i|x_i))$

$- A_i \nabla_Wlog[\pi(a_i|x_i)]$

`Complete code`



In [0]:
%%writefile simple_pg_head.py
import tensorflow as tf
from rl_coach.architectures.tensorflow_components.heads.head import Head
from rl_coach.base_parameters import AgentParameters
from rl_coach.spaces import SpacesDefinition
from probabilistic_layer import ProbabilisticLayer


class SimplePgHead(Head):
    def __init__(self, agent_parameters: AgentParameters,
                 spaces: SpacesDefinition, network_name: str,
                 head_idx: int = 0, is_local: bool = True):
        super().__init__(agent_parameters, spaces, network_name)

        self.exploration_policy = agent_parameters.exploration

    def _build_module(self, input_layer):
        # Define inputs
        actions = tf.placeholder(tf.int32, [None], name="actions")
        advantages = tf.placeholder(tf.float32, [None], name="advantages")

        # Two actions, left or right
        policy_distribution = ProbabilisticLayer(input_layer, num_actions=2)

        # calculate loss
        log_prob = policy_distribution.log_prob(???)
        mudulated_log_prob = ???
        expected_mudulated_log_prob = tf.reduce_mean(mudulated_log_prob)

        ### Coach bookkeeping
        # List of placeholders for additional inputs to the head 
        #(except from the middleware input)
        self.input.append(???)
        # The output of the head, which is also the output of the network.
        self.output.append(???)
        # Placeholder for the target that we will use to train the network
        self.target = ???
        # The loss that we will use to train the network
        self.loss = ???
        tf.losses.add_loss(self.loss)



Overwriting simple_pg_head.py


##### ***Define exploration policy*** 
Every iteration we want to sample from the network output distribution i.e. toss a bias coin to get the agent actual move

**`Complete code`**

In [0]:
%%writefile simple_pg_exploration.py

import numpy as np
from rl_coach.exploration_policies.exploration_policy import ExplorationPolicy, ExplorationParameters
from rl_coach.spaces import ActionSpace


class DiscreteExplorationParameters(ExplorationParameters):
    @property
    def path(self):
        return 'simple_pg_exploration:DiscreteExploration'


class DiscreteExploration(ExplorationPolicy):
    """
    Discrete exploration policy is intended for discrete action spaces. It expects the action values to
    represent a probability distribution over the action
    """
    def __init__(self, action_space: ActionSpace):
        """
        :param action_space: the action space used by the environment
        """
        super().__init__(action_space)

    def get_action(self, probabilities):
        # choose actions according to the probabilities
        action = np.random.choice(self.action_space.actions, p=???)
        return chosen_action, probabilities


Overwriting simple_pg_exploration.py


##### ***Define new agent parameters***
Coach is modular!

Each class in Coach has a complementary parameters class which defines its constructor. 
This is also true for the agent. The agent has a complementary `AgentParameters` class. This class enable to select the paramenters of the agent sub modules.

It consists of the following four parts:



1.   algorithm
2.   exploration
3.   memory
4.   Networks



In [0]:
%%writefile simple_pg_params.py
from rl_coach.architectures.embedder_parameters import InputEmbedderParameters
from rl_coach.architectures.head_parameters import HeadParameters
from rl_coach.architectures.middleware_parameters import FCMiddlewareParameters
from rl_coach.base_parameters import NetworkParameters, AlgorithmParameters, \
    AgentParameters

from rl_coach.exploration_policies.additive_noise import AdditiveNoiseParameters
from rl_coach.exploration_policies.categorical import CategoricalParameters
from rl_coach.memories.episodic.single_episode_buffer import SingleEpisodeBufferParameters
from rl_coach.spaces import DiscreteActionSpace, BoxActionSpace
from rl_coach.agents.policy_optimization_agent import PolicyGradientRescaler
from simple_pg_exploration import DiscreteExplorationParameters

class SimplePgAgentParameters(AgentParameters):
    def __init__(self):
        super().__init__(algorithm=SimplePGAlgorithmParameters(),
                         #exploration=CategoricalParameters(),
                         exploration=DiscreteExplorationParameters(),
                         memory=SingleEpisodeBufferParameters(),
                         networks={"main": SimplePgTopology()})
    @property
    def path(self):
        #return 'simple_pg_agent:SimplePgAgent'
        return 'rl_coach.agents.policy_gradients_agent:PolicyGradientsAgent'

        
    
# Since we are adding a new head we need to tell coach the heads path
class SimplePgHeadParams(HeadParameters):
    def __init__(self):
        super().__init__(parameterized_class_name="AiWeekHead")

    @property
    def path(self):
        return 'simple_pg_head:SimplePgHead'


class SimplePgTopology(NetworkParameters):
    def __init__(self):
        super().__init__()
        self.input_embedders_parameters = {'observation': InputEmbedderParameters()}
        self.middleware_parameters = FCMiddlewareParameters()
        self.heads_parameters = [SimplePgHeadParams()]


class SimplePGAlgorithmParameters(AlgorithmParameters):
    """
    :param num_steps_between_gradient_updates: (int)
        The number of steps between calculating gradients for the collected data. In the A3C paper, this parameter is
        called t_max. Since this algorithm is on-policy, only the steps collected between each two gradient calculations
        are used in the batch.
    """
    def __init__(self):
        super().__init__()
        # TOTAL_RETURN
        # FUTURE_RETURN
        # FUTURE_RETURN_NORMALIZED_BY_EPISODE 
        # FUTURE_RETURN_NORMALIZED_BY_TIMESTEP
        # Q_VALUE
        # A_VALUE
        # TD_RESIDUAL
        # DISCOUNTED_TD_RESIDUAL
        # GAE
        self.policy_gradient_rescaler = PolicyGradientRescaler.FUTURE_RETURN
        self.num_steps_between_gradient_updates = 20000  # this is called t_max in all the papers






Overwriting simple_pg_params.py


##### ***Write preset to run new agent on short corridor***
complete code
* **complete code**
* **Hint: look at DQN preset**


In [0]:
%%writefile short_corridor_new_agent_preset.py
from rl_coach.base_parameters import VisualizationParameters
from rl_coach.core_types import EnvironmentEpisodes, EnvironmentSteps
from rl_coach.environments.gym_environment import GymEnvironmentParameters
from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleSchedule
from rl_coach.memories.memory import MemoryGranularity
from rl_coach.schedules import LinearSchedule
from simple_pg_params import SimplePgAgentParameters


####################
# Graph Scheduling #
####################
schedule_params = SimpleSchedule()


#########
# Agent #
#########
agent_params = ???
agent_params.input_filter = NoInputFilter()
agent_params.output_filter = NoOutputFilter()


###############
# Environment #
###############
env_params = GymEnvironmentParameters(level='short_corridor_env:ShortCorridorEnv')

#################
# Graph Manager #
#################
graph_manager = BasicRLGraphManager(agent_params=agent_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params)



Writing short_corridor_new_agent_preset.py


##### ***Run preset of the new agent on the new environment***

**`Complete code`**




In [0]:
???