# Action space, Policy, Episode, Horizon

In this section, we will learn about the several important fundamental concepts that are involved in reinforcement learning. 

## Action space
Consider the grid world environment shown below:

![title](Images/16.png)


In the above grid world environment, the goal of the agent is to reach the state I starting from state A without visiting the shaded states. In each of the states, the agent performs any of the four actions which are up, down, left and right to achieve the goal. The set of all possible actions in the environment is called action space. Thus, for the above grid world environment, the action space will be [up, down, left, right]



We can categorize the action space into two:

* Discrete action space 
* Continuous action space

__Discrete action space__ - When our action space consists of actions which are discrete then it is called discrete action space. For instance, in the above grid world environment, our action space consists of four discrete actions which are up, down, left, right and so it is called discrete action space. 

__Continuous action space__ - When our action space consists of actions which are continuous then it is called continuous action space. For instance, let's suppose, we are training an agent to drive a car then our action space will consist of several actions which are continuous values such as speed in which we need to drive the car, the number of degrees we need to rotate the wheel and so on. So in such cases where our action space consists of actions which are continuous then it is called continuous action space. 

## Policy

A policy defines the agent's behavior in an environment. The policy tells the agent what action to perform in each state. For instance, in the grid world environment, we have states A to I and four possible actions. The policy tells the agent what action to perform in each state, for instance, the policy tells the agent to perform action down in state A, perform action right in state D and so on. 

To interact with the environment for the first time, we initialize a random policy, that is, the random policy tells the agent to perform random action in each state. Thus, in an initial iteration, the agent performs a random action in each state and try to learn whether the action is good or bad based on the reward it obtains and over a series of iterations, an agent will learn to perform good actions in each state which gives a positive reward. Thus, we can say that over a series of iterations agent will learn a good policy which gives a positive reward. 

This good policy is called an optimal policy. The optimal policy is the policy that gets the agent a good reward and helps the agent to achieve the goal. For instance, in our grid world environment, optimal policy tells the agent to perform an action in each state such that the agent can reach the state I from A visiting without visiting the shaded states. 

The optimal policy is shown in the below figure, as we can observe, the agent selects the action in each state based on the optimal policy and reaches the terminal state I from the starting state A without visiting the shaded states. 


![title](Images/17.png)

Thus, the optimal policy tells the agent to perform correct action in each state so that the agent can receive a good reward.

A policy can be classified into two:

* Deterministic Policy
* Stochastic Policy

### Deterministic Policy
The policy which we just learned above is called deterministic policy. That is, deterministic policy tells the agent to perform a one particular action in a state. Thus, the deterministic policy maps the state to one particular action and is often denoted by $\mu$. Given a state $s$ at a time $t$, a deterministic policy tells the agent to perform a one particular action $a$. It can be expressed as:

$$a_t = \mu(s_t) $$

For instance, consider our grid world example, given a state A, the deterministic policy $\mu$ tells the agent to perform an action down and it can be expressed as:

$$\mu (A) = \text{Down} $$

Thus, according to the deterministic policy, whenever the agent visits state A, it performs the action down. 

### Stochastic Policy
Unlike deterministic policy, the stochastic policy does not map the state directly to one particular action, instead, it maps the state to a probability distribution over an action space. 

That is, we learned that given a state, the deterministic policy will tell the agent to perform one particular action in the given state, so, whenever the agent visits the state it always performs the same particular action. But with stochastic policy, given a state, the stochastic policy will return a probability distribution over an action space so instead of performing the same action every time the agent visits the state, the agent performs different actions each time based on a probability distribution returned by the stochastic policy. 

Let's understand this with an example, we know that our grid world environment's action space consists of 4 actions which are [up, down, left, right]. Given a state A, the stochastic policy returns the probability distribution over the action space as [0.10,0.70,0.10,0.10]. Now, whenever the agent visits the state A, instead of selecting the same particular action every time, the agent selects the action up 10% of the time, action down 70% of the time, action left 10% of time and action right 10% of the time. 

The difference between the deterministic policy and stochastic policy is shown below, as we can observe the deterministic policy maps the state to one particular action whereas the stochastic policy maps the state to the probability distribution over an action space:



![title](Images/18.png)

Thus, stochastic policy maps the state to a probability distribution over action space and it is often denoted by $\pi$.  Say, we have a state $s$ and action $a$ at a time $t$, then we can express the stochastic policy as:


$$a_t \sim \pi(s_t) $$

Or it can also be expressed as $\pi(a_t |s_t) $. 

We can categorize the stochastic policy into two:

* Categorical policy
* Gaussian policy

### Categorical policy 
A stochastic policy is called a categorical policy when the action space is discrete. That is, the stochastic policy uses categorical probability distribution over action space to select actions when the action space is discrete. For instance, in the grid world environment, we have just seen above, we select actions based on categorical probability distribution (discrete distribution) as the action space of the environment is discrete. As shown below, given a state A, we select an action based on the categorical probability distribution over the action space:



![title](Images/19.png)
### Gaussian policy 
A stochastic policy is called a gaussian policy when our action space is continuous. That is, the stochastic policy uses Gaussian probability distribution over action space to select actions when the action space is continuous. Let's understand this with a small example. Suppose we training an agent to drive a car and say we have one continuous action in our action space. Let the action be the speed of the car and the value of the speed of the car ranges from 0 to 150 kmph. Then, the stochastic policy uses the Gaussian distribution over the action space to select action as shown below:

![title](Images/20.png)


We will learn more about the gaussian policy in the upcoming chapters.



# Episode 
The agent interacts with the environment by performing some action starting from the initial state and reach the final state. This agent-environment interaction starting from the initial state until the final state is called an episode. For instance, in the car racing video game, the agent plays the game by starting from the initial state (starting point of the race) and reach the final state (endpoint of the race). This is considered an episode. An episode is also often called trajectory (path taken by the agent) and it is denoted by $\tau$. 

An agent can play the game for any number of episodes and each episode is independent of each other. What is the use of playing the game for multiple numbers of episodes? In order to learn the optimal policy, that is, the policy which tells the agent to perform correct action in each state, the agent plays the game for many episodes. 

For example, let's say we are playing a car racing game, for the first time, we may not win the game and we play the game several times to understand more about the game and discover some good strategies for winning the game. Similarly, in the first episode, the agent may not win the game and it plays the game for several episodes to understand more about the game environment and good strategies to win the game. 



Say, we begin the game from an initial state at a time step t=0 and reach the final state at a time step T then the episode information consists of the agent environment interaction such as state, action, and reward starting from the initial state till the final state, that is, $(s_0, a_0,r_0,s_1,a_1,r_1,\dots,s_T) $

An episode (or) trajectory is shown below:

![title](Images/21.png)


Let's strengthen our understanding of the episode and optimal policy with the grid world environment. We learned that in the grid world environment, the goal of our agent is to reach the final state I starting from the initial state A without visiting the shaded states. An agent receives +1 reward when it visits the unshaded states and -1 reward when it visits the shaded states.

When we say, generate an episode it means going from initial state to the final state. The agent generates the first episode using a random policy and explores the environment and over several episodes, it will learn the optimal policy. 

### Episode 1:

As shown below, in the first episode, the agent uses random policy and selects random action in each state starting from the initial state until the final state and observe the reward:


![title](Images/22.png)


### Episode 2:

In the second episode, the agent tries a different policy to avoid negative rewards which it had received in the previous episode. For instance, as we can observe in the previous episode, the agent selected an action right in the state A and received a negative reward, so in this episode, instead of selecting action right in the state A, it tries a different action say, down as shown below:


![title](Images/23.png)

### Episode n:

Thus, over a series of the episodes, the agent learns the optimal policy, that is, the policy which takes the agent to the final state I from the state A without visiting the shaded states as shown below:


![title](Images/24.png)

# Episodic and Continuous tasks 
A reinforcement learning task can be categorized into two:
* Episodic task
* Continuous task

__Episodic task__ - As the name suggests episodic task is the one that has the terminal state. That is, episodic tasks are basically tasks made up of episodes and thus they have a terminal state. Example: Car racing game. 

__Continuous task__ - Unlike episodic tasks, continuous tasks do not contain any episodes and so they don't have any terminal state. For example, a personal assistance robot does not have a terminal state. 


# Horizon
Horizon is the time step until which the agent interacts with the environment. We can classify the horizon into two:

* Finite horizon
* Infinite horizon

__Finite horizon__ - If the agent environment interaction stops at a particular time step then it is called finite Horizon. For instance, in the episodic tasks agent interacts with the environment starting from the initial state at time step  t =0 and reach the final state at a time step T.  Since the agent environment interaction stops at the time step T, it is considered a finite horizon. 

__Infinite horizon__ - If the agent environment interaction never stops then it is called an infinite horizon. For instance, we learned that the continuous task does not have any terminal states, so the agent environment interaction will never stop in the continuous task and so it is considered an infinite horizon. 

















