<a href="https://colab.research.google.com/github/deltorobarba/machinelearning/blob/master/bellman.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bellman Equation (for Dynamic Programming in RL)**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## **Elements of Reinforcement Learning**

**Policy**

* A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. 

* The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. A policy is often denoted by the symbol 𝛑.

**Reward Function**

* A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state. 

* A reinforcement learning agent’s sole objective is to maximize the total reward it receives in the long run. The reward function defines what are the good and bad events for the agent.

**Value Function**

* The **value** of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. 

* Whereas **rewards** determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. 

* To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas values correspond to a more reﬁned and farsighted judgment of how pleased or displeased we are that our environment is in a particular state.

**Model**

* Model is the agent’s representation of an environment. The learning can be of two types — model-based learning and model-free learning. 

* In **model-based learning**, the agent exploits previously learned information to accomplish a task whereas, in **model-free learning**, the agent simply relies on a trial-and-error experience for performing the right action. 

* Say you want to reach your office from home faster. In model-based learning, you simply use a previously learned experience (map) to reach the office faster, whereas in model-free learning you will not use previous experience and will try all different routes and choose the faster one.

## **Functions & Components of Reinforcement Learning**

**Rewards**

* Based on the action our agent performs, it receives a reward. A reward is nothing but a numerical value, say, +1 for good action and -1 for a bad action. An agent tries to maximize the total amount of rewards (cumulative rewards) it receives from the environment instead of immediate rewards. 

* The total amount of rewards the agent receives from the environment is called returns. We can formulate the total amount of reward as

```
R(t) = r(t+1)+r(t+2)+r(t+3)+r(t+4) ... +r(Τ)
```



where r(t+1) is the reward received by the agent at a time step t₀ and so on.


**Episodic and continuous tasks**

* Episodic tasks are the tasks that have a terminal state (end). For example, in car racing game the end of game is a terminal state. Once the game is over, you start the next episode by restarting the game which will be a whole new beginning. In the above case r(Τ) is the terminal state and the end of episode.

* In continuous task there is no terminal state.For example, personal assitant do not have terminal state.


**Discount factor**

* Since we don’t have any final state for a continuous task, we can define our return for continuous tasks as



```
R(t) = r(t+1)+r(t+2)+r(t+3)…+r(Τ) which will sum up to ∞
```



* That’s why we introduce the notion of a discount factor. We can redefine our return with a discount factor, as follows



```
R(t) = r(t+1)+γr(t+2)+γ² r(t+3)+ ...
```



* The discount factor decides how much importance we give to the future rewards and immediate rewards. The value of the discount factor lies within 0 to 1. Very low discount factor signifies importance to immediate reward while high discount signifies importance to future reward. The true value of the discount factor is application dependent but the optimal value of the discount factor lies between 0.2 to 0.8.

**The policy function**

* As we studied in the previous article it is a function which maps states to actions. It is denoted by π. So, basically, a policy function says what action to perform in each state. 

* Our ultimate goal lies in finding the optimal policy which specifies the correct action to perform in each state, which maximizes the reward.

**State value function**

* State value function specifies how good it is for an agent to be in a particular state with a policy π. A value function is often denoted by V(s). It denotes the value of a state following a policy. 

* The state value function depends on the policy and it varies depending on the policy we choose.

> $V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1} \mid s_{t}=s\right]$

**State-action value function (Q function)**

* A state-action value function is also called the Q function. It specifies how good it is for an agent to perform a particular action in a state with a policy π. The Q function is denoted by Q(s). It denotes the value of taking an action in a state following a policy π.

* The difference between the value function and the Q function is that the value function specifies the goodness of a state, while a Q function specifies the goodness of an action in a state. Similar to state value table we can make Q table which shows the value of all possible state action pairs. Whenever we say value function V(S) or Q function Q( S, a), it actually means the value table and Q table.

> $Q^{\pi}(s, a)=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1} \mid s_{t}=s, a_{t}=a\right]$

## **Markov Decision Process & Reinforcement Learning**

* The Markov property states that the future depends only on the present and not on the past. The Markov chain is a probabilistic model that solely depends on the current state and not the previous states, that is, the future is conditionally independent of past.

* Moving from one state to another is called transition and its probability is called a transition probability. We can think of an example of anything in which next state depends only on the present state.

* MDP is an extension of the Markov chain. It provides a mathematical framework for modeling decision-making situations. 


Almost all Reinforcement Learning problems can be modeled as MDP. MDP can be represented by 5 important elements:

1.  A set of state(S) the agent can be in.

2. A set of actions (A) that can be performed by an agent, for moving from one state to another.

3. A transition probability (Pᵃₛ₁ₛ₂), which is the probability of moving from one state to another state by performing some action.

4. A reward probability ( Rᵃₛ₁ₛ₂), which is the probability of a reward acquired by the agent for moving from one state to another state by performing some action.

5. A discount factor (γ), which controls the importance of immediate and future rewards. 

## **Bellman Equation & Markov Decision Processes**

* The Bellman equation for dynamic programming helps to solve MDP. To solve means finding the optimal policy and value functions. The optimal value function V*(S) is one that yields maximum value.

* The value of a given state is equal to the max action (action which maximizes the value) of the reward of the optimal action in the given state and add a discount factor multiplied by the next state’s Value from the Bellman Equation.

**Deterministic environment**

An environment is said to be deterministic when we know the outcome based on the current state. For instance, in a chess game, we know the exact outcome of moving any player.

* V(s) is the value for being in a certain state. V(s’) is the value for being in the next state that we will end up in after taking action a. 

* R(s, a) is the reward we get after taking action a in state s. As we can take different actions so we use maximum because our agent wants to be in the optimal state. γ is the discount factor

> $V(s)=\max _{a}\left(R(s, a)+\gamma V\left(s^{\prime}\right)\right)$

**Stochastic environment**

An environment is said to be stochastic when we cannot determine the outcome based on the current state. There will be a greater level of uncertainty. For example, we never know what number will show up when throwing a dice.

* In a stochastic environment when we take an action it is not confirmed that we will end up in a particular next state and there is a probability of ending in a particular state. 

* P(s, a,s’) is the probability of ending is state s’ from s by taking action a. This is summed up to a total number of future states. 

> $V(s)=\max _{a}\left(R(s, a)+\gamma \sum_{s^{\prime}} P\left(s, a, s^{\prime}\right) V\left(s^{\prime}\right)\right)$

* For example, if by taking an action we can end up in 3 states s₁,s₂, and s₃ from state s with a probability of 0.2, 0.2 and 0.6. The Bellman equation will be following.

* We can solve the Bellman equation using a technique called **dynamic programming**.



```
V(s) = maxₐ(R(s,a) + γ(0.2*V(s₁) + 0.2*V(s₂) + 0.6*V(s₃) )
```



**Dynamic Programming**

Dynamic programming (DP) is a technique for solving complex problems. In DP, instead of solving complex problems one at a time, we break the problem into simple subproblems, then for each sub-problem, we compute and store the solution. If the same subproblem occurs, we will not recompute, instead, we use the already computed solution.

We solve a Bellman equation using two powerful algorithms:

* Value Iteration (In value iteration, we start off with a random value function. As the value table is not optimized if randomly initialized we optimize it iteratively.)

* Policy Iteration(In Policy Iteration the actions which the agent needs to take are decided or initialized first and the value table is created according to the policy.)

## **Simulate**

In [1]:
import gym
import numpy as np
#make environment
env = gym.make('FrozenLake-v0')
# as the environment is continues there cannot be finite number of states 
states = env.observation_space.n #used if discrete environment

#check number of actions that can be 
actions = env.action_space.n

#initialize value table randomly
value_table = np.zeros((states,1))

def value_iterations(env , n_iterations , gamma = 1.0 , threshold = 1e-30):
    for i in range(n_iterations):
        
        new_valuetable = np.copy(value_table)
        for state in range(states):
            q_value = []
            for action in range(actions):
                next_state_reward = []
                for next_state_parameters in env.env.P[state][action]:
                    transition_prob, next_state, reward_prob, _ = next_state_parameters
                    reward = transition_prob*(reward_prob+gamma*new_valuetable[next_state])
                    next_state_reward.append(reward)
                    
                    
                q_value.append((np.sum(next_state_reward)))
            value_table[state] = max(q_value)
            
        if (np.sum(np.fabs(new_valuetable - value_table))<=threshold):
            break
    return value_table
  

def extract_policy(value_table, gamma = 1.0):
  policy = np.zeros(env.observation_space.n)
  for state in range(env.observation_space.n):
    Q_table = np.zeros(env.action_space.n)
    for action in range(env.action_space.n):
      for next_sr in env.env.P[state][action]:
        transition_prob, next_state, reward_prob, _ = next_sr
        Q_table[action] += (transition_prob * (reward_prob + gamma *value_table[next_state]))
    policy[state] = np.argmax(Q_table)
  return policy
value_table = value_iterations(env,10000)
policy = extract_policy(value_table)
print(policy)

[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
