# Markov Decision Process

To make an introduction to the MDPs, let's start with an example (which I took from the [Machine Learning Nanodegree from Udacity](https://www.udacity.com/course/machine-learning-engineer-nanodegree--nd009t)).

Let's suppose that we have a roboto that collects empty cans of soda without any human intervention and it should decide by itself whether or not he needs to recharge its batteries - which means it has to go to the charging area. We want to define this problem by means of the *MDP*.

So, **what are the actions?**

    * search for a can
    * recharge its battery
    * wait for a can to be collected
    
We call the set of all possible actions that an agent can take within the context of this problem (that is, the available actions to the agent), the **Action Space**.

So, **what are the states?**

    * battery high
    * battery low
    
We call the set of all non-terminal states, the **State Space**. $S^+$ denotes the state space, including terminal states.

In case there are some states where only a subset of the actions are available, we can also use $A(s)$ to refer to the set of actions available in state $s \in S$.



This is a diagram that shows the state diagram with probability of a transition to occur (blue numbers) and the reward (in number of empty soda cans) that transition brings (orange numbers).

![MDP diagram](images/markovDiagram.png)

### Markov decision Process (MDP); the definition

A (finite) Markov Decision Process (MDP) is defined by:

   * a (finite) set of states $S$
   * a (finite) set of actions $A$
   * a (finite) set of rewards $W$
   * the one-step dynamics of the environment:
   $$p(s', r \mid s,a) = \mathbb{P}(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a)$$
   for all s, s', a and r
   * a discount rate $\gamma \in [0, 1]$
   
*Hint:* The discount rate will have to be different than 0, but maybe close to 1 to avoid that the agent becomes too short-sighted to a fault. An example to remember discount rate; let us define a discount rate as $\gamma = 0.9$. Our discounted return would then be:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + ...$$ and continues without limit.