<!--NAVIGATION-->
< [In Depth: MDP Wikipedia example](MDP-wikipedia.ipynb) | [In-Dept: MDP Exercise](MDP-2.ipynb) >



# Markov Decision Process

A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
At each time step, the process is in some state $S$, and the decision maker may choose any action a that is available in state $S$. The process responds at the next time step by randomly moving into a new state $s'$, and giving the decision maker a corresponding reward $R_a(S,S')$.

The probability that the process moves into its new state $s'$ is influenced by the chosen action. Specifically, it is given by the state transition function $P_a(S,S')$. Thus, the next state $S'$ depends on the current state $S$ and the decision maker's action $a$. But given $S$ and $a$, it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP satisfies the Markov property.

A Markov decision process is a 4-tuple $(S,A,P_a,R_a)$, where:

$S$ is a finite set of states,

$A$ is a finite set of actions (alternatively, $A_s$ is the finite set of actions available from state $S$

$P_a (S,S') = P_r (S_t+1 = S';  S_t = S , a_t = a )$ is the probability that action $a$ in state $S$ at time $t$ will lead to state $S'$  at time $t + 1$,

$R_a (S,S')$ is the immediate reward (or expected immediate reward) received after transitioning from state $S$ to state $S'$, due to action $a$

## Explanation

The value function for the MDP is defined as:

![](images/equation.png)

###The policy
The objective of this MDP is to find the optimum policy for the reward $+5$ intuitively starting from $S_0$ we can see that the best route to take are the actions $a_1, a_1, a_0$ respectivly.
At each time step, the process is in some state $S$, and the decision maker may choose any action a that is available in state $S$. The process responds at the next time step by randomly moving into a new state $s'$, and giving the decision maker a corresponding reward $R_a(S,S')$.

Breaking down the problem:

Consider we start at the $S_0$, and we have the following vectors:
$$V_1=\begin{pmatrix}0 \\ 0 \\ 0 \end{pmatrix}$$
$$V_2=\begin{pmatrix}0 \\ 0 \\ 0 \end{pmatrix}$$

We can either take action $a_0$ and have a probability of 1/2 to go to $S_1$ and other 1/2 to go back to $S_0$.
According to the formula above we can get the value function from taking the two actions:and have a probability of 1/2 to go to $S_1$ and other 1/2 to go back to $S_0$.
$$V_(1) = max_{a}
$$a_0 : P_a(0,1)(R_a(0,1) + gamma V_i(S'))$$
$$a_0 = {\sum_{S'} = P_a(0,1)(R_a(S,S') + //gamma V_i(S'))}$$



Consider the next example (figure example from [wikipedia](https://en.wikipedia.org/wiki/Markov_decision_process)):
![](images/MDP.png)

In [4]:
print("Inclinacion del modelo:    ", model.coef_[0])
print("Intercepción del modelo:", model.intercept_)
Import library:

Model slope:     2.02720881036
Model intercept: -4.99857708555


In [1]:
%Numpy
import numpy as np

``LinearRegression`` es mas poderoso que una *linea recta*. Maneja modelos lineales multidimensionales de la forma:
$$
y = a_0 + a_1 x_1 + a_2 x_2 + \cdots
$$
Donde hay multiplos de valores de valores de $x$.

multidimensionalmente, la regresión se dificulta para visualizar,  pero podemos observar los coeficientes mas optimos para ese modelo:

In [5]:
rng = np.random.RandomState(1)
X = 10 * rng.rand(100, 3)
y = 0.5 + np.dot(X, [1.5, -2., 1.])

model.fit(X, y)
print(model.intercept_)
print(model.coef_)

0.5
[ 1.5 -2.   1. ]


## Funciones basicas polinomiales

Para adaptar la regressión lineal a relaciones no lineales entre variables es usando ``PolynomialRegression``
$$
y = a_0 + a_1 x_1 + a_2 x_2 + a_3 x_3 + \cdots
$$
Construimos $x_1, x_2, x_3,$ etc, desde nuestra entrada $x$.
Entonces $x_n = f_n(x)$, donde $f_n()$ es una función que transforma nuestros datos.

Ejemplo, si $f_n(x) = x^n$, nuestro modelo se convierte a una regresión polinial:
$$
y = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \cdots
$$

In [6]:
from sklearn.preprocessing import PolynomialFeatures
x = np.array([2, 3, 4])
poly = PolynomialFeatures(3, include_bias=False)
poly.fit_transform(x[:, None])

array([[  2.,   4.,   8.],
       [  3.,   9.,  27.],
       [  4.,  16.,  64.]])

In [7]:
from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(PolynomialFeatures(7),
                           LinearRegression())

Podemos observar que podemos usar el modelo lineal, para clasificar relaciones mas complicadas entre $x$ y $y$. 
Por ejemplo:

<!--NAVIGATION-->
< [In Depth: MDP Wikipedia example](MDP-wikipedia.ipynb) | [In-Dept: MDP Exercise](MDP-2.ipynb) >

