# Markov Decision Processes (MDPs)
So far we have discussed k-armed bandit problems with stationary and non-stationary environments and greedy and $epsilon$-greedy agents. However, the k-armed bandit scenario that we implemented was not complex. The agent was always presented with the same situation, in front of k arms where each k was an independent action it could take. The scenario never really changed after each interaction and the only (slightly) complex implementation was building the ***non-stationary*** environment.

MDPs were introduced in order to describe and handle more complex situations. A basic MDP framework consists of:

* States
* Actions
* Rewards 
* Transitions 

### States
State is described as the fundamental information needed to make a decision (to take an action). This is importan because in MDPs each state is an isolated case. States aren't constrained to physical locations or objects. There is no limit on the amount of information that a state can have. 

Here, we define new states based on what the agent chose in the previous interaction. We should also keep in mind that even though there is no real limit on the number of states an environment can have, the more teh states the harder it is for our agent to learn. *We aim at having a good balance between the information provided in each state and the number of states that the environment has*. 

### Actions
Think of the actions as the way that our agent intercats with the environment. The action the agent can take depends on the state it finds itself, and every state can have a unique set of actions. Actions also define how we move from one state to another

### Rewards 
Rewards are scalar values that the agent receives after taking an action. A reward can be thought of as a measurement of hoe good or bad the agent is doing in the environment. Every time we an action and (a path) we get a reward that is either positive, negative or zero. The o bjective of the agent is to maximise the reward values by taking the right actions at the right states. 

In RL, assigning rewards to an environment is one of the most difficult and crucial tasks. Since rewards shape the behaviour of the agent, any misalignement between the rewards can lead to unexpected behaviours.

### Transitions
Transition models represent the way the environment responds to the agent's interaction and determine the probabilities of going from one state to another.

A transition model is usually represented as a function named $p$ and is often written like this:

$$p(s',r|s,\alpha)$$

Here, $s'$ is the next state, $r$ is the reward, $s$ is the current state and $\alpha$ is the action taken in the current state. The formula shwos the probability of reaching the next state ($s'$) and obtaining a reward ($r$), given that we are currently at some state ($s$) and have taken an action ($\alpha$). 

We can think of the transition model as an accurate representation of the rules of the game.

### Load the required libraries

In [1]:
import gym 
from gym import spaces 
import numpy as np

We can now built our environment using the gym interface. 

#### the __init__() function

In the initialisation function we add all the primary configurations that we need for the env to work properly. Most of the times we mainly need to define two required properties: the ```action space``` and the ```observation space```, which define the number of actions the agent may find and what it should expect to receive from the state of the environment. 

In [None]:
class MainEnv(gym.Env):
    '''
    This class works with 5 functions: __init__, next_observation, step, reset, render.'''
    
    