# Markov Decision Processes (MDPs)
So far we have discussed k-armed bandit problems with stationary and non-stationary environments and greedy and $epsilon$-greedy agents. However, the k-armed bandit scenario that we implemented was not complex. The agent was always presented with the same situation, in front of k arms where each k was an independent action it could take. The scenario never really changed after each interaction and the only (slightly) complex implementation was building the ***non-stationary*** environment.

MDPs were introduced in order to describe and handle more complex situations. A basic MDP framework consists of:

* States
* Actions
* Rewards 
* Transitions 

### States
State is described as the fundamental information needed to make a decision (to take an action). This is importan because in MDPs each state is an isolated case. States aren't constrained to physical locations or objects. There is no limit on the amount of information that a state can have. 

Here, we define new states based on what the agent chose in the previous interaction. We should also keep in mind that even though there is no real limit on the number of states an environment can have, the more teh states the harder it is for our agent to learn. *We aim at having a good balance between the information provided in each state and the number of states that the environment has*. 

### Actions
Think of the actions as the way that our agent intercats with the environment. The action the agent can take depends on the state it finds itself, and every state can have a unique set of actions. Actions also define how we move from one state to another

### Rewards 
Rewards are scalar values that the agent receives after taking an action. A reward can be thought of as a measurement of hoe good or bad the agent is doing in the environment. Every time we an action and (a path) we get a reward that is either positive, negative or zero. The o bjective of the agent is to maximise the reward values by taking the right actions at the right states. 

In RL, assigning rewards to an environment is one of the most difficult and crucial tasks. Since rewards shape the behaviour of the agent, any misalignement between the rewards can lead to unexpected behaviours.

### Transitions
Transition models represent the way the environment responds to the agent's interaction and determine the probabilities of going from one state to another.

A transition model is usually represented as a function named $p$ and is often written like this:

$$p(s',r|s,\alpha)$$

Here, $s'$ is the next state, $r$ is the reward, $s$ is the current state and $\alpha$ is the action taken in the current state. The formula shwos the probability of reaching the next state ($s'$) and obtaining a reward ($r$), given that we are currently at some state ($s$) and have taken an action ($\alpha$). 

We can think of the transition model as an accurate representation of the rules of the game.

### Load the required libraries

In [None]:
import gym 
from gym import spaces 
import numpy as np

We can now built our environment using the gym interface. 

#### the __init__() function

In the initialisation function we add all the primary configurations that we need for the env to work properly. Most of the times we mainly need to define two required properties: the ```action space``` and the ```observation space```, which define the number of actions the agent may find and what it should expect to receive from the state of the environment. We use the ```.Discrete()``` method to define our states. For now we will create 2 states. State 1 will have 3 actions and state 2 will have 4 actions.

#### step() function
This is the function that our agent calls for each action within the environment.
```step function``` is responsible for the evaluation of each action the reward each action comes with and transition to the next state. 

In a few words, depending on the agent's current position, the agent responds by taking an action  and that action generates the ```next_state``` and some ```reward```.  

#### reset function
This function resets the agent's position. Given that the ```reset function``` returns a state we are just reusing the ```next_state function```.

#### render function
Here we define how we want to visualise the env. 

In [None]:
class MainEnv(gym.Env):
    '''
    This class works with 5 functions: __init__, next_observation, step, reset, render.
    We will create a new environment for our agent using the spaces module of the gym interface. 
    Thes spaces module is used to create both the states and the actions for each state.
    '''

    metadata = {'render.modes': ['human']}
    
    def __init__(self):
        
        super(MainEnv, self).__init__()
        
        self.observation_space = spaces.Discrete(2) # define the states
        self.action_space = spaces.Tuple((spaces.Discrete(3), spaces.Discrete(4))) # tuple with actions for each state
        self.position = np.random.randint(2)
        
    def next_state(self):
        return {'state': self.position}
    
    def step(self, action):
        ''' 
        INPUT: action (A)
        We define the actions as:
        STATE (S) 1 is a 3-armed bandit:
        A1 = arm 1 -- goes to S 1 -- reward: 0.5
        A2 = arm 2 -- goes to S 1 -- reward: -1
        A3 = arm 3 -- goes to S 2 (70%) or goes to state 1 (30%) -- reward: (-2, -0.5)
        
        STATE (S) 2 is a 4-armed bandit:
        A1 = arm 1 -- goes to S 2 -- reward: -2
        A2 = arm 2 -- goes to S 2 -- reward: 1.2
        A3 = arm 3 -- goes to S 2 -- reward: -1.5
        A4 = arm 4 -- goes to S 1 (80%) or goes to S 2 (20%) -- reward: (-2, -0.5)
        
        OUTPUT:
        next state (observation) = the new position of the agent 
        reward = the amount of reward for the action taken 
        done = whether the episode has ended 
        info = diagnositcs dict
        '''
        
        transitions_1 = {
            0: lambda: [0.5, 0],
            1: lambda: [-1, 0],
            2: lambda: [[-2, 1], [-0.5, 0]][np.random.choice(2, p=[0.7, 0.3])]
        }
        
        transitions_2 = {
            0: lambda: [-2, 1],
            1: lambda: [1.2, 1],
            2: lambda: [-1.5, 1],
            3: lambda: [[-2, 0], [-0.5, 1]][np.random.choice(2, p=[0.8, 0.2])] 
        }
        
        reward = None
        new_state = None 
        if (self.position == 0):
            reward, new_state = transitions_1[action]() # agent is at state 1
        else:
            reward, new_state = transitions_2[action]() # agent is at state 2
            
        self.position = new_state # move the agent to the next state
        return self.next_state(), reward, False, {}  
    
    def reset(self):
        # reset the agent's position
        self.pisition = np.random.randint(2)
        return self.next_state()
    
    def render(self, mode='human'):
        if mode == 'human':
            pretty_print_state = {
                0: "state 1",
                1: "state 2"
            }
            print('current state: {}'.format(print_state[self.position]))
        else:
            raise NotImplementedError()        

## Test the environment 

In [None]:
env = MainEnv()
this_state = env.position
steps = 10

In [None]:
print_a_0 = {
    0: "Arm 1",
    1: "Arm 2",
    2: "Arm 3"
}

print_a_1 = {
    0: "Arm 1",
    1: "Arm 2",
    2: "Arm 3",
    3: "Arm 4"
}

In [None]:
for _ in range(steps):
    env.render()
    a = env.action_space[this_state].sample() # choose an action from the current state
    print_action = None
    
    if this_state == 0:
        print_action = pretty_print_a_0[a]
    else:
        print_action = pretty_print_a_1[a]
        
    print('action taken: {}'.format(print_action))
    s_prime, reward, _,_ = env.step(a) # execute a step, get reward and move to the next state
    this_state = s_prime['state']
    print('reward obtained: {}'.format(reward))
    print('---------------------------')