# Markov Decision Processes

In this notebook, we're going to construct the environment for the Casinos scenario presented on [this article](https://medium.com/@alejandro.aristizabal24/understanding-reinforcement-learning-hands-on-markov-decision-processes-7d8469a8a782). Here, we're going to mainly cover the implementation details. For a deeper look onto how MDP's work and how they are conceptualized, it's recommended to read the whole article.

## The Casinos Environment

The Casinos environment is a simple scenario I presented as a middle point between complex MDPs and the already familiar Multi-Armed Bandit Scenario. It presents two states, which represent two distinct casinos. Each of those independent states behaves very much like the Multi-Armed Bandit Scenario. They have some number of Armed Bandits they can pull, and those actions will not change the agent's current state. To allow transition between the two casinos, an additional action was added to both states, called **Take a bus**. This action is non-deterministic, which means that with some certain probability, the environment might respond differently to what's expected. Here's a full diagram of the MDP.

<img src='assets/casinos_mdp.png'></img>

## Implementation

We're going to be using the gym package to construct our custom environment. We will be going step-by-step, and each step will add some detail to the environment.

### Gym Custom Environment Template
Below, you will find the backbone of a gym environment. This has all the necessary functions to create an acceptable environment. We will start adding details to this template in order to construct our desired environment

In [81]:
import gym
from gym import spaces
import numpy as np

class CasinosEnv(gym.Env):
    """
    Casinos custom environment, built with the gym interface
    """
    metadata= {'render.modes': ['human']}
    
    def __init__(self):
        super(CasinosmEnv, self).__init__()
        
        # Here, we declare our initial configuration
        # Usually, this includes defining the action
        # and observation step
        pass
        
    def step(self, action):
        # This function executes a one-step simulation
        # in our environment. Receives the desired action
        # the agent has taken as input
        pass
        
    def reset(self):
        # Restarts the environment to a starting position.
        # Usually, changes in the environment will be undone,
        # and the environment is reseeded
        pass
        
    def render(self, mode='human'):
        # Display the environment in some way. The mode defines
        # in which way we desire to display it.
        pass

We will now go through each individual function or method. Before we do so, there are some special remarks to do about the template:

- It inherits from the `gym.Env` class
- Our class contains a global property called `metadata`. This parameter usually contains information meant for the developer. Most of the time, it includes a list of all possible ways the developer is able to render the specific environment. In this case, only `'human'` is allowed.

### Initialization

The initialization is where we add any primary configurations required for the environment to work properly. Most of the time, we only need to define two required properties of the environment: the `action_space` and `observation_space`. Both of these values define the number of actions the agent may find, as well as what it should expect to receive from the state of the environment. As a sidenote, we use the name "observation" here because sometimes, the whole state of the environment isn't accessible to the agent. In our case, observation and state will be the same thing, which means that an agent interacting with the environment will have full access to the state.

In [86]:
class CasinosEnv(gym.Env):
    """
    Casinos custom environment, built with the gym interface
    """
    metadata= {'render.modes': ['human']}
    
    def __init__(self):
        super(CustomEnv, self).__init__()
        
        # Here, we declare our initial configuration
        # Usually, this includes defining the action
        # and observation step
        
        self.observation_space = spaces.Discrete(2)
        self.action_space = spaces.Tuple((spaces.Discrete(3), spaces.Discrete(4)))
        self.agent_pos = np.random.randint(2)

Inside our initialization method we have declared both necessary properties. There are many options offered by gym to define how our observation and action space looks like. Most of the time, we can find out what option to choose from the problem at hand. 

Our environment contains two distinct states (Casino 1, Casino 2). They are discrete, and so we use the `spaces.Discrete(2)` option.
for our actions, the MDP described previously had a different set of actions according to which casino we find ourselves in. In each casino, we also have a discrete number of actions we can take. For the first casino we only have three actions, while for the second we have 4. For this reason, we used a `spaces.Tuple`, which contains two sets of discrete spaces, one for each state.

Lastly, we added an additional variable, which stores the agent's current position. This allows the environment to keep track of the agent's state, and it is also what the agent will see from any observation.

### Step
Our step function contains most of the interaction dynamics of the environment. This is the function our agent calls everytime it wants to take an action upon the environment. Because of this, the step function is in charge of evaluation such action, and transitioning to the corresponding state. It also defines how much reward the agent should receive, the next observation that the agent will use to take future actions, and whether the agent reached an ending state.

We have all the necessary information with the diagram above, so we will turn it into code here.

In [67]:
class CasinosEnv(gym.Env):
    """
    Casinos custom environment, built with the gym interface
    """
    metadata= {'render.modes': ['human']}
    
    def _next_observation(self):
        return {'state': self.agent_pos}
    
    def step(self, action):
        """
        Our actions will be defined as such:
        - Casino 1 (state 0):
            - Action 0: Pull Arm 0, go to state 0, get reward 0.5
            - Action 1: Pull Arm 1, go to state 0, get reward -1
            - Action 2: Take a bus,
                70% of times: go to state 1, get reward -2
                30% of times: go to state 0, get reward -0.5
                
        - Casino 2 (state 1):
            - Action 0: Pull Arm 0, go to state 1, get reward -2
            - Action 1: Pull Arm 1, go to state 1, get reward 1.2
            - Action 2: Pull Arm 2, go to state 1, get reward -1.5
            - Action 3: Take a bus,
                80% of times: go to state 0, get reward -2
                30% of times: go to state 1, get reward -0.5
                
        Returns:
            observation (object): agent's observation of the current environment
            reward (float) : amount of reward returned after previous action
            done (bool): whether the episode has ended, in which case further step() calls will return undefined results
            info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
        """
        
        # dictionary of functions that return [reward, next_state] transitions for each action
        state_0_transitions = {
            0: lambda: [0.5, 0],
            1: lambda: [-1, 0],
            2: lambda: [[-2, 1],[-0.5, 0]][np.random.choice(2,p=[0.7,0.3])]
        }
        
        state_1_transitions = {
            0: lambda: [-2, 1],
            1: lambda: [1.2, 1],
            2: lambda: [-1.5, 1],
            3: lambda: [[-2, 0],[-0.5, 1]][np.random.choice(2,p=[0.8,0.2])]
        }
        
        reward = None
        next_state = None
        if (self.agent_pos==0):
            # Agent is at Casino 1. Use state 0 transitions
            reward, next_state = state_0_transitions[action]()
        else:
            # Agent is at Casino 2. Use state 1 transitions
            reward, next_state = state_1_transitions[action]()
            
        # Transition the agent to the next state
        self.agent_pos = next_state
        # Return the data as defined by the gym interface
        return self._next_observation(), reward, False, {}

It may look complicated, but there's not much going on in here. We defined some objects that display how the environment reacts to the agent's actions. Then, depending on the current state the agent is found, the agent responds and generates the `next_state` and `reward`. Lastly, we return the information the way the gym interface expects us to do. We also defined another function, called `_next_observation`. It's a pretty basic function, that was added mainly to keep our observations consistent. 

Other remarks about what we return:

- the `done` variable is always set to False, since our environment doesn't have any termination state.
- the `info` variable is always empty. We do not need to add any auxiliary information with our environment.

### Reset

Our reset function will be pretty simple. The only thing we need to reset is the agent position. Our reset function needs to return an observation, so we're reusing the `_next_observation` function declared above

In [92]:
class CasinosEnv(gym.Env):
    """
    Casinos custom environment, built with the gym interface
    """
    metadata= {'render.modes': ['human']}
    
    def reset(self):
        # Reset the agent's position
        self.agent_pos = np.random.randint(2)
        return self._next_observation()

### Render

Inside the render function we define how we want to visualize the environment. The gym interface offers a list of modes that people might expect to find in an environment. It is up to us to define how our environment displays, and which modes it supports. The `metadata` global property inside our environment's class contains a list of supported modes, which in our case only contains `human`. The `human` mode means that whatever we render is intended for a human to visualize. We will simply print the agent's current state.

In [93]:
class CasinosEnv(gym.Env):
    """
    Casinos custom environment, built with the gym interface
    """
    metadata= {'render.modes': ['human']}
    
    def render(self, mode='human'):
        if mode == 'human':
            pretty_print_state = {
                0: "Casino 1",
                1: "Casino 2"
            }
            print('Current State: {}'.format(pretty_print_state[self.agent_pos]))
        else:
            raise NotImplementedError()

## Full Implementation

Now that all of the required functionality has been explained, let's put it all together

In [94]:
import gym
from gym import spaces
import numpy as np

class CasinosEnv(gym.Env):
    """
    Casinos custom environment, built with the gym interface
    """
    metadata= {'render.modes': ['human']}
    
    def __init__(self):
        super(CasinosEnv, self).__init__()
        
        self.observation_space = spaces.Discrete(2)
        self.action_space = spaces.Tuple((spaces.Discrete(3), spaces.Discrete(4)))
        self.agent_pos = np.random.randint(2)
        
    def _next_observation(self):
        return {'state': self.agent_pos}
        
    def step(self, action):
        # dictionary of functions that return [reward, next_state] transitions for each action
        state_0_transitions = {
            0: lambda: [0.5, 0],
            1: lambda: [-1, 0],
            2: lambda: [[-2, 1],[-0.5, 0]][np.random.choice(2,p=[0.7,0.3])]
        }
        
        state_1_transitions = {
            0: lambda: [-2, 1],
            1: lambda: [1.2, 1],
            2: lambda: [-1.5, 1],
            3: lambda: [[-2, 0],[-0.5, 1]][np.random.choice(2,p=[0.8,0.2])]
        }
        
        reward = None
        next_state = None
        if (self.agent_pos==0):
            # Agent is at Casino 1. Use state 0 transitions
            reward, next_state = state_0_transitions[action]()
        else:
            # Agent is at Casino 2. Use state 1 transitions
            reward, next_state = state_1_transitions[action]()
            
        # Transition the agent to the next state
        self.agent_pos = next_state
        # Return the data as defined by the gym interface
        return self._next_observation(), reward, False, {}
        
    def reset(self):
        # Reset the agent's position
        self.agent_pos = np.random.randint(2)
        return self._next_observation()
        
    def render(self, mode='human'):
        if mode == 'human':
            pretty_print_state = {
                0: "Casino 1",
                1: "Casino 2"
            }
            print('Current State: {}'.format(pretty_print_state[self.agent_pos]))
        else:
            raise NotImplementedError()

That's it! Note that some comments were removed for space. Our MDP is now implemented, and we can now interact with it! Let's make a pretty simple agent and do some simulation with it. Our agent will simply take random actions and print which action it took, as well as the reward received

In [101]:
env = CasinosEnv()
curr_state = env.agent_pos
steps = 10

pretty_print_a_0 = {
    0: "Arm 1",
    1: "Arm 2",
    2: "Take a bus",
}

pretty_print_a_1 = {
    0: "Arm 1",
    1: "Arm 2",
    2: "Arm 3",
    3: "Take a bus",
}
for _ in range(steps):
    env.render()
    action = env.action_space[curr_state].sample() # Sample an action from the current state
    print_a = None
    if curr_state==0:
        print_a = pretty_print_a_0[action]
    else:
        print_a = pretty_print_a_1[action]
    print("Action taken: {}".format(print_a))
    obs, reward, _, _ = env.step(action) # Execute a step, get observation and reward
    curr_state = obs['state']
    print("Reward obtained: {}".format(reward))
    print("===========================")
    

Current State: Casino 2
Action taken: Arm 1
Reward obtained: -2
Current State: Casino 2
Action taken: Arm 1
Reward obtained: -2
Current State: Casino 2
Action taken: Arm 2
Reward obtained: 1.2
Current State: Casino 2
Action taken: Arm 2
Reward obtained: 1.2
Current State: Casino 2
Action taken: Arm 3
Reward obtained: -1.5
Current State: Casino 2
Action taken: Take a bus
Reward obtained: -2
Current State: Casino 1
Action taken: Take a bus
Reward obtained: -2
Current State: Casino 2
Action taken: Arm 3
Reward obtained: -1.5
Current State: Casino 2
Action taken: Take a bus
Reward obtained: -0.5
Current State: Casino 2
Action taken: Arm 3
Reward obtained: -1.5


## Suggested Exercises

There are many changes we can make to our environment, and here are some simple ones that could be fun to attempt as an exercise:

- Make the Arms return random rewards based on a normal distribution, just like the Multi-Armed Bandits **(easy)**
- Add some more states **(medium)**
- Make the number of states and actions variable, so that the developer can define them when initializing the environment. Make use of the `__init__` function for this. No state should be isolated from each other **(hard)**