# Learning Markov Decision Process (MDP) Algorithm with the MDPToolBox Python Package

The MDPToolBox can be installed using pip

In [2]:
!pip install pymdptoolbox



In [3]:
import mdptoolbox.example
import mdptoolbox.mdp
import numpy as np

## Forest Management Example
* Trees can be either young, middle-aged, or old (states = 0, 1, 2)
* Each year, the trees get one stage older (S+1).
* Each year, there is a 10% chance that the whole forest burns down!
* If the forest burns down, you get nothing.
* If you cut down the trees, you get 0 points for a young one, 1 point for a middle-aged one, and 2 points for an old one.
* If the forest reaches its oldest state, and you do not cut, you will receive 4 points!

What is the best strategy, given these facts?

In [5]:
# inputs
'''
S : is the number of states
r1 : is the reward for the first action (keep/wait)
r2 : is the reward for the second action (cut)
p : probability of a wildfire
'''
# outputs
'''
P : is the probability transition matrix (A, S, S)
R : is the reward matrix (S, A)
'''

P, R = mdptoolbox.example.forest(S=3, r1=4, r2=2, p=0.1)

Exploring the probability transition matrix

In [6]:
P

array([[[0.1, 0.9, 0. ],
        [0.1, 0. , 0.9],
        [0.1, 0. , 0.9]],

       [[1. , 0. , 0. ],
        [1. , 0. , 0. ],
        [1. , 0. , 0. ]]])

In [7]:
P[0] # P[0] is the transition matrix for action 0 (keep/wait)

array([[0.1, 0.9, 0. ],
       [0.1, 0. , 0.9],
       [0.1, 0. , 0.9]])

In [8]:
'''
ex: what is the probability that a forest in its youngest state
 will advance to the next oldest (middle-aged state), if we wait?
'''
print(P[0][0][1]) # 0.9

0.9


In [9]:
'''
ex: what is the probability that a forest in its oldest state
 will burn down, if we wait?
'''
P[0][2][0] # 0.1

0.1

Exploring the rewards matrix. Rewards matrix has shape S x A (S,A). 

In [10]:
R

array([[0., 0.],
       [0., 1.],
       [4., 2.]])

In [13]:
# what reward do we get if we choose to wait, and the forest is in its oldest state?
def compute_reward(action, states):
    return np.sum(np.multiply(R.T[action], states))

compute_reward(1, [10, 5, 3])

11.0

## Finding the optimal "policy"

In [14]:
# 99% discount says that it is very likely 
# that the scenario will continue into the future (long-term strategy)
model = mdptoolbox.mdp.PolicyIteration(P, R, 0.99)

In [15]:
model.run()

In [16]:
model.policy

(0, 0, 0)

## applying a discount to our model.

(what is a discount?)

In [17]:
# 1% discount says that it is very unlikely 
# that the scenario will continue into the future (short-term strategy)
model = mdptoolbox.mdp.PolicyIteration(P, R, 0.01)

In [18]:
model.run()

In [19]:
model.policy

(0, 1, 0)