# APPLICATIONS OF MARKOV DECISION PROCESSES
---
In this notebook we will take a look at some indicative applications of markov decision processes. 
We will cover content from [`mdp.py`](https://github.com/aimacode/aima-python/blob/master/mdp.py), for chapter 17 of Stuart Russel's and Peter Norvig's book [*Artificial Intellignece: A Modern Approach*](http://aima.cs.berkeley.edu/).


## CONTENTS
- Simple MDPs


## SIMPLE MDP

Markov Decision Processes are formally described as processes that follow the Markov property which states that "The future is independent of the past given the present". 
MDPs formally describe environments for reinforcement learning and we assume that the environment is *fully observable*. 
Let us take a toy example MDP and solve it using the functions in `mdp.py`.
This is a simple example adapted from a [similar problem](http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/MDP.pdf) by Dr. David Silver, tweaked to fit the limitations of the current functions.
![title](images/mdp-b.png)

Let's say you're a student attending lectures in a university.
There are three lectures you need to attend on a given day.
<br>
Attending the first lecture gives you 4 points of reward.
After the first lecture, you have a 0.6 probability to continue into the second one, yielding 6 more points of reward.
But, with a probability of 0.4, you get distracted and start using Facebook instead and get a reward of -1.
From then onwards, you really can't let go of Facebook and there's just a 0.1 probability that you will concentrate back on the lecture.
<br>
After the second lecture, you have an equal chance of attending the next lecture or just falling asleep.
Falling asleep is the terminal state and yields you no reward, but continuing on to the final lecture gives you a big reward of 10 points.
<br>
From there on, you have a 40% chance of going to study and reach the terminal state, 
but a 60% chance of going to the pub with your friends instead. 
You end up drunk and don't know which lecture to attend, so you go to one of the lectures according to the probabilities given above.
<br> 
We now have an outline of our stochastic environment and we need to maximize our reward by solving this MDP.
<br>
<br>
We first have to define our Transition Matrix

In [23]:
from mdp import *
from notebook import psource, pseudocode

'''t = {
    'leisure': {
                    'facebook': {'leisure':0.9, 'class1':0.1},
                    'quit': {'leisure':0.1, 'class1':0.9},
#                     'study': {},
#                     'sleep': {},
#                     'pub': {}
               },
    'class1': {
                    'study': {'class2':0.6, 'leisure':0.4},
                    'facebook': {'class2':0.4, 'leisure':0.6}
              },
    'class2': {
                    'study': {'class3':0.5, 'end':0.5},
                    'sleep': {'end':0.5, 'class3':0.5}
              },
    'class3': {
                    'study': {'end':0.6, 'class1':0.08, 'class2':0.16, 'class3':0.16},
                    'pub': {'end':0.4, 'class1':0.12, 'class2':0.24, 'class3':0.24}
              },
    'end': {}
}'''

init = 'class1'

terminals = ['end']

rewards = {
    'class1': 4,
    'class2': 6,
    'class3': 10,
    'leisure': -1,
    'end': 0
}

t = {
    'leisure': {
                    'facebook': {'leisure':1},
                    'quit': {'class1':0.9}
               },
    'class1': {
                    'study': {'class2':1},
                    'facebook': {'leisure':1}
              },
    'class2': {
                    'study': {'class3':1},
                    'sleep': {'end':1}
              },
    'class3': {
                    'study': {'end':1},
                    'pub': {'class1':0.2, 'class2':0.4, 'class3':0.4}
              },
    'end': {}
}


In [18]:
class CustomMDP(MDP):

    def __init__(self, transition_matrix, rewards, terminals, init, gamma=.9):
        # All possible actions.
        actlist = []
        for state in transition_matrix.keys():
            actlist.extend(transition_matrix[state])
        actlist = list(set(actlist))
        print(actlist)

        MDP.__init__(self, init, actlist, terminals=terminals, gamma=gamma)
        self.t = transition_matrix
        self.reward = rewards
        for state in self.t:
            self.states.add(state)

    def T(self, state, action):
        if action is None:
            return [(0.0, state)]
        else: 
            return [(prob, new_state) for new_state, prob in self.t[state][action].items()]

In [19]:
our_mdp = CustomMDP(t, rewards, terminals, init, gamma=.9)

['quit', 'study', 'sleep', 'pub', 'facebook']


In [20]:
value_iteration(our_mdp)

KeyError: 'study'