# Frozen Lake: Naive Learning

In reinforcement learning, an agent is acting in an environment, and learning by trial-and-error to optimize its performance in order to gain maximal cumulative reward. The model of the environment can be formalized by a Markov Decision Process (MDP), from which the optimal policy can be derived applying dynamic programming algorithms. If the MDP is not known, however, the agent has to find it out by interacting with the environment. This notebook discusses a naive learning strategy in which the agent runs many episodes while taking samples to build the MDP. This empirical MDP is used to approximate the optimal policy applying the dynamic programming methods. 

## Import module ReinforcementLearning 

First the "ReinforcementLearning" module is imported, which also imports packages "numpy" as "np" and "matplotlib.pyplot" as "plt". Matplotlib is set to the interactive "notebook" mode:

In [1]:
from ReinforcementLearning import *
%matplotlib notebook

This module contains a class "NaiveStrategy" that implements the naive learning strategy discussed in the introduction. It uses a class "EmpiricalMDP" which is a subclass of "MarkovDecisionProcess". This "EmpiricalMDP" stores the samples taken by the agent, using them to approximate the state transition probability matrix "Psas" and the reward matrix "Rsas". Class "Agent" is used to create an agent, and the "FrozenLake" class to define a deterministic and a stochastic Frozen Lake environment to test the naive learning strategy.

## Deterministic Frozen Lake Environment 

First the naive strategy is tested in a deterministic "FrozenLake" environment:

In [2]:
env = FrozenLake.make(is_slippery=False)

The agent needs a policy to run many episodes and to take samples. A uniform random policy seems appropriate here as the agent has to pass all states:

In [3]:
policy = UniformRandomPolicy(env)

Now the policy is defined, a "NaiveStrategy" object can be constructed: 

In [4]:
strategy = NaiveStrategy(num_of_episodes=1000, policy=policy)

The first argument is the number of episodes the agent has to run. In this case, 1000 episodes will be sufficient to approximate the MDP. Finally, an "Agent" object is constructed that needs the "FrozenLake" object and the "NaiveStrategy" object:

In [5]:
agent = Agent(env, strategy)  # env is stored in attribute agent.env, strategy in attribute agent.strategy

Class "Agent" has a "learn" method to execute the given learning strategy: 

In [6]:
agent.learn()

The "NaiveStrategy" object has a attribute "mdp", which is an "EmpiricalMDP" object that holds the approximated "Psas" and "Rsas" matrices. To verify these matrices, a "GymMDP" object is constructed which defines the exact MDP:

In [7]:
mdp = GymMDP(env)

Because the probabilities are approximated using state and action frequencies, both MDPs must be the same in the deterministic case, if the agent passed all states:

In [8]:
print(np.all(strategy.mdp.Psas == mdp.Psas))  # must be True
print(np.all(strategy.mdp.Rsas == mdp.Rsas))  # must be True

True
True


Now the optimal policy and optimal value functions can be calculated using the "policy_iteration" or the "value_iteration" method of the "EmpiricalMDP" object, which it inherited from the "MarkovDecisionProcess" superclass:

In [9]:
policy, Vs, Qsa = strategy.mdp.value_iteration()
print(np.reshape(Vs, (4, 4), order="C"))
print(Qsa)
env.plot(policy=policy, values=Vs)

[[1. 1. 1. 1.]
 [1. 0. 1. 0.]
 [1. 1. 1. 0.]
 [0. 1. 1. 0.]]
[[1. 1. 1. 1.]
 [1. 0. 1. 1.]
 [1. 1. 1. 1.]
 [1. 0. 1. 1.]
 [1. 1. 0. 1.]
 [0. 0. 0. 0.]
 [0. 1. 0. 1.]
 [0. 0. 0. 0.]
 [1. 0. 1. 1.]
 [1. 1. 1. 0.]
 [1. 1. 0. 1.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 1. 1. 1.]
 [1. 1. 1. 1.]
 [0. 0. 0. 0.]]


<IPython.core.display.Javascript object>

## Stochastic Frozen Lake Environment 

The naive learning strategy can also be applied in a stochastich environment:

In [10]:
env = FrozenLake.make()

The steps are the same as in the deterministic case. A "NaiveStrategy" object is created first with 1000 number of episodes and a uniform random policy. The environment and strategy objects are used then to create an "Agent" object:

In [11]:
strategy = NaiveStrategy(num_of_episodes=10000, 
                         policy=UniformRandomPolicy(env))
agent = Agent(env, strategy)

The agent executes the naive learning strategy by collecting samples while running the 10000 episodes. This is done by calling the "Agent" object's method "learn":

In [12]:
agent.learn()

To verify the agent's empirical MDP, class "GymMDP" is used to get the exact MDP from the "FrozenLake" environment:

In [13]:
mdp = GymMDP(env)

In the stochastic case, the empirical MDP is not the same as the exact MDP. Verifying the state transition probabilities of taking action 2 in state 9 gives, for instance:

In [14]:
state = 9
action = 2

print("Exact MDP:")
print(mdp.Psas[state, action, :])

print("\nEmpirical MDP:")
print(strategy.mdp.Psas[state, action, :])

Exact MDP:
[0.         0.         0.         0.         0.         0.33333333
 0.         0.         0.         0.         0.33333333 0.
 0.         0.33333333 0.         0.        ]

Empirical MDP:
[0.         0.         0.         0.         0.         0.34936709
 0.         0.         0.         0.         0.32658228 0.
 0.         0.32405063 0.         0.        ]


Although the empirical probabilities are not the same as the exact probabilities, the first seem to be a good approximation of the latter. To check if this is true for all states and actions, the optimal policy and optimal value functions from both exact and empirical MDP are calculated and compared. 

First the optimal policy and optimal value functions for the exact MDP are calculated:

In [15]:
policy, Vs, Qsa = mdp.value_iteration()
print(np.reshape(Vs, (4, 4), order="C"))
print(Qsa)
env.plot(policy=policy, values=Vs)

[[0.82352941 0.82352941 0.82352941 0.82352941]
 [0.82352941 0.         0.52941176 0.        ]
 [0.82352941 0.82352941 0.76470588 0.        ]
 [0.         0.88235294 0.94117647 0.        ]]
[[0.82352941 0.82352941 0.82352941 0.82352941]
 [0.54901961 0.54901961 0.54901961 0.82352941]
 [0.7254902  0.7254902  0.7254902  0.82352941]
 [0.54901961 0.54901961 0.54901961 0.82352941]
 [0.82352941 0.54901961 0.54901961 0.54901961]
 [0.         0.         0.         0.        ]
 [0.52941176 0.25490196 0.52941176 0.2745098 ]
 [0.         0.         0.         0.        ]
 [0.54901961 0.54901961 0.54901961 0.82352941]
 [0.56862745 0.82352941 0.54901961 0.52941176]
 [0.76470588 0.58823529 0.49019608 0.45098039]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.56862745 0.60784314 0.88235294 0.58823529]
 [0.8627451  0.94117647 0.90196078 0.88235294]
 [0.         0.         0.         0.        ]]


<IPython.core.display.Javascript object>

The optimal policy and value functions for the empirical MDP calculated using policy iteration:

In [16]:
policy, Vs, Qsa = strategy.mdp.policy_iteration(inner=500, outer=100)
print(np.reshape(Vs, (4, 4), order="C"))
print(Qsa)
env.plot(policy=policy, values=Vs)

[[0.80214142 0.80213684 0.80213386 0.80213219]
 [0.80214248 0.         0.538156   0.        ]
 [0.80214439 0.80214733 0.74868902 0.        ]
 [0.         0.86794655 0.92920833 0.        ]]
[[0.80214176 0.80214023 0.80214024 0.80213987]
 [0.52193272 0.52660362 0.52842586 0.80213728]
 [0.71141516 0.70904154 0.7126203  0.80213437]
 [0.54086684 0.55405068 0.52750776 0.80213274]
 [0.8021428  0.53414378 0.52237924 0.53194956]
 [0.         0.         0.         0.        ]
 [0.53815625 0.2563749  0.50514039 0.280567  ]
 [0.         0.         0.         0.        ]
 [0.53720916 0.52848956 0.53214432 0.80214467]
 [0.53372955 0.80214755 0.52576719 0.4852277 ]
 [0.74868921 0.57746083 0.49209324 0.43402234]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.57993068 0.58158906 0.8679467  0.56434106]
 [0.84242238 0.92920842 0.89640225 0.88241545]
 [0.         0.         0.         0.        ]]


<IPython.core.display.Javascript object>

Note that "policy_iteration" method's optional argument "inner" is set to 500, because the iterative solver is preferred in this case, as the matrix system sometimes is close to singular, giving inaccurate results applying the direct solver.

The optimal policy and value functions for the empirical MDP can also be calculated using value iteration:

In [17]:
policy, Vs, Qsa = strategy.mdp.value_iteration()
print(np.reshape(Vs, (4, 4), order="C"))
print(Qsa)
env.plot(policy=policy, values=Vs)

[[0.80215729 0.80215729 0.80215729 0.80215729]
 [0.80215729 0.         0.53816726 0.        ]
 [0.80215729 0.80215729 0.74869779 0.        ]
 [0.         0.86795361 0.92921224 0.        ]]
[[0.80215729 0.80215729 0.80215729 0.80215729]
 [0.52194459 0.52661651 0.5284403  0.80215729]
 [0.71143345 0.70906025 0.71264015 0.80215729]
 [0.5408832  0.55406741 0.52752427 0.80215729]
 [0.80215729 0.53415299 0.5223886  0.53195972]
 [0.         0.         0.         0.        ]
 [0.53816726 0.2563779  0.50515111 0.2805752 ]
 [0.         0.         0.         0.        ]
 [0.53721849 0.52849713 0.53215243 0.80215729]
 [0.53373581 0.80215729 0.52577234 0.48523443]
 [0.74869779 0.57746525 0.49209867 0.43402917]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.57993656 0.58159255 0.86795361 0.56434564]
 [0.84242917 0.92921224 0.89640631 0.88242041]
 [0.         0.         0.         0.        ]]


<IPython.core.display.Javascript object>

The optimal policy derived from the empirical MDP is not the same as in the exact MDP case, but is very close, as a large number of episodes was run by the agent. The value functions derived from the empirical MDP are also approximated accurately.