# Introduction

Once again we're off to the casino, and this time it's situated in sunny Monte Carlo, made famous by its appearance in the classic movie _[Madagascar 3: Europe's Most Wanted](https://en.wikipedia.org/wiki/Madagascar_3:_Europe%27s_Most_Wanted)_ (although there's a slight chance that it was already famous).

In our last visit to a casino we looked at the _[multi-armed bandit](https://medium.com/towards-data-science/multi-armed-bandits-part-1-b8d33ab80697)_ and used this as a way to visualise the problem of how to choose the best action when confronted with many possible actions.

In terms of **Reinforcement Learning** the bandit problem can be thought of as representing a single state and the actions available within that state. _Monte Carlo_ methods extend this idea to cover multiple, interrelated, states.

Additionally, in the previous problems we've looked at, we've always been given a full model of the environment. This model defines both the transition probabilities, that describe the chances of moving from one state to the next, and the reward received for making this transition.

In _Monte Carlo_ methods this isn't the case. No model is given and instead the agent must discover the properties of the environment through exploration, gathering information as it moves from one state to the next. In other words, <i>Monte Carlo methods learn from experience.</i>

<center><img src="images/green_babyrobot_small.gif"/></center>

___

This notebook accompanies the Towards Data Science article and is part of _[A Baby Robot's Guide To Reinforcement Learning](https://towardsdatascience.com/tagged/baby-robot-guide)_

___

# Code Setup

The examples in this notebook use the **[Baby Robot Custom Gym Environment](https://medium.com/towards-data-science/creating-a-custom-gym-environment-for-jupyter-notebooks-e17024474617)**.

The source code for this can be found on _[Github](https://github.com/WhatIThinkAbout/BabyRobotGym)_


In [1]:
# install Baby Robot Gym
# %pip install --upgrade babyrobot -q

import babyrobot
print(f"Baby Robot Version = {babyrobot.__version__}")

Baby Robot Version = 1.0.31


In [2]:
from copy import deepcopy

# Monte Carlo Prediction

In the prediction problem we want to find how good it is to be in a particular state of the environment. This "_goodness_" is represented by the state value, which is defined as the expected reward that can be obtained when starting in that state and then following the current policy for all subsequent states.

When we have full knowledge about the environment, and know the transition probabilities and rewards, we can simply use _[Dynamic Programming](https://medium.com/towards-data-science/state-values-and-policy-evaluation-ceefdd8c2369#e996)_ to iteratively calculate the value for each state.

In practice, its unlikely that a system's transition probabilities are known in advance. Therefore, to estimate how likely it is to move from one state to another, it's possible to observe multiple episodes and then take the average. This approach, of taking random samples to calculate estimates, is known as **_Monte Carlo Sampling_**.


Consider the level, shown in figure 1 below, where Baby Robot currently finds himself:

<center><img src="images/part4/glass_wall_level.png"/></center>
<center><i>Figure 1: A level containing a glass wall and the co-ordinates on this level.</i></center>

At first glance this level appears to be rather simple, with a short path from the start of the level to the exit. However, there are 2 obstacles of note:

* In the top-middle square (coordinate (1,0)) there is a large puddle. As we've seen before, Baby Robot doesn't like puddles. They take longer to move through, incurring a negative reward of -4, and can cause him to skid.\
\
When a skid occurs Baby Robot won't reach the target state. Normally this would result in him moving to one of the other possible states, but in this case there are no other possible states, so he'll stay exactly where he is and receive another -4 penalty.\
\
If Baby Robot moves into this puddle there's a good chance he'll become stuck for several time periods and receive a large negative reward. It would be best to avoid this puddle!
<br>
<br>


* The thick blue line, between the cells (1,1) and (1,2), represents a glass wall. This is a new type of challenge that Baby Robot hasn't encountered before.\
\
Unlike standard walls, Baby Robot can't see glass walls and may therefore select an action that causes him to walk into the wall. When this happens he'll bounce off the wall and, rather than reaching the target state, end up in the opposite state. Also, he'll be given a negative reward penalty of -1 for the additional time required to make the move.\
\
In this level there are 2 possible opportunities for walking into the glass wall:

  - if he moves South from cell (1,1) he'll instead end up in the puddle at (1,0) and receive a reward of -5 (-4 for moving into a puddle and -1 for hitting the glass wall).

  - if he moves North from (1,2), instead of arriving at (1,1) he'll actually bounce off the wall and end up at the exit. In this case he'll be given a reward of -2 (-1 wall penalty and -1 for moving to a dry square).

### Code Example

In [3]:
# create the environment
basesetup = {'width':3,'height':4,'start':[0,1],'end': [1,3]}

# add a glass wall
basesetup['walls'] = [((1, 1),'S',{'color':'#00a9ff','width':10,'fit':True,'prob':0.0})]
basesetup['puddles'] = [((1,0),2)]   
basesetup['base_areas'] = [(0,0,1,1),(0,2,1,2),(2,0,1,1),(2,3,1,1)]

In [4]:
# create the test environment using the base setup
setup = deepcopy(basesetup)
env = babyrobot.make("BabyRobot-v0", **setup )
env.render()

MultiCanvas(height=260, sync_image_data=True, width=196)

In [5]:
# create a setup for information display 
# - just show the graphical grid
info_setup = deepcopy(basesetup)
info_setup['show_start_text'] =False
info_setup['show_end_text'] = False
info_setup['robot'] = { 'show': False}

In [6]:
setup = deepcopy(info_setup)
setup['add_compass'] = True
env = babyrobot.make("BabyRobot-v0", **setup )
info = {'coords': True}
env.show_info(info)
env.render()

MultiCanvas(height=260, sync_image_data=True, width=296)

<br>
<br>

As mentioned above, when we have complete information about a system, and know all of its transition probabilities and rewards, we can use _[Policy Evaluation](https://towardsdatascience.com/state-values-and-policy-evaluation-ceefdd8c2369)_ to calculate the state values.

For this environment with a stochastic policy, that just chooses randomly between the available actions in each state, _Policy Evaluation_ gives the following state values:

In [None]:
# create a stochastic policy and a policy evaluation object for this policy
policy = Policy(env)
policy_evaluation = PolicyEvaluation( env, policy )

setup = deepcopy(info_setup)
env = babyrobot.make("BabyRobot-v0", render_mode=None, **setup)
steps_to_convergence = policy_evaluation.run_to_convergence(max_iterations = 1000)
print(f"Convergence in {steps_to_convergence} iterations")

# show the final state values after convergence
env = babyrobot.make("BabyRobot-v0",**setup)
info = {'text': policy_evaluation.end_values, 'precision': 0}
env.show_info(info) 
env.render()