# Hands On #1 - Skating the Frozen Lake

## Goal:
* Get familiar with programming OpenAI Gym & it’s interfaces
* Solve OpenAI Gym : Frozen Lake, given an optimum deterministic policy 

## Steps:
1. Understand OpenAI Setup & interface
2. Examine Random Policy Run
3. Run the cells upto _5. Hands-on ToDo_
4. Create aPolicy which captures the deterministic policy
5. Use deterministic policy to skate to the goal
6. Metrics that can be programmed:
 * Episodes before solve
 * Solved 0.78 over 100 consecutive runs
 * Max score over 1000 episodes
 * Avg over 1000 episodes

### 1. Install the required packages

* No esoteric requirements
* You can run them without docker
* pip install -r requirements.txt
* Requirements
 * python 3.6, pytorch, openAI gym, numpy, matplotlib
 * anaconda is easier but not needed
 * Miniconda works fine

### 2.0. Define imports

python 3, numpy, matplotlib, torch, gym

In [1]:
# General imports
import gym

import numpy as np
import random
from collections import namedtuple, deque

import matplotlib.pyplot as plt
%matplotlib inline

# torch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

### 2.1. Global Constants and other variables

In [2]:
# Constants Definitions
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate 
UPDATE_EVERY = 4        # how often to update the network
# Number of neurons in the layers of the Q Network
FC1_UNITS = 16
FC2_UNITS = 8
FC3_UNITS = 4
# Store models flag. Store during calibration runs and do not store during hyperparameter search
STORE_MODELS = False

### Work Area

In [3]:
# Work area to quickly test utility functions
import time
from datetime import datetime, timedelta
start_time = time.time()
time.sleep(10)
print('Elapsed : {}'.format(timedelta(seconds=time.time() - start_time)))

Elapsed : 0:00:10.002450


In [4]:
env = gym.make('FrozenLake-v0')
env.reset()
env.render()

  result = entry_point.load(False)



[41mS[0mFFF
FHFH
FFFH
HFFG


In [5]:
print("State :")
print(" 0  1  2  3")
print(" 4  5  6  7")
print(" 8  9 10 11")
print("12 13 14 15")
print()
print(" S  F  F  F")
print(" F  H  F  H")
print(" F  F  F  H")
print(" H  F  F  G")

State :
 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

 S  F  F  F
 F  H  F  H
 F  F  F  H
 H  F  F  G


In [6]:
print('[ 0 = Left, 1 = Down, 2 = Right, 3 = Up ]')
print(env.unwrapped.P[0])
print()
print(env.unwrapped.P[14])
print()
print(env.unwrapped.P[15])

[ 0 = Left, 1 = Down, 2 = Right, 3 = Up ]
{0: [(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False)], 1: [(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False)], 2: [(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)], 3: [(0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 0, 0.0, False)]}

{0: [(0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False)], 1: [(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True)], 2: [(0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False)], 3: [(0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 13, 0.0, False)]}

{0: [(1.0, 15, 0,

### Slippery Slope !
#### It is very slippery. To start with, let us create an environment that is not slippery.

In [7]:
from gym.envs.registration import register
register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78, # optimum = .8196
)

In [8]:
import gym
env = gym.make('FrozenLakeNotSlippery-v0')
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


  result = entry_point.load(False)


### 3. Examine the State and Action Spaces

The state space is 16, with four actions [ 0 = Left, 1 = Down, 2 = Right, 3 = Up ]

In [9]:
print(env.observation_space)
print(env.action_space)
act_space = [i for i in range(0,env.action_space.n)]
print(act_space)
# env.unwrapped.get_action_meanings() # AttributeError: 'FrozenLakeEnv' object has no attribute 'get_action_meanings'
print('[ 0 = Left, 1 = Down, 2 = Right, 3 = Up ]')

Discrete(16)
Discrete(4)
[0, 1, 2, 3]
[ 0 = Left, 1 = Down, 2 = Right, 3 = Up ]


In [10]:
#print(dir(env))
#print(dir(env.unwrapped))
print('States = {:d}'.format(env.unwrapped.nS))
print('Actions = {:d}'.format(env.unwrapped.nA))
print(env.unwrapped.P[0])

States = 16
Actions = 4
{0: [(1.0, 0, 0.0, False)], 1: [(1.0, 4, 0.0, False)], 2: [(1.0, 1, 0.0, False)], 3: [(1.0, 0, 0.0, False)]}


### Frozen Lake, but slippery no more !

### 4. Test the environment with Random Action

In [11]:
for i_episode in range(3):
    state = env.reset()
    while True:
        action = env.action_space.sample()
        next_state, reward, done, info = env.step(action)
        print('[',state,']',' -> ', action,' = [',next_state,']', 'R = ',reward)
        env.render()
        if done:
            print('End game! Reward: ', reward)
            print('You won :)\n') if reward > 0 else print('You lost :(\n')
            break
        else:
            state = next_state

[ 0 ]  ->  0  = [ 0 ] R =  0.0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  3  = [ 0 ] R =  0.0
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  1  = [ 4 ] R =  0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
[ 4 ]  ->  0  = [ 4 ] R =  0.0
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
[ 4 ]  ->  3  = [ 0 ] R =  0.0
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  3  = [ 0 ] R =  0.0
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  3  = [ 0 ] R =  0.0
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  3  = [ 0 ] R =  0.0
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  1  = [ 4 ] R =  0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
[ 4 ]  ->  3  = [ 0 ] R =  0.0
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  1  = [ 4 ] R =  0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
[ 4 ]  ->  2  = [ 5 ] R =  0.0
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
End game! Reward:  0.0
You lost :(

[ 0 ]  ->  0  = [ 0 ] R =  0.0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  3  = [ 0 ] R =  0.0
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
[ 0 ]  ->  

### 5. Hands-On ToDo
----
#### Create a dictionary aPolicy that captures the deterministic policy from the picture below
##### e.g. To start from State S (which is 0) and go down, the aPilocy={0:1,}
#### Add more elements to the dictionary that captures the good path from Start to Goal

In [12]:
# A Deterministic Optimal Policy
aPolicy = {0:1,4:1,8:2,9:1,13:2,14:2}
# [ 0 = Left, 1 = Down, 2 = Right, 3 = Up ]

<img src='Frozen_Lake_Policy.png'>

### 6. Run the deterministic policy

In [13]:
for i_episode in range(2): # Should be over in 6 steps, try for 2 episodes
    state = env.reset()
    while True:
        policy_action = aPolicy.get(state,-1)
        if policy_action == -1 :
            action = env.action_space.sample()
        else:
            action = policy_action
        next_state, reward, done, info = env.step(action)
        print('[',state,']',' -> ', action,' = [',next_state,']', reward)
        env.render()
        if done:
            print('End game! Reward: ', reward)
            print('You won :)\n') if reward > 0 else print('You lost :(\n')
            break
        else:
            state = next_state

[ 0 ]  ->  1  = [ 4 ] 0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
[ 4 ]  ->  1  = [ 8 ] 0.0
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
[ 8 ]  ->  2  = [ 9 ] 0.0
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
[ 9 ]  ->  1  = [ 13 ] 0.0
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
[ 13 ]  ->  2  = [ 14 ] 0.0
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
[ 14 ]  ->  2  = [ 15 ] 1.0
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
End game! Reward:  1.0
You won :)

[ 0 ]  ->  1  = [ 4 ] 0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
[ 4 ]  ->  1  = [ 8 ] 0.0
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
[ 8 ]  ->  2  = [ 9 ] 0.0
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
[ 9 ]  ->  1  = [ 13 ] 0.0
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
[ 13 ]  ->  2  = [ 14 ] 0.0
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
[ 14 ]  ->  2  = [ 15 ] 1.0
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
End game! Reward:  1.0
You won :)



In [14]:
env.close()

### That's All _Folks_ !