<a href="https://colab.research.google.com/github/constructor-s/aps1080_winter_2021/blob/main/A1/A1_FrozenLake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import gym

In [None]:
gym.envs.register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.74
)

In [None]:
# Create the gridworld-like environment
env=gym.make('FrozenLakeNotSlippery-v0')
# Let's look at the model of the environment (i.e., P):
env.env.P
# Question: what is the data in this structure saying? Relate this to the course
# presentation of P

{0: {0: [(1.0, 0, 0.0, False)],
  1: [(1.0, 4, 0.0, False)],
  2: [(1.0, 1, 0.0, False)],
  3: [(1.0, 0, 0.0, False)]},
 1: {0: [(1.0, 0, 0.0, False)],
  1: [(1.0, 5, 0.0, True)],
  2: [(1.0, 2, 0.0, False)],
  3: [(1.0, 1, 0.0, False)]},
 2: {0: [(1.0, 1, 0.0, False)],
  1: [(1.0, 6, 0.0, False)],
  2: [(1.0, 3, 0.0, False)],
  3: [(1.0, 2, 0.0, False)]},
 3: {0: [(1.0, 2, 0.0, False)],
  1: [(1.0, 7, 0.0, True)],
  2: [(1.0, 3, 0.0, False)],
  3: [(1.0, 3, 0.0, False)]},
 4: {0: [(1.0, 4, 0.0, False)],
  1: [(1.0, 8, 0.0, False)],
  2: [(1.0, 5, 0.0, True)],
  3: [(1.0, 0, 0.0, False)]},
 5: {0: [(1.0, 5, 0, True)],
  1: [(1.0, 5, 0, True)],
  2: [(1.0, 5, 0, True)],
  3: [(1.0, 5, 0, True)]},
 6: {0: [(1.0, 5, 0.0, True)],
  1: [(1.0, 10, 0.0, False)],
  2: [(1.0, 7, 0.0, True)],
  3: [(1.0, 2, 0.0, False)]},
 7: {0: [(1.0, 7, 0, True)],
  1: [(1.0, 7, 0, True)],
  2: [(1.0, 7, 0, True)],
  3: [(1.0, 7, 0, True)]},
 8: {0: [(1.0, 8, 0.0, False)],
  1: [(1.0, 12, 0.0, True)],
  2: [(

---------
Data in this structure represents the *dynamics function $p$*, a mapping from (state index, action (action is represented by four possible integer values corresponding to $\leftarrow$, $\downarrow$, $\rightarrow$, $\uparrow$)) pairs to: 

1. Probability of transitioning to the next state (in this environment it is deterministic, i.e. always equal to 1)

1. The next state (index)

1. The reward (1.0 for state 15, 0.0 otherwise)

1. Whether the state terminates the episode (`True`, hole or goal) or not (`False`, safe or frozen)
---------

In [None]:
# Now let's investigate the observation space (i.e., S using our nomenclature),
# and confirm we see it is a discrete space with 16 locations
print(env.observation_space)

Discrete(16)


In [None]:
stateSpaceSize = env.observation_space.n
print(stateSpaceSize)

16


In [None]:
# Now let's investigate the action space (i.e., A) for the agent->environment
# channel
print(env.action_space)

Discrete(4)


In [None]:
# The gym environment has ...sample() functions that allow us to sample
# from the above spaces:
for g in range(1,10,1):
  print("sample from S:",env.observation_space.sample()," ... ","sample from A:",env.action_space.sample())



sample from S: 5  ...  sample from A: 2
sample from S: 15  ...  sample from A: 1
sample from S: 11  ...  sample from A: 2
sample from S: 7  ...  sample from A: 0
sample from S: 1  ...  sample from A: 3
sample from S: 8  ...  sample from A: 3
sample from S: 2  ...  sample from A: 2
sample from S: 6  ...  sample from A: 1
sample from S: 15  ...  sample from A: 0


In [None]:
# The enviroment also provides a helper to render (visualize) the environment
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [None]:
# We can act as the agent, by selecting actions and stepping the environment
# through time to see its responses to our actions
env.reset()
exitCommand=False
while not(exitCommand):
  env.render()
  print("Enter the action as an integer from 0 to",env.action_space.n," (or exit): ")
  userInput=input()
  if userInput=="exit":
    break
  action=int(userInput)
  (observation, reward, compute, probability) = env.step(action)
  print("--> The result of taking action",action,"is:")
  print("     S=",observation)
  print("     R=",reward)
  print("     p=",probability)

  env.render()


In [None]:
# Question: draw a table indicating the correspondence between the action
# you input (a number) and the logic action performed.
# Question: draw a table that illustrates what the symbols on the render image
# mean?
# Question: Explain what the objective of the agent is in this environment?

------------

| Index | Action |
| ----------- | ----------- |
| `0` | LEFT |
| `1` | DOWN | 
| `2` | RIGHT | 
| `3` | UP | 

------------

| Character | State |
| ----------- | ----------- |
| `S` | starting point, safe |
| `F` | frozen surface, safe |
| `H` | hole, fall to your doom |
| `G` | goal, where the frisbee is located |

------------

The objective of the agent is to reach the goal (`G`) state

------------

In [None]:
# Practical: Code up an AI that will employ random action selection in order
# to drive the agent. Test this random action selection agent with the
# above environment (i.e., code up a loop as I did above, but instead
# of taking input from a human user, take it from the AI you coded).

In [None]:
#%% An example AI that takes random moves and renders the result at each step
def play_agent(env, policy):
    """
    Function to test drive an agent against the environment. 
    Terminates after the same state is repeated. 

    Parameters
    ----------
    env : Env
        Gym environment
    policy : callable
        A function that takes `env` as input and returns a valid action
    """
    env.reset()
    previous_observation = None

    env.render()
    while True:
        # env.render()

        # Sample from action space using the built-in function
        action = policy(env)
        (observation, reward, compute, probability) = env.step(action)
        print("--> The result of taking action",action,"is:")
        print("     S=",observation)
        print("     R=",reward)
        print("     p=",probability)
        env.render()

        # Terminate if stuck in a state
        if observation == previous_observation:
            print("This episode has terminated")
            break
        else:
            previous_observation = observation

env.seed(5)
policy = lambda env: env.action_space.sample()
play_agent(env, policy)


[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 3 is:
     S= 0
     R= 0.0
     p= {'prob': 1.0}
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 0 is:
     S= 0
     R= 0.0
     p= {'prob': 1.0}
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
This episode has terminated


In [None]:
# Now towards dynamic programming. Note that env.env.P has the model
# of the environment.
#
# Question: How would you represent the agent's policy function and value function?
# Practical: revise the above AI solver to use a policy function in which you
# code the random action selections in the policy function. Test this.
# Practical: Code the C-4 Policy Evaluation (Prediction) algorithm. You may use
# either the inplace or ping-pong buffer (as described in the lecture). Now
# randomly initialize your policy function, and compute its value function.
# Report your results: policy and value function. Ensure your prediction
# algo reports how many iterations it took.
#
# (Optional): Repeat the above for q.
#
# Policy Improvement:
# Question: How would you use P and your value function to improve an arbitrary
# policy, pi, per Chapter 4?
# Practical: Code the policy iteration process, and employ it to arrive at a
# policy that solves this problem. Show your testing results, and ensure
# it reports the number of iterations for each step: (a) overall policy
# iteration steps and (b) evaluation steps.
# Practical: Code the value iteration process, and employ it to arrive at a
# policy that solves this problem. Show your testing results, reporting
# the iteration counts.
# Comment on the difference between the iterations required for policy vs
# value iteration.
#
# Optional: instead of the above environment, use the "slippery" Frozen Lake via
# env = gym.make("FrozenLake-v0")

----------

The agent's policy function can be implemented as an $N$-element array lookup table of integer action indices, representing $p=100\%$ choosing the that action at that state. 

Similarly, the value function can be implemented as an $N$-element array lookup table of float values. 

----------

In [None]:
# Practical: revise the above AI solver to use a policy function in which you
# code the random action selections in the policy function. Test this.

manual_policy = [
    1, 2, 1, 0, 
    1, 0, 1, 0, 
    2, 2, 1, 2, 
    2, 2, 2, 2
]
policy = lambda env: manual_policy[env.env.s]
play_agent(env, policy)


[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 4
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 8
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
--> The result of taking action 2 is:
     S= 9
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
--> The result of taking action 2 is:
     S= 10
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FF[41mF[0mH
HFFG
--> The result of taking action 1 is:
     S= 14
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
FFFH
HF[41mF[0mG
--> The result of taking action 2 is:
     S= 15
     R= 1.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> The result of taking action 2 is:
     S= 15
     R= 0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
This episode has terminated
