| <p style="text-align: left;">Name</p>               | Matr.Nr. | <p style="text-align: right;">Date</p> |
| --------------------------------------------------- | -------- | ------------------------------------- |
| <p style="text-align: left">Fathy Shalaby</p> | 11701175 | 10.06.2020                            |

<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 9 (Assignment) -- Introduction to Reinforcement Learning -- Part II </h2>

<b>Authors</b>: Brandstetter, Schäfl <br>
<b>Date</b>: 08-06-2020

This file is part of the "Hands-on AI II" lecture material. The following copyright statement applies 
to all code within this file.

<b>Copyright statement</b>: <br>
This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

<h2>Exercise 0</h2>

- Import the same modules as discussed in the lecture notebook.
- Check if your model versions are correct.

In [18]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
/gdrive


In [19]:
%cd /gdrive/My Drive/as9

/gdrive/My Drive/as9


In [0]:
# your imports go here
import u9_utils as u9
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import gym

from typing import Any, Dict, Tuple
from gym.envs.toy_text import FrozenLakeEnv

# Set Seaborn plotting style.
sns.set()

In [21]:
# your modul version check goes here
u9.check_module_versions()

Installed Python version: 3.6 (✗)
Installed matplotlib version: 3.2.1 (✓)
Installed Pandas version: 1.0.4 (✓)
Installed Seaborn version: 0.10.1 (✓)
Installed OpenAI Gym version: 0.17.2 (✓)


All exercises in this assignment are referring to the <i>FrozenLake-v0</i> environment of <a href="https://gym.openai.com"><i>OpenAI Gym</i></a>. This environment is descibed according to its official <a href="https://gym.openai.com/envs/FrozenLake-v0/">OpenAI Gym website</a> as follows:<br>
<cite>Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.</cite>


There are <i>four</i> types of surfaces described in this environment:
<ul>
    <li><code>S</code> $\rightarrow$ starting point (<span style="color:rgb(0,255,0)"><i>safe</i></span>)</li>
    <li><code>F</code> $\rightarrow$ frozen surface (<span style="color:rgb(0,255,0)"><i>safe</i></span>)</li>
    <li><code>H</code> $\rightarrow$ hole (<span style="color:rgb(255,0,0)"><i>fall to your doom</i></span>)</li>
    <li><code>G</code> $\rightarrow$ goal (<span style="color:rgb(255,0,255)"><i>frisbee location</i></span>)</li>
</ul>


If not already done, more information on how to <i>install</i> and <i>import</i> the <code>gym</code> module is available in the lecture's notebook.

<h3 style="color:rgb(0,120,170)">States and actions</h3>
Experiment with the <i>FrozenLake-v0</i> environment as discussed during the lecture and explained in the accompanying notebook.

In [0]:
lake_environment = FrozenLakeEnv()
u9.set_seed(environment=lake_environment, seed=42)

In [23]:
lake_environment.render(mode=r'human')
current_state_id = lake_environment.s
print(f'\nCurrent state ID: {current_state_id}')


[41mS[0mFFF
FHFH
FFFH
HFFG

Current state ID: 0


The current position of the <i>disc retrieving</i> entity is displayed as a filled <span style="color:rgb(255,0,0)"><i>red</i></span> rectangle.

As we want to tackle this problem using our renowned <i>random search</i> approach, we have to analyse its applicability beforehand. Hence, the number of possible <i>actions</i> and <i>states</i> is of utter importance, as we don't want to get lost in the depth of combinatorial explosion.
<ul>
    <li>Query the amount of <i>actions</i> using the appropriate peoperty of the lake environment.</li>
    <li>Query the amount of <i>states</i> using the appropriate property of the lake environment.</li>
</ul>

In [24]:
num_actions = lake_environment.action_space.n
num_states = lake_environment.observation_space.n
print(f'The FrozenLake-v0 environment comprises <{num_actions}> actions and <{num_states}> states.')

The FrozenLake-v0 environment comprises <4> actions and <16> states.


<h2>Exercise 1</h2>

- Create a q_table for the frozen lake environment.
- Apply $Q$-learning as it was done in the lecture to solve the environment.
- Test the learned policy and animate one (or more) exemplary episode.
- What do you observe? Does the agent learn anything useful? Discuss if something strange happens. Hint: print the q_table during training to better understand what is going on during learning.

In [25]:
# q_table
q_table = np.zeros([lake_environment.observation_space.n, lake_environment.action_space.n])
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [26]:
# shape of q_table
q_table.shape

(16, 4)

In [0]:
def apply_q_learning(environment: lake_environment, alpha: float = 0.1):
    """
    Solve lake_environment by applying Q learning.
    """
    for i in range(1, 10001):
        # your code goes here
        state = environment.reset()
        done = False
    
        while not done:
            action = np.argmax(q_table[state]) 
            next_state, reward, done, info = environment.step(action) 
            old_value = q_table[state, action]
            next_max = np.max(q_table[next_state])
            new_value = (1 - alpha) * old_value + alpha * (reward + next_max)
            q_table[state, action] = new_value

            state = next_state
        
        if i % 100 == 0:
            clear_output(wait=True)
            print(f"Episode: {i}")
            print(q_table)   

    print("Training finished.\n")

In [28]:
%%time
from IPython.display import clear_output
# train your agent
alpha = 0.1
apply_q_learning(lake_environment, alpha)

Episode: 10000
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
Training finished.

CPU times: user 4.67 s, sys: 608 ms, total: 5.28 s
Wall time: 4.75 s


In [29]:
total_epochs, total_dives = 0, 0
episodes = 10000

captured_frames = [[] for _ in range(episodes)]

for episode in range(episodes):
    # test your method
    state = lake_environment.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = lake_environment.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1
        
        # Save rendering o f current state.
        captured_frames[episode].append({
            r'frame': lake_environment.render(mode=r'ansi'),
            r'state': state,
            r'action': action,
            r'reward': reward
        })

    total_dives += penalties
    total_epochs += epochs


print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average dives per episode: {total_dives / episodes}")

Results after 100 episodes:
Average timesteps per episode: 17.19
Average dives per episode: 0.0


In [30]:
# animate some of the results
u9.animate_environment_search(frames=captured_frames[12], verbose=True, delay=0.4)

  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG

Step No.: 5
State ID: 12
Action ID: 0
Reward: 0.0


<h2>Exercise 2</h2>
Very likely your training in Exercise 1 was not successful. Try to add exploration to your algorithm (you might have to write a new function):
<li><code>I</code> $\rightarrow$ Throw a random uniform number between 0 and 1. 
<li><code>II</code> $\rightarrow$ If the number is smaller than 0.1, sample a random action.
<li><code>III</code> $\rightarrow$ Choose your action as usual.   
    
- Apply the modified $Q$-learning again to solve the environment.
- Test the learned policy and animate one (or more) exemplary episode.
- What do you observe? Does the agent learn now?.

In [31]:
# q_table
q_table = np.zeros([lake_environment.observation_space.n, lake_environment.action_space.n])
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [0]:
def apply_q_learning_exploration(environment: lake_environment, alpha: float = 0.1):
    """
    Solve lake_environment by applying Q learning and exploration.
    """
    for i in range(1, 10001):
        # your code goes here
        # Throw a random uniform number between 0 and 1
        # If the number is smaller than 0.1, sample a random action.
        # Choose your action as usual.
        state = environment.reset()
        done = False
        while not done:
            num = np.random.uniform(0,1)
            if num < alpha:
              action = lake_environment.action_space.sample()
            else:
              action = np.argmax(q_table[state]) 
            next_state, reward, done, info = environment.step(action) 
            old_value = q_table[state, action]
            next_max = np.max(q_table[next_state])
            new_value = (1 - alpha) * old_value + alpha * (reward + next_max)
            q_table[state, action] = new_value

            state = next_state
        
        if i % 100 == 0:
            clear_output(wait=True)
            print(f"Episode: {i}")
            print(q_table)   
    print("Training finished.\n")

In [33]:
%%time
from IPython.display import clear_output
# train your agent
alpha = 0.1
apply_q_learning_exploration(lake_environment, alpha)

Episode: 10000
[[0.77222677 0.75566608 0.76416355 0.75991826]
 [0.48776354 0.31521277 0.43941593 0.69773066]
 [0.54905981 0.42002365 0.39975704 0.5184715 ]
 [0.23150219 0.07878616 0.09765571 0.07743555]
 [0.77732367 0.61322777 0.56736619 0.52477322]
 [0.         0.         0.         0.        ]
 [0.32684521 0.21219287 0.37495681 0.15066474]
 [0.         0.         0.         0.        ]
 [0.53642365 0.56613119 0.36771388 0.78449932]
 [0.7103599  0.78899253 0.52449693 0.44644416]
 [0.73175767 0.51156238 0.59075405 0.36403519]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.61700706 0.70163724 0.84566247 0.52615417]
 [0.79876141 0.93116259 0.88754576 0.84422637]
 [0.         0.         0.         0.        ]]
Training finished.

CPU times: user 10.1 s, sys: 2.57 s, total: 12.6 s
Wall time: 9.71 s


In [34]:
total_epochs, total_dives = 0, 0
episodes = 10000

captured_frames = [[] for _ in range(episodes)]

for episode in range(episodes):
    # test your method
    state = lake_environment.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = lake_environment.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1
        
        # Save rendering of current state.
        captured_frames[episode].append({
            r'frame': lake_environment.render(mode=r'ansi'),
            r'state': state,
            r'action': action,
            r'reward': reward
        })

    total_dives += penalties
    total_epochs += epochs


print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average dives per episode: {total_dives / episodes}")

Results after 10000 episodes:
Average timesteps per episode: 43.5011
Average dives per episode: 0.0


In [35]:
u9.animate_environment_search(frames=captured_frames[4], verbose=True, delay=0.1)

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m

Step No.: 102
State ID: 15
Action ID: 1
Reward: 1.0
