# In this project we will solve two simple environments using a Q-table and a Neural Network (Deep Q-learning).

# Subproject 1

Solve [`FrozenLake8x8-v0`](https://gym.openai.com/envs/FrozenLake8x8-v0/) using a Q-table.


1. Import Necessary Packages:


2. Instantiate the Environment and Agent

3. Set up the QTable:

4. The Q-Learning algorithm training

5. Evaluate how well your agent performs
* Render output of one episode
* Give an average episode return

## Step 1: Import libs

In [4]:
# Import packages
import numpy as np
import gym
import random
import matplotlib.pyplot as plt
%matplotlib inline


## Step 2: Initiate the environment and agent.
* add FrozenLake8x8-v0 environment.

OpenAI Gym is a library composed of many environments that we can use to train our agents.
In our case we choose to use Frozen Lake.



In [5]:
#Create Gym
from gym import wrappers
env = gym.make("FrozenLake8x8-v0")
env.render()


[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


## Step 3: setup the Q-table


* Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the `action_size` and the `state_size`
* OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`


In [6]:
action_size = env.action_space.n
print("Action size ", action_size)

state_size = env.observation_space.n
print("State size ", state_size)


Action size  4
State size  64


In [7]:
qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## 4. The Q-Learning algorithm training
Here, we'll specify the hyperparameters


In [8]:
qtable_history = []
score_history = []
qtable = np.zeros((state_size, action_size))

total_episodes = 250000       # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 400               # Max steps per episode
gamma = 0.9                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.001            # Minimum exploration probability 
decay_rate = 0.00005             # Exponential decay rate for exploration prob

### Now we implement the Q learning algorithm: Q algo

In [9]:
# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

Score over time: 0.398708
[[1.97754429e-04 8.26350881e-04 1.93113721e-04 5.36319528e-04]
 [2.22505877e-04 2.07983858e-04 2.11095136e-04 2.88863031e-03]
 [3.17509777e-04 3.91013713e-03 5.21439205e-04 3.32209516e-04]
 [6.29004079e-04 6.75732393e-03 6.15383864e-04 6.22303982e-04]
 [8.48217419e-04 1.40045690e-03 5.60187590e-03 3.86229897e-03]
 [2.55895403e-03 2.69177106e-03 2.06022342e-02 2.73136021e-03]
 [1.09922028e-03 1.56098040e-02 2.71186276e-03 2.75363560e-03]
 [1.52041958e-02 1.52967922e-03 2.46201788e-03 9.15776815e-04]
 [1.47753164e-03 1.56233665e-04 1.59876817e-04 1.53334550e-04]
 [1.40909362e-04 1.33916900e-04 1.55234364e-04 2.42259055e-03]
 [2.01159673e-04 2.31868766e-04 1.28932743e-04 3.72589537e-03]
 [4.88996858e-04 2.04153837e-04 4.23194065e-04 7.30445565e-03]
 [1.24552691e-03 7.88396310e-04 6.00891418e-04 1.02733594e-02]
 [1.11163945e-03 2.40284456e-02 1.16521638e-03 1.18161698e-03]
 [1.41562543e-03 3.32593355e-03 3.15390376e-02 1.12475565e-03]
 [2.51205299e-02 1.73185531e-

## Step 5: Use our Q-table to play FrozenLake ! 👾
After 10 000 episodes, our Q-table can be used as a "cheatsheet" to play FrozenLake"
By running this cell you can see our agent playing FrozenLake.

In [10]:
env.reset()

for episode in range(10000):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFF[41mH[0mFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Number of steps 110
****************************************************
EPISODE  9584
  (Left)
SFFFFFF[41mF[0m
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Number of steps 199
****************************************************
EPISODE  9585
  (Left)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FH[41mH[0mFFFHF
FHFFHFHF
FFFHFFFG
Number of steps 30
****************************************************
EPISODE  9586
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFF[41mG[0m
Number of steps 60
****************************************************
EPISODE  9587
  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFF[41mH[0mFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG
Number of steps 81
****************************************************
EPISODE  9588
  (Down)
SFFFFFFF
FFF[41mF[0mFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF

# Subproject 2

Solve [MoonLander-v2](https://gym.openai.com/envs/LunarLander-v2/) using DQN.

**1. Import Necessary Packages:**


In [16]:


!apt install swig cmake
!pip install stable-baselines3[extra] box2d box2d-kengz


Reading package lists... Done
Building dependency tree       
Reading state information... Done
swig is already the newest version (3.0.12-1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 40 not upgraded.
Collecting box2d
  Using cached Box2D-2.3.10-cp37-cp37m-manylinux1_x86_64.whl (1.3 MB)
Collecting box2d-kengz
  Using cached Box2D-kengz-2.3.3.tar.gz (425 kB)
Building wheels for collected packages: box2d-kengz
  Building wheel for box2d-kengz (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-kengz: filename=Box2D_kengz-2.3.3-cp37-cp37m-linux_x86_64.whl size=2052971 sha256=dabba85f1baf8999d4af67c7e532b464b098a79811d3af078035d52970b3f619
  Stored in directory: /root/.cache/pip/wheels/50/6d/6a/6ff76731fd9e8efbd1cdc6111e98b2dd0f1872184d7c28939c
Successfully built box2d-kengz
Installi

In [19]:
#Imports
import gym
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import tensorflow as tf
from tensorflow import keras
from gym import wrappers
from stable_baselines3 import DQN

**2. Instantiate the Environment**

In [24]:
model = DQN('MlpPolicy', 'LunarLander-v2', verbose=1, exploration_final_eps=0.1, target_update_interval=250)


Using cpu device
Creating environment from the given name 'LunarLander-v2'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


*We* load a helper function to evaluate the agent:

In [25]:
from stable_baselines3.common.evaluation import evaluate_policy

**3. Implement and instantiate the agent**



In [26]:
# Separate env for evaluation
eval_env = gym.make('LunarLander-v2')

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=-669.34 +/- 252.52553896565271


**4. Train the agent with DQN**

4.1 Show the episode return plot
  
  - Is the agent learning to solve the task?

4.2 Save the best model

In [27]:
# Train the agent
model.learn(total_timesteps=int(1e5))
# Save the agent
model.save("dqn_lunar")
del model  # delete trained model to demonstrate loading

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 91.2     |
|    ep_rew_mean      | -167     |
|    exploration rate | 0.967    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 2719     |
|    time_elapsed     | 0        |
|    total timesteps  | 365      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 97.9     |
|    ep_rew_mean      | -258     |
|    exploration rate | 0.93     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 2983     |
|    time_elapsed     | 0        |
|    total timesteps  | 783      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 97.6     |
|    ep_rew_mean      | -254     |
|    exploration rate | 0.895    |
| time/               |          |
|    episodes       

**5. Load the model from the disk and run it in a loop**
- Hint: if you want to see the agent laning the Moon Lander, type `env.render()` after the `env.step()`.
- Do to Colab not cooperating with the Gym rendering, you might want to download the trained model and run this loop on you computer to visualise the behavior.

**Helper functions**

In [28]:
model = DQN.load("dqn_lunar")

In [29]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=27.66 +/- 139.5156945536127
