# Q* Learning with FrozenLake 🕹️⛄
<br> 
In this Notebook, we'll implement an agent <b>that plays FrozenLake.</b>
<img src="frozenlake.png" alt="Frozen Lake"/>

The goal of this game is <b>to go from the starting state (S) to the goal state (G)</b> by walking only on frozen tiles (F) and avoid holes (H).However, the ice is slippery, <b>so you won't always move in the direction you intend (stochastic environment)</b>

# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png" alt="Deep Reinforcement Course"/>
<br>
<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**
<br><br>
    
📜The articles explain the architectures from the big picture to the mathematical details behind them.
<br>
📹 The videos explain how to build the agents with Tensorflow </b></p>
<br>
This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. 
<br><br>
</p> 

## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    
## How to help  🙌
3 ways:
- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.
- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. 
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Prerequisites 🏗️
Before diving on the notebook **you need to understand**:
- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)
- Q-learning [Article](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe)
- In the [video version](https://www.youtube.com/watch?v=q2ZOEFAaaI0)  we implemented a Q-learning agent that learns to play OpenAI Taxi-v2 🚕 with Numpy.

In [1]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

## Step 0: Import the dependencies 📚
We use 3 libraries:
- `Numpy` for our Qtable
- `OpenAI Gym` for our FrozenLake Environment
- `Random` to generate random numbers

In [1]:
import numpy as np
import gym
import random

## Step 1: Create the environment 🎮
- Here we'll create the FrozenLake environment. 
- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>
- In our case we choose to use Frozen Lake.

In [2]:
env = gym.make("FrozenLake-v0")

In [3]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


## Step 2: Create the Q-table and initialize it 🗄️
- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size
- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`

In [4]:
action_size = env.action_space.n
state_size = env.observation_space.n

In [5]:
qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [6]:
qtable.shape

(16, 4)

## Step 3: Create the hyperparameters ⚙️
- Here, we'll specify the hyperparameters

In [7]:
total_episodes = 15000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005             # Exponential decay rate for exploration prob

## Step 4: The Q learning algorithm 🧠
- Now we implement the Q learning algorithm:
<img src="qtable_algo.png" alt="Q algo"/>

In [9]:
print(env.action_space)

print(env.observation_space)

Discrete(4)
Discrete(16)


In [10]:
min([env.observation_space.sample()
 for i in range(10000)])

0

In [11]:
action

4

In [14]:
from scipy import stats

In [16]:
rewards[-100:]

[1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 1.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0]

In [15]:
print(stats.describe(rewards))

DescribeResult(nobs=15000, minmax=(0.0, 1.0), mean=0.4942666666666667, variance=0.24998379447518723, skewness=0.0229348411703241, kurtosis=-1.9994739930604915)


In [12]:
epsilon_list = (min_epsilon+(max_epsilon-min_epsilon)*np.exp(-decay_rate * np.arange(total_episodes)))
rewards = []
for i in (range(total_episodes)):
    observation = env.reset()
    current_episode_rewards=0
    for _ in range(max_steps):
        if random.random() > epsilon_list[i]:
            action = int(np.argmax(qtable[observation,],0))  #env.action_space.sample()
        else:
            action = random.randint(0,3)
        observation_new, reward, done, info = env.step(action)
        current_episode_rewards+=reward
        qtable[observation,action] = qtable[observation,action] + learning_rate * (reward + gamma * (max(qtable[observation_new,])) - qtable[observation,action]  )
        observation = observation_new
        if done:
            
            #print("Episode finished after {} timesteps".format(_+1))
            break
    rewards.append(current_episode_rewards)
print("after episodes :{} reward over time: {}".format(total_episodes,sum(rewards)/total_episodes))
           

after episodes :15000 reward over time: 0.4942666666666667


## Step 5: Use our Q-table to play FrozenLake ! 👾
- After 10 000 episodes, our Q-table can be used as a "cheatsheet" to play FrozenLake"
- By running this cell you can see our agent playing FrozenLake.

In [20]:
for i_episode in range(4):
    observation = env.reset()
    rewards = 0
    for t in range(max_steps):
        #env.render()
        #print(observation)
        action = int(np.argmax(qtable[observation,],0))
        observation, reward, done, info = env.step(action)
        rewards+=reward
        if done:
            env.render()
            print("Episode finished after {} timesteps, rewards: {}".format(t+1,rewards))
            break
#env.close()

  (Down)
SFFF
FHFH
FFF[41mH[0m
HFFG
Episode finished after 28 timesteps, rewards: 0.0
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Episode finished after 18 timesteps, rewards: 1.0
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Episode finished after 88 timesteps, rewards: 1.0
  (Down)
SFFF
FHFH
FFF[41mH[0m
HFFG
Episode finished after 45 timesteps, rewards: 0.0
