# Aula 2 - Reinforcement Learning

## Tutorial: Q Learning no ambiente FrozenLake

### Prof. Paulo Caixeta (profpaulo.oliveria@fiap.com.br)

Agradecimentos: Prof. Ahirton

# Q* Learning com FrozenLake 4x4

Neste Notebook, implementaremos um agente <b>que reproduz o desafio FrozenLake.</b>

![texto alternativo](https://www.gymlibrary.dev/_images/frozen_lake.gif)

O objetivo deste jogo é <b>passar do estado inicial (S) para o estado objetivo (G)</b> andando apenas sobre peças congeladas (F) e evitando buracos (H). No entanto, o gelo é escorregadio, **então você nem sempre se moverá na direção pretendida (ambiente estocástico)**

## Pré-requisitos

Antes de começarmos, **você precisa entender**:
- Os fundamentos da aprendizagem por reforço
- Q-learning

## Etapa 1: Instalando as dependências no Google Colab

In [1]:
!pip install gymnasium
!pip install numpy



## Preparação: Importando as bilbiotecas

Usamos 3 bibliotecas:

- `Numpy` para nosso Qtable
- `Gymnasium` para nosso ambiente FrozenLake
- `Random` para gerar números aleatórios
- `Time` para pausarmos a visualização
- `dusplay` para ver o jogo!

In [18]:
import numpy as np
import gymnasium as gym
import random
import time
from IPython.display import clear_output

## Passo 1: Criando o ambiente

- Aqui criaremos o ambiente *FrozenLake* 4x4.
- *Gymnasium* é uma biblioteca <b> composta por diversos ambientes que podemos usar para treinar nossos agentes.</b>

In [14]:
env = gym.make("FrozenLake-v1", desc=None, map_name="4x4", is_slippery=True, render_mode="ansi")

## Etapa 2: Criando a tabela Q

- Agora, vamos criar nossa Q-table, para saber de quantas linhas (estados) e colunas (ações) precisamos, precisamos calcular o action_size e o state_size
- Gymnasium nos fornece uma maneira de fazer isso: `env.action_space.n` e `env.observation_space.n`

In [5]:
action_size = env.action_space.n
state_size = env.observation_space.n

In [6]:
# Criando a tabela-Q zerada (64x4)
qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## Etapa 3: Criando os hiperparâmetros

In [7]:
total_episodes = 500         # Total episodes
learning_rate = 0.7          # Learning rate
max_steps = 99               # Max steps per episode
gamma = 0.95                 # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability
decay_rate = 0.005            # Exponential decay rate for exploration prob


![texto alternativo](https://miro.medium.com/v2/resize:fit:1400/1*tyIE_430xbBRzrrjUdYLQw.png)


## Etapa 4: Implementando o algoritmo de Q-learning


In [11]:
# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    if episode %50 == 0: #a cada 50 episódeos:
      print("Episódio:",episode)
    # Reset the environment
    state, _ = env.reset()
    state = int(state)
    step = 0
    done = False
    total_rewards = 0

    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)

        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
            #print(exp_exp_tradeoff, "action", action)

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
            #print("action random", action)


        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])

        total_rewards += reward

        # Our new state is state
        state = new_state

        # If done (if we're dead) : finish episode
        if done:
            break

    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    rewards.append(total_rewards)


print ("Pontuação no tempo: " +  str(sum(rewards)/total_episodes))
print(qtable)

Episódio: 0
Episódio: 50
Episódio: 100
Episódio: 150
Episódio: 200
Episódio: 250
Episódio: 300
Episódio: 350
Episódio: 400
Episódio: 450
Pontuação no tempo: 0.098
[[2.39552864e-01 2.29260624e-01 2.21471612e-01 2.38244601e-01]
 [5.00865409e-03 6.22835670e-02 1.06427284e-01 2.21066113e-01]
 [6.29579437e-02 2.03090629e-01 8.61450464e-02 9.92475242e-02]
 [2.30886667e-02 5.49117601e-03 8.43013340e-02 1.10447998e-01]
 [2.41272102e-01 2.02722795e-01 2.78664985e-01 7.20789324e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.48926964e-02 1.34077615e-02 4.21853538e-01 5.04672293e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.76705900e-01 2.67423756e-02 2.74424439e-02 2.98479230e-01]
 [1.29037131e-01 3.88859742e-01 5.86097182e-02 1.08679580e-01]
 [4.35504971e-01 4.32150912e-01 3.52689643e-01 4.44483460e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [5.16815861e-01 2

In [12]:
# [[2.39552864e-01 2.29260624e-01 2.21471612e-01 2.38244601e-01]
#  [5.00865409e-03 6.22835670e-02 1.06427284e-01 2.21066113e-01]
#  [6.29579437e-02 2.03090629e-01 8.61450464e-02 9.92475242e-02]
#  [2.30886667e-02 5.49117601e-03 8.43013340e-02 1.10447998e-01]
#  [2.41272102e-01 2.02722795e-01 2.78664985e-01 7.20789324e-02]
#  [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
#  [1.48926964e-02 1.34077615e-02 4.21853538e-01 5.04672293e-04]
#  [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
#  [2.76705900e-01 2.67423756e-02 2.74424439e-02 2.98479230e-01]
#  [1.29037131e-01 3.88859742e-01 5.86097182e-02 1.08679580e-01]
#  [4.35504971e-01 4.32150912e-01 3.52689643e-01 4.44483460e-01]
#  [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
#  [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
#  [5.16815861e-01 2.17806729e-01 5.86103870e-01 5.92147639e-01]
#  [7.81334360e-01 6.78413574e-01 9.68027119e-01 6.50978770e-01]
#  [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]

## Etapa 5: Jogando (e vencendo!?) FrozenLake!

- Após 500 episódios, nossa tabela Q pode ser usada como uma "mapa da mina" para jogar *FrozenLake*
- Ao executar este celular você poderá ver nosso agente jogando FrozenLake.

In [None]:
for episode in range(5):
    state, info = env.reset()
    state = int(state)
    step = 0
    done = False
    max_steps = 100
    print("****************************************************")
    print("EPISODIO ", episode)

    for step in range(max_steps):

        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])

        new_state, reward, terminated, truncated, _ = env.step(action)

        done = terminated or truncated
        clear_output(wait=True)
        print(env.render())
        time.sleep(1)
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            if new_state == 15:
                print("Chegamos no Objetivo!")
            else:
                print("Caímos em um Obstáculo!")

            # We print the number of step it took.
            print("Número de passos", step)
            time.sleep(5)
            break
        state = int(new_state)
env.close()