## Solving the 'FrozenLake-v1' by using the Q-Learning approach
This environment is fully documented in [Gymnasium Documentation](https://gymnasium.farama.org/environments/toy_text/frozen_lake/#frozen-lake)

### Set Up

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym


### Playing the Game in 'human' mode

In [15]:
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False, render_mode='human')
env.reset()
for _ in range(20):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    print(f"obs: {observation}, rew: {reward}, term: {terminated}, trunc: {truncated}")
    if terminated or terminated:
        #env.reset()
        print('finished')
        break
    #time.sleep(0.5)
env.close()

obs: 0, rew: 0.0, term: False, trunc: False
obs: 0, rew: 0.0, term: False, trunc: False
obs: 4, rew: 0.0, term: False, trunc: False
obs: 5, rew: 0.0, term: True, trunc: False
finished


In [7]:
# Rendering in RGB format
# env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False) 

In [9]:
action_size = env.action_space.n # possible actions
state_size = env.observation_space.n # observation = state
print(action_size)
print(state_size)

4
16


### Building the Q-Table

In [10]:
# rows: States      columns: Actions
q_table = np.zeros([state_size,action_size])
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

The Q-learning equation is (Also called Bellman Equation):


`Q(s, a) ← Q(s, a) + α [r + γ max Q(s', a') - Q(s, a)]`

Where:
* `Q(s, a)` is the Q-value for a given state-action pair.
* `s` and `a` represent the current state and action.
* `r` is the immediate reward received for taking action a in state s.
* `s` is the next state after taking action a.
* `a` is the action taken in the next state s'.
* `α` is the learning rate that determines how much the Q-value should be updated.
* `γ` is the discount factor that determines the importance of future rewards.

In [11]:
EPOCS = 2000 # how many times the agent play the game
ALPHA = 0.8 # learning rate
GAMMA = 0.95 # discount rate

# explotation and exploration params
epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.001

In [12]:
def epsilon_greedy_action_selection(epsilon, q_table, discrete_state):
    random_number = np.random.random()
    # Exploitation (choose the action that maximizes Q)
    if random_number > epsilon:
        state_row = q_table[discrete_state,:]
        action = np.argmax(state_row)
    # Exploration (choose a random action)
    else:
        action = env.action_space.sample()
    
    return action

In [13]:
def compute_next_q_value(old_q_value, reward, next_optimal_q_value):
    return old_q_value + ALPHA * (reward + GAMMA * next_optimal_q_value - old_q_value)

In [14]:
def reduce_epsilon(epsilon, epoch):
    return min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate*epoch)

In [None]:
rewards = []
log_interval = 1000

env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False) 

# Agent plays the game
for epoch in range(EPOCS):
    state, _ = env.reset()
    terminated = False
    total_rewards = 0

    while not terminated:

        # ACTION
        action = epsilon_greedy_action_selection(epsilon, q_table, state)

        # what will be the impact of the action?
        new_state, reward, terminated, truncated, info = env.step(action)

        # Old (current) Q VALUE Q(st, at)
        old_q_value = q_table[state, action]

        # Get next optimal Q VALUE (Q(s+1, at+1))
        next_optimal_q_value = np.max(q_table[new_state,:])

        # Compute next Q VALUE
        next_q = compute_next_q_value(old_q_value, reward, next_optimal_q_value)



In [16]:
np.random.random()

0.4329766478685064

In [17]:
env.action_space.sample()

2

In [20]:
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False) 
env.reset()
env.step(2)

(1, 0.0, False, False, {'prob': 1.0})