# Quick overview of gymnasium: Q-learning and SARSA
[Gymnasium website](https://gymnasium.farama.org/)

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [2]:
import gymnasium as gym
import pygame

## Taxi driver
There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.

**Actions**
- 0: move south;
- 1: move north;
- 2: move east;
- 3: move west;
- 4: pickup the passenger;
- 5: dropoff the passenger.

4 locations $\times$ 5 passenger locations (including the taxi) $\times$ 25 grid positions = 500 space

Each action takes 1 point meaning a reward of -1; successfully delivering the passenger yields +20; any illegal move costs -10 points.

In [3]:
env = gym.make("Taxi-v3", render_mode='ansi')

In [4]:
env.reset()
print(env.render())

+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : | : : |
| :[43m [0m: : : |
| | : | : |
|Y| : |B: |
+---------+




### Utility functions

In [5]:
print(env.env.metadata)
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))
random_action = env.action_space.sample()
print(random_action)
print(env.step(random_action))
# next_state, reward, terminated, truncated , info = env.step(action)

{'render_modes': ['human', 'ansi', 'rgb_array'], 'render_fps': 4}
Action Space Discrete(6)
State Space Discrete(500)
1
(124, -1, False, False, {'prob': 1.0, 'action_mask': array([1, 1, 0, 1, 0, 0], dtype=int8)})


## Example: use gym to implement Q-learning

In [6]:
from IPython.display import clear_output
import time

In [7]:
env = gym.make("Taxi-v3", render_mode='ansi')

# Initialize the q-table with zero values
q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
eta = 0.1  # learning-rate
gamma = 0.7  # discount-factor
epsilon = 0.1  # explor vs exploit
max_steps = 100
sleep_time = .05

# Random generator
rng = np.random.default_rng()

for i in range(1000):
    state = env.reset()[0]
    clear_output(wait=True)
    print(env.render())
    time.sleep(sleep_time)
    terminated = False
    
    for j in range(max_steps):
        if rng.random() < epsilon:
            action = env.action_space.sample() # Explore the action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        # Apply the action and see what happens
        next_state, reward, terminated, truncated, info = env.step(action) 
        
        q_table[state, action] = (1 - eta) * q_table[state, action] + eta * (
            reward + gamma * np.max(q_table[next_state]))
        if terminated:
            break

+---------+
|[34;1mR[0m: | : :[43mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+




KeyboardInterrupt: 