##**Smart Taxi** : https://gymnasium.farama.org/environments/toy_text/taxi/

In [1]:
!pip install numpy==1.23.5  # install this and restart notebook to fix the issue with gym library



In [2]:
!pip install gym==0.25.2




In [3]:
# Importing necessary libraries
import gym        # OpenAI Gym for RL environments
import numpy as np
import pickle, os



In [4]:
print("Gym version:", gym.__version__)


Gym version: 0.25.2


In [5]:
# Create the Taxi environment from OpenAI Gym
env = gym.make("Taxi-v3")

  deprecation(
  deprecation(


In [6]:
# Reset the environment and get the initial state - it will randomly pick a state
state = env.reset()



In [7]:
state # look at the state generated : this will initialize taxi at a random state

212

In [8]:
# Show the total number of states (500 in Taxi-v3)
env.observation_space.n


500

In [9]:
# Render the initial grid — taxi environment with walls, pickup/drop points
print(env.render(mode="ansi"))


+---------+
|[35mR[0m: | : :G|
| : | : : |
|[43m [0m: : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+




See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


In [10]:
# There are 6 possible actions in Taxi-v3:
# 0: South, 1: North, 2: East, 3: West, 4: Pickup, 5: Drop-off
n_states = env.observation_space.n
n_actions = env.action_space.n

n_actions  # Will output 6



6

In [11]:
# You can also manually set a specific environment state for demo
env.env.s = 300
print(env.render(mode="ansi"))

# BLUE = PICKUP LOCATIOM
# PINK = DROP LOCATION

+---------+
|[35mR[0m: | : :G|
| : | : : |
|[43m [0m: : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+




###Actions:
* 0: Move south (down)

* 1: Move north (up)

* 2: Move east (right)

* 3: Move west (left)

* 4: Pickup passenger

* 5: Drop off passenger

In [12]:
# Take actions manually to see how the taxi behaves
env.step(0)   # Move south
print(env.render(mode="ansi"))



+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
|[43m [0m| : | : |
|Y| : |[34;1mB[0m: |
+---------+
  (South)



In [13]:
env.step(2)   # Move east
print(env.render(mode="ansi"))



+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
|[43m [0m| : | : |
|Y| : |[34;1mB[0m: |
+---------+
  (East)



In [14]:
env.step(0)   # Move south again
print(env.render(mode="ansi"))

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[43mY[0m| : |[34;1mB[0m: |
+---------+
  (South)



In [27]:
# Let's try a random action and observe the outcome
env.step(env.action_space.sample())


# output is of the format : (next_state, reward, done, info)
# each of these parameters are explained below

(85,
 -10,
 True,
 {'prob': 1.0,
  'action_mask': array([1, 0, 0, 1, 1, 0], dtype=int8),
  'TimeLimit.truncated': True})


###  Interpretation:
#### 1. `447` → **Next State**

* This is the encoded state **number**. In `Taxi-v3`, the state is a single integer (from 0 to 499) that encodes the taxi’s:

  * Row (5 possible)
  * Column (5 possible)
  * Passenger location (5 possible: at 4 locations or in the taxi)
  * Destination (4 possible)
* So, `447` is just a state ID.

#### 2. `-1` → **Reward**

* The environment penalizes each step with **-1** until the passenger is successfully dropped at the destination.
* This encourages the agent to find the **shortest path**.

#### 3. `False` → **Done**

* `False` means the episode is **not yet finished**.
* It would be `True` only when the taxi successfully picks up the passenger and drops them at the correct destination.

#### 4. `{'prob': 1.0, 'action_mask': array([0, 1, 0, 1, 0, 0], dtype=int8)}` → **Info Dictionary**

* This extra dictionary gives **metadata** about the environment transition.

Let's look into both keys:

##### a. `'prob': 1.0`

* The **probability** of the transition happening is 1.0.
* Taxi is a **deterministic** environment, so the result of any action is always the same (no randomness).

##### b. `'action_mask': array([0, 1, 0, 1, 0, 0])`

* This shows which **actions are currently allowed**.

* In `Taxi-v3`, there are **6 possible actions**:

  ```
  0 = south
  1 = north
  2 = east
  3 = west
  4 = pickup
  5 = dropoff
  ```

* The mask `[0, 1, 0, 1, 0, 0]` means:

  * Action 1 (north) and 3 (west) are **allowed** (`1`)
  * Others are **not allowed** (`0`)
  * For example, if the taxi is not at the passenger location, `pickup` (4) is invalid and masked out.

---

###  Why is `action_mask` Useful?

* It's helpful when using agents that need to know which actions are legal (e.g., in **masked reinforcement learning**).
* Prevents trying invalid moves like pickup when no passenger is there.

---

###  Summary

The output means:

* After taking an action, you're now in state `447`
* You got a penalty of `-1`
* The episode is not done yet
* You can only take actions `north` and `west` next



### just for testing purpose:
 - we are dropping the taxi at a random location and leaving it to converge
 - at the end we will count the number of step and reward

In [28]:
# ----------------------------------------------
# 🔍 Test: How well does a random agent perform?
# ----------------------------------------------

state = env.reset()
counter = 0
tot_reward = 0
reward = None


In [29]:
# Run until we get the max reward of 20 (i.e., successful drop-off)
while reward != 20:
    state, reward, done, info = env.step(env.action_space.sample())  # Random action
    counter += 1
    tot_reward += reward

print("Solved in {} Steps with a Total Reward of {}".format(counter, tot_reward))



Solved in 1370 Steps with a Total Reward of -5597


In [30]:
# As seen above in such problem it is unlikely we will get a positive cummulative reward

In [31]:
# Using the above logic we just need to add an extra step to update Q-Matrix as shown below

In [32]:

# ----------------------------------------------
# ✅ Q-Learning Implementation
# ----------------------------------------------

# Initialize Q-table with all zeros (states × actions)
Q = np.zeros([n_states, n_actions])

Q.shape  # (500, 6) for Taxi-v3

episodes = 1000      # Number of training episodes
G = 0                # Total cumulative reward
gamma = 0.7          # Discount factor (how much future reward matters)

# Loop over episodes to train the agent
for episode in range(1, episodes + 1):
    done = False
    G, reward = 0, 0
    state = env.reset()
    firststate = state
    print("Initial State = {} ".format(state))

    # Run until successful drop-off (reward == 20)
    while reward != 20:
        # Choose best known action (greedy)
        action = np.argmax(Q[state])
        # Take the action and observe the outcome
        state2, reward, done, info = env.step(action)
        # Update Q-table using Q-learning update rule (off-policy)
        Q[state, action] += gamma * (reward + np.max(Q[state2]) - Q[state, action])
        G += reward
        state = state2




Initial State = 389 
Initial State = 153 
Initial State = 364 
Initial State = 241 
Initial State = 86 
Initial State = 86 
Initial State = 31 
Initial State = 472 
Initial State = 348 
Initial State = 53 
Initial State = 368 
Initial State = 329 
Initial State = 193 
Initial State = 251 
Initial State = 341 
Initial State = 303 
Initial State = 21 
Initial State = 49 
Initial State = 267 
Initial State = 91 
Initial State = 152 
Initial State = 208 
Initial State = 108 
Initial State = 61 
Initial State = 81 
Initial State = 191 
Initial State = 166 
Initial State = 294 
Initial State = 22 
Initial State = 122 
Initial State = 146 
Initial State = 364 
Initial State = 326 
Initial State = 486 
Initial State = 148 
Initial State = 324 
Initial State = 492 
Initial State = 154 
Initial State = 189 
Initial State = 173 
Initial State = 103 
Initial State = 51 
Initial State = 314 
Initial State = 429 
Initial State = 428 
Initial State = 208 
Initial State = 226 
Initial State = 473 
Ini

In [33]:
Q

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [-6.56889478, -6.3       , -6.64152656, -6.3       , 11.        ,
        -7.        ],
       [-4.63106203, -4.2       , -4.2101871 , -4.2       , 15.        ,
        -7.        ],
       ...,
       [-2.8       , -2.457     , -2.8       , -3.187639  , -7.        ,
        -7.        ],
       [-4.9       , -4.89600181, -4.9       , -4.7463598 , -7.        ,
        -7.        ],
       [-1.4       , -1.4       , -1.4       ,  8.89      , -7.        ,
        -7.        ]])

In [34]:
# After training, use the Q-table to see how the taxi behaves
state = env.reset()
done = None






In [35]:
while done != True:
    action = np.argmax(Q[state])
    state, reward, done, info = env.step(action)
    print(env.render(mode="ansi"))

+---------+
|R: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (West)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (South)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1m[43mB[0m[0m: |
+---------+
  (South)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[42mB[0m: |
+---------+
  (Pickup)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : |[42m_[0m: |
|[35mY[0m| : |B: |
+---------+
  (North)

+---------+
|R: | : :G|
| : | : : |
| : : :[42m_[0m: |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)

+---------+
|R: | : :G|
| : | : : |
| : :[42m_[0m: : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (West)

+---------+
|R: | : :G|
| : | : : |
| :[42m_[0m: : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (West)

+---------+
|R: | : :G|
| : | : : |
|[42m_[0m: : : : |
| | : | 

# ----------------------------------------------
# ✅ SARSA Implementation (On-policy)
# ----------------------------------------------

In [None]:

# Set learning rate (α) and discount factor (γ)
alpha = 0.7
gamma = 0.7

In [None]:
# Reset environment and reinitialize Q-table
state = env.reset()
Q = np.zeros([n_states, n_actions])




In [None]:
# ε-greedy policy parameter — balances exploration vs exploitation
epsilon = 0.1  # 50% chance of exploring



In [None]:
# Function to choose action based on ε-greedy policy
def choose_action(state):
    if np.random.uniform(0, 1) < epsilon:
        return env.action_space.sample()  # Exploration
    else:
        return np.argmax(Q[state, :])     # Exploitation


# Cuttoff=0.1 = Exploitation extensively
# cutoff = 0.1 = Exploration extensively

In [None]:
# SARSA update function: uses actual action taken next
def learn(state, stateNext, reward, action, actionNext):
    predict = Q[state, action]
    target = reward + gamma * Q[stateNext, actionNext]
    Q[state, action] = Q[state, action] + alpha * (target - predict)



## Name comes from the formula in last two lines S-A-R-S-A

In [None]:
# Train using SARSA
total_episodes = 100000

for episode in range(total_episodes + 1):
    state = env.reset()
    action = choose_action(state)
    reward = 0

    while reward != 20:
        stateNext, reward, done, info = env.step(action)
        actionNext = choose_action(stateNext)
        learn(state, stateNext, reward, action, actionNext)
        state = stateNext
        action = actionNext

In [None]:
# Play using the SARSA-trained Q-table (greedy play)
state = env.reset()
done = None

while done != True:
    action = np.argmax(Q[state])
    state, reward, done, info = env.step(action)
    print(env.render(mode="ansi"))


In [None]:
# ----------------------------------------------
# ❄️ FrozenLake: Optional Extra Assignment
# ----------------------------------------------

# Load another classic Gym environment
env2 = gym.make('FrozenLake-v0')
print(env.render(mode="ansi"))

env2.observation_space.n  # Number of states in FrozenLake

# Manually set state
env2.env.s = 4
print(env.render(mode="ansi"))