# Diffusion Q-Learning for Taxi Environment

The Taxi-v3 environment from OpenAI Gym is a discrete, reinforcement learning task where an agent must pick up and drop off a passenger at the right location as efficiently as possible. Our objective is to adapt the Diffusion Q-Learning algorithm to this environment and validate its performance using our collected offline expert dataset taxi_q_expert_dataset.csv"


## Setup

### Imports

In [1]:
import torch
import gymnasium as gym
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import copy
from torch.optim import Adam
from torch.optim.lr_scheduler import CosineAnnealingLR
# Additional libraries as needed


### Load Dataset

In [2]:
#Load data
dataset = pd.read_csv('taxi_q_expert_dataset.csv')
#remove header
#dataset = dataset[1:]


#Define the environment
env = gym.make('Taxi-v3')

In [1]:
#dataset test
print(dataset.head())

NameError: name 'dataset' is not defined

## Data Preprocessing and Exploration

To effectively train the Diffusion Q-Learning model, we first need to preprocess and understand the offline expert data. This step involves exploring the distribution of states, actions, and rewards, and formatting the data into a structure format for training and testing the model.


### Data Preprocessing

#### Conversion from Integer observation to Analog Bits

In [4]:
def int2analog(x, n=8):
    # Convert an integer to a PyTorch tensor
    x_tensor = torch.tensor([x], dtype=torch.int32)

    # Convert integers into the corresponding binary bits.
    shifts = torch.arange(n - 1, -1, -1, dtype=x_tensor.dtype)
    x_tensor = torch.bitwise_right_shift(x_tensor, shifts)
    x_tensor = torch.remainder(x_tensor, 2)

    # Convert the binary bits into the corresponding analog values.
    x_tensor = x_tensor.type(torch.float32)
    x_tensor = 2 * x_tensor - 1

    return x_tensor  


def analog2int(x):
    # Convert an analog bit representation back to an integer
    x = (x + 1) / 2  # Convert from [-1, 1] to [0, 1]
    x = torch.round(x).type(torch.int32)  # Round and convert to int
    # Convert binary bits back to integer
    int_val = 0
    for i, bit in enumerate(reversed(x)):
        int_val += bit.item() * (2 ** i)
    return int_val

In [5]:
print(int2analog(12))
print(analog2int(int2analog(12)))

tensor([-1., -1., -1., -1.,  1.,  1., -1., -1.])
12


#### Preprocess Data

In [6]:
def preprocess_data(data, state_bit_length, action_bit_length):
    if not isinstance(data, pd.DataFrame):
        raise ValueError("Data must be a Pandas DataFrame")

    # Ensure required columns are present
    required_columns = ['state', 'action', 'reward', 'next_state', 'done']
    if not all(col in data.columns for col in required_columns):
        raise ValueError(f"Data must contain the following columns: {required_columns}")

    # Apply int2analog conversion
    states = data['state'].apply(lambda x: int2analog(x, n=state_bit_length))
    actions = data['action'].apply(lambda x: int2analog(x, n=action_bit_length))
    next_states = data['next_state'].apply(lambda x: int2analog(x, n=state_bit_length))
    # Keep rewards as float
    rewards = torch.tensor(data['reward'].values, dtype=torch.float32)
    # Keep done as boolean
    dones = torch.tensor(data['done'].values, dtype=torch.bool)
    # Combine into a single dataset
    processed_data = list(zip(states, actions, rewards, next_states, dones))
    return processed_data

# How many bits to use for the state and action representations
state_bit_length = 10
action_bit_length = 10

# Preprocess the data
processed_data = preprocess_data(dataset, state_bit_length, action_bit_length)


In [7]:
print(processed_data[0])

(tensor([-1., -1., -1.,  1.,  1.,  1.,  1., -1.,  1.,  1.]), tensor([-1., -1., -1., -1., -1., -1., -1., -1., -1.,  1.]), tensor(-1.), tensor([-1., -1., -1., -1., -1.,  1., -1.,  1.,  1.,  1.]), tensor(False))


## Model Architecture and Hyperparameters

Based on the paper's description, here we will define a similar (but simpler) architecture for the diffusion policy and Q networks. We will also determine the hyperparameters necessary for training the Diffusion Q-Learning model within the constraints of the Taxi environment.


### Define Model and Hyperparameters

#### Critic Network

In [8]:
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super(Critic, self).__init__()
        self.q1_model = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.Mish(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Mish(),
            nn.Linear(hidden_dim, 1))

        self.q2_model = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.Mish(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Mish(),
            nn.Linear(hidden_dim, 1))

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        return self.q1_model(x), self.q2_model(x)

#### Discrete Diffusion Actor Network

In [9]:
class DiscreteDiffusionActor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super(DiscreteDiffusionActor, self).__init__()
        # Define the network layers
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Mish(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Mish(),
            nn.Linear(hidden_dim, action_dim))

    def forward(self, state):
        return torch.softmax(self.network(state), dim=-1)  # Softmax for probability distribution


#### Corruption Process and noise schedule

In [10]:
def cosine_beta_schedule(timesteps, s=0.008):
    """
    Generates a cosine noise schedule.

    Args:
    - timesteps (int): The total number of timesteps.
    - s (float): Scale factor for the noise level.

    Returns:
    - torch.Tensor: The beta schedule tensor.
    """
    steps = torch.arange(timesteps, dtype=torch.float32) / timesteps
    beta_schedule = s * (1 + torch.cos(torch.pi * steps)) / 2
    return beta_schedule

def compute_alpha_bar(beta_schedule):
    alpha = 1. - beta_schedule
    alpha_bar = torch.cumprod(alpha, dim=0)
    return alpha, alpha_bar

def apply_noise(x, timestep, beta_schedule):
    """
    Applies noise to an image x at a specific timestep.

    Args:
    - x (torch.Tensor): The initial image tensor.
    - timestep (int): The specific timestep at which to apply noise.
    - beta_schedule (torch.Tensor): The beta schedule tensor.

    Returns:
    - torch.Tensor: The noised version of the image at the specified timestep.
    """
    # Compute alpha and alpha_bar
    alpha, alpha_bar = compute_alpha_bar(beta_schedule)

    # Ensure the timestep is within the range of the beta schedule
    if timestep >= beta_schedule.size(0):
        raise ValueError("Timestep is out of range of the beta schedule")

    # Add noise to the image at the specified timestep
    epsilon = torch.randn_like(x)
    xt = torch.sqrt(alpha_bar[timestep]) * x + torch.sqrt(1. - alpha_bar[timestep]) * epsilon

    return xt

# # Example usage
# T = 1000  # Number of timesteps
# beta_schedule = torch.linspace(1e-4, 0.02, T)  # Example beta schedule
# x0 = torch.tensor([1.0, -1.0, 1.0, 1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 1.0])

# # Apply noise recurseively for T timesteps
# x_to_plot = []
# xt = x0
# for t in range(T):
#     xt = apply_noise(xt, t, beta_schedule)
#     if t % 100 == 0:
#         print("Timestep:", t, end=" | ")
#         print("xt:", xt)
#         x_to_plot.append(xt)




#### Hyperparameters

In [12]:
state_dim = 10  # Based on state representation
action_dim = 2  # Based on action representation
hidden_dim = 128  # Smaller network for a simpler task

learning_rate = 0.001  # Learning rate for the optimizer
batch_size = 64  # Size of the batch used for training
train_epochs = 500  # Number of epochs to train for


#### Full Model

In [15]:
class DiscreteDiffusionQL:
    def __init__(self, state_dim, action_dim, hidden_dim, learning_rate, beta_schedule):
        # Initialize actor and critic networks
        self.actor = DiscreteDiffusionActor(state_dim, action_dim, hidden_dim)
        self.critic = Critic(state_dim, action_dim, hidden_dim)

        # Initialize optimizers for both networks
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=learning_rate)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=learning_rate)

        # Store the beta schedule for the diffusion process
        self.beta_schedule = beta_schedule

        # Other initializations as needed (maybe device settings, target networks if used)

    def train_step(self, states, actions, rewards, next_states, dones, timestep):
        # Apply noise to states for the diffusion process
        noised_states = apply_noise(states, timestep, self.beta_schedule)

        # TODO: Implement the training logic
        # This includes updating the critic and actor networks based on the sampled batch
        # Compute losses, perform backward passes, and update network weights

        # Return any metrics or losses for logging

    # Other things as needed (e.g., for evaluation, saving/loading models)


## Training the Diffusion Q-Learning Model

In this section, we will implement the training loop for the Diffusion Q-Learning model using the taxi_q_expert_dataset. The training involves iteratively updating the policy (actor) and value function (critic) networks, focusing on minimizing both the behavior cloning loss and the Q-learning loss just like in the paper


### Define Replay Buffer and Training Loop


In [16]:
class ReplayBuffer:
    def __init__(self, max_size):
        self.buffer = []
        self.max_size = max_size
        self.position = 0

    def push(self, state, action, reward, next_state, done):
        """Saves a transition."""
        if len(self.buffer) < self.max_size:
            self.buffer.append(None)
        self.buffer[self.position] = (state, action, reward, next_state, done)
        self.position = (self.position + 1) % self.max_size

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(torch.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

### Training the model

In [19]:
import random
import numpy as np

def train_diffusion_ql(model, replay_buffer, epochs, batch_size, num_timesteps, beta_schedule):
    for epoch in range(epochs):
        for _ in range(len(replay_buffer) // batch_size):
            # Sample a batch from the replay buffer
            states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)

            # Apply noise to states for the diffusion process
            timestep = np.random.randint(0, num_timesteps)
            noised_states = apply_noise(states, timestep, beta_schedule)

            # TODO: Update the critic network
            # Compute the critic loss and perform a backward pass

            # TODO: Update the actor (diffusion policy) network
            # Compute the actor loss and perform a backward pass

        # TODO: Additional logic for logging, validation, saving models, etc.

        print(f"Epoch {epoch+1}/{epochs} completed.")


# Train the model
num_timesteps = 1000  # Number of timesteps for the diffusion process
beta_schedule = cosine_beta_schedule(num_timesteps)  # Beta schedule

# Initialize your model (Diffusion Q-Learning model)
model = DiscreteDiffusionQL(state_dim, action_dim, hidden_dim, learning_rate, beta_schedule)

# Initialize the replay buffer and populate it
replay_buffer = ReplayBuffer(max_size=10000)
for state, action, reward, next_state, done in processed_data:
    replay_buffer.push(state, action, reward, next_state, done)


train_diffusion_ql(model, replay_buffer, epochs=train_epochs, batch_size=batch_size, num_timesteps=num_timesteps, beta_schedule=beta_schedule)



Epoch 1/500 completed.
Epoch 2/500 completed.
Epoch 3/500 completed.
Epoch 4/500 completed.
Epoch 5/500 completed.
Epoch 6/500 completed.
Epoch 7/500 completed.
Epoch 8/500 completed.
Epoch 9/500 completed.
Epoch 10/500 completed.
Epoch 11/500 completed.
Epoch 12/500 completed.
Epoch 13/500 completed.
Epoch 14/500 completed.
Epoch 15/500 completed.
Epoch 16/500 completed.
Epoch 17/500 completed.
Epoch 18/500 completed.
Epoch 19/500 completed.
Epoch 20/500 completed.
Epoch 21/500 completed.
Epoch 22/500 completed.
Epoch 23/500 completed.
Epoch 24/500 completed.
Epoch 25/500 completed.
Epoch 26/500 completed.
Epoch 27/500 completed.
Epoch 28/500 completed.
Epoch 29/500 completed.
Epoch 30/500 completed.
Epoch 31/500 completed.
Epoch 32/500 completed.
Epoch 33/500 completed.
Epoch 34/500 completed.
Epoch 35/500 completed.
Epoch 36/500 completed.
Epoch 37/500 completed.
Epoch 38/500 completed.
Epoch 39/500 completed.
Epoch 40/500 completed.
Epoch 41/500 completed.
Epoch 42/500 completed.
E

In [None]:
# Initialize the model (Diffusion Q-Learning model)
model = DiscreteDiffusionQL(...)

# Initialize the replay buffer and populate it
replay_buffer = ReplayBuffer(max_size=10000)
for state, action, reward, next_state, done in processed_data:
    replay_buffer.push(state, action, reward, next_state, done)

# Train the model
num_timesteps = 1000  # Number of timesteps for the diffusion process
beta_schedule = cosine_beta_schedule(num_timesteps)  # Beta schedule
train_diffusion_ql(model, replay_buffer, epochs=train_epochs, batch_size=batch_size, num_timesteps=num_timesteps, beta_schedule=beta_schedule)



## Evaluating the Model

After training, we evaluate the Diffusion Q-Learning model using the "taxi_expert_test" dataset. The evaluation focuses on assessing how well the model replicates expert behavior and its effectiveness in achieving the goals of the Taxi environment.


### Model Evaluation

In [None]:
def evaluate_model(model, test_data):
    # TODO: Implement the logic to evaluate the model on the test data
    # This could be running the model on the test data and comparing its performance against some kind of metrics

In [None]:
# Evaluate the trained model
evaluate_model(model, taxi_expert_test)


## Conclusion and Future Work

I'll use this section to summarize the findings from the training and evaluation of the Diffusion Q-Learning model in the Taxi environment. I'll discuss the model's performance, potential areas of improvement, and opportunities for future research and application in more complex environments. This will serve as a useful reference when writing the thesis and/or paper.
