# Team information

| Team member 1     | Details  | Team member 2     | Details  |
| :---------------- | :------: | :---------------- | :------: |
| Name              |          | Name              |          |
| NUSNet (Exxxxxxx) |          | NUSNet (Exxxxxxx) |          |
| Matric (AxxxxxxxZ)|          | Matric (AxxxxxxxZ)|          |


In [None]:
# Connect to Google drive to save your model, etc.,

from google.colab import drive
drive.mount('/content/drive')

# Installation and setup

The gym environment requires an older version numpy (and corresponding packages). <br>
The following cell contains the `requirements.txt` to setup the python environment used in the rest of this notebook.


In [2]:
%%writefile requirements.txt

cloudpickle==3.1.1
contourpy==1.3.0
cycler==0.12.1
filelock==3.18.0
fonttools==4.56.0
fsspec==2025.3.0
gym==0.26.2
gym-notices==0.0.8
importlib_metadata==8.6.1
importlib_resources==6.5.2
Jinja2==3.1.6
kiwisolver==1.4.7
MarkupSafe==3.0.2
matplotlib==3.9.4
mpmath==1.3.0
networkx==3.2.1
numpy==1.24.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
packaging==24.2
pillow==11.1.0
ply==3.11
pygame==2.6.1
pyparsing==3.2.1
python-dateutil==2.9.0.post0
six==1.17.0
sympy==1.13.1
torch==2.6.0
tqdm==4.67.1
triton==3.2.0
zipp==3.21.0

Overwriting requirements.txt


Now install the requirements.

You may be asked to restart the session to load the installed versions of the packages. If so, restart the session and continue using the notebook

In [3]:
!pip install -r requirements.txt

Collecting contourpy==1.3.0 (from -r requirements.txt (line 3))
  Using cached contourpy-1.3.0-cp311-cp311-win_amd64.whl.metadata (5.4 kB)
Collecting filelock==3.18.0 (from -r requirements.txt (line 5))
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec==2025.3.0 (from -r requirements.txt (line 7))
  Using cached fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting importlib_metadata==8.6.1 (from -r requirements.txt (line 10))
  Using cached importlib_metadata-8.6.1-py3-none-any.whl.metadata (4.7 kB)
Collecting importlib_resources==6.5.2 (from -r requirements.txt (line 11))
  Using cached importlib_resources-6.5.2-py3-none-any.whl.metadata (3.9 kB)
Collecting Jinja2==3.1.6 (from -r requirements.txt (line 12))
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting kiwisolver==1.4.7 (from -r requirements.txt (line 13))
  Using cached kiwisolver-1.4.7-cp311-cp311-win_amd64.whl.metadata (6.4 kB)
Collecting MarkupSafe==3.0.2 (from 

ERROR: Could not find a version that satisfies the requirement nvidia-nccl-cu12==2.21.5 (from versions: 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-nccl-cu12==2.21.5


We will use a discretized version of
the [elevator domain](https://ataitler.github.io/IPPC2023/elevator.html) from the International Planning Competition, 2023.

Install the pyRDDL gym environment using the given repository.

In [4]:
!pip install -q git+https://github.com/tasbolat1/pyRDDLGym.git --force-reinstall

## Install other packages if needed

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import copy
import itertools
import numpy as np
import random
import tqdm
import matplotlib.pyplot as plt
from collections import deque

from pyRDDLGym.Visualizer.MovieGenerator import MovieGenerator # loads visualizer utilites
from IPython.display import Image, display, clear_output # for displaying gifs in colab
from pyRDDLGym.Elevator import Elevator # imports Discrete Elevator

## Add more imports here as required

  from pyRDDLGym.Visualizer.MovieGenerator import MovieGenerator # loads visualizer utilites


# Environment Initialization

In [3]:
## IMPORTANT: Do not change the instance of the environment.
env = Elevator(instance = 5)

print('Discrete environment actions:')
print(env.disc_actions)
print('Continuous environment actions:')
print(env.base_env.action_space)
print(f"Observation space size for the discrete Elevator Environment: {len(env.disc_states)}")

d:\GitHub\cs4246_assignment_2\.venv\Lib\site-packages\pyRDDLGym\Examples d:\GitHub\cs4246_assignment_2\.venv\Lib\site-packages\pyRDDLGym\Examples\manifest.csv
Available example environment(s):
CartPole_continuous -> A simple continuous state-action MDP for the classical cart-pole system by Rich Sutton, with actions that describe the continuous force applied to the cart.
CartPole_discrete -> A simple continuous state MDP for the classical cart-pole system by Rich Sutton, with discrete actions that apply a constant force on either the left or right side of the cart.
Elevators -> The Elevator domain models evening rush hours when people from different floors in a building want to go down to the bottom floor using elevators.
HVAC -> Multi-zone and multi-heater HVAC control problem
MarsRover -> Multi Rover Navigation, where a group of agent needs to harvest mineral.
MountainCar -> A simple continuous MDP for the classical mountain car control problem.
NewLanguage -> Example with new languag

<op> is one of {<=, <, >=, >}
<rhs> is a deterministic function of non-fluents or constants only.
>> ( sum_{?f: floor} [ elevator-at-floor(?e, ?f) ] ) == 1


Discrete environment actions:
{0: ('e0_movcurdir_0',), 1: ('e0_movcurdir_1',), 2: ('e0_close_0',), 3: ('e0_close_1',), 4: ('e0_open_0',), 5: ('e0_open_1',)}
Continuous environment actions:
Dict('move-current-dir___e0': Discrete(2), 'open-door___e0': Discrete(2), 'close-door___e0': Discrete(2))
Observation space size for the discrete Elevator Environment: 225280


# Hyperparameters

In [3]:
# Define hyperparameters

## IMPORTANT: <BEGIN> DO NOT CHANGE THIS CODE!
## GENERAL HYPERPARAMS
num_episodes = 3000
## IMPORTANT: <END> DO NOT CHANGE THIS CODE!

learning_rate = 3e-4
batch_size = 64
clip_value = 1.0  # Gradient clipping value

## ALGO SPECIFIC HYPERPARAMS
# Update the hyperparams as necessary for your implementation



# Model Definition

Define your model here. You can rename the class `YourModel` appropriately and use it later in the code.
Note: In case of actor-critic or other models, all components must subclass `nn.Module`

- Your model should take in 11 inputs, which will be derived from the convert_state_to_list function.
- Your model should return 6 values corresponding to action logits or probabilities.

In [None]:
class YourModel(nn.Module):
    def __init__(self):
        super(YourModel, self).__init__()
        # Your model layers and initializations here

    def forward(self, x):
        # x will be a tensor with shape [batch_size, 11]
        # Your forward pass logic here
        # Ensure the output has shape [batch_size, 6]
        return output

# Feature Extraction

In [None]:
## IMPORTANT: DO NOT CHANGE THIS CODE!
env_features = list(env.observation_space.keys())

def convert_state_to_list(state, env_features):
    out = []
    for i in env_features:
        out.append(state[i])
    return out

# Neural Net Initialization

In [None]:
# Initialize the network and optimizer
input_size = len(env_features)
output_size = 6

# INITIALIZE OTHER NETWORK PARAMS HERE
hidden_size = ...

# INITIALIZE YOUR NETWORK HERE
your_network = YourModel()

# INIT OPTIMIZER - Adam is a good start, but you can try changing this as well
optimizer = optim.Adam(
    your_network.parameters(), lr=learning_rate
)

In [None]:
# Convert networks to CUDA if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
your_network.to(device)

# Define other constructs (replay buffers, etc) as necessary

## Gradient Clipping (Optional, you can use torch's version as well)

In [None]:
# Define a function for gradient clipping
def clip_grads(model, clip_value):
    for param in model.parameters():
        if param.grad is not None:
            param.grad.data = torch.clamp(param.grad.data, -clip_value, clip_value)

# Live Plotting Setup

In [None]:
# Create a figure for plotting
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Lists to store rewards and episode numbers
rewards_list = []
episodes = []

def exponential_smoothing(data, alpha=0.1):
    """Compute exponential smoothing."""
    smoothed = [data[0]]  # Initialize with the first data point
    for i in range(1, len(data)):
        st = alpha * data[i] + (1 - alpha) * smoothed[-1]
        smoothed.append(st)
    return smoothed

def live_plot(data_dict, figure, ylabel="Total Rewards"):
    """Plot the live graph."""
    clear_output(wait=True)
    ax.clear()
    for label, data in data_dict.items():
        if label == "Total Reward":
            ax.plot(data, label=label, color="yellow", linestyle='--')

            # Compute and plot moving average for total reward
            ma = exponential_smoothing(data)
            ma_idx_start = len(data) - len(ma)
            ax.plot(range(ma_idx_start, len(data)), ma, label="Smoothed Value", linestyle="-", color="purple", linewidth=2)
        else:
            ax.plot(data, label=label)
    ax.set_ylabel(ylabel)
    ax.legend(loc='upper left')
    display(figure)


# RL Algorithm

In [None]:
# Define the loss calculation function
def calculate_loss(
    ## INCLUDE PARAMS YOU NEED HERE
    ):
    ## TODO - CALCULATE LOSS VALUE & RETURN IT
    return loss

In [None]:
def choose_action(
    ## INCLUDE PARAMS YOU NEED HERE
    ):
    ## TODO - RETURN AN INTEGER FROM 0 - 5 (both inclusive) based on your model training/testing strategy
    return 0

## Training loop with live plotting

Use the graph generated here in your pdf submission.

In [None]:
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Create a tqdm progress bar
progress_bar = tqdm.tqdm(range(num_episodes), postfix={'Total Reward': 0, 'Loss': 0})

# RL algorithm training loop
for episode in progress_bar:
    total_reward = 0
    state = env.reset()

    while True:
        # Convert the original state to the suitable format for the network
        state_desc = env.disc2state(state)
        state_list = convert_state_to_list(state_desc, env_features)
        state_tensor = torch.tensor(state_list, dtype=torch.float32, device=device)

        action = choose_action(
            ## TODO: FILL IN PARAMS FOR CALLING choose_action
        )

        # Take the chosen action and observe the next state and reward
        next_state, reward, done, _ = env.step((action))

        # Convert the next state to the suitable format for the network
        next_state_desc = env.disc2state(next_state)
        next_state_list = convert_state_to_list(next_state_desc, env_features)
        next_state_tensor = torch.tensor(next_state_list, dtype=torch.float32, device=device)


        # Hint: You may want to collect experiences from the environment to update the agent in batches!

        loss = calculate_loss(
            ## TODO: FILL IN PARAMS FOR CALLING calculate_loss
        )

        optimizer.zero_grad()
        loss.backward()

        optimizer.step()

        state = next_state
        total_reward += reward

        if done:
            break


    rewards_list.append(total_reward)
    episodes.append(episode)

    live_plot({'Total Reward': rewards_list}, fig)

    # Saving the model
    if episode%500 == 0:
      torch.save(your_network, f'model.pt')

    progress_bar.set_postfix({'Total Reward': total_reward, 'Loss': loss.item()})

## Compute the mean rewards

Report the mean rewards obtained in your pdf submission

In [None]:
print(f"\nMean Rewards: ...")

# close the environment
env.close()