# **CIS 6200 Spring 2024 Homework 6**


**Coding: Behavioral Cloning from a PPO agent**

1. Train a PPO agent using Stable Baselines to Perform Behavioral Cloning.
  * Explain all the parameters that are shown during the training of the PPO agent.
  * How are you creating the dataset for Behavioral Cloning?
  * How is the performance of the model compared to the expert?
2. Change the parameters of the CartPole environment and test your model.
  * How does the trained BCO model perform?
  * Do you think the PPO model will perform better in this new environment?
3. Describe two strategies on how you can fine-tune your BCO model for this new environment. Mention the specifics i.e algorithm, loss etc.

**Optional:**
4. Implement one of your strategies and test your results!



**Note: Answers to the questions need to be submitted in the corresponding PDF submission along with this coding submission on gradescope.**

## Installing Dependencies and Imports

In [37]:
!pip install -q gymnasium[classic_control]
!pip install -q renderlab
!pip install -q stable_baselines3

In [38]:
from stable_baselines3 import PPO
import numpy as np
import math
import gymnasium as gym
import renderlab as rl

## Making the Environment
We import the CartPole model from gymnasium to use as our environment. You can get more information about the environment [here](https://gymnasium.farama.org/environments/classic_control/cart_pole/). We also add a wrapper to the environment to render the output.

In [39]:
env = gym.make("CartPole-v1", render_mode = "rgb_array")
env = rl.RenderFrame(env, "./output")
env.env.env.env.env.theta_threshold_radians = 12 * 4 * math.pi / 360
env.env.env.env.env.x_threshold = 4.0

## Training the Expert
We are going to use a PPO model from Stable Baselines as our expert. You can learn more about it [here](https://github.com/DLR-RM/stable-baselines3).

In [40]:
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
model.save("ppo_cartpole")


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 32.8     |
|    ep_rew_mean     | 32.8     |
| time/              |          |
|    fps             | 87       |
|    iterations      | 1        |
|    time_elapsed    | 23       |
|    total_timesteps | 2048     |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 37.1       |
|    ep_rew_mean          | 37.1       |
| time/                   |            |
|    fps                  | 103        |
|    iterations           | 2          |
|    time_elapsed         | 39         |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00809131 |
|    clip_fraction        | 0.0671     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.68

We can use the trained model to run inference in our environment and see the output.

In [41]:
model = PPO.load("ppo_cartpole")
observation, info = env.reset()

while True:
  observation = np.array(observation).reshape(1,4)
  action, _states = model.predict(observation, deterministic=True)
  observation, reward, terminated, truncated, info = env.step(action.item())

  if terminated or truncated:
    break

env.play()

Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4





Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


## Part 1:
We want to use this PPO model that we trained as our expert to do Behaviour Cloning. You will need to do two things to achieve this,


1.   Create your own dataset of state-action pairs based on the expert. Think about how you would create this dataset and ensure that you have a variety of observations and actions.
2.   Train your model using the created dataset. Feel free to use a simple MLP model to clone the behavior of the expert

Validate the trained model on the same environment and see your results.



In [42]:
# Create dataset
import random

model = PPO.load("ppo_cartpole")

# original parameters
def set_env(**kargs):
  env.env.env.env.env.gravity = 9.8 + kargs.get("gravity", 0)
  env.env.env.env.env.masscart = 1.0 + kargs.get("masscart", 0)
  env.env.env.env.env.masspole = 0.1 + kargs.get("masspole", 0)
  env.env.env.env.env.total_mass = env.env.env.env.env.masspole + env.env.env.env.env.masscart
  env.env.env.env.env.length = 0.5 + kargs.get("length", 0)  # actually half the pole's length
  env.env.env.env.env.polemass_length = env.env.env.env.env.masspole * env.env.env.env.env.length
  env.env.env.env.env.force_mag = 10.0 + kargs.get("force_mag", 0)


data_epoch = 10
data = []
# epoch * 500 = 5000
for _ in range(data_epoch):
  # set random env
  set_env(gravity = random.random() * 5,
      masscart = random.random() * 0.5,
      massploe = random.random() * 0.1,
      length = random.random() * 0.2,
      force_mag = random.random() * 8)

  observation, info = env.reset()

  while True:
    pre_observation = np.array(observation).reshape(1,4)
    action, _states = model.predict(pre_observation, deterministic=True)
    observation, reward, terminated, truncated, info = env.step(action.item())
    data.append((action, pre_observation))
    if terminated or truncated:
      break

In [43]:
# Train your model
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import numpy as np
device = "cuda" if torch.cuda.is_available() else "cpu"
class MLP(nn.Module):
  def __init__(self, input_size, hidden_size, output_size):
    super(MLP, self).__init__()
    self.fc1 = nn.Linear(input_size, hidden_size)
    self.relu1 = nn.ReLU()
    self.fc2 = nn.Linear(hidden_size, hidden_size)
    self.relu2 = nn.ReLU()
    self.fc3 = nn.Linear(hidden_size, output_size)


  def forward(self, x):
    x = self.relu1(self.fc1(x))
    x = self.relu2(self.fc2(x))
    x = self.fc3(x)
    return x

input_size = 4
hidden_size = 10
output_size = 2
batch_size = 128
train_epoch = 10

clone_model = MLP(input_size, hidden_size, output_size).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(clone_model.parameters(), lr=0.01)

dataloader = DataLoader([[x,y] for y, x in data], batch_size=batch_size, shuffle=True)

for e in range(train_epoch):
  total_loss = []
  for X, y in dataloader:
    X = X.to(device).reshape(-1,4)
    y = y.to(device).reshape(-1)

    output = clone_model(X)
    loss = criterion(output, y)
    total_loss.append(loss.detach().cpu())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f"{e=} | {np.mean(total_loss)=}")

torch.save(clone_model, 'clone_model.pth')

e=0 | np.mean(total_loss)=0.6227964
e=1 | np.mean(total_loss)=0.3460347
e=2 | np.mean(total_loss)=0.29079083
e=3 | np.mean(total_loss)=0.2704614
e=4 | np.mean(total_loss)=0.25879872
e=5 | np.mean(total_loss)=0.23977599
e=6 | np.mean(total_loss)=0.23163234
e=7 | np.mean(total_loss)=0.21417525
e=8 | np.mean(total_loss)=0.2062499
e=9 | np.mean(total_loss)=0.19275641


In [44]:
set_env()

clone_model.eval()
observation, info = env.reset()

while True:
  observation = np.array(observation).reshape(1,4)
  action = clone_model(torch.tensor(observation)).argmax()
  observation, reward, terminated, truncated, info = env.step(action.item())

  if terminated or truncated:
    break
env.play()

Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4





Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


##Part 2:
You want to see how well your model generalised to environment with different parameter. Two of the parameters are changed below, feel free to look at the [code](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py) base and the see which other parameters you can change. Test your Behavioral Cloning model on this new enviroment and report your results.

In [45]:
# Changing environment's parameters

env.env.env.env.env.force_mag = 30.0
env.env.env.env.env.gravity = 15.0

# Validate your model
observation, info = env.reset()

while True:
  observation = np.array(observation).reshape(1,4)
  action = clone_model(torch.tensor(observation)).argmax()
  observation, reward, terminated, truncated, info = env.step(action.item())

  if terminated or truncated:
    break
env.play()

Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4





Moviepy - Done !
Moviepy - video ready temp-{start}.mp4
