# Accelerating AI Through Human Knowledge
## Training & Tuning Notebook

As part of the workshop you will get the chance to record your own expert trajectories and train your own agents. This notebook accomodates the training and tuning portion. In the first cell bellow you will find two installation commands. Please run this cell within the first 10-15 minutes of the Workshop as this will take some minutes to fully install and can just run in the background.

In [None]:
!mkdir recordings agents
!pip install swig
!pip install git+https://github.com/fhstp/MLPragueImitation

This are the imports, you should be able to run these without doing anything else (given the install commands ran without an error).

In [None]:
from imitation_workshop.iqlearn import IQLearn
from imitation_gym_wrappers.recorder_wrapper import RecorderWrapper
import imitation_workshop.envs
import gymnasium as gym
import pickle
from stable_baselines3 import SAC, PPO
import numpy as np

To show progress in tensorboard, execute the following cell.

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

The next cell you see sets the device. It is set to `'cuda'` by default. If you have GPU or TPU runtime available on colab you can leave this as is. Should you only be able to get CPU runtime then you will need to change this to `'cpu'`. The rest of the code wokrs as intended, however training will take longer.

In [None]:
device = 'cuda'

## The MountainCar Environment

In the following cell you will be able to try your hand at reward shaping. This is setup to be done on the MountainCar Environment (https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous/). If you have never seen or worked with the MountainCar environment, it is a rather simple task in which a little cart can accelerate to the left and the right. The task is to build enough momentum to reach the goal atop a steep slope (i.e. the Mountain).

In the cell we initiate an instance of the environment and set what type of algorithm we want to use for our agent. In this case we use the IQLearn algorithm to try and solve the task.

### Usage

In the `step` function there is a variable called `reward` as you can tell by the comment, this is were you can try out your ideas on how the reward can be defined and shaped. To do so, you have the following information at your disposal:
+ the cart's previous position
+ the cart's previous velocity
+ the cart's current position
+ the cart's current velocity

As the Notebook has the package `numpy` loaded you can implement any ideas on how to define the reward, as long as they can be represented mathematically.

In [None]:
class ShapingWrapperMountaincar(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.last_obs = None

    def reset(self, **kwargs):
        obs, info = super().reset(**kwargs)
        self.last_obs = obs
        return obs, info

    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)

        last_pos = self.last_obs[0]
        last_vel = self.last_obs[1]
        pos = obs[0]
        vel = obs[1]

        # change reward here
        reward = -1

        self.last_obs = obs
        return obs, reward, terminated, truncated, info

env = gym.wrappers.TransformReward(gym.make("MountainCarContinuous-v0"), lambda _: -1)
sac = IQLearn(ShapingWrapperMountaincar(env), sac_args={'use_targets': True, 'q_lr': 3e-2, 'policy_lr': 3e-2, 'autotune': True, 'buffer_size': 10000, 'device': device})

Here we start training the agent. You can also if you wish run the cell more often, as long as you do not rerun the cell above, the agent will just keep training.

In [None]:
sac.sac_learn(5000)

In the next step we provide a cell that gives a form of evalution. It let's your agent try and solve the task 20 times and gives the average amount of steps it took the agent to reach the goal.

In [None]:
avg_env = gym.make("MountainCarContinuous-v0")

steps = np.zeros((20,), dtype=np.uint32)
for i in range(20):
    obs, info = avg_env.reset()
    terminated = False
    truncated = False
    step = 0
    while not (terminated or truncated):
        action, _states = sac.predict(obs, deterministic=True)
        obs, rewards, terminated, truncated, info = avg_env.step(action)
        step += 1
    steps[i] = step
print(f"{steps.mean()} steps taken on average")
avg_env.close()

The last cell saves the agent. We provide this as you can now, if you wish to, load your reward shaped agent in the frontend used in this workshop, to see how well your agent solves Mountain Car.

In [None]:
agent_name = 'mountaincar_shaped'
save_name = f'agents/{agent_name}_{sac.n_updates}.agent'
with open(save_name, 'wb') as f:
  pickle.dump(sac, f)
print(f'saved agent as {save_name}')

## Imitation Learning

In this part of the Notebook, we start using imitation learning. First again on the example of Mountain Car and later on in the Notebook on a Car Racing environment (https://gymnasium.farama.org/environments/box2d/car_racing/).

### Mountain Car imitation

At first we define a regularizer, initiate an instance of Mountan Car and define the algorithm to be used as IQLearn.

In [None]:
def regularizer(x):
  return x**2/40

env = gym.wrappers.TransformReward(gym.make("MountainCarContinuous-v0"), lambda _: -1)
iqlearn = IQLearn(env, regularizer=regularizer, sac_args={'device': device})

Next we load the recordings made in the frontend and set this to be used as expert trajectories by our imitator.

In [None]:
recording_name = 'recordings/recording'
recorder = RecorderWrapper(env, 10000)
recorder.load_buffer(recording_name)
iqlearn.set_demonstration_buffer(recorder.get_sb3_buffer())

Here we train the iqlearn agent to be able to use it later on to bias the actual agent towards behaviour displayed by the recordings.

In [None]:
iqlearn.learn(5000)

Here we again include a way of evaluating the performance.

In [None]:
iqlearn = sac
avg_env = gym.make("MountainCarContinuous-v0")

steps = np.zeros((20,), dtype=np.uint32)
for i in range(20):
    obs, info = avg_env.reset()
    terminated = False
    truncated = False
    step = 0
    while not (terminated or truncated):
        action, _states = iqlearn.predict(obs, deterministic=True)
        obs, rewards, terminated, truncated, info = avg_env.step(action)
        step += 1
    steps[i] = step
print(f"{steps.mean()} steps taken on average")
avg_env.close()

This cell saves the agent so it can be loaded into the provided frontent to visualize it's performance.

In [None]:
agent_name = 'mountaincar_imitation'
save_name = f'agents/{agent_name}_{iqlearn.n_updates}.agent'
with open(save_name, 'wb') as f:
  pickle.dump(iqlearn, f)
print(f'saved agent as {save_name}')

Now we train the actual agent by using the iq learn agent as a bias.

In [None]:
sac = IQLearn(env, sac_args={'use_targets': False, 'buffer_size': 10000, 'autotune': False, 'device': device})
sac.set_bias_actor(iqlearn.actor)
sac.actor.load_state_dict(iqlearn.actor.state_dict())

In this cell you can fine tune your agent by adapting the `alpha`value and retraining it.

In [None]:
sac.alpha=0.2
sac.sac_learn(5000)

Here you can evaluate the agent.

In [None]:
avg_env = gym.make("MountainCarContinuous-v0")

steps = np.zeros((20,), dtype=np.uint32)
for i in range(20):
    obs, info = avg_env.reset()
    terminated = False
    truncated = False
    step = 0
    while not (terminated or truncated):
        action, _states = sac.predict(obs, deterministic=True)
        obs, rewards, terminated, truncated, info = avg_env.step(action)
        step += 1
    steps[i] = step
print(f"{steps.mean()} steps taken on average")
avg_env.close()

Just like before in this cell the agent is saved to be visualized in the frontend.

In [None]:
agent_name = 'mountaincar_finetuned'
save_name = f'agents/{agent_name}_{sac.n_updates}.agent'
with open(save_name, 'wb') as f:
    pickle.dump(sac, f)
print(f'saved agent as {save_name}')

### Car Racing

In this section the cells baseically follow the same steps as in Mountain Car before but this time using the Car Racing environment. First a regularizer, an instance of the environment and our biasing agent are defined.

In [None]:
def regularizer(x):
  return x**2/40

env = gym.make("InternalStateCarRacing-v0")
iqlearn = IQLearn(env, regularizer=regularizer, sac_args={'device': device})

Here the recordings saved in the frontend are loaded in and set to be used by die biasing agent.

In [None]:
recording_name = 'recordings/recording'

recorder = RecorderWrapper(env, 10000)
recorder.load_buffer(recording_name)
iqlearn.set_demonstration_buffer(recorder.get_sb3_buffer())

Again we train the biasing agent.

In [None]:
iqlearn.learn(5000)

Here the biasing agent is saved so it can be loaded in the frontend.

In [None]:
agent_name = 'carracing_imitation'
save_name = f'agents/{agent_name}_{iqlearn.n_updates}.agent'
with open(save_name, 'wb') as f:
  pickle.dump(iqlearn, f)
print(f'saved agent as {save_name}')

Should you not want to use the most recently trained biasing agent but rather another version you have, you can use this cell to load your desired biasing agent.

In [None]:

agent_name = 'carracing_imitation_5000.agent'
with open(f'agents/{agent_name}', 'rb') as f:
    iqlearn = pickle.load(f)
iqlearn.args.device = device
iqlearn.actor.to(device)
iqlearn.qf1.to(device)
iqlearn.qf2.to(device)

Here we instantiate a Car Racing environment to give you the opportunity to apply reward shaping on your Car Racing imitaiton agent.

#### Usage

As before you can use various information to shape your reward. The information is as follows:

- offset to the center of the road
- the angle of the road in various distances
- the car's angular velocity
- the car's velocity

As before you can use `numpy` to facilitate any calculations you might want to make.

In [None]:
class ShapingWrapperCarracing(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)

    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)

        offset = obs[0]
        angle_to_road = obs[1]
        angle_to_2m = obs[2]
        angle_to_5m = obs[3]
        angle_to_10m = obs[4]
        angular_velocity = obs[5]
        velocity = obs[6]

        # set reward here
        middle_of_lane_reward = 1-np.abs(offset)/10
        velocity_reward = 1-(np.abs(velocity-50)/50)
        a = 1
        b = 1
        reward = a*middle_of_lane_reward + b*velocity_reward

        if np.abs(offset) > 10:
            terminated = True

        return obs, reward, terminated, truncated, info

env = gym.make("InternalStateCarRacing-v0")
sac = IQLearn(ShapingWrapperCarracing(env), sac_args={'use_targets': True, 'buffer_size': 10000, 'tau': 0.0005, 'autotune': False, 'device': device})
sac.set_bias_actor(iqlearn.actor)
sac.actor.load_state_dict(iqlearn.actor.state_dict())


Here the agent can be fine tuned by adapting `alpha`.

In [None]:
sac.alpha=0.5
sac.sac_learn(5000)

This cell saves the agent to be visualized in the frontend.

In [None]:
agent_name = 'carracing_finetuned'
save_name = f'agents/{agent_name}_{sac.n_updates}.agent'
with open(save_name, 'wb') as f:
    pickle.dump(sac, f)
print(f'saved agent as {save_name}')

Here the agent can be evaluated.

In [None]:
avg_env = gym.make("InternalStateCarRacing-v0")

steps = np.zeros((20,), dtype=np.uint32)
np.random.seed(0)
for i in range(20):
    obs, info = avg_env.reset(seed=np.random.randint(2147483647))
    terminated = False
    truncated = False
    step = 0
    while not (terminated or truncated):
        action, _states = iqlearn.predict(obs, deterministic=True)
        obs, rewards, terminated, truncated, info = avg_env.step(action)
        step += 1
    steps[i] = step
print(f"{steps.mean()} steps taken on average")
avg_env.close()