***Deep Learning Applications 2023** course, held by Professor **Andrew David Bagdanov** - University of Florence, Italy*

*Notebook and code created by **Giovanni Colombo** - Mat. 7092745*

Check the dedicated [Repository on GitHub](https://github.com/giovancombo/DLA-Labs/tree/main/lab3).

# Deep Learning Applications: Laboratory #3 - DRL

In this laboratory session we will hack one of your colleague's (Francesco Fantechi, from Ingegneria Informatica) implementation of a navigation environment for Deep Reinforcement Learning. The setup is fairly simple:

+ A simple 2D environment with a (limited) number of *obstacles* and a single *goal* is presented to the agent, which must learn how to navigate to the goal without hitting any obstacles.
+ The agent *observes* the environment via a set of 16 rays cast uniformly which return the distance to the first obstacle encountered, as well as the distance and direction to the goal.
+ The agent has three possible actions: `ROTATE LEFT`, `ROTATE RIGHT`, or `MOVE FORWARD`.

For each step of an episode, the agent receives a reward of:
+ -100 if hitting an obstacle (episode ends).
+ -100 if one hundred steps are reached without hitting the goal.
+ +500 if hitting the goal (episode ends)
+ A small *positive* reward if the distance to the goal is *reduced*.
+ A small *negative* reward if the distance to the goal is *increased*.

In the file `main.py` you will find an implementation of **Deep Q-Learning**.

## Exercise 1: Testing the Environment

The first thing to do is verify that the environment is working in your Anaconda virtual environment. I had a weird problem with Tensorboard and had to downgrade it using:

    conda install -c conda-forge tensorboard=2.11.2
    
In any case, you should be able to run:

    python main.py
    
from the repository root and it will run episodes using a pretrained agent. To train an agent from scratch, you must modify `main.py` setting `TRAIN = True` at the top. Then running `main.py` again will train an agent for 2000 episodes of training. To run the trained agent you will again have to modify `main.py` on line 225 to load the last saved checkpoint:

    PATH = './checkpoints/last.pth'
    
and then run the script again (after setting `TRAIN = False` !).

Make sure you can at run the demo agent and train one from scratch. If you don't have a GPU you can set the number of training episodes to a smaller number.

In [None]:
# Set TRAIN = True for training and then False for testing
!python main_dqn.py

Well, I guess I did it. The main script works. I let it train for 1000 episodes.

Qualitatively, it's possible to see that the agents looks like it has not learned so well to find the goal. Many times, the agent hits the walls or obstables without even trying to change direction, even just after the spawn. Some other times, the agent finds its way to the goal, until it stops right in front of it and changes direction.

## Exercise 2: Stabilizing Q-Learning



Ok, so, now that I verified that the environment works, it's now time to stabilize Q-Learning via tweaking the hyperparameters and the architecture. This will just be an ablation study, in the end.

## Exercise 3: Going Deeper

As usual, pick **AT LEAST ONE** of the following exercises to complete.

### Exercise 3.1: Solving the environment with `REINFORCE`

Use my (or even better, improve on my) implementation of `REINFORCE` to solve the environment.

**Note**: There is a *design flaw* in the environment implementation that will lead to strange (by explainable) behavior in agents trained with `REINFORCE`. See if you can figure it out and fix it.

---

Coming up with a good reward structure is the main challenge of reinforcement learning. Your problem could be perfectly within the capabilities of the model, but if the reward structure is not set up correctly it may never learn.

The goal of the rewards is to encourage specific behavior. In our case we want to guide the agent towards the goal cell, defined by -1.

Similar to the layers and neurons in the network, and epsilon and its associated values, there can be many right (and many wrong) ways to define the reward structure.

The two main types of reward structures:

Sparse: When rewards are only given in a handful of states.
Dense: When rewards are common throughout the state-space.
With sparse rewards the agent has very little feedback to lead it. This would be like simply giving a set penalty for each step, and if the agent reaches the goal you provide one large reward.

The agent can certainly learn to reach the goal, but depending on the size of the state-space it can take much longer and may get stuck on a suboptimal strategy.

What would be a good way to reward the agent to move towards the goal more incrementally?

The first way is to return the negative of the Manhattan distance. 

There are many things that can be improved in this implementation. Some things you can think about:

1. **Replay**. In the current implementation we execute an episode, and then immediately run an optimization step on all of the steps of the episode. Not only are we using *correlated* samples from a single episode, we are decidedly *not* taking advantage of parallelism via batch gradient descent. Note that `REINFORCE` does **not** require entire trajectories, all we need are the discounted rewards and log probabilities for *individual transitions*.
NB: In realtà in REINFORCE non è possibile utilizzare un Replay Buffer! Perché è un metodo on-policy, quindi non può supportare cambio di policy transition-wise.

2. **Exploration**. The model is probably overfitting (or perhaps remaining too *plastic*, which can explain the unstable convergence). Our policy is *always* stochastic in that we sample from the output distribution. It would be interesting to add a temperature parameter to the policy so that we can control this behavior, or even implement a deterministic policy sampler that always selects the action with max probability to evaluate the quality of the learned policy network.

3. **Discount Factor**: The discount factor (default $\gamma = 0.99$) is an important hyperparameter that has an effect on the stability of training. Try different values for $\gamma$ and see how it affects training. Can you think of other ways to stabilize training?

4. **Baseline Function**: funzione indipendente dall'azione stabilizza molto l'algoritmo REINFORCE, che di norma invece è instabile.
La funzione è arbitraria, ma di norma viene scelta una *stima della state-value function V*, un po' come poi succederà in PPO.


In [None]:
import gymnasium as gym
import torch
import matplotlib.pyplot as plt

import wandb
wandb.login()

from models import PolicyNet
from reinforce import reinforce
from combo_reinforce import combo_reinforce

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Here, I try to solve the Navigation Goal environment using the provided REINFORCE implementation.

In the new version of Gymnasium you need different environments for rendering and no rendering.
Here we instaintiate two versions of cartpole, one that animates the episodes (which slows everything down), and another that does not animate.

In [None]:
config = dict(
    env_name = "gym_navigation:NavigationGoal-v0",
    hidden_size = 64,
    lr = 1e-4,
    gamma = 0.99,
    episodes = 1500,
    wandb_log = True,
    capture_video = True)

In [None]:
env = gym.make(config['env_name'], render_mode=None)
env_render = gym.make(config['env_name'], render_mode = 'human')
policy = PolicyNet(env, config['hidden_size']).to(device)

with wandb.init(project="DLA_Lab3_DRL", config = config, monitor_gym=True, save_code=True):
    config = wandb.config

    running  = reinforce(policy, env, env_render, device=device, lr=config['lr'], num_episodes=config['episodes'], wandb_log=config['wandb_log'])

plt.plot(running)
torch.save(policy, f"models/andyreinforce_newgymnavigation_lr{config['lr']}_gamma{config['gamma']}.pt")
env_render.close()
env.close()

I'll now try to implement my own version of the REINFORCE algorithm. Let's first test the environment.

In [None]:
env = gym.make(config['env_name'], render_mode = 'human')
policy = PolicyNet(env, config['hidden_size']).to(device)

with wandb.init(project="DLA_Lab3_DRL", config=config, monitor_gym=True, save_code=True):
    config = wandb.config

    running = combo_reinforce(env, policy, lr = config['lr'], gamma = config['gamma'], episodes = config['episodes'], device = device, wandb_log = config['wandb_log'])

plt.plot(running)
torch.save(policy, f"models/comboreinforce_gymnavigation_lr{config['lr']}_gamma{config['gamma']}.pt")
env.close()

### Exercise 3.2: Solving another environment

The [Gymnasium](https://gymnasium.farama.org/) framework has a ton of interesting and fun environments to work with. Pick one and try to solve it using any technique you like. The [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/) environment is a fun one.

Ok, so, let's build a new LunarLander environment, with render mode so that we can graphically see progress of our lander.

In [None]:
config = dict(
    env_name = "MountainCar-v0",
    hidden_size = 64,
    lr = 1e-3,
    gamma = 0.99,
    episodes = 1500,
    wandb_log = True,
    capture_video = True)

env = gym.make(config['env_name'], render_mode='rgb_array' if config['capture_video'] else None)
if config['capture_video']:
    env = gym.wrappers.RecordVideo(env, f"videos/{config['env_name']}/comboreinforce-lr{config['lr']}-g{config['gamma']}",
                                   episode_trigger=lambda t: t % 25 == 0)
policy = PolicyNet(env, config['hidden_size']).to(device)

In [None]:
env = gym.make(config['env_name'], render_mode=None)
env_render = gym.make(config['env_name'], render_mode = 'human')
policy = PolicyNet(env, config['hidden_size']).to(device)

with wandb.init(project="DLA_Lab3_DRL", config = config, monitor_gym=True, save_code=True):
    config = wandb.config

    running  = reinforce(policy, env, env_render, device=device, lr=config['lr'], num_episodes=config['episodes'], wandb_log=config['wandb_log'])

plt.plot(running)
torch.save(policy, f"models/andyreinforce_CartPole-v1_lr{config['lr']}_gamma{config['gamma']}.pt")
env_render.close()
env.close()

To set things, I start implementing a lander that takes totally random actions at each time tick. The total reward will be very bad.

In [None]:
for episode in range(10):
    (state, _) = env.reset()
    terminated, truncated = False, False
    score = 0

    while True:
        env.render()
        action = env.action_space.sample()
        state, reward, terminated, truncated, info = env.step(action)
        score += reward

        if terminated or truncated:
            print(f'Episode {episode+1}\tScore: {score:.2f}')
            break

env.close()

...As expected. Now I will try to use the REINFORCE algorithm: running the two versions of it made me find out that the episode-wise REINFORCE is the only one that "works" in this task, as opposite to the CartPole environment, which showed better results with the interaction-wise REINFORCE.

In [None]:
with wandb.init(project="DLA_Lab3_DRL", config=config, monitor_gym=True, save_code=True):
    config = wandb.config

    running = combo_reinforce(env, policy, lr = config['lr'], gamma = config['gamma'], episodes = config['episodes'], device = device, wandb_log = config['wandb_log'])

plt.plot(running)
torch.save(policy, f"models/comboreinforce_{config['env_name']}_lr{config['lr']}_gamma{config['gamma']}.pt")
env.close()

And finally, the Deep Q-Learning technique.

To complete the experience, let's try to solve also the CartPole environment.

### Exercise 3.3: Advanced techniques 

The `REINFORCE` and Q-Learning approaches, though venerable, are not even close to the state-of-the-art. Try using an off-the-shelf implementation of [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) to solve one (or more) of these environments. Compare your results with those of Q-Learning and/or REINFORCE.

PPO uses the Actor-Critic approach for the agent. This means that it uses two models, one called the Actor and the other called Critic.

The Actor model performs the task of learning what action to take under a particular observed state of the environment. In LunarLander-v2 case, it takes eight values list of the game as input which represents the current state of our rocket and gives a particular action what engine to fire as output.

In [1]:
import gym
import torch
import wandb
wandb.login()

from combo_ppo import PPOAgent, PPOTrainer, train_ppo

config = dict(
    env_name = "CartPole-v1",
    hidden_size = 64,
    policy_lr = 3e-4,
    value_lr = 1e-3,
    target_kl_div = 0.02,
    max_policy_train_iters = 40,
    value_train_iters = 40,
    episodes = 1500,
    max_steps = 1000,
    log_frequency = 10,
    wandb_log = True,
    capture_video = False)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

env = gym.make('CartPole-v1', render_mode = 'rgb_array' if config['capture_video'] else None)
if config['capture_video']:
    env = gym.wrappers.RecordVideo(env, f"videos/{config['env_name']}/ppo-plr{config['policy_lr']}-vlr{config['value_lr']}",
                                   episode_trigger=lambda t: t % 25 == 0)
model = PPOAgent(env.observation_space.shape[0], env.action_space.n, config['hidden_size']).to(device)
ppo = PPOTrainer(model,
                 policy_lr = config['policy_lr'],
                 value_lr = config['value_lr'],
                 target_kl_div = config['target_kl_div'],
                 max_policy_train_iters = config['max_policy_train_iters'],
                 value_train_iters = config['value_train_iters'],)


train_ppo(env, model, ppo, config['episodes'], config['max_steps'], config['log_frequency'], device,
          config['wandb_log'], config)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mgiovancombo[0m. Use [1m`wandb login --relogin`[0m to force relogin


  if not isinstance(terminated, (bool, np.bool8)):


Episode 10 | Avg Reward 11.8
Episode 20 | Avg Reward 10.5
Episode 30 | Avg Reward 9.7
Episode 40 | Avg Reward 10.0
Episode 50 | Avg Reward 10.7
Episode 60 | Avg Reward 10.7
Episode 70 | Avg Reward 11.3
Episode 80 | Avg Reward 10.6
Episode 90 | Avg Reward 9.5
Episode 100 | Avg Reward 10.2
Episode 110 | Avg Reward 10.4
Episode 120 | Avg Reward 10.3
Episode 130 | Avg Reward 10.6
Episode 140 | Avg Reward 11.0
Episode 150 | Avg Reward 10.5
Episode 160 | Avg Reward 12.2
Episode 170 | Avg Reward 11.4
Episode 180 | Avg Reward 10.9
Episode 190 | Avg Reward 13.2
Episode 200 | Avg Reward 16.8
Episode 210 | Avg Reward 37.1
Episode 220 | Avg Reward 99.9
Episode 230 | Avg Reward 156.7
Episode 240 | Avg Reward 69.0
Episode 250 | Avg Reward 71.8
Episode 260 | Avg Reward 109.3
Episode 270 | Avg Reward 157.3
Episode 280 | Avg Reward 106.7
Episode 290 | Avg Reward 64.2
Episode 300 | Avg Reward 59.8
Episode 310 | Avg Reward 68.3
Episode 320 | Avg Reward 55.5
Episode 330 | Avg Reward 98.7
Episode 340 | Avg

0,1
score,▁▁▁▁▁▁▂▂▂▄▂▁▂▂▂▃▃▅▃▄▁▁▁▁▁▁▁▂▂▃▃▃▃███▃▃▃▁

0,1
score,10.0
