***Deep Learning Applications 2023** course, held by Professor **Andrew David Bagdanov** - University of Florence, Italy*

*Notebook and code created by **Giovanni Colombo** - Mat. 7092745*

Check the dedicated [Repository on GitHub](https://github.com/giovancombo/DLA-Labs/tree/main/lab3).

# Deep Learning Applications: Laboratory #3 - DRL

In this laboratory session we will hack one of your colleague's (Francesco Fantechi, from Ingegneria Informatica) implementation of a navigation environment for Deep Reinforcement Learning. The setup is fairly simple:

+ A simple 2D environment with a (limited) number of *obstacles* and a single *goal* is presented to the agent, which must learn how to navigate to the goal without hitting any obstacles.
+ The agent *observes* the environment via a set of 16 rays cast uniformly which return the distance to the first obstacle encountered, as well as the distance and direction to the goal.
+ The agent has three possible actions: `ROTATE LEFT`, `ROTATE RIGHT`, or `MOVE FORWARD`.

For each step of an episode, the agent receives a reward of:
+ -100 if hitting an obstacle (episode ends).
+ -100 if one hundred steps are reached without hitting the goal.
+ +500 if hitting the goal (episode ends)
+ A small *positive* reward if the distance to the goal is *reduced*.
+ A small *negative* reward if the distance to the goal is *increased*.

In the file `main.py` you will find an implementation of **Deep Q-Learning**.

## Exercise 1: Testing the Environment

The first thing to do is verify that the environment is working in your Anaconda virtual environment. I had a weird problem with Tensorboard and had to downgrade it using:

    conda install -c conda-forge tensorboard=2.11.2
    
In any case, you should be able to run:

    python main.py
    
from the repository root and it will run episodes using a pretrained agent. To train an agent from scratch, you must modify `main.py` setting `TRAIN = True` at the top. Then running `main.py` again will train an agent for 2000 episodes of training. To run the trained agent you will again have to modify `main.py` on line 225 to load the last saved checkpoint:

    PATH = './checkpoints/last.pth'
    
and then run the script again (after setting `TRAIN = False` !).

Make sure you can at run the demo agent and train one from scratch. If you don't have a GPU you can set the number of training episodes to a smaller number.

In [None]:
# Set TRAIN = True for training and then False for testing
!python Navigation_Goal_Deep_Q_Learning/main.py

Well, I guess I did it. The main script works. I let it train for 1000 episodes.

Qualitatively, it's possible to see that the agents looks like it has not learned so well to find the goal. Many times, the agent hits the walls or obstables without even trying to change direction, even just after the spawn. Some other times, the agent finds its way to the goal, until it stops right in front of it and changes direction.

## Exercise 2: Stabilizing Q-Learning



Ok, so, now that I verified that the environment works, it's now time to stabilize Q-Learning via tweaking the hyperparameters and the architecture. This will just be an ablation study, in the end.

## Exercise 3: Going Deeper

As usual, pick **AT LEAST ONE** of the following exercises to complete.

### Exercise 3.1: Solving the environment with `REINFORCE`

Use my (or even better, improve on my) implementation of `REINFORCE` to solve the environment.

**Note**: There is a *design flaw* in the environment implementation that will lead to strange (by explainable) behavior in agents trained with `REINFORCE`. See if you can figure it out and fix it.

There are many things that can be improved in this implementation. Some things you can think about:

1. **Replay**. In the current implementation we execute an episode, and then immediately run an optimization step on all of the steps of the episode. Not only are we using *correlated* samples from a single episode, we are decidedly *not* taking advantage of parallelism via batch gradient descent. Note that `REINFORCE` does **not** require entire trajectories, all we need are the discounted rewards and log probabilities for *individual transitions*.

2. **Exploration**. The model is probably overfitting (or perhaps remaining too *plastic*, which can explain the unstable convergence). Our policy is *always* stochastic in that we sample from the output distribution. It would be interesting to add a temperature parameter to the policy so that we can control this behavior, or even implement a deterministic policy sampler that always selects the action with max probability to evaluate the quality of the learned policy network.

3. **Discount Factor**: The discount factor (default $\gamma = 0.99$) is an important hyperparameter that has an effect on the stability of training. Try different values for $\gamma$ and see how it affects training. Can you think of other ways to stabilize training?

So, what I will do now is the following:
1) Testing the REINFORCE algorithm as it is in the script
2) Implementing a REINFORCE algorithm that takes into account individual transitions instead of entire trajectories
3) Implementing a sort of epsilon-greedy approach for sampling actions from the distribution given by the policy
4) Trying different values for gamma and other hyperparameters

In [None]:
import numpy as np

rewards = [1,1,1,1,1]
gamma = 0.90

def compute_returns(rewards, gamma):
    return np.flip(np.cumsum([gamma**(i+1)*r for (i, r) in enumerate(rewards)][::-1]), 0).copy()

print(compute_returns(rewards, gamma))

In [None]:
discountedRewards = []
for t in range(len(rewards)):
    G = 0.0
    for k, r in enumerate(rewards[t:]):
        G += (gamma ** k) * r
    discountedRewards.append(G)

print(np.array(discountedRewards))

In [None]:
# Here, I try to solve the Navigation Goal environment using the provided REINFORCE implementation

import matplotlib.pyplot as plt
import gymnasium

from reinforce import *
from policy import *

# In the new version of Gymnasium you need different environments for rendering and no rendering.
# Here we instaintiate two versions of cartpole, one that animates the episodes (which slows everything
# down), and another that does not animate.
env = gymnasium.make('CartPole-v0', render_mode=None)
env_render = gymnasium.make('CartPole-v0', render_mode='human')

# Make a policy network.
policy = PolicyNet(env, 64).to(device)

# Train the agent.
running  = reinforce(policy, env, env_render, device=device, lr=1e-3, num_episodes=500)
#running += reinforce(policy, env, env_render, device=device, lr=1e-5, num_episodes=100)
plt.plot(running)

# Close up everything
env_render.close()
env.close()

I'll now try to implement my own version of the REINFORCE algorithm. Let's first test the environment.

In [None]:
import gymnasium as gym
import torch
from torch.distributions import Categorical
import matplotlib.pyplot as plt
import wandb

wandb.login()

from policy import *

plt.style.use('fivethirtyeight')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
config = dict(
    env_name = 'CartPole-v0', #"LunarLander-v2",#"gym_navigation:NavigationGoal-v0",
    hidden_size = 64,
    lr = 1e-3,
    gamma = 0.99,
    episodes = 1000,)

env = gym.make(config['env_name'], render_mode='human') 
policy = PolicyNet(env, config['hidden_size']).to(device)
optimizer = torch.optim.Adam(policy.parameters(), lr=config['lr'])


with wandb.init(project="navigation-goal", config=config):
    config = wandb.config

    running_rewards = [0.0]
    policy.train()
    for episode in range(config['episodes']):
        (state, _) = env.reset()
        terminated, truncated = False, False
        states, actions, log_probs, rewards = [], [], [], []
        score = 0

        while True:
            env.render()
            action_probs = policy(torch.tensor(state, dtype = torch.float32, device = device))  # Logits = Tensor of shape 3 (3 actions)
            dist = Categorical(probs = action_probs)    # Create a categorical distribution over the actions
            action = dist.sample().item()               # Sample an action from the policy

            states.append(state)
            state, reward, terminated, truncated, _ = env.step(action)

            # Log probability of the chosen action
            log_prob = dist.log_prob(torch.tensor(action, dtype = torch.long, device=device))
            log_probs.append(log_prob.reshape(1))
            actions.append(action)
            rewards.append(reward)
            score += reward

            if terminated or truncated:
                states.append(state)
                break
        
        discountedRewards = []
        for t in range(len(rewards)):
            G = 0.0
            for k, r in enumerate(rewards[t:]):
                G += (config['gamma'] ** k) * r
            discountedRewards.append(G)

        log_probs = torch.cat(log_probs).to(device)
        discountedRewards = torch.tensor(discountedRewards, dtype = torch.float32, device = device)
        running_rewards.append(0.005 * discountedRewards[0].item() + 0.995 * running_rewards[-1])
        discountedRewards = ((discountedRewards - discountedRewards.mean()) / (discountedRewards.std() + 1e-6))

        optimizer.zero_grad()
        policy_loss = (-log_probs * discountedRewards).sum()
        policy_loss.backward()
        optimizer.step()

        print(f'Episode {episode+1}, {len(rewards)}\tScore: {score:.2f}; Policy loss: {policy_loss:.2f}; Running reward: {running_rewards[-1]:.2f}')

        wandb.log({"score": score,
                "policy_loss": policy_loss,
                "running_reward": running_rewards}, step=episode)

    env.close()

In [None]:
plt.plot(scores)

In [None]:
# with wandb.init(project="navigation-goal", config=config):
#     config = wandb.config

scores = []
policy.train()
for episode in range(config['episodes']):
    (state, _) = env.reset()
    terminated, truncated = False, False
    states, actions, log_probs, rewards = [], [], [], []
    score = 0

    while True:
        env.render()
        action_probs = policy(torch.tensor(state, dtype = torch.float32, device = device))  # Logits = Tensor of shape 3 (3 actions)
        dist = Categorical(probs = action_probs)    # Create a categorical distribution over the actions
        action = dist.sample().item()               # Sample an action from the policy

        #log_probs.append(dist.log_prob(torch.tensor(action, dtype = torch.long, device=device)))     # Log probability of the chosen action
        states.append(state)

        state, reward, terminated, truncated, _ = env.step(action)
        actions.append(torch.tensor(action, dtype = torch.long, device=device))
        rewards.append(reward)
        score += reward

        if terminated or truncated:
            states.append(state)
            break
    
    discountedRewards = []
    for t in range(len(rewards)):
        G = 0.0
        for k, r in enumerate(rewards[t:]):
            G += (config['gamma'] ** k) * r
        discountedRewards.append(G)

    for state, action, G in zip(states, actions, discountedRewards):
        action_probs = policy(torch.tensor(state, dtype = torch.float32, device = device))
        dist = Categorical(probs = action_probs)    # Create a categorical distribution over the actions
        log_prob = dist.log_prob(action)

        policy_loss = -log_prob * G

        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

    print(f'Episode {episode+1}, {len(rewards)}\tScore: {score:.2f}')

    scores.append(score)

env.close()

### Exercise 3.2: Solving another environment

The [Gymnasium](https://gymnasium.farama.org/) framework has a ton of interesting and fun environments to work with. Pick one and try to solve it using any technique you like. The [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/) environment is a fun one.

Ok, so, let's build a new LunarLander environment, with render mode so that we can graphically see progress of our lander.

To set things, I start implementing a lander that takes totally random actions at each time tick. The total reward will be very bad.

In [None]:
import gymnasium as gym
import torch
import matplotlib.pyplot as plt
import wandb

wandb.login()

from policy import *

plt.style.use('fivethirtyeight')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

config = dict(
    env_name = 'CartPole-v0', #"LunarLander-v2",#"gym_navigation:NavigationGoal-v0",
    hidden_size = 64,
    lr = 1e-3,
    gamma = 0.99,
    episodes = 1000,)

env = gym.make(config['env_name'], render_mode='human', gravity=0.0,
                                              enable_wind=False,
                                              wind_power=0.0,
                                              turbulence_power=0.0,)
policy = PolicyNet(env, config['hidden_size']).to(device)
optimizer = torch.optim.Adam(policy.parameters(), lr=config['lr'])

scores = []
policy.train()
for episode in range(config['episodes']):
    (state, _) = env.reset()
    terminated, truncated = False, False
    score = 0

    while True:
        env.render()
        action = env.action_space.sample()
        state, reward, terminated, truncated, info = env.step(action)
        score += reward
        if terminated or truncated:
            break
    
    print(f'Episode {episode+1}, {len(rewards)}\tScore: {score:.2f}')

env.close()

...As expected. Now I will try to use the REINFORCE algorithm

In [None]:
from combo_reinforce import combo_reinforce

# with wandb.init(project="navigation-goal", config=config):
#     config = wandb.config

score = combo_reinforce(env, policy, lr = config['lr'], gamma = config['gamma'], episodes = config['episodes'], device = device)

# wandb.log({"score": score}, step=episode)

env.close()

And finally, the Deep Q-Learning technique.

### Exercise 3.3: Advanced techniques 

The `REINFORCE` and Q-Learning approaches, though venerable, are not even close to the state-of-the-art. Try using an off-the-shelf implementation of [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) to solve one (or more) of these environments. Compare your results with those of Q-Learning and/or REINFORCE.