***Deep Learning Applications 2023** course, held by Professor **Andrew David Bagdanov** - University of Florence, Italy*

*Notebook and code created by **Giovanni Colombo** - Mat. 7092745*

Check the dedicated [Repository on GitHub](https://github.com/giovancombo/DLA-Labs/tree/main/lab3).

# Deep Learning Applications: Laboratory #3 - DRL

In this laboratory session we will hack one of your colleague's (Francesco Fantechi, from Ingegneria Informatica) implementation of a navigation environment for Deep Reinforcement Learning. The setup is fairly simple:

+ A simple 2D environment with a (limited) number of *obstacles* and a single *goal* is presented to the agent, which must learn how to navigate to the goal without hitting any obstacles.
+ The agent *observes* the environment via a set of 16 rays cast uniformly which return the distance to the first obstacle encountered, as well as the distance and direction to the goal.
+ The agent has three possible actions: `ROTATE LEFT`, `ROTATE RIGHT`, or `MOVE FORWARD`.

For each step of an episode, the agent receives a reward of:
+ -100 if hitting an obstacle (episode ends).
+ -100 if one hundred steps are reached without hitting the goal.
+ +100 if hitting the goal (episode ends)
+ A small *positive* reward if the distance to the goal is *reduced*.
+ A small *negative* reward if the distance to the goal is *increased*.

In the file `main.py` you will find an implementation of **Deep Q-Learning**.

## Exercise 1: Testing the Environment

The first thing to do is verify that the environment is working in your Anaconda virtual environment. I had a weird problem with Tensorboard and had to downgrade it using:

    conda install -c conda-forge tensorboard=2.11.2
    
In any case, you should be able to run:

    python main.py
    
from the repository root and it will run episodes using a pretrained agent. To train an agent from scratch, you must modify `main.py` setting `TRAIN = True` at the top. Then running `main.py` again will train an agent for 2000 episodes of training. To run the trained agent you will again have to modify `main.py` on line 225 to load the last saved checkpoint:

    PATH = './checkpoints/last.pth'
    
and then run the script again (after setting `TRAIN = False` !).

Make sure you can at run the demo agent and train one from scratch. If you don't have a GPU you can set the number of training episodes to a smaller number.

In [None]:
# Set TRAIN = True for training and then False for testing
!python Navigation_Goal_Deep_Q_Learning/main.py

Well, I guess I did it. The main script works. I let it train for 1000 episodes.

Qualitatively, it's possible to see that the agents looks like it has not learned so well to find the goal. Many times, the agent hits the walls or obstables without even trying to change direction, even just after the spawn. Some other times, the agent finds its way to the goal, until it stops right in front of it and changes direction.

## Exercise 2: Stabilizing Q-Learning



## Exercise 3: Going Deeper

As usual, pick **AT LEAST ONE** of the following exercises to complete.

### Exercise 3.1: Solving the environment with `REINFORCE`

Use my (or even better, improve on my) implementation of `REINFORCE` to solve the environment.

**Note**: There is a *design flaw* in the environment implementation that will lead to strange (by explainable) behavior in agents trained with `REINFORCE`. See if you can figure it out and fix it.

In [703]:
# Here, I try to solve the Navigation Goal environment using the provided REINFORCE implementation

import matplotlib.pyplot as plt
import gymnasium

from reinforce import *
from policy import *

# In the new version of Gymnasium you need different environments for rendering and no rendering.
# Here we instaintiate two versions of cartpole, one that animates the episodes (which slows everything
# down), and another that does not animate.
env = gymnasium.make('gym_navigation:NavigationGoal-v0', render_mode=None, track_id=1)
env_render = gymnasium.make('gym_navigation:NavigationGoal-v0', render_mode='human')

# Make a policy network.
policy = PolicyNet(env, 64).to(device)

# Train the agent.
running  = reinforce(policy, env, env_render, device=device, lr=1e-4, num_episodes=100)
running += reinforce(policy, env, env_render, device=device, lr=1e-5, num_episodes=100)
plt.plot(running)

# Close up everything
env_render.close()
env.close()

Running reward: -0.9353093063354492
Running reward: -1.3599613349380493
Running reward: -1.774640257511406
Running reward: -2.2050206725568566
Running reward: -2.5085219173080864
Running reward: -2.9942807756170535


KeyboardInterrupt: 

I'll now try to implement my own version of the REINFORCE algorithm. Let's first test the environment.

In [754]:
import gymnasium as gym
import torch
import matplotlib.pyplot as plt
import numpy as np
import wandb

wandb.login()

from policy import *

plt.style.use('fivethirtyeight')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

env_name = "gym_navigation:NavigationGoal-v0"
hidden_size = 128

lr = 1e-4
gamma = 0.989
episodes = 1000

In [755]:
env = gym.make(env_name, render_mode='human') 
policy = PolicyNet(env, hidden_size).to(device)
optimizer = torch.optim.Adam(policy.parameters(), lr=lr)

with wandb.init(project="navigation-goal"):
    policy.train()
    for episode in range(episodes):
        (state, info) = env.reset()
        terminated, truncated = False, False
        states, actions, log_probs, rewards = [], [], [], []
        score = 0

        while True:
            env.render()
            state = torch.tensor(state, dtype = torch.float32, device = device)
            action_logits = policy(state)                               # Logits = Tensor of shape 3 (3 actions)
            action = torch.multinomial(action_logits.exp(), 1).item()   # Sample an action from the policy
            log_logits = torch.log(action_logits)                       # Log probability of each action

            states.append(state)
            actions.append(action)
            log_probs.append(log_logits[action])

            state, reward, terminated, truncated, info = env.step(action)
            rewards.append(reward)
            score += reward
                
            if terminated or truncated:
                break
        
        optimizer.zero_grad()
        # Compute the discounted rewards
        policy_loss = torch.tensor(0, dtype=torch.float32, device = device)
        for t in range(len(rewards)):
            gammas = gamma ** np.arange(len(rewards) - t)
            G = torch.tensor(np.sum(np.array(rewards[t:]) * gammas), dtype=torch.float32, device = device)
            policy_loss += log_probs[t] * G

        policy_loss /= len(rewards)
        #print(policy_loss)
        policy_loss.backward()
        optimizer.step()

        print(f'Episode {episode+1}, {len(rewards)}\tScore: {score:.4f}; Policy loss: {policy_loss:.2f}')

        wandb.log({"score": score, "policy_loss": policy_loss}, step=episode)

    env.close()

Episode 1, 13	Score: -101.1562; Policy loss: 127.44
Episode 2, 55	Score: -115.3443; Policy loss: 90.34
Episode 3, 39	Score: -69.4699; Policy loss: 80.17
Episode 4, 36	Score: -115.1223; Policy loss: 101.48
Episode 5, 26	Score: -100.0161; Policy loss: 92.19
Episode 6, 43	Score: -104.3072; Policy loss: 94.81
Episode 7, 51	Score: -104.6188; Policy loss: 92.28
Episode 8, 23	Score: -100.1336; Policy loss: 110.68
Episode 9, 32	Score: -102.6616; Policy loss: 91.62
Episode 10, 12	Score: -108.2232; Policy loss: 99.08
Episode 11, 22	Score: -102.3085; Policy loss: 124.85
Episode 12, 35	Score: -116.3669; Policy loss: 105.26
Episode 13, 36	Score: -114.3873; Policy loss: 95.19
Episode 14, 28	Score: -108.9274; Policy loss: 109.72
Episode 15, 13	Score: -93.9534; Policy loss: 94.80
Episode 16, 61	Score: -112.6524; Policy loss: 86.21
Episode 17, 9	Score: -95.1236; Policy loss: 119.69
Episode 18, 38	Score: -100.5372; Policy loss: 90.25
Episode 19, 65	Score: -112.7306; Policy loss: 88.68
Episode 20, 60	Sco

0,1
policy_loss,███████▁████████████████████████████████
score,▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
policy_loss,92.43647
score,-102.40881


### Exercise 3.2: Solving another environment

The [Gymnasium](https://gymnasium.farama.org/) framework has a ton of interesting and fun environments to work with. Pick one and try to solve it using any technique you like. The [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/) environment is a fun one.

Ok, so, let's build a new LunarLander environment, with render mode so that we can graphically see progress of our lander.

To set things, I start implementing a lander that takes totally random actions at each time tick. The total reward will be very bad.

In [None]:
import gymnasium as gym
import matplotlib.pyplot as plt

from reinforce import *

plt.style.use('fivethirtyeight')

env_name = "LunarLander-v2"
episodes = 20

env = gym.make(env_name, render_mode='human', gravity=0.0,
                                              enable_wind=False,
                                              wind_power=0.0,
                                              turbulence_power=0.0,)

for episode in range(1, episodes+1):
    state = env.reset()
    terminated = False
    truncated = False
    score = 0

    while True:
        env.render()
        action = env.action_space.sample()              # Scelta random tra i valori [0,1,2,3] delle azioni
        state, reward, terminated, truncated, info = env.step(action)
        score += reward
        if terminated or truncated:
            break
    print(f'Episode {episode},  score: {score:.4f}')

env.close()

...As expected. Now I will try to use the REINFORCE algorithm already implemented in Francesco's work.

In [None]:
# Make a policy network.
policy = PolicyNet(env).to(device)

# Train the agent.
running  = reinforce(policy, env, env_render, device=device, lr=1e-4, num_episodes=1000)
running += reinforce(policy, env, env_render, device=device, lr=1e-5, num_episodes=1000)
plt.plot(running)

# Close up everything
env_render.close()
env.close()

...And finally, with my own implementation of REINFORCE.

In [None]:
# Make a policy network.
policy = PolicyNet(env).to(device)

# Train the agent.
running  = reinforce(policy, env, env_render, device=device, lr=1e-4, num_episodes=1000)
running += reinforce(policy, env, env_render, device=device, lr=1e-5, num_episodes=1000)
plt.plot(running)

# Close up everything
env_render.close()
env.close()

### Exercise 3.3: Advanced techniques 

The `REINFORCE` and Q-Learning approaches, though venerable, are not even close to the state-of-the-art. Try using an off-the-shelf implementation of [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) to solve one (or more) of these environments. Compare your results with those of Q-Learning and/or REINFORCE.