***Deep Learning Applications 2023** course, held by Professor **Andrew David Bagdanov** - University of Florence, Italy*

*Notebook and code created by **Giovanni Colombo** - Mat. 7092745*

Check the dedicated [Repository on GitHub](https://github.com/giovancombo/DLA-Labs/tree/main/lab3).

# Deep Learning Applications: Laboratory #3 - DRL

In this laboratory session we will hack one of your colleague's (Francesco Fantechi, from Ingegneria Informatica) implementation of a navigation environment for Deep Reinforcement Learning. The setup is fairly simple:

+ A simple 2D environment with a (limited) number of *obstacles* and a single *goal* is presented to the agent, which must learn how to navigate to the goal without hitting any obstacles.
+ The agent *observes* the environment via a set of 16 rays cast uniformly which return the distance to the first obstacle encountered, as well as the distance and direction to the goal.
+ The agent has three possible actions: `ROTATE LEFT`, `ROTATE RIGHT`, or `MOVE FORWARD`.

For each step of an episode, the agent receives a reward of:
+ -100 if hitting an obstacle (episode ends).
+ -100 if one hundred steps are reached without hitting the goal.
+ +100 if hitting the goal (episode ends)
+ A small *positive* reward if the distance to the goal is *reduced*.
+ A small *negative* reward if the distance to the goal is *increased*.

In the file `main.py` you will find an implementation of **Deep Q-Learning**.

## Exercise 1: Testing the Environment

The first thing to do is verify that the environment is working in your Anaconda virtual environment. I had a weird problem with Tensorboard and had to downgrade it using:

    conda install -c conda-forge tensorboard=2.11.2
    
In any case, you should be able to run:

    python main.py
    
from the repository root and it will run episodes using a pretrained agent. To train an agent from scratch, you must modify `main.py` setting `TRAIN = True` at the top. Then running `main.py` again will train an agent for 2000 episodes of training. To run the trained agent you will again have to modify `main.py` on line 225 to load the last saved checkpoint:

    PATH = './checkpoints/last.pth'
    
and then run the script again (after setting `TRAIN = False` !).

Make sure you can at run the demo agent and train one from scratch. If you don't have a GPU you can set the number of training episodes to a smaller number.

In [None]:
# Set TRAIN = True for training and then False for testing
!python Navigation_Goal_Deep_Q_Learning/main.py

Well, I guess I did it. The main script works. I let it train for 1000 episodes.

Qualitatively, it's possible to see that the agents looks like it has not learned so well to find the goal. Many times, the agent hits the walls or obstables without even trying to change direction, even just after the spawn. Some other times, the agent finds its way to the goal, until it stops right in front of it and changes direction.

## Exercise 2: Stabilizing Q-Learning



## Exercise 3: Going Deeper

As usual, pick **AT LEAST ONE** of the following exercises to complete.

### Exercise 3.1: Solving the environment with `REINFORCE`

Use my (or even better, improve on my) implementation of `REINFORCE` to solve the environment.

**Note**: There is a *design flaw* in the environment implementation that will lead to strange (by explainable) behavior in agents trained with `REINFORCE`. See if you can figure it out and fix it.

In [4]:
import matplotlib.pyplot as plt
import gymnasium

from reinforce import *
from policy import *

# In the new version of Gymnasium you need different environments for rendering and no rendering.
# Here we instaintiate two versions of cartpole, one that animates the episodes (which slows everything
# down), and another that does not animate.
env = gymnasium.make('gym_navigation:NavigationGoal-v0', render_mode=None, track_id=1)
env_render = gymnasium.make('gym_navigation:NavigationGoal-v0', render_mode='human')

# Make a policy network.
policy = PolicyNet(env).to(device)

# Train the agent.
running  = reinforce(policy, env, env_render, device=device, lr=1e-4, num_episodes=100)
running += reinforce(policy, env, env_render, device=device, lr=1e-5, num_episodes=100)
plt.plot(running)

# Close up everything
env_render.close()
env.close()

Running reward: -0.7287060056686402
Running reward: -1.1493442062921524
Running reward: -1.5693263594672346
Running reward: -1.9958526295864023
Running reward: -2.4143626318803646
Running reward: -2.8590494323073887
Running reward: -3.307590755885598
Running reward: -3.523532870312957
Running reward: -3.976472876554654
Running reward: -4.341953289881841
Running reward: -4.709449150873838
Running reward: -5.131456409514
Running reward: -5.3675004252264396
Running reward: -3.3445638325241354
Running reward: -3.7173847755685463
Running reward: -4.085489708074981
Running reward: -4.524758380750426
Running reward: -4.925999792704096
Running reward: -5.3047203567898915


KeyboardInterrupt: 

### Exercise 3.2: Solving another environment

The [Gymnasium](https://gymnasium.farama.org/) framework has a ton of interesting and fun environments to work with. Pick one and try to solve it using any technique you like. The [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/) environment is a fun one.

Ok, so, let's build a new LunarLander environment, with render mode so that we can graphically see progress of our lander.

In [1]:
import gymnasium as gym
import matplotlib.pyplot as plt

from reinforce import *

plt.style.use('fivethirtyeight')

env_name = "LunarLander-v2"
episodes = 20

env = gym.make(env_name, render_mode='human')

To set things, I start implementing a lander that takes totally random actions at each time tick. The total reward will be very bad.

In [7]:
for episode in range(1, episodes+1):
    state = env.reset()
    terminated = False
    truncated = False
    score = 0

    while not terminated and not truncated:
        env.render()
        action = env.action_space.sample()              # Scelta random tra i valori [0,1,2,3] delle azioni
        state, reward, terminated, truncated, info = env.step(action)
        score += reward
    print('Episode {},  score: {}'.format(episode, score))

env.close()

Episode 1,  score: -62.1084151496713
Episode 2,  score: -194.84647116135045
Episode 3,  score: -76.65653307325597
Episode 4,  score: -68.78041603709626
Episode 5,  score: -306.6852013935006
Episode 6,  score: -114.56088488358084
Episode 7,  score: -2.5777693246189415
Episode 8,  score: -123.80920637421295
Episode 9,  score: -167.7293375767921
Episode 10,  score: -523.5022029599596
Episode 11,  score: -260.32935786394
Episode 12,  score: -137.38976611142658
Episode 13,  score: -99.43182030164184
Episode 14,  score: -163.50386089447278
Episode 15,  score: -97.87862314242409
Episode 16,  score: -83.57409230094022
Episode 17,  score: -329.40643678070023
Episode 18,  score: -146.48907410685518
Episode 19,  score: -103.10951506921988
Episode 20,  score: -148.4399761696688


...As expected. Now I will try to use the REINFORCE algorithm already implemented in Francesco's work.

In [None]:
# In the new version of Gymnasium you need different environments for rendering and no rendering.
# Here we instaintiate two versions of cartpole, one that animates the episodes (which slows everything
# down), and another that does not animate.
env = gymnasium.make(env_name, render_mode='human')

# Make a policy network.
policy = PolicyNet(env).to(device)

# Train the agent.
running  = reinforce(policy, env, env_render, device=device, lr=1e-4, num_episodes=1000)
running += reinforce(policy, env, env_render, device=device, lr=1e-5, num_episodes=1000)
plt.plot(running)

# Close up everything
env_render.close()
env.close()

### Exercise 3.3: Advanced techniques 

The `REINFORCE` and Q-Learning approaches, though venerable, are not even close to the state-of-the-art. Try using an off-the-shelf implementation of [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) to solve one (or more) of these environments. Compare your results with those of Q-Learning and/or REINFORCE.