# Reinforcement Learning - RIDER Project

<img src="logo.jpeg" style="float: left; width: 15%" />


2022-2023 Marc-Antoine Oudotte, Clément Garancini, Victor Barberteguy

# Deep Q-Learning

This notebook presents the algorithm, settings and results of the Deep Q-Learning (DQN) algorithm on the three levels of the Rider game

## 1 - Environment and algorithm

We start by installing the necessary libraries and loading the Unity Environment

In [None]:
# If you don't have the libraries, run this cell
### THIS REQUIRES PYTHON 3.7 OR HIGHER AND IS LIKELY TO DOWNGRADE SOME OF YOUR LIBRARIES
### IF YOU JUST WANT TO SEE THE RESULT, GO TO PART 3, ELSE YOU CAN RUN ON A VIRTUAL ENV
### You can also use the file RiderDQN.py that is best fitted to run the algorithm.

!pip install gymnasium imageio ipython ipywidgets nnfigs numpy pandas pygame seaborn torch tqdm matplotlib mlagents

In [None]:
# Libraries
from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper
from mlagents_envs.environment import UnityEnvironment
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import copy
from tqdm.notebook import tqdm
import random

In [None]:
# Input the number of the level that will be played by the algorithm

LEVEL = 3

In [None]:
# Wrap the environment

# Input True if you want the game to render graphically
Graphics = False

#Input the path to the Build
PATH = "/Users/victorbarberteguy/Desktop/3A/INF581/RL/Rider/Build_Level"+str(LEVEL)+".app"

uEnv = UnityEnvironment(PATH, worker_id=0, seed=1, no_graphics=not(Graphics))
env = UnityToGymWrapper(uEnv, uint8_visual = False, flatten_branched= True,  allow_multiple_obs=True)

We now define the constants of the DQN, and save them at runtime in a .txt file (in the active directory)

In [None]:
# Constants

EPISODES = 200
LR = 0.001
MEM_SIZE = 10000
BATCH_SIZE = 72
GAMMA = 0.95
EXPLORATION_MAX = 1.0
EXPLORATION_DECAY = 0.999
EXPLORATION_MIN = 0.001
sync_freq = 5

file = open("settingsDQN.txt", 'w')
settings = ['\nEPISODES  = ' + str(EPISODES),'\nLR  = ' + str(LR), '\nMEM_SIZE  = ' + str(MEM_SIZE),'\nBATCH_SIZE  = ' + str(BATCH_SIZE),'\nGAMMA  = ' + str(GAMMA),'\nEXPLORATION_MAX  = ' + str(EXPLORATION_MAX),'\nEXPLORATION_DECAY  = ' + str(EXPLORATION_DECAY),'\nEXPLORATION_MIN  = ' + str(EXPLORATION_MIN),'\nsync_freq = ' + str(sync_freq) ]
file.writelines(settings)
file.close()

We create a network class that will be a parameter of our DQN

In [None]:
class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.input_shape = 5 # our observation_space is 5
        self.action_space = 2 # our action space is 2

        # We chose a shallow network with two layers and ReLu activation functions
        
        self.fc1 = nn.Linear(self.input_shape, 1024)
        self.fc2 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(512, self.action_space)

        self.optimizer = optim.Adam(self.parameters(), lr=LR)
        self.loss = nn.MSELoss()
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

Another parameter of our DQN will be a replay buffer. This feature stabilizes the learning by replaying in a random order the `BATCH SIZE` previous episodes

In [None]:
class ReplayBuffer:
    def __init__(self):
        self.memory = deque(maxlen=MEM_SIZE)
    
    def add(self, experience):
        self.memory.append(experience)
    
    def sample(self):
        minibatch = random.sample(self.memory, BATCH_SIZE)

        state1_batch = torch.stack([s1 for (s1,a,r,s2,d) in minibatch])
        action_batch = torch.tensor([a for (s1,a,r,s2,d) in minibatch])
        reward_batch = torch.tensor([r for (s1,a,r,s2,d) in minibatch])
        state2_batch = torch.stack([s2 for (s1,a,r,s2,d) in minibatch])
        done_batch = torch.tensor([d for (s1,a,r,s2,d) in minibatch])

        return (state1_batch, action_batch, reward_batch, state2_batch, done_batch)

We can now define our main class DQN with a secpnd feature : the target network TODO

In [None]:
class DQN:
    def __init__(self):
        self.replay = ReplayBuffer() 
        self.exploration_rate = EXPLORATION_MAX
        self.network = Network()
        
        # Target network
        self.network2 = copy.deepcopy(self.network) 
        self.network2.load_state_dict(self.network.state_dict())


    def choose_action(self, observation):
        if random.random() < self.exploration_rate:
            return env.action_space.sample()

        # Convert observation to PyTorch Tensor
        state = torch.tensor(observation).float().detach()
        #state = state.to(DEVICE)
        state = state.unsqueeze(0)

        # Get Q(s,.)
        q_values = self.network(state)

        # Choose the action to play
        action = torch.argmax(q_values).item()

        return action
    
    def learn(self):
        if len(self.replay.memory)< BATCH_SIZE:
            return

        # Sample minibatch s1, a1, r1, s1', done_1, ... , sn, an, rn, sn', done_n
        state1_batch, action_batch, reward_batch, state2_batch, done_batch = self.replay.sample()

        # Compute Q values
        q_values = self.network(state1_batch).squeeze()

        with torch.no_grad():
            # Compute next Q values
            next_q_values = self.network2(state2_batch).squeeze()

        batch_indices = np.arange(BATCH_SIZE, dtype=np.int64)

        predicted_value_of_now = q_values[batch_indices, action_batch]
        predicted_value_of_future = torch.max(next_q_values, dim=1)[0]

        # Compute the q_target
        q_target = reward_batch + GAMMA * predicted_value_of_future * (1-(done_batch).long())

        # Compute the loss (c.f. self.network.loss())
        loss = self.network.loss(q_target, predicted_value_of_now)

        # Complute 𝛁Q
        self.network.optimizer.zero_grad()
        loss.backward()
        self.network.optimizer.step()

        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)
        
    def returning_epsilon(self):
        return self.exploration_rate

## 2 - Training

We start the training of our DQN that will last `EPISODES` episodes

In [None]:
agent = DQN()

best_reward = 0
average_reward = 0
episode_number = []
average_reward_number = []

success = False # Success on one try

n_success_10_try = 0 #number of successes on 10 consecutive tries

average_success_10_try = [] #store the number of successes on 10 tries to plot it

j=0
dim = 5 # dimesion of observation space
print("BEGINNING TRAINING")

for i in tqdm(range(1, EPISODES+1)):

    print("episode " + str(i) + " playing...")
    
    state = env.reset()
    state = np.reshape(state, [1, dim])
    score = 0

    while True:
        j+=1

        action = agent.choose_action(state)

        state_, reward, done, info = env.step(action)
        state_ = np.reshape(state_, [1, dim])
        state = torch.tensor(state).float()
        state_ = torch.tensor(state_).float()

        exp = (state, action, reward, state_, done)
        agent.replay.add(exp)
        
        # Will effectively learn only if the replay buffer is full
        agent.learn()

        state = state_
        score += reward

        # As in the script agent.cs (in Unity), we set the reward to 200 if the goal is reached,
        #   we test the success with this condition
        
        if(reward == 200) :
            success = True
            
        # Synchronization ofthe target network
        if j % sync_freq == 0:
            agent.network2.load_state_dict(agent.network.state_dict())

        # Episode is finished
        if done:
            if score > best_reward:
                best_reward = score

            average_reward += score 
            
            if(success): #goal reached
                success = False
                n_success_10_try += 1


            if i%10==0:
                print("Episode {} Average Reward {} Best Reward {} Last Reward {} Epsilon {} Success = {}/10".format(i, average_reward/i, best_reward, score, agent.returning_epsilon(), n_success_10_try))
                average_success_10_try.append(n_success_10_try)
                n_success_10_try = 0

            break
            
        
        episode_number.append(i)
        average_reward_number.append(average_reward/i)

## 3 - Save the results

We plot and save in the active directory the reward vs the number of episodes, and the number of successes every 10 tries.
On the second plot, we traced y=7 as 68% of success is often used to know if the training is finished

In [None]:
plt.subplot(1,2,1)
plt.plot(episode_number, average_reward_number)
plt.title("Plot of the reward")
plt.xlabel('Episodes ')
plt.ylabel('Reward ')

plt.subplot(1, 2, 2) # index 2
x2 = [10*(i+1) for i in range(int(EPISODES/10))]
plt.scatter(x2, average_success_10_try)
plt.plot(x2, [7 for i in range(len(x2))], color = 'r')
plt.ylim(-1,11)
plt.title("Successes each 10 tries")
plt.xlabel('Episodes')
plt.ylabel('N success every 10 tries')

plt.savefig("ResultsDQN")
env.close()
plt.show()

## 4 - Exploit the results

We will display some of the results we had after training our agent on the three levels so that the lector can have some visuals without running the code (it can take up to 10/15min with 200 episodes...).
However, the interpretation of these results will be given in the report (.pdf).

### 4.1 - Level 1

This level is quite simple. The main difficulty is to jump over a hole.
<img src="level1.png" alt="Level 1" style="float: right; width: 50%" />

In [2]:
#The settings of the try displayed are as follows:

EPISODES  = 200
LR  = 0.001 #dynamially changed during training
MEM_SIZE  = 10000
BATCH_SIZE  = 72
GAMMA  = 0.95
EXPLORATION_MAX  = 1.0
EXPLORATION_DECAY  = 0.999
EXPLORATION_MIN  = 0.001
sync_freq = 5

Here are the results obtained.

<img src="LV1OPTI-DYNAMIC.png" title="Results lvl1" style="display:block;float: center; width: 50%" />

Here is the final episode (when the agent is trained)

<img src="LVL1.gif" style="float: center; width: 50%" />

### 4.2 - Level 2

The level 2 is trickier inasmuch as the agent can find the goal easily, but will face failures when trying to reach the goal faster. Indeed, it will collide with the upper slope. We wanted to study how the agent would adapt his speed and increase its reward.
<img src="level2.png" alt="Level 2 - Speed Control" style="float: right; width: 50%" />


In [1]:
#The settings of the try displayed are as follows:

EPISODES  = 200
LR  = 0.001
MEM_SIZE  = 10000
BATCH_SIZE  = 72
GAMMA  = 0.8
EXPLORATION_MAX  = 1.0
EXPLORATION_DECAY  = 0.999
EXPLORATION_MIN  = 0.001
sync_freq = 5

Here are the results we obtained.


<img src="LVL2OPTI.png" style="float: center; width: 50%" />

Here is the final episode (when the agent is trained)


<img src="LVL21.gif" style="float: center; width: 50%" />

### 4.3 - Level 3

The level 3 is the hardest we have designed and the goal is to control spin. The agent has to land on the platforms correctly or it will collide with the slopes. When landed, it the optimal policy is to keep accelerating.
<img src="level3.png" style="float: right; width: 50%" />

In [None]:
#The settings of the try displayed are as follows:

EPISODES  = 200
LR  = 0.001
MEM_SIZE  = 10000
BATCH_SIZE  = 72
GAMMA  = 0.97
EXPLORATION_MAX  = 1.0
EXPLORATION_DECAY  = 0.999
EXPLORATION_MIN  = 0.001
sync_freq = 5

Here are the results we obtained.

<img src="LVL3OPTI.png" style="float: center; width: 50%" />

Here is the final episode (when the agent is trained).

<img src="LVL3.gif" style="float: center; width: 50%" />