# Customizing OpenAI Gym Environments and Implementing Reinforcement Learning Agents with Stable Baselines

### **Imports**

In [2]:
import gymnasium as gym
from stable_baselines3 import A2C, PPO,DQN
from sb3_contrib.ppo_mask import MaskablePPO
from sb3_contrib.ars import ARS
from sb3_contrib.qrdqn import QRDQN
from sb3_contrib.trpo import TRPO
from sb3_contrib.common.maskable.policies import MaskableActorCriticPolicy
from sb3_contrib.common.wrappers import ActionMasker
from stable_baselines3.common.evaluation import evaluate_policy
from gymnasium import Wrapper
import os

### **Frozen Lake**

After exploring the existing environments, we decided to pick Frozen Lake. 


#### **Analysing the chosen environment**

This environment is part of the Toy Text environments and consists of a 4X4 grid by default.

- ##### **Action Space**:

   The action space is Discrete(4) and its shape is (1,) wich indicates that there are four possible actions the player can make.

   These are in the range {0,3} and are: 
   - 0 - Move up, 
   - 1 - Move right
   - 2 - Move down
   - 3 - Move left


- ##### **Observation Space**:

   The observation space consists in the value of the player's position. It starts at 0 and when using the 4X4 map, it goes up to 15, which corresponds to the goal state. 

   The expression to calculate the the actual position is ``current_row * number_of_rows + number_of_columns.``

- ##### **Starting state**:

   The episode starts with the player in the 0 position, being the corresponding rows and columns both 0.

- ##### **Rewards**:

   There are three different tiles in the board and the rewards consist in reaching them:
   - Reach goal: +1 reward
   - Reach hole: 0 reward
   - Reach frozen tile: 0 reward

- ##### **End of the episode**:

   The episode has two different forms of ending:
   - Termination: When player reaches a hole or the goal (always located at the last column and row).
   - Truncation: When player reaches the time_limit_wrapper, being 100 for 4X4 board and 200 for 8X8.

- ##### **Information**:

   The step() function returns a dictionary with 5 arguments:
   - observation (int)
   - reward (int)
   - end_of_the_episode (bol)
   - truncation (bol)
   - probability_of_transition (dictionary)

   The reset() functions returns:
   - observation for the initial state (int)
   - probability_of_transition (dictionary):
      - {"prob": 1}   - if the slippery is False
      - {"prob": 1/3} - if the slippery is True

- ##### **Arguments**:

   gym.make() can receive some arguments: 
   - desc - if None the map is a non custom map or it can generate a random map with a certain size or it can be a list of strings that specify a custom map. 
      - Example: desc=["SFFF", "FHFH", "FFFH", "HFFG"].
   - map_name - ID to use a certain preloaded map.
   - is_slippery - if true then there is 1/3 the player will move in the intended direction





### **Testing the random agent**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="human")
env.reset() 

episodes = 10  
for ep in range(episodes):
    env.reset()  
    done = False
    while not done:
        env.render()  
        obs, rewards, done, truncated, prob = env.step(env.action_space.sample())

env.close()

As we can see the agent performs very poorly when playing randomly.

### **Training models chosen**

After checking the reinforcement learning models that were compatible with our environment we decided to test all of the following:
- Advantage Actor Critic (A2C)
- Proximal Policy Optimization (PPO)
- Deep Q Network (DQN)
- Trust Region Policy Optimization (TRPO)
- Quantile Regression DQN (QR-DQN)
- Maskable PPO
- Augmented Random Search (ARS)

### **Modifications**

After having in mind the information above we noticed that the reward system is sparse, meaning that the rewards are only attributed when achieving a significant milestone, in this case the goal. We intend to  change it to a dense one in order to provide enough feedback for the agent to reach the optimal solution, minimizing the path to the goal and also penalising when the agent moves into a hole. Since we are altering those rewards we will also need to increase the amount when getting to the final state.
The initial idea is to take away 1 of reward for each step, because as is, the agent is not taking an optimal path and can sometimes perform repeated actions that do not lead to the solution, leading to unwanted behaviour.
When moving into a hole, the reward need to be a big negative number so that taking a certain number os steps does not equal reaching a hole, since the first one still has the possibility of getting to the goal. We will try -100.
Since we are penalising the agent so much, we will try to increase the reward to 100 when getting to the final state. The reasoning is that we want to encourage finding the solution and not only taking few steps, since the reward would be greater.
Other modification we can implement is turning off the freezing tiles,  that way the randomness factor is not taken in consideration in this problem, and we can compare the two scenarios and see how it impacts the agent.

### **Implement customizations**

In [None]:
class CustomRewardWrapper(Wrapper):
    def __init__(self, env):
        super(CustomRewardWrapper, self).__init__(env)

    def step(self, action):
        obs, reward, done, truncation, info = self.env.step(action)

        if done and obs != 15:
            reward = -100
        elif obs == 15:
            reward = 100
        else:
            reward = -1
        return obs, reward, done, truncation, info

env = gym.make('FrozenLake-v1', render_mode="human")
env = CustomRewardWrapper(env)  
env.reset()   

episodes = 10  
for ep in range(episodes):
    env.reset()  
    done = False
    while not done:
        env.render()
        obs, rewards, done, truncated, prob = env.step(env.action_space.sample())
        print("reward ", rewards)

env.close()

After rechearching about the models we came to the conclusion that some of them can be sensitive to rewards in a large range so we decided to change them.

In [None]:
class CustomReward(Wrapper):
    def __init__(self, env):
        super(CustomReward, self).__init__(env)

    def step(self, action):
        obs, reward, done, truncation, info = self.env.step(action)

        if done and obs != 15:
            reward = -10
        elif obs == 15:
            reward = 1.6
        else:
            reward = -0.1
        return obs, reward, done, truncation, info

env = gym.make('FrozenLake-v1', render_mode="human")
env = CustomReward(env)  
env.reset()   

episodes = 10  
for ep in range(episodes):
    env.reset()  
    done = False
    while not done:
        env.render()
        obs, rewards, done, truncated, prob = env.step(env.action_space.sample())
        print("reward ", rewards)

env.close()

### **Without Slippery**

#### **Training all models - (500 000 timesteps)**

We are going to train all the chosen models with a small number of timesteps to check the ones that perform the best.

##### **A2C**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Agent','Logs')
model = A2C('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="A2C")
model.save(f"Training/WithChange/SlipperyOff/Agent/models/A2C/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOff/Agents/A2C.png)

##### **PPO**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Agent','Logs')
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO")
model.save(f"Training/WithChange/SlipperyOff/Agent/models/PPO/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOff/Agents/PPO.png)

##### **DQN**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Agent','Logs')
model = DQN('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="DQN")
model.save(f"Training/WithChange/SlipperyOff/Agent/models/DQN/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOff/Agents/DQN.png)

##### **TRPO**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Agent','Logs')
model = TRPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="TRPO")
model.save(f"Training/WithChange/SlipperyOff/Agent/models/TRPO/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOff/Agents/TRPO.png)

##### **QR-DQN**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Agent','Logs')
model = QRDQN('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="QRDQN")
model.save(f"Training/WithChange/SlipperyOff/Agent/models/QRDQN/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOff/Agents/QRDQN.png)

##### **Maskable PPO**

In [None]:
def mask_fn(env):
    return [True, True, True, False]

env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = ActionMasker(env,mask_fn)
env = CustomReward(env)

log_path = os.path.join('Training','WithChange','SlipperyOff','Agent','Logs')
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="Maskable PPO")
model.save(f"Training/WithChange/SlipperyOff/Agent/models/Maskable PPO/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOff/Agents/MaskablePPO.png)

##### **ARS**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Agent','Logs')
model = ARS('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="ARS")
model.save(f"Training/WithChange/SlipperyOff/Agent/models/ARS/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOff/Agents/ARS.png)

#### **Choosing the 3 best ones**

By looking at the graphs we can see that the number of timesteps was enough for the majority of models to stabilize their learning curve.

![Image](images/SlipperyOff/Agents/TODOS.png)

In the mean reward graph above we can see that tree models got an average reward of 1.1 or close, which was the max reward possible for each episode, which means they learn how to play the agent very effectivly when the randomness is turned off.

Based on the results we decided to pick the 3 best performing models to train with more timesteps:
- Advantage Actor Critic (A2C)
- Proximal Policy Optimization (PPO)
- Trust Region Policy Optimization (TRPO)

#### **Training the 3 best - (5 000 000 timesteps)**

Now we are going to train more deeply the chosen models.

##### **A2C**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange''SlipperyOff','Top3_Agent','Logs')
model = A2C('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 5000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="A2C")
    model.save(f"Training/WithChange/SlipperyOff/Top3_Agent/models/A2C/{iters*TIMESTEPS}")
    iters+=1

###### **Image**

![Image](images/SlipperyOff/Top3_Agent/A2C.png)

##### **PPO**

In [None]:
env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Top3_Agentt','Logs')
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 5000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO")
    model.save(f"Training/WithChange/SlipperyOff/Top3_Agent/models/PPO/{TIMESTEPS*iters}")
    iters += 1

###### **Image**

![Image](images/SlipperyOff/Top3_Agent/PPO.png)

##### **TRPO**

In [None]:
env = gym.make('FrozenLake-v1',is_slippery=False, render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOff','Top3_Agent','Logs')
model = TRPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 5000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="TRPO")
    model.save(f"Training/WithChange/SlipperyOff/Top3_Agent/models/TRPO/{TIMESTEPS*iters}")
    iters += 1

###### **Image**

![Image](images/SlipperyOff/Top3_Agent/TRPO.png)

#### **Choosing the best performer**

![Image](images/SlipperyOff/Top3_Agent/TODOS.png)

After trainning our models with 5 million timesteps, we analised the graph and got to the conclusion that the A2C is the better one in this case, since it stabilizes at the top, which the PPO does not, and gets that first comparing to the TRPO.

#### **Testing the model**

With the following code we will see how the agent performs based on the prediction of the A2C model. We will run 10 episodes using the model trained with 500 thousand since by then it already stabilized with the max reward and the lenth of the episode was the minimum. 

![Image](images/SlipperyOff/Top3_Agent/A2C.png)

In [None]:
env = gym.make('FrozenLake-v1', render_mode="human")
env = CustomReward(env)
env.reset()

models_dir = "Training/WithChange/SlipperyOff/Top3_Agent/models/A2C"
model_path = f"{models_dir}/4000000.zip"

model = A2C.load(model_path, env=env)

episodes = 10

for ep in range(episodes):
    obs, _ = env.reset()
    done = False
    while not done:
        env.render()
        action, _states = model.predict(obs)
        obs, reward, done, truncated, prob = env.step(action.item())
        print(reward)


env.close()

The agent got to the goal in all episodes. In total it made 60 steps, on average 6 steps per episode that corresponds to the optimal solution. The mean reward was 1 and was the max possible. We can conclude that the model learnt how to play the game perfectly which is no surprise as the environment never changes and it only needs to do the same actions in every episode. It is also worth noticing that the number of timesteps it took was rather small.

#### **Comparison with original environment**

In both environments the agents gets to goal in all simulations with the optimal solution. The only noticeable difference is in the graphs of mean lenght reward however it does not have an impact on the agent's behaviour.

![Image](images\NoChange\A2C.png)

#### **Hyperparameters**

The only thing that can improve relatively to the previous model is number of iterations needed to get to minimum of steps. We applied some hyperparameters to make it learn faster: 
 - learning rate: 0.01
 - gamma: 1

We used the following script to train and ran it for 10 million timesteps:

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','Hyperparameter','Logs')
model = A2C('MlpPolicy', env, verbose=1, learning_rate=0.01, gamma=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 10000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="A2C")
    model.save(f"Training/Hyperparameter/models/A2C/{iters*TIMESTEPS}")
    iters+=1

#### **Evaluate**

Since the path to solution is always the same there is no need to evaluate extensively the performance of the model with hyperparameters since it is not possible to have a better result than the one we already have with a certain number of timesteps.

### **With Slippery**

#### **Training all models - (500 000 timesteps)**

We are going to train all the chosen models with a small number of timesteps to check the ones that perform the best.

##### **A2C**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Agent','Logs')
model = A2C('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="A2C")
model.save(f"Training/WithChange/SlipperyOn/Agent/models/A2C/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOn/Agents/A2C.png)

##### **PPO**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Agent','Logs')
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO")
model.save(f"Training/WithChange/SlipperyOn/Agent/models/PPO/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOn/Agents/PPO.png)

##### **DQN**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Agent','Logs')
model = DQN('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="DQN")
model.save(f"Training/WithChange/SlipperyOn/Agent/models/DQN/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOn/Agents/DQN.png)

##### **TRPO**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Agent','Logs')
model = TRPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="TRPO")
model.save(f"Training/WithChange/SlipperyOn/Agent/models/TRPO/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOn/Agents/TRPO.png)

##### **QR-DQN**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Agent','Logs')
model = QRDQN('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="QRDQN")
model.save(f"Training/WithChange/SlipperyOn/Agent/models/QRDQN/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOn/Agents/QRDQN.png)

##### **Maskable PPO**

In [None]:
def mask_fn(env):
    return [True, True, True, False]

env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = ActionMasker(env,mask_fn)
env = CustomReward(env)

log_path = os.path.join('Training','WithChange','SlipperyOn','Agent','Logs')
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="Maskable PPO")
model.save(f"Training/WithChange/SlipperyOn/Agent/models/Maskable PPO/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOn/Agents/MaskablePPO.png)

##### **ARS**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Agent','Logs')
model = ARS('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 500000
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="ARS")
model.save(f"Training/WithChange/SlipperyOn/Agent/models/ARS/{TIMESTEPS}")

###### **Image**

![Image](images/SlipperyOn/Agents/ARS.png)

#### **Choosing the 3 best ones**

![Image](images/SlipperyOn/Agents/TODOS.png)

In the mean reward graph above we can see that tree models got an average reward of 1.1 or close, which was the max reward possible for each episode, which means they learn how to play the agent very effectivly when the randomness is turned off.
With the randomness on the models don't perform as well when it is turned off. We can see that the max mean reward was very close to -4.6, which is not a very good result since it has a lot of room for improvements.

Based on the results we decided to pick the 3 best performing models to train with more timesteps:
- Advantage Actor Critic (A2C)
- Proximal Policy Optimization (PPO)
- Trust Region Policy Optimization (TRPO)

#### **Training the 3 best - (5 000 000 timesteps)**

##### **A2C**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Top3_Agent','Logs')
model = A2C('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 5000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="A2C")
    model.save(f"Training/WithChange/SlipperyOn/Top3_Agent/models/A2C/{iters*TIMESTEPS}")
    iters+=1

###### **Image**

![Image](images/SlipperyOn/Top3_Agent/A2C.png)

##### **PPO**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Top3_Agent','Logs')
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 5000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO")
    model.save(f"Training/WithChange/SlipperyOn/Top3_Agent/models/PPO/{TIMESTEPS*iters}")
    iters += 1

###### **Image**

![Image](images/SlipperyOn/Top3_Agent/PPO.png)

##### **TRPO**

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','WithChange','SlipperyOn','Top3_Agent','Logs')
model = TRPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 5000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="TRPO")
    model.save(f"Training/WithChange/SlipperyOn/Top3_Agent/models/TRPO/{TIMESTEPS*iters}")
    iters += 1

###### **Image**

![Image](images/SlipperyOn/Top3_Agent/TRPO.png)

#### **Choosing the best performer**

Based on the observations of the graph, we can see that the TRPO stays more consistently with higher mean reward compared to the other models, despite having highs and lows like the other ones.

#### **Testing the model**

With the following code we will see how the agent performs based on the prediction of the TRPO model. We will run 10 episodes using the model trained with 2.5 million since it got a mean reward that was close to the highest and stayed at the top long enough. 

In [6]:
env = gym.make('FrozenLake-v1', render_mode="human")
env = CustomReward(env)
env.reset()

models_dir = "Training/WithChange/SlipperyOn/Top3_Agent/models/TRPO"
model_path = f"{models_dir}/2500000.zip"

model = TRPO.load(model_path, env=env)

episodes = 10

for ep in range(episodes):
    obs, _ = env.reset()
    done = False
    while not done:
        env.render()
        action, _states = model.predict(obs)
        obs, reward, done, truncated, prob = env.step(action.item())
        print(reward)


env.close()

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
1.6
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
1.6
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
1.6
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
1.6
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0.1
-0

Based on the chosen model, the agent got sucessfully to the goal 9 times and fell into a hole once.
The agent performed 335 total steps, which gives us a mean of 33.5 steps per episode. This number is a bit higher than the optimal one because the path is only 6 steps long, but since sometimes the direction of the movement is random, we cannot say it is a bad result.
In this case the mean reward was -2.81, mainly because each step takes away -0.1.

#### **Comparison**

With no changes to the environment and slippery turned on, the agent wins 9 times out of 10 and makes 345 steps. The mean is 34.5 which is on average 1 step more than the environment with our custom wrapper. This does not represent a big change, but is still an improvement.

#### **Hyperparameter**

To fine tune our agent's performance we decided to experiemnt with learning rate and gamma, which is a discount factor used to determine the relative weight between future rewards and immediate ones.
After testing some values for a few timesteps, we got to the conclusion that for our environment and the algorithm chosen the best values were:
 - learning rate: 0.01
 - gamma: 1

We used the following script to train and ran it for 10 million timesteps:

In [None]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

log_path = os.path.join('Training','Hyperparameter','Logs')
model = TRPO('MlpPolicy', env, verbose=1, learning_rate=0.01, gamma=1, tensorboard_log=log_path)

TIMESTEPS = 10000
iters = 1
while TIMESTEPS*iters < 10000000:
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="TRPO")
    model.save(f"Training/Hyperparameter/models/TRPO/{iters*TIMESTEPS}")
    iters+=1

#### **Evaluate**

We will test the model 100 times to see if the alterations we made with hyperparameters were positive or not:

In [26]:
env = gym.make('FrozenLake-v1', render_mode="rgb_array")
env = CustomReward(env)
env.reset()

# Carregar o modelo treinado
models_dir = "Training/WithChange/SlipperyOn/Top3_Agent/models/TRPO"
model_path = f"{models_dir}/2500000.zip"
model = TRPO.load(model_path, env=env)

# Avaliar o desempenho do modelo
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True, render=False)

print(f"A recompensa média foi {mean_reward}")
print(f"O desvio padrão da recompensa foi {std_reward}")

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
A recompensa média foi -2.290000034123659
O desvio padrão da recompensa foi 2.438626699031943


### **Conclusion**

In summary our results are positive as both of the models improve with the changes made, both the ones to the environment and the hyperparameters.
When we compare to the agent playing randomly there is a huge difference since with training it is able to win consistentely and without taking a very long and unnecessary path.
With slippery turned off the agent plays perfectly which was expected.