# Reinforcement learning
In this notebook you will learn the basis of reinforcement learning in python

Sections:
- 1) Example 1
- 2) Example 2
- 3) Training and evaluate a model
- 4) Environment vectorization
- 5) Build a custom model
- 6) Simple rendering

# 1) Example 1

This is a first example to illustrate the concept of reinforcement learning and to play with an existing environment.
The lunar lander environment recreate the environment of a lander arriving on the moon and subject to the gravity force, the objectve is to land in the delimited area in a perfect way by using the minor number of corrections.

In [1]:
import gymnasium as gym
#  gym library is used for developing and comparing reinforcement learning algorithms

# This line creates an environment for the "LunarLander-v2" game. 
# The render_mode="human" parameter is used to render the environment in a way that is viewable by humans.
env = gym.make("LunarLander-v2", render_mode="human")

# This sets a seed for the environment's action space to ensure reproducibility. 
# Seeding makes the random actions predictable and repeatable.
env.action_space.seed(42)

# This resets the environment to its initial state and returns the initial observation and some additional info. 
# The seed=42 ensures that the reset is reproducible.
observation, info = env.reset(seed=42)
episode_count = 0 # count number of episodes i.e. number of simulations

# Loop runs for 500 iterations, performing the following actions in each iteration
for _ in range(500):
    # env.action_space.sample(): Randomly samples an action from the action space.
    # env.step(action): Applies the sampled action to the environment and returns the following:
    # observation: The new state of the environment after the action is taken.
    # reward: The reward received after taking the action.
    # terminated: A boolean indicating if the episode has ended (e.g., the lander has crashed or landed successfully).
    # truncated: A boolean indicating if the episode was truncated (e.g., due to a time limit).
    # info: Additional diagnostic information.
    observation, reward, terminated, truncated, info = env.step(env.action_space.sample())

    # If the episode has ended (terminated or truncated is True), the environment is reset to the initial state.
    if terminated or truncated:
        episode_count += 1
        observation, info = env.reset()

# Closes the environment and performs any necessary cleanup.
env.close()
print(f"Total episodes: {episode_count}")

Total episodes: 5


# 2) Example 2

This is a second example to illustrate the concept of reinforcement learning and to play with an existing environment.
The objective of this environment is to provide a simple but yet interesting envronment to test modes for autonomous driving.

In [2]:
# This line creates an environment for the "CarRacing-v2" game. 
# The render_mode="human" parameter is used to render the environment in a way that is viewable by humans.
env = gym.make("CarRacing-v2", render_mode="human")

env.action_space.seed(42)

observation, info = env.reset(seed=42)
episode_count = 0

for _ in range(500):
    observation, reward, terminated, truncated, info = env.step(env.action_space.sample())

    # If the episode has ended (terminated or truncated is True), the environment is reset to the initial state.
    if terminated or truncated:
        episode_count += 1
        observation, info = env.reset()

# Closes the environment and performs any necessary cleanup.
env.close()
print(f"Total episodes: {episode_count}")

Total episodes: 0


# 3) Training a model and evaluate it

Several strategies can be chosen to trainour model. In this example we will use "Machine learning policy".
Full view on the implemented methods and their applicability. 

In [3]:
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy

# Create the environment
env = gym.make("LunarLander-v2", render_mode="human")

# Create the model: DQN
model = DQN('MlpPolicy', env, seed=42, verbose=1)

# Train the model
model.learn(total_timesteps=1000)  # Adjust the number of timesteps as needed

# Save the model
model.save("dqn_lunar_lander")

# Load a trained model
model = DQN.load("dqn_lunar_lander")

# Evaluate the model
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)

print(f"Mean reward: {mean_reward} +/- {std_reward}")

env.close()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 85.5     |
|    ep_rew_mean      | -460     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 4        |
|    fps              | 46       |
|    time_elapsed     | 7        |
|    total_timesteps  | 342      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 5.42     |
|    n_updates        | 60       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 80.1     |
|    ep_rew_mean      | -358     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 46       |
|    time_elapsed     | 13       |
|    total_timesteps  | 641      |
| train/              |        



Mean reward: -149.4235740008531 +/- 18.611171676267748


# Manual evaluation

In the previous section we have evaluated the model thanks to the evaluate_policy method but if we want we can reproduce the same with the following for loop. This is a good excercise just to understand what is going on.

In [4]:
# Create the environment
env = gym.make("LunarLander-v2", render_mode="human")

# Load the saved model
model = DQN.load("dqn_lunar_lander")

# Number of episodes to run
num_episodes = 10
episode_rewards = []

for episode in range(num_episodes):
    observation, info = env.reset()
    total_reward = 0
    terminated = False
    truncated = False
    
    while not terminated and not truncated:
        action, _states = model.predict(observation, deterministic=True)
        observation, reward, terminated, truncated, info = env.step(action)
        
        # Render the environment to visualize the agent's performance
        env.render()
        
        # Accumulate the reward
        total_reward += reward
    
    # Append the total reward for this episode to the list
    episode_rewards.append(total_reward)
    print(f"Episode {episode + 1}: Total Reward = {total_reward}")

# Close the environment
env.close()

# Print the rewards for all episodes
print("Episode rewards:", episode_rewards)
print(f"Mean reward over {num_episodes} episodes: {sum(episode_rewards) / num_episodes}")


Episode 1: Total Reward = -156.4939233016318
Episode 2: Total Reward = -133.3727717892838
Episode 3: Total Reward = -162.60920327789626
Episode 4: Total Reward = -132.82254172894105
Episode 5: Total Reward = -129.79511089490825
Episode 6: Total Reward = -124.94995358858012
Episode 7: Total Reward = -133.52126040626428
Episode 8: Total Reward = -141.01516062459905
Episode 9: Total Reward = -127.47238629863033
Episode 10: Total Reward = -113.97364011031893
Episode rewards: [-156.4939233016318, -133.3727717892838, -162.60920327789626, -132.82254172894105, -129.79511089490825, -124.94995358858012, -133.52126040626428, -141.01516062459905, -127.47238629863033, -113.97364011031893]
Mean reward over 10 episodes: -135.6025952021054


# 4) Parallel training

Vectorized environments are environments that run multiple independent copies of the same environment in parallel using multiprocessing. Vectorized environments take as input a batch of actions, and return a batch of observations. This is particularly useful, for example, when the policy is defined as a neural network that operates over a batch of observations.

In [32]:
import gym
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack
from stable_baselines3 import PPO

# Create the LunarLander-v2 environment
def make_env():
    return gym.make('LunarLander-v2')

# Create a vectorized environment
num_envs = 4
# DummyVecEnv is used to create a vectorized environment
env = DummyVecEnv([make_env for _ in range(num_envs)])

# VecFrameStack stacks frames to provide the agent with a sequence of observations, 
# which can be useful for environments where the current state depends on previous states.
env = VecFrameStack(env, n_stack=4)

# Initialize the PPO model
model = PPO('MlpPolicy', env, verbose=1)

# Train the model
model.learn(total_timesteps=1000)

# Save the model
model.save("ppo_lunarlander_parallel")

# To load and use the model
model = PPO.load("ppo_lunarlander_parallel")

Using cpu device
-----------------------------
| time/              |      |
|    fps             | 3070 |
|    iterations      | 1    |
|    time_elapsed    | 2    |
|    total_timesteps | 8192 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1681        |
|    iterations           | 2           |
|    time_elapsed         | 9           |
|    total_timesteps      | 16384       |
| train/                  |             |
|    approx_kl            | 0.009970572 |
|    clip_fraction        | 0.0927      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | 0.0119      |
|    learning_rate        | 0.0003      |
|    loss                 | 345         |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00755    |
|    value_loss           | 1.09e+03    |
-----------------------------------------
-----------------

-----------------------------------------
| time/                   |             |
|    fps                  | 787         |
|    iterations           | 13          |
|    time_elapsed         | 135         |
|    total_timesteps      | 106496      |
| train/                  |             |
|    approx_kl            | 0.008156115 |
|    clip_fraction        | 0.0876      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.08       |
|    explained_variance   | 0.924       |
|    learning_rate        | 0.0003      |
|    loss                 | 4.87        |
|    n_updates            | 120         |
|    policy_gradient_loss | -0.00779    |
|    value_loss           | 11.5        |
-----------------------------------------


In [33]:
# Create a vectorized environment
# Create a single instance of the LunarLander-v2 environment with frame stacking.
num_envs = 1
env = DummyVecEnv([make_env for _ in range(num_envs)])
env = VecFrameStack(env, n_stack=4)

# Evaluate the model
# Run the model for a specified number of episodes (in this case, 10) and print the total reward for each episode.
episodes = 10
for episode in range(episodes):
    obs = env.reset()
    done = False
    total_reward = 0
    while not done:
        action, _states = model.predict(obs)
        obs, reward, done, info = env.step(action)
        total_reward += reward
    print(f"Episode {episode + 1}: Total Reward = {total_reward}")

Episode 1: Total Reward = [132.08537]
Episode 2: Total Reward = [121.57703]
Episode 3: Total Reward = [144.12541]
Episode 4: Total Reward = [156.30675]
Episode 5: Total Reward = [93.6617]
Episode 6: Total Reward = [99.62338]
Episode 7: Total Reward = [148.38686]
Episode 8: Total Reward = [162.59474]
Episode 9: Total Reward = [143.88536]
Episode 10: Total Reward = [132.70992]


# 5) Custom basic model

In this section we will create a custom environment and we will see how to train a model to achieve our goal.

In [5]:
import gymnasium as gym
from stable_baselines3 import PPO
from gymnasium import spaces
import numpy as np
from stable_baselines3.common.env_checker import check_env

class DamEnv(gym.Env):
    def __init__(self):
        super(DamEnv, self).__init__()
        self.action_space = spaces.Discrete(2)  # 0: Close dam, 1: Open dam
        self.observation_space = spaces.Box(low=96, high=104, shape=(1,), dtype=np.float32)
        self.max_consecutive_steps = 100
        self.reset()

    def seed(self, seed=None):
        self.np_random, seed = gym.utils.seeding.np_random(seed)
        return [seed]

    def reset(self, seed=None, options=None):
        self.seed(seed)
        self.water_level = self.np_random.uniform(96, 104)
        self.consecutive_steps_in_range = 0
        return np.array([self.water_level], dtype=np.float32), {}

    def step(self, action):
        if action == 0:
            self.water_level += 0.1  # Smaller increment
        else:
            self.water_level -= 0.1  # Smaller decrement
        
        in_range = 98 <= self.water_level <= 102

        if in_range:
            self.consecutive_steps_in_range += 1
            reward = 1  # Small positive reward for being in range
        else:
            self.consecutive_steps_in_range = 0
            reward = -1  # Negative reward for being out of range

        terminated = self.consecutive_steps_in_range >= self.max_consecutive_steps
        truncated = self.water_level < 90 or self.water_level > 110

        return np.array([self.water_level], dtype=np.float32), reward, terminated, truncated, {}

    def render(self, mode='human'):
        print(f"Water Level: {self.water_level:.2f}")

    def close(self):
        pass


env = DamEnv()
check_env(env)

# Define the model
model = PPO(
    'MlpPolicy', 
    env, 
    #learning_rate=0.0001,  # Smaller learning rate
    #batch_size=32,  # Adjusted batch size
    #gamma=0.99,  # Discount factor
    verbose=1
)

# Train the model
model.learn(total_timesteps=100000)

# Save the model
model.save("dam_dqn")

# Load the model
model = PPO.load("dam_dqn")

# Test the trained model
obs, info = env.reset()
for i in range(100):
    action, _states = model.predict(obs)
    # print("action", action)
    obs, reward, terminated, truncated, info = env.step(action)
    env.render()
    if terminated or truncated:
        break

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 367      |
|    ep_rew_mean     | 85.3     |
| time/              |          |
|    fps             | 1543     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 495         |
|    ep_rew_mean          | -220        |
| time/                   |             |
|    fps                  | 1103        |
|    iterations           | 2           |
|    time_elapsed         | 3           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.012386147 |
|    clip_fraction        | 0.0879      |
|    clip_range           | 0.2         |
|    entropy_loss   

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 164           |
|    ep_rew_mean          | -109          |
| time/                   |               |
|    fps                  | 735           |
|    iterations           | 11            |
|    time_elapsed         | 30            |
|    total_timesteps      | 22528         |
| train/                  |               |
|    approx_kl            | 0.00013810821 |
|    clip_fraction        | 0.0197        |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.402        |
|    explained_variance   | -7.15e-07     |
|    learning_rate        | 0.0003        |
|    loss                 | 47.9          |
|    n_updates            | 100           |
|    policy_gradient_loss | -7.43e-05     |
|    value_loss           | 87.8          |
-------------------------------------------
-----------------------------------------
| rollout/                |       

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 113          |
|    ep_rew_mean          | -74.6        |
| time/                   |              |
|    fps                  | 792          |
|    iterations           | 21           |
|    time_elapsed         | 54           |
|    total_timesteps      | 43008        |
| train/                  |              |
|    approx_kl            | 0.0015716945 |
|    clip_fraction        | 0.0118       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.251       |
|    explained_variance   | 0            |
|    learning_rate        | 0.0003       |
|    loss                 | 38.4         |
|    n_updates            | 200          |
|    policy_gradient_loss | -0.000414    |
|    value_loss           | 80           |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 120          |
|    ep_rew_mean          | -73.8        |
| time/                   |              |
|    fps                  | 826          |
|    iterations           | 31           |
|    time_elapsed         | 76           |
|    total_timesteps      | 63488        |
| train/                  |              |
|    approx_kl            | 7.135881e-05 |
|    clip_fraction        | 0            |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.237       |
|    explained_variance   | 0.0278       |
|    learning_rate        | 0.0003       |
|    loss                 | 49.4         |
|    n_updates            | 300          |
|    policy_gradient_loss | -5.39e-05    |
|    value_loss           | 92           |
------------------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_l

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 115          |
|    ep_rew_mean          | -70.2        |
| time/                   |              |
|    fps                  | 841          |
|    iterations           | 40           |
|    time_elapsed         | 97           |
|    total_timesteps      | 81920        |
| train/                  |              |
|    approx_kl            | 0.0008460237 |
|    clip_fraction        | 0.00752      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.215       |
|    explained_variance   | 0.0722       |
|    learning_rate        | 0.0003       |
|    loss                 | 31.8         |
|    n_updates            | 390          |
|    policy_gradient_loss | -0.00018     |
|    value_loss           | 94.4         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

Water Level: 102.47
Water Level: 102.57
Water Level: 102.47
Water Level: 102.37
Water Level: 102.47
Water Level: 102.37
Water Level: 102.27
Water Level: 102.17
Water Level: 102.07
Water Level: 101.97
Water Level: 101.87
Water Level: 101.77
Water Level: 101.87
Water Level: 101.77
Water Level: 101.67
Water Level: 101.57
Water Level: 101.67
Water Level: 101.57
Water Level: 101.47
Water Level: 101.37
Water Level: 101.47
Water Level: 101.37
Water Level: 101.47
Water Level: 101.37
Water Level: 101.47
Water Level: 101.37
Water Level: 101.27
Water Level: 101.37
Water Level: 101.27
Water Level: 101.17
Water Level: 101.07
Water Level: 100.97
Water Level: 100.87
Water Level: 100.77
Water Level: 100.67
Water Level: 100.57
Water Level: 100.67
Water Level: 100.57
Water Level: 100.47
Water Level: 100.37
Water Level: 100.27
Water Level: 100.17
Water Level: 100.07
Water Level: 99.97
Water Level: 99.87
Water Level: 99.97
Water Level: 99.87
Water Level: 99.77
Water Level: 99.67
Water Level: 99.57
Water L

# 6) Simple rendering

In this section we will  modify the render function to show the water level after each step

In [5]:
import matplotlib
matplotlib.use('TkAgg')  # Use TkAgg backend for separate windows
import gymnasium as gym
from stable_baselines3 import PPO
from gymnasium import spaces
import numpy as np
from stable_baselines3.common.env_checker import check_env
import matplotlib.pyplot as plt

class DamEnv(gym.Env):
    def __init__(self):
        super(DamEnv, self).__init__()
        self.action_space = spaces.Discrete(2)  # 0: Close dam, 1: Open dam
        self.observation_space = spaces.Box(low=96, high=104, shape=(1,), dtype=np.float32)
        self.max_consecutive_steps = 50
        self.reset()
        self.fig, self.ax = plt.subplots()

    def seed(self, seed=None):
        self.np_random, seed = gym.utils.seeding.np_random(seed)
        return [seed]

    def reset(self, seed=None, options=None):
        self.seed(seed)
        self.water_level = self.np_random.uniform(99, 101)
        self.consecutive_steps_in_range = 0
        return np.array([self.water_level], dtype=np.float32), {}

    def step(self, action):
        if action == 0:
            self.water_level += 0.1  # Smaller increment
        else:
            self.water_level -= 0.1  # Smaller decrement
        
        in_range = 98 <= self.water_level <= 102

        if in_range:
            self.consecutive_steps_in_range += 1
            reward = 1  # Small positive reward for being in range
        else:
            self.consecutive_steps_in_range = 0
            reward = -1  # Negative reward for being out of range

        terminated = self.consecutive_steps_in_range >= self.max_consecutive_steps
        truncated = self.water_level < 90 or self.water_level > 110

        return np.array([self.water_level], dtype=np.float32), reward, terminated, truncated, {}

    def render(self, mode='human'):
        self.ax.clear()
        
        # Set the range for the water level
        self.ax.set_ylim(90, 110)
        self.ax.set_xlim(0, 1)
        
        # Draw the vertical bar for the water level range
        self.ax.bar(0.5, height=110-90, width=0.1, bottom=90, color='lightblue', edgecolor='blue')

        # Draw the point representing the current water level
        self.ax.plot(0.5, self.water_level, 'ro')  # 'ro' for red dot
        plt.axhline(y=98, color='r', linestyle='--', label=f'y = {98}')
        plt.axhline(y=102, color='r', linestyle='--', label=f'y = {102}')

    
        # Set labels
        self.ax.set_ylabel('Water Level (meters)')
        self.ax.set_xticks([])
        self.ax.set_title(f'Current Water Level: {self.water_level:.2f} meters')
        
        # Display the plot
        plt.pause(0.1)

    def close(self):
        pass


env = DamEnv()
check_env(env)

# Load the model
model = PPO.load("dam_dqn_best")

# Test the trained model
obs, info = env.reset()
for i in range(200):
    action, _states = model.predict(obs)
    # print("action", action)
    obs, reward, terminated, truncated, info = env.step(action)
    env.render()
    if terminated or truncated:
        break