## Implementing a Custom Reinforcement Learning Environment using gymnasium

# Install packages

In [1]:
!pip install gymnasium
!pip install stable-baselines3



# Create a Custom Environment Using Gymnasium
10 states (0, 1, 2, ..., 9).<br>
2 possible actions (0 and 1).<br>
The agent receives a reward when it reaches state 2 by taking the correct sequence of actions.

In [2]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces

class SimpleEnv(gym.Env):
    def __init__(self):
        super(SimpleEnv, self).__init__()

        # Define action and observation spaces
        self.observation_space = spaces.Discrete(10)  # Example: Discrete observation space with 10 possible states
        self.action_space = spaces.Discrete(2)  # Example: Discrete action space with 2 possible actions

        # Seed for reproducibility
        self.seed()

    def reset(self, seed=None, options=None):
        # Set the random seed if provided
        super().reset(seed=seed)

        # Reset the environment to an initial state
        initial_observation = 0

        # Optionally, set other options for the environment reset

        # Return the initial observation and additional info
        return initial_observation, {}

    def step(self, action):
        # Implement your environment's step function
        observation = np.random.choice(self.observation_space.n)  # Example: Randomly choose a new observation
        reward = 1.0  # Example: Fixed reward
        done = False  # Example: Fixed not done (change according to your terminal state logic)
        info = {}
        return observation, reward, done, False, info

    def render(self, mode="human"):
        # Implement rendering logic if required
        pass

    def seed(self, seed=None):
        # Set the seed for the environment's random number generator
        np.random.seed(seed)


# Implement an RL Agent Using Stable-Baselines3

Now we will use the PPO (Proximal Policy Optimization) algorithm from Stable-Baselines3 to train an RL agent to interact with the SimpleEnv environment.

In [3]:
from stable_baselines3 import PPO

# Create the custom environment
env = SimpleEnv()

# Instantiate the agent
model = PPO('MlpPolicy', env, verbose=1)

# Train the agent
model.learn(total_timesteps=10000)

# Save the trained model
model.save("ppo_simple_env")

# Test the trained agent
obs, _ = env.reset()
for _ in range(100):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _, _ = env.step(action)
    env.render()
    if done:
        print("Reached terminal state!")
        obs, _ = env.reset()


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
-----------------------------
| time/              |      |
|    fps             | 583  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 477         |
|    iterations           | 2           |
|    time_elapsed         | 8           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.000980796 |
|    clip_fraction        | 0           |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.693      |
|    explained_variance   | -0.0229     |
|    learning_rate        | 0.0003      |
|    loss                 | 13.7        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.000752   |
|    value_loss        

  and should_run_async(code)
