# **Custom Environment for Reinforcement Learning**

The code below is taken from Nicholas Renotte's tutorial on how to create Custom environments for reinforcement learning. [Tutorial](https://youtu.be/Mut_u40Sqz4?t=8940), [code on github](https://github.com/nicknochnack/ReinforcementLearningCourse/blob/main/Project%203%20-%20Custom%20Environment.ipynb).

You are encouraged to visit the links above and check out the full code. In this lab, you will practice training a model.

**About the problem**

The task is to build an agent that regulates the shower temperature to give the best shower possible every time.

Based the activity of other people in the building, the temperature fluctuates randomly. Assuming that our optimal temperature is between 37 and 39 degrees, we want to train an agent to automatically respond to changes in temperature and get it back within the preferred range.

Note that the agent does not know the preffered range ahead of time, and should instead learn the types of adjustments it can make to get a reward.

**Import libraries**

In [1]:
import os
# Avoid reinstalling packages that are available on edstem
if not os.getenv("ED_COURSE_ID"):
    !pip install tensorflow stable_baselines3 torch collections gym box2d-py --user

# Import gym libraries
import gym 
from gym import Env # the supperclass to build our own environment
# All different types of spaces available in Gym
from gym.spaces import Discrete, Box, Dict, Tuple, MultiBinary, MultiDiscrete 

# Import helpers
import numpy as np
import random

#Import stable bbaselines libraries
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

# Inspect types of spaces

There are four key types of Gym spaces:
Box, Discrete, Multibinary and MultiDiscrete.

There are two wrapper spaces, Tuple and Dict that help group different spaces together.

These spaces can be used to create simple environment, like the shower environment in the following example.

In [2]:
# Define a discrete space
disc = Discrete(3)

In [3]:
# Sample the discrete space for a value (between 0 and 2)
disc.sample()

1

In [4]:
# Define a box space
box = Box(0,1,shape=(3,3))

In [5]:
#TODO: Sample the box space for a value
box.sample()

array([[0.15143314, 0.34911984, 0.43493873],
       [0.03203558, 0.25423893, 0.34316856],
       [0.95327836, 0.65670663, 0.742812  ]], dtype=float32)

In [6]:
# Define a tuple space and combine a discrete and box spaces
tup = Tuple((Discrete(2), Box(0,100, shape=(1,))))

In [7]:
#TODO: Sample the tuple space for a value
tup.sample()

(0, array([78.21167], dtype=float32))

In [8]:
# Define a dict space
dic = Dict({'height':Discrete(2), "speed":Box(0,100, shape=(1,))}).sample()

In [9]:
# Define a multibinary space
multibi = MultiBinary(4)

In [10]:
#TODO: Sample the multibinary space for a value
multibi.sample()

array([1, 1, 1, 0], dtype=int8)

In [11]:
# Define a multidiscrete space
multidi = MultiDiscrete([5,2,2])

In [12]:
#TODO: Sample the multidiscrete space for a value
multidi.sample()

array([0, 1, 1])

# Create a custom environment

In [13]:
# Define a shower environment class with four key functions
class ShowerEnv(Env):
    # Define a function to initialize the environment
    def __init__(self):
        # Define the discrete action space: 
        # Actions we can take, down, hold, up
        self.action_space = Discrete(3)
        # Define a temperature range from 0 to 100
        self.observation_space = Box(low=np.array([0]), high=np.array([100]))
        # Set initial state: starting temp is 38 +- 3
        self.state = 38 + random.randint(-3,3)
        # Set shower length: set to 60 seconds for testing
        self.shower_length = 60

    # Define the step function for what to do in one action step    
    def step(self, action):
        # Apply impact of the action on current state
        # 0 -1 = -1 temperature
        # 1 -1 = 0 
        # 2 -1 = 1 temperature 
        self.state += action -1 
        # Reduce shower length by 1 second at each action
        self.shower_length -= 1 
        
        # Calculate reward
        # If the temperature is within preferred range, the reward is positive
        if self.state >= 37 and self.state <= 39: 
            reward = 1 
        # If the reward is outside of preferred range, the reward is negative 
        else: 
            reward = -1 
        
        # Check if shower is done
        if self.shower_length <= 0: 
            done = True
        else:
            done = False
        
        # Set placeholder for info
        info = {}
        
        # Return step information
        return self.state, reward, done, info

    # For this lab, we will not implement a visualization of the environment
    def render(self):
        # Implement viz
        pass
    
    # Define function to reset the environment for the next run
    def reset(self):
        # Reset shower temperature to a random value between 35 and 41
        self.state = np.array([38 + random.randint(-3,3)]).astype(float)
        # Reset shower time
        self.shower_length = 60 
        return self.state

# Test the environment

In [14]:
# Initialize the environment
env=ShowerEnv()

  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")


In [15]:
#TODO: Write code to sample the environment's observation space
env.observation_space.sample()

array([38.61166], dtype=float32)

In [16]:
#TODO: Write code to sample the environment's action space
env.action_space.sample()

1

In [17]:
# Reset the environment
env.reset()

array([39.])

In [18]:
# Test five episodes of taking random Actions
# in the environment
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
    
env.close()

Episode:1 Score:-20
Episode:2 Score:-24
Episode:3 Score:-56
Episode:4 Score:-52
Episode:5 Score:20


# Earn Your Wings

Implement the rest of the reinforcement learning algorithm to train the model using MlpPolicy. Save the training in the log_path defined below, and evaluate the model at the end with render set to False. Add comments in your code to explain each step that you take in your implementation.


In [19]:
# Define a path for where to output the training log files
log_path = os.path.join('ReinforcementLearning/ShowerEnvironment/Training', 'Logs')
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=log_path)
model.learn(total_timesteps=400000)
model.save('PPO')
evaluate_policy(model, env, n_eval_episodes=10, render=False)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ReinforcementLearning/ShowerEnvironment/Training/Logs/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 60       |
|    ep_rew_mean     | -30.8    |
| time/              |          |
|    fps             | 1311     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 60          |
|    ep_rew_mean          | -30.6       |
| time/                   |             |
|    fps                  | 1592        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.012955134 |
|    clip_fraction        | 0.117



(-12.0, 58.787753826796276)