# Stable Baselines 3

## Required installations to get their environments to work before installing baselines 3

### Swig
- [Install swig before installing baselines 3 on windows](https://gist.github.com/felix-tjernberg/8bc7313ad1a0de136789f11d7ae7acd3) 
- Install swig on mac before installing baselines 3 -> `brew install swig`

## Tutorial reference
- [Video tutorial](https://www.youtube.com/playlist?list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1)
- [Written tutorial](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/)

## [Basic environment setup](https://www.youtube.com/watch?v=XbWhJdQgi7E)

In [1]:
import gym

environment = gym.make("CartPole-v1") # Using CarPole-v1 as I could not get LunarLander to work on windows :/
environment.reset()
print(
    f"""
    sample action: {environment.action_space.sample()}
    observation space: {environment.observation_space.shape}
    sample observation: {environment.observation_space.sample()}
    """
)
environment.close()


    sample action: 1
    observation space: (4,)
    sample observation: [3.9042451e+00 1.2297147e+38 3.3339074e-01 3.2759917e+36]
    


In [7]:
environment = gym.make("CartPole-v1")
environment.reset()
for step in range(100):
    environment.render()
    environment.step(environment.action_space.sample())
environment.close()

  logger.warn(


In [3]:
environment.reset()
for step in range(10):
	environment.render()
	obs, reward, done, info = environment.step(environment.action_space.sample())
	print(obs, reward, done, info)

environment.close()

[-0.02662069  0.20854338  0.04225498 -0.23203783] 1.0 False {}
[-0.02244982  0.40303686  0.03761422 -0.5110984 ] 1.0 False {}
[-0.01438908  0.59760934  0.02739225 -0.7916947 ] 1.0 False {}
[-0.0024369   0.7923447   0.01155836 -1.0756358 ] 1.0 False {}
[ 0.01341     0.987312   -0.00995436 -1.3646692 ] 1.0 False {}
[ 0.03315624  0.79231614 -0.03724774 -1.0751164 ] 1.0 False {}
[ 0.04900256  0.98791    -0.05875007 -1.379252  ] 1.0 False {}
[ 0.06876076  1.1837142  -0.08633511 -1.6897141 ] 1.0 False {}
[ 0.09243505  1.3797213  -0.12012939 -2.0079808 ] 1.0 False {}
[ 0.12002947  1.5758721  -0.160289   -2.3353195 ] 1.0 False {}


In [4]:
from stable_baselines3 import A2C

A2C_model = A2C('MlpPolicy', environment, verbose=1)
A2C_model.learn(total_timesteps=1000)

  from .autonotebook import tqdm as notebook_tqdm


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 19.1     |
|    ep_rew_mean        | 19.1     |
| time/                 |          |
|    fps                | 1018     |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.693   |
|    explained_variance | -0.153   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 1.92     |
|    value_loss         | 9.48     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 23.3     |
|    ep_rew_mean        | 23.3     |
| time/                 |          |
|    fps                | 1042     |
|    iterations         | 200      |
|    time_elapsed 

<stable_baselines3.a2c.a2c.A2C at 0x1799f0bf610>

In [6]:
for episode in range(10):
	obs = environment.reset()
	done = False
	while not done:
		# pass observation to model to get predicted action
		action, _states = A2C_model.predict(obs)

		# pass action to environment and get info back
		obs, rewards, done, info = environment.step(action)

		# show the environment on the screen
		environment.render()
environment.close()

## [Model saving and loading](https://www.youtube.com/watch?v=dLP-2Y6yu70)

In [38]:
from stable_baselines3 import PPO
import os

model_directories = ['models/A2C', 'models/PPO']
log_directory = 'logs'

for directory in model_directories:
    if not os.path.exists(directory):
        os.makedirs(directory)
if not os.path.exists(log_directory):
    os.makedirs(log_directory)

A2C_model = PPO('MlpPolicy', environment, verbose=0, tensorboard_log=log_directory) # Switching to verbose 0 so we can train more without flooding the cell output
PPO_model = PPO('MlpPolicy', environment, verbose=0, tensorboard_log=log_directory)

In [39]:
TOTAL_TIMESTEPS = 10000
def train_and_save_model(model, model_directory):
    model_name = model_directory.split('/')[1]
    for index in range(1,11):
        model.learn(total_timesteps=TOTAL_TIMESTEPS, reset_num_timesteps=False, tb_log_name=model_name)
        model.save(f'{model_directory}/{TOTAL_TIMESTEPS*index}')

In [40]:
train_and_save_model(A2C_model, model_directories[0])
train_and_save_model(PPO_model, model_directories[1])

Run `pipenv shell` then `tensorboard --logdir=agent/logs` in terminal to serve a tensorboard website

In [43]:
ppo_model_path = f'{model_directories[0]}/90000.zip'

loaded_PPO_model = PPO.load(ppo_model_path, environment)

for episode in range(10):
	observation = environment.reset()
	done = False
	while not done:
		environment.render()
		action, _ = loaded_PPO_model.predict(observation)
		observation, reward, done, info = environment.step(action)
environment.close()