# Stable Baselines 3

## Swig installation

Swig is "only" needed to run stable baselines 3 base environments and is not needed for custom game environments
- [Install swig before installing baselines 3 on windows](https://gist.github.com/felix-tjernberg/8bc7313ad1a0de136789f11d7ae7acd3) 
- Install swig on mac before installing baselines 3 -> `brew install swig`

## Tutorial references
- [mustafa Video tutorial](https://www.youtube.com/watch?v=5dxJEXCjruE)
- [sentdex Video tutorial](https://www.youtube.com/playlist?list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1)
- [sentdex Written tutorial](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/)

## Basic environment setup

In this part the goal was to learn the basic environment setup, so following cells are just code from the [first part of the sentex video tutorial](https://www.youtube.com/watch?v=XbWhJdQgi7E&list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1)

In [1]:
import gym

environment = gym.make("CartPole-v1") # Using CarPole-v1 as I could not get LunarLander to work on windows :/
environment.reset()
print(
    f"""
    sample action: {environment.action_space.sample()}
    observation space: {environment.observation_space.shape}
    sample observation: {environment.observation_space.sample()}
    """
)
environment.close()


    sample action: 1
    observation space: (4,)
    sample observation: [3.9042451e+00 1.2297147e+38 3.3339074e-01 3.2759917e+36]
    


In [7]:
environment = gym.make("CartPole-v1")
environment.reset()
for step in range(100):
    environment.render()
    environment.step(environment.action_space.sample())
environment.close()

  logger.warn(


In [3]:
environment.reset()
for step in range(10):
	environment.render()
	obs, reward, done, info = environment.step(environment.action_space.sample())
	print(obs, reward, done, info)

environment.close()

[-0.02662069  0.20854338  0.04225498 -0.23203783] 1.0 False {}
[-0.02244982  0.40303686  0.03761422 -0.5110984 ] 1.0 False {}
[-0.01438908  0.59760934  0.02739225 -0.7916947 ] 1.0 False {}
[-0.0024369   0.7923447   0.01155836 -1.0756358 ] 1.0 False {}
[ 0.01341     0.987312   -0.00995436 -1.3646692 ] 1.0 False {}
[ 0.03315624  0.79231614 -0.03724774 -1.0751164 ] 1.0 False {}
[ 0.04900256  0.98791    -0.05875007 -1.379252  ] 1.0 False {}
[ 0.06876076  1.1837142  -0.08633511 -1.6897141 ] 1.0 False {}
[ 0.09243505  1.3797213  -0.12012939 -2.0079808 ] 1.0 False {}
[ 0.12002947  1.5758721  -0.160289   -2.3353195 ] 1.0 False {}


In [4]:
from stable_baselines3 import A2C

A2C_model = A2C('MlpPolicy', environment, verbose=1)
A2C_model.learn(total_timesteps=1000)

  from .autonotebook import tqdm as notebook_tqdm


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 19.1     |
|    ep_rew_mean        | 19.1     |
| time/                 |          |
|    fps                | 1018     |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.693   |
|    explained_variance | -0.153   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 1.92     |
|    value_loss         | 9.48     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 23.3     |
|    ep_rew_mean        | 23.3     |
| time/                 |          |
|    fps                | 1042     |
|    iterations         | 200      |
|    time_elapsed 

<stable_baselines3.a2c.a2c.A2C at 0x1799f0bf610>

In [6]:
for episode in range(10):
	obs = environment.reset()
	done = False
	while not done:
		# pass observation to model to get predicted action
		action, _states = A2C_model.predict(obs)

		# pass action to environment and get info back
		obs, rewards, done, info = environment.step(action)

		# show the environment on the screen
		environment.render()
environment.close()

## Model saving and loading

[In part 2 sentdex taught how to save and load models and run a local tensorboard website](https://www.youtube.com/watch?v=dLP-2Y6yu70&list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1)

### Training and saving a model

In [38]:
from stable_baselines3 import PPO
import os

model_directories = ['models/A2C', 'models/PPO']
log_directory = 'logs'

for directory in model_directories:
    if not os.path.exists(directory):
        os.makedirs(directory)
if not os.path.exists(log_directory):
    os.makedirs(log_directory)

A2C_model = PPO('MlpPolicy', environment, verbose=0, tensorboard_log=log_directory) # Switching to verbose 0 so we can train more without flooding the cell output
PPO_model = PPO('MlpPolicy', environment, verbose=0, tensorboard_log=log_directory)

In [39]:
TOTAL_TIMESTEPS = 10000
def train_and_save_model(model, model_directory):
    model_name = model_directory.split('/')[1]
    for index in range(1,11):
        model.learn(total_timesteps=TOTAL_TIMESTEPS, reset_num_timesteps=False, tb_log_name=model_name)
        model.save(f'{model_directory}/{TOTAL_TIMESTEPS*index}')

In [40]:
train_and_save_model(A2C_model, model_directories[0])
train_and_save_model(PPO_model, model_directories[1])

### Launching local tensorboard website

Run `pipenv shell` then `tensorboard --logdir=logs` in terminal to serve a tensorboard website

### Loading and running a model

In [43]:
ppo_model_path = f'{model_directories[0]}/90000.zip'

loaded_PPO_model = PPO.load(ppo_model_path, environment)

for episode in range(10):
	observation = environment.reset()
	done = False
	while not done:
		environment.render()
		action, _ = loaded_PPO_model.predict(observation)
		observation, reward, done, info = environment.step(action)
environment.close()

## Making my snake game agent compatible

I [continued to watch part 3 from sentdex](https://www.youtube.com/watch?v=uKnjGn8fF70&list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1) and I also found [this video from mustafa](https://www.youtube.com/watch?v=5dxJEXCjruE) to see what was required when setting up the training environment

What I noted from those videos was that they coded the snake game into the training environment class, which in my opinion makes the environment code very hard to read and edit

My goal then became to rewrite the snake game into a class that is agent compatible, ie it can be ran from a training environment

First I tried to add a variable that says if the game is agent driven and keep the human playability of the game which you can see in [this commit](https://github.com/felix-tjernberg/snake-reinforcement-learning-model/commit/16e9a5915c538b5bcdf9743f758c351f47f97f6f). My thought process was that it might be cool if you could start the game either in agent mode or in human mode

In the [next commit](https://github.com/felix-tjernberg/snake-reinforcement-learning-model/commit/4526e9593bccd105e882c5b1e743be105b5880a0) I realized that it was just better to have a agent playable version and a human playable version of the game, to show that an agent could play I rewrote `if __name__ == "__main__":` so that it initializes the game and then tests if agent input works

This accomplishes my goal of making both the snake game and training environment more readable and editable

## Version 1 of the training environment

Goal was to get the environment up and running in the most basic way like mustafa did and sentdex did in their videos, I combined mustafa's first observation with sentdex's observation code: `[head.x, head.y, food_delta_y, food_delta_x, body_x, body_y]`

I also added body_x and body_y as observations. Which is the sum of all coordinates x and y which in my mind will create a vector for body size, shape and position: The thought is that hopefully model only need information about the body and not history of previous moves

I also choose to do food_delta from sentdex's video instead of exact position from mustafa to help the model figure out if the food is up, down, left or right in relation to the head, which hopefully reduces the need for the model to figure that part out

The check environment scripts that sentdex showed where very useful, especially the check_environment_visually (As I renamed his doublecheckenv script to). When I ran this script I saw that none of the inputs worked as I intended

I solved the input issue by changing the Direction class into a dictionary as a class attribute instead in the snake game, so I learned the lesson of always checking the gym environment visually is very important which sentdex also said in his video :)

### Observation and reward code

In [None]:
def observe_game_state(self):
    snake_head = self.snake_game.head
    food = self.snake_game.food
    food_delta_x = snake_head.x - food.x
    food_delta_y = snake_head.y - food.y

    all_body_x = [coordinate.x for coordinate in self.snake_game.body]
    all_body_y = [coordinate.y for coordinate in self.snake_game.body]

    return array(
        [
            snake_head.x,
            snake_head.y,
            food_delta_x,
            food_delta_y,
            sum(all_body_x),
            sum(all_body_y),
        ],
        dtype=float32,
    )

In [None]:
# Reward the agent for eating food and punish for game over
if self.snake_game.game_over:
    self.reward = self.snake_game.score - 10
else:
    self.reward = self.snake_game.score

### Result

The results of the first version was not very good it looked decent on [the graph](https://github.com/felix-tjernberg/snake-reinforcement-learning-model/blob/main/models/saved_models/PPO_v1_1653236185_960000.png), but when I ran the model it was just following some [spinning patterns](https://youtu.be/PGNiXGX2nLU?t=60) like mustafa warned it would as it really does not want to die

But the goal was to get training and loading of models to work

I'm saving the best model and logs for each environment change for future reference