# Stable Baselines 3

## Swig installation

Swig is "only" needed to run stable baselines 3 base environments and is not needed for custom game environments
- [Install swig before installing baselines 3 on windows](https://gist.github.com/felix-tjernberg/8bc7313ad1a0de136789f11d7ae7acd3) 
- Install swig on mac before installing baselines 3 -> `brew install swig`

## Tutorial references
- [mustafa Video tutorial](https://www.youtube.com/watch?v=5dxJEXCjruE)
- [sentdex Video tutorial](https://www.youtube.com/playlist?list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1)
- [sentdex Written tutorial](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/)

## Basic environment setup

In this part the goal was to learn the basic environment setup, so following cells are just code from the [first part of the sentex video tutorial](https://www.youtube.com/watch?v=XbWhJdQgi7E&list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1)

In [None]:
import gym

environment = gym.make("CartPole-v1") # Using CarPole-v1 as I could not get LunarLander to work on windows :/
environment.reset()
print(
    f"""
    sample action: {environment.action_space.sample()}
    observation space: {environment.observation_space.shape}
    sample observation: {environment.observation_space.sample()}
    """
)
environment.close()

In [None]:
environment = gym.make("CartPole-v1")
environment.reset()
for step in range(100):
    environment.render()
    environment.step(environment.action_space.sample())
environment.close()

In [None]:
environment.reset()
for step in range(10):
	environment.render()
	obs, reward, done, info = environment.step(environment.action_space.sample())
	print(obs, reward, done, info)

environment.close()

In [None]:
from stable_baselines3 import A2C

A2C_model = A2C('MlpPolicy', environment, verbose=1)
A2C_model.learn(total_timesteps=1000)

In [None]:
for episode in range(10):
	obs = environment.reset()
	done = False
	while not done:
		# pass observation to model to get predicted action
		action, _states = A2C_model.predict(obs)

		# pass action to environment and get info back
		obs, rewards, done, info = environment.step(action)

		# show the environment on the screen
		environment.render()
environment.close()

## Model saving and loading

[In part 2 sentdex taught how to save and load models and run a local tensorboard website](https://www.youtube.com/watch?v=dLP-2Y6yu70&list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1)

### Training and saving a model

In [None]:
from stable_baselines3 import PPO
import os

model_directories = ['models/A2C', 'models/PPO']
log_directory = 'logs'

for directory in model_directories:
    if not os.path.exists(directory):
        os.makedirs(directory)
if not os.path.exists(log_directory):
    os.makedirs(log_directory)

A2C_model = PPO('MlpPolicy', environment, verbose=0, tensorboard_log=log_directory) # Switching to verbose 0 so we can train more without flooding the cell output
PPO_model = PPO('MlpPolicy', environment, verbose=0, tensorboard_log=log_directory)

In [None]:
TOTAL_TIMESTEPS = 10000
def train_and_save_model(model, model_directory):
    model_name = model_directory.split('/')[1]
    for index in range(1,11):
        model.learn(total_timesteps=TOTAL_TIMESTEPS, reset_num_timesteps=False, tb_log_name=model_name)
        model.save(f'{model_directory}/{TOTAL_TIMESTEPS*index}')

In [None]:
train_and_save_model(A2C_model, model_directories[0])
train_and_save_model(PPO_model, model_directories[1])

### Launching local tensorboard website

Run `pipenv shell` then `tensorboard --logdir=logs` in terminal to serve a tensorboard website

### Loading and running a model

In [None]:
ppo_model_path = f'{model_directories[0]}/90000.zip'

loaded_PPO_model = PPO.load(ppo_model_path, environment)

for episode in range(10):
	observation = environment.reset()
	done = False
	while not done:
		environment.render()
		action, _ = loaded_PPO_model.predict(observation)
		observation, reward, done, info = environment.step(action)
environment.close()

## Making my snake game agent compatible

I [continued to watch part 3 from sentdex](https://www.youtube.com/watch?v=uKnjGn8fF70&list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1) and I also found [this video from mustafa](https://www.youtube.com/watch?v=5dxJEXCjruE) to see what was required when setting up the training environment

What I noted from those videos was that they coded the snake game into the training environment class, which in my opinion makes the environment code very hard to read and edit

My goal then became to rewrite the snake game into a class that is agent compatible, ie it can be ran from a training environment

First I tried to add a variable that says if the game is agent driven and keep the human playability of the game which you can see in [this commit](https://github.com/felix-tjernberg/snake-reinforcement-learning-model/commit/16e9a5915c538b5bcdf9743f758c351f47f97f6f). My thought process was that it might be cool if you could start the game either in agent mode or in human mode

In the [next commit](https://github.com/felix-tjernberg/snake-reinforcement-learning-model/commit/4526e9593bccd105e882c5b1e743be105b5880a0) I realized that it was just better to have a agent playable version and a human playable version of the game, to show that an agent could play I rewrote `if __name__ == "__main__":` so that it initializes the game and then tests if agent input works

This accomplishes my goal of making both the snake game and training environment more readable and editable

## Version 1 of the training environment

Goal was to get the environment up and running in the most basic way like mustafa did and sentdex did in their videos, I combined mustafa's first observation with sentdex's observation code: `[head.x, head.y, food_delta_y, food_delta_x, body_x, body_y]`

I also added body_x and body_y as observations. Which is the sum of all coordinates x and y which in my mind will create a vector for body size, shape and position: The thought is that hopefully model only need information about the body and not history of previous moves

I also choose to do food_delta from sentdex's video instead of exact position from mustafa to help the model figure out if the food is up, down, left or right in relation to the head, which hopefully reduces the need for the model to figure that part out

The check environment scripts that sentdex showed where very useful, especially the check_environment_visually (As I renamed his doublecheckenv script to). When I ran this script I saw that none of the inputs worked as I intended

I solved the input issue by changing the Direction class into a dictionary as a class attribute instead in the snake game, so I learned the lesson of always checking the gym environment visually is very important which sentdex also said in his video :)

### Observation and reward code

In [None]:
def observe_game_state(self):
    snake_head = self.snake_game.head
    food = self.snake_game.food
    food_delta_x = snake_head.x - food.x
    food_delta_y = snake_head.y - food.y

    all_body_x = [coordinate.x for coordinate in self.snake_game.body]
    all_body_y = [coordinate.y for coordinate in self.snake_game.body]

    return array(
        [
            snake_head.x,
            snake_head.y,
            food_delta_x,
            food_delta_y,
            sum(all_body_x),
            sum(all_body_y),
        ],
        dtype=float32,
    )

In [None]:
# Reward the agent for eating food and punish for game over
if self.snake_game.game_over:
    self.reward = self.snake_game.score - 10
else:
    self.reward = self.snake_game.score

### Result

The results of the first version was not very good it looked decent on [the graph](https://github.com/felix-tjernberg/snake-reinforcement-learning-model/blob/main/models/saved_models/PPO_v1_1653236185_960000.png), but when I ran the model it was just following some [spinning patterns](https://youtu.be/PGNiXGX2nLU?t=60) like mustafa warned it would as it really does not want to die

But the goal was to get training and loading of models to work

I'm saving the best model and logs for each environment change for future reference

## Adding score to the logger object

As I'm going to modify how reward is awarded when trying to solve snake it is not very useful to only compare ep_mean_reward as rewards will be awarded differently between models, for example some models might get more reward than 1 when eating food. Something that is better to compare is actual game score as that is really what matters anyways

This required me to create a [custom callback](https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html#custom-callback) that adds game_score to the logger object with the .record method

To get the score I started to add snake_game.score to the infos object, this can then be recorded to the logger using self.locals in _on_step method. (I have no idea why infos object is a list)

More interesting to things to note:
1. ep_rew_mean is calculated over 100 episodes for PPO at least
2. I also might want to investigate [adding a is_success](https://stable-baselines3.readthedocs.io/en/master/common/logger.html#eval) to the infos object to see if the agent reached max score, might add this when I get closer to actually beating the game regularly 
3. There is a lot more you can do with [custom callbacks](https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html#custom-callback) (you have access to quite a lot of info in the self parameter, as they describe in the comments of the __init__ function), [like this person did and saves the model when agent reaches a new high score](https://github.com/DLR-RM/stable-baselines3/issues/506#issue-938981510) using the global property

## Add pygame.event.pump()

Something I did not notice before was that the game window would appear to freeze when running and training the model

I found out the issue is in the agent version of the game because it does not run pygame.event.get() as no key presses are expected

This means that pygame is not making any calls to the event que. Another way to make calls to the event que is to run [pygame.event.pump()](https://www.pygame.org/docs/ref/event.html#pygame.event.pump)

## Version 2 of the training environment

Now that I have set up the environment it is time to try different observations and rewards, I'm first going to try Mustafas manhattan distance approach and see what kind of results I can get

I will modify Mustafa's approach by only rewarding the agent once for each manhattan distance closer to the food and not punish for getting further away like Mustafa did 

My thinking is that punishing the agent for getting further away from the food might discourage the agent to take longer paths when the body gets longer which makes it more likely to trap itself

One possible issue with this method is that the agent will figure out that it can go as far away from the food as possible to be able to get all manhattan distance rewards, to prevent this I will only reward the agent if it gets closer to than the first manhattan distance after food spawn, I'm also going to track the manhattan distances so the agent only can be rewarded once per distance

I will also remove the minus score from game over and only give minus reward if the agent gets game over from eating itself (Mustafas assignment)

Also throwing away the sum(all_body_x/y) observations, I realized that this does not give the intended body information. For example if the snake body takes x positions of \[1,2,3] it is the same as if the snake body takes the x position of \[6] (x positions \[1,2,3] == x position \[6])

I'm also adding a max amount of allowed steps to stop the [spinning patterns](https://youtu.be/PGNiXGX2nLU?t=60), this will be 42x42 because you only need to visit each coordinate once if you take the optimal path between each food when snake body starts to fill up the whole screen. I'm also giving the amount of steps taken to the agent as observation so it can take that into account

A friend also alerted me that the game will crash if you reach max score, this is because of when the you reached max body size the _place_food method will end in an infinite loop when trying to spawn a new food

In [1]:
from snake_game.snake_game_agent_version import Coordinate, SnekGame
from time import sleep


def manhattan_distance(coordinate_one: Coordinate, coordinate_two: Coordinate):
    """Takes two coordinates and returns the manhattan distance between them"""
    return int(abs(coordinate_one.x - coordinate_two.x) + abs(coordinate_one.y - coordinate_two.y))


snake_game_instance = SnekGame()
snake_game_instance.display_ui = True


def test_distance(distance_function):
    def make_move(direction):
        snake_game_instance.agent_action = snake_game_instance.DIRECTION[direction]
        snake_game_instance.game_tick()
        head, food = snake_game_instance.head, snake_game_instance.food
        print(head, food)
        print(distance_function(head, food))
        sleep(1)

    for move in ["down", "left", "up", "right"]:
        make_move(move)


print("\nManhattan distance test:")
test_distance(manhattan_distance)
print(snake_game_instance.max_score)
snake_game_instance.quit_game()

pygame 2.1.2 (SDL 2.0.18, Python 3.9.9)
Hello from the pygame community. https://www.pygame.org/contribute.html

Manhattan distance test:
Coordinate(x=210, y=220) Coordinate(x=110, y=240)
120
Coordinate(x=200, y=220) Coordinate(x=110, y=240)
110
Coordinate(x=200, y=210) Coordinate(x=110, y=240)
120
Coordinate(x=210, y=210) Coordinate(x=110, y=240)
130
1761


### Results

Version 2 taught the agent to go for the food quite well, the agent has also learned some strategies to avoid it's body but they work very poorly

Some strategies I noted are making O's and trying to make lines close to foods

This is quite impressive as the only information the agent gets about the body is head position and body length

This environment might work eventually but I think it will take unreasonably long to train the model in this state 

The score is did not converge after after 45M steps, and averages around 6 points

Making changes incrementally to rewards and observations is defiantly the way to go as you can quite clearly see if the changes is positive or negative

Note to self: Model names should reflect the change made, distinguishing between models is hard when they are named PPO_v1_1653236185_960000: PPO_v1_linear_step_reward_1653236185_960000 is better

## Version 3 of the training environment

In this version I'm first going to make reward between food non-linear, even tho agent seemed to get to the food quite quickly in version two

Non-linear reward can later be flipped after snake body has reached a larger size, this will in turn motivate agent to take longer paths to which might help the agent to not trap itself

I'm also going to be adding a 5th action to see if snake takes more linear paths instead of diagonal paths, which again might help the agent when snake body gets longer


### Results

The non-linear reward seems to help the agent prioritize getting to the food more quickly, I have not yet implemented the inverse reward yet as the snake body is not getting much bigger than 30 segments

After I added 5th action the model seemed to preform worse, which is understandable as it might increase the model complexity unnecessarily

So I decided to reward the model for going straight lines. Either by rewarding the 5th action or by removing the 5th action and rewarding the agent for keeping it's previous direction

Rewarding the agent for using the 5th action it seemed to backfire quite bad as the agent realized that if it just stays alive it can farm the reward for using the 5th action

The models that only had 4 actions preformed better than the 5 action models again, amazingly none of those 3 models that I trained figured out the staying alive strategy

The 4 action models did however seem to take more linear paths to the food, So for now I will keep the reward for keeping direction and remove the reward if I see it becoming a problem in the future versions

## Version 4 of the training environment

In this model I want to give the agent more information about the snake body, I'm going to start with the easiest method of giving the agent a 1761 history of the head positions

I will be doing this by increasing the observation space to 1764: history of steps using floats for x any y. For example x = 1, y = 3 is becomes a float of 1.3, hopefully this is enough for the agent to figure out that left of the decimal is x and right is y.

My thinking is that this will reduce the observation space by half as if I would need another 1761 if I wanted to represent x and y individually, this is different than what people usually do