# Reward Functions and State Shapes

### Intro
For me, one of the more interesting parts of this competition is how the reward functions and modified state shapes can effect an agent's ability to perform well.

Some of this flies in the face of the entire purpose of reinforcement learning. In the case of rewards I would like to present this snippet: "taken from Richard Sutton and Andrew Barto's intro book on Reinforcement Learning:

> The reward signal is your way of communicating to the [agent] what you want it to achieve, not how you want it achieved (author emphasis).
>For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponents pieces or gaining control of the center."

Additionally,

>Newcomers to reinforcement learning are sometimes surprised that the rewards—which define of the goal of learning—are computed in the environment rather than in the agent...

>For example, if the goal concerns a robot’s internal energy reservoirs, then these are considered to
be part of the environment; if the goal concerns the positions of the robot’s limbs, then these too are considered to be part of the environment—that is, the agent’s boundary is drawn at the interface between the limbs and their
control systems. These things are considered internal to the robot but external to the learning agent. 

The simplest reward function would be 1 for winning and 0 for everything else.

### Motivation
I kept running into the problem (especially while training against the random agents) of my agents deciding the best thing to do would be to do nothing.

Against the random agent this makes sense. Generally speaking the random agent will keep spawning new agents or converting to shipyards (reducing its total score). In this scenario the player agent is content to just sit back and not spend halite converting or spawning if it doesn't need to do so.

### Strategies

#### Find Better Agents to Play Against
This approach would at least result in games that my agent would lose sometimes. We still run into this issue of the agent only being rewarded at the end of a game. 

This doesn't mean that the agent only learns at the end of a game, but it does mean that many-many full games need to be played out if my only reward is winning or losing. 



#### Reward Shaping
See below.

### Reward Shaping

From [Andrew Y. Ng, Daishi Harada, Stuart Russell],
>  These results shed light on the practice of reward shaping, a method used in reinforcement learning whereby additional training rewards are used to guide the learning agent. In particular, some well-known bugs" in reward shaping procedures are shown to arise from non-potential-based rewards, and methods are given for constructing shaping potentials corresponding to distance-based and subgoalbased heuristics. We show that such potentials can lead to substantial reductions in learning time.

Additionally from this write-up,
https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0
> You want to instead shape rewards that get gradual feedback and let it know it’s getting better and getting closer. It helps it learn a lot faster

The focus of this notebook is on reward shaping. The goal is to see if we can nudge the agents to learn a bit faster and perhaps with better agents, we can train the final agent _against_ those agents such that it actually has to react to learn good moves. 

The idea here is to incentivize some sort of intermediate reward that typically leads to winning. What we don't want to do, is overspecify in a way that results in more or less writing a rule-based approach.

#### Reward Shaping Ideas: Ships

- (COLLECT/DEPOSIT) Total halite for that ship plus total halite for player
   - One drawback is that this ship would never try to CONVERT because it does not improve either objective
- (ATTACK) Subtract total halite mined by opponent ships
   - I have a feeling this would take a long time for the ship to learn that it should attack another ship if it has less halite than it
- (CONVERT) Reward a particular ship to shipyard ratio, or penalize ships with too much mined halite
   
#### Reward Shaping Ideas: Shipyards

- Reward the shipyard agents for achieving parity in ships between the player and the opponents
- Reward shipyards for ensuring that there are at least X ships per Y halite

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
import os

code_dir = os.environ.get('HALITE_PATH')

if not code_dir:
    code_dir = '/'.join(os.getcwd().split('/')[:-1] + ['code'])
sys.path.append(code_dir)

In [None]:
import numpy as np

from kaggle_environments import make
from kaggle_environments.envs.halite.helpers import Board, ShipAction, ShipyardAction, Observation

from halite_env import HaliteEnv
from ship_state_wrapper import ShipStateWrapper
from shipyard_state_wrapper import ShipYardStateWrapper
from agent import Agent
from game_runner_v2 import GameRunner

In [None]:
def run_training_with_reward(reward_type, max_steps, episodes):
    ship_frame_stack_len = 2
    env = make("halite", debug=True)
    ship_state_wrapper = ShipStateWrapper(
        radius=4,
        max_frames=ship_frame_stack_len,
        map_size=int(env.configuration['size'])
    )

    shipyard_state_wrapper = ShipYardStateWrapper(
        radius=4,
        max_frames=1,
        map_size=int(env.configuration['size'])
    )

    print(env.configuration)

    print("Initialized state wrappers")

    ship_agent = Agent(
        alpha=0.99, gamma=0.5, n_actions=6,
        batch_size=32, epsilon=.9, input_dims=ship_state_wrapper.state_size
    )

    shipyard_agent = Agent(
        alpha=0.99, gamma=0.5, n_actions=2,
        batch_size=32, epsilon=.9, input_dims=shipyard_state_wrapper.state_size
    )

    print("Initialized agents")
    
    players = [None, "random"]

    trainer = env.train(players)

    print("Initialized trainer")
    
    halite_env = HaliteEnv(
        environment=env,
        opponents=len(players),
        ship_state_wrapper=ship_state_wrapper,
        shipyard_state_wrapper=shipyard_state_wrapper,
        radius=4,
        trainer=trainer,
        ship_reward_type=reward_type
    ) 
    
    game = GameRunner(
        configuration=env.configuration,
        env=halite_env,
        ship_agent=ship_agent,
        shipyard_agent=shipyard_agent,
        training=True,
        ship_frame_stack_len=ship_frame_stack_len
    )
    
    all_scores = []
    for episode in range(episodes):
        scores = game.play_episode(max_steps)
        all_scores.append(scores)

    return {
        'all_scores': all_scores,
        'ship_agent': ship_agent,
        'shipyard_agent': shipyard_agent,
        'env': env
    }

## Reward Shaping A

Our first attempt at reward shaping will encourage ships to collect and deposit halite. 

- Here we will deduct points at each timestep.
- We will also add points for the difference between the previous amount of halite the ship had and the current amount of halite it has. 
- In order to encourage an increase in player halite instead of just ship halite, we will multiply the new ship halite by 0.5.
- Finally, we will add the difference in total player halite. 


In [None]:
def win_loss_reward(observation):
    player_halite = observation.players[observation.player][0]
    opponent_halites = [item[0] for item in observation.players[observation.player:]]
    best_opponent_halite = sorted(opponent_halites, reverse=True)[0]

    if player_halite > best_opponent_halite:
        return 500
    else:
        return -500

In [None]:
results = run_training_with_reward('total_halite', 150, 5)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
all_scores = results['all_scores']
ship_agent = results['ship_agent']
shipyard_agent = results['shipyard_agent']
env = results['env']
rewards = all_scores

In [None]:
i = 0
for episode in rewards[-5:]:
    i += 1
    episode = np.array(episode)
    for j in range(episode.shape[1]):
        plt.subplot(3, 2, i)
        sns.lineplot(x=range(0, episode.shape[0]), y=episode[:, j])

In [None]:
env.render(mode="ipython",width=800, height=600)

In [None]:
results = run_training_with_reward('basic', 150, 5)

In [None]:
all_scores = results['all_scores']
ship_agent = results['ship_agent']
shipyard_agent = results['shipyard_agent']
env = results['env']
rewards = all_scores

In [None]:
i = 0
for episode in rewards[-5:]:
    i += 1
    episode = np.array(episode)
    for j in range(episode.shape[1]):
        plt.subplot(3, 2, i)
        sns.lineplot(x=range(0, episode.shape[0]), y=episode[:, j])

In [None]:
env.render(mode="ipython",width=800, height=600)

In [None]:
results = run_training_with_reward('collector', 150, 5)

In [None]:
all_scores = results['all_scores']
ship_agent = results['ship_agent']
shipyard_agent = results['shipyard_agent']
env = results['env']
rewards = all_scores

i = 0
for episode in rewards[-5:]:
    i += 1
    episode = np.array(episode)
    for j in range(episode.shape[1]):
        plt.subplot(3, 2, i)
        sns.lineplot(x=range(0, episode.shape[0]), y=episode[:, j])

In [None]:
env.render(mode="ipython",width=800, height=600)