# Reward Functions and State Shapes

### Intro
For me, one of the more interesting parts of this competition is how the reward functions and modified state shapes can effect an agent's ability to perform well.

Some of this flies in the face of the entire purpose of reinforcement learning. In the case of rewards I would like to present this snippet: "taken from Richard Sutton and Andrew Barto's intro book on Reinforcement Learning:

> The reward signal is your way of communicating to the [agent] what you want it to achieve, not how you want it achieved (author emphasis).
>For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponents pieces or gaining control of the center."

Additionally,

>Newcomers to reinforcement learning are sometimes surprised that the rewards—which define of the goal of learning—are computed in the environment rather than in the agent...

>For example, if the goal concerns a robot’s internal energy reservoirs, then these are considered to
be part of the environment; if the goal concerns the positions of the robot’s limbs, then these too are considered to be part of the environment—that is, the agent’s boundary is drawn at the interface between the limbs and their
control systems. These things are considered internal to the robot but external to the learning agent. 

The simplest reward function would be 1 for winning and 0 for everything else.

### Motivation
I kept running into the problem (especially while training against the random agents) of my agents deciding the best thing to do would be to do nothing.

Against the random agent this makes sense. Generally speaking the random agent will keep spawning new agents or converting to shipyards (reducing its total score). In this scenario the player agent is content to just sit back and not spend halite converting or spawning if it doesn't need to do so.

### Strategies

#### Make it possible to lose games
It is very important to either have an opponent increase their halite, or have the player decrease their halite (artificially). This will remove the incentive for the agent to sit around until it inevitably wins.

- improve the opponent agent
- by default subtract -N halite when the game starts (problem is that this effects ability to spawn/convert)

#### Reward Shaping
See below.

### Reward Shaping

From [Andrew Y. Ng, Daishi Harada, Stuart Russell],
>  These results shed light on the practice of reward shaping, a method used in reinforcement learning whereby additional training rewards are used to guide the learning agent. In particular, some well-known bugs" in reward shaping procedures are shown to arise from non-potential-based rewards, and methods are given for constructing shaping potentials corresponding to distance-based and subgoalbased heuristics. We show that such potentials can lead to substantial reductions in learning time.

Additionally from this write-up,
https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0
> You want to instead shape rewards that get gradual feedback and let it know it’s getting better and getting closer. It helps it learn a lot faster

The focus of this notebook is on reward shaping. The goal is to see if we can nudge the agents to learn a bit faster and perhaps with better agents, we can train the final agent _against_ those agents such that it actually has to react to learn good moves. 