*Properly used, positive reinforcement is extremely powerful. - [B. F. Skinner](https://www.brainyquote.com/authors/b-f-skinner-quotes)*

*once one of my pups found half a roast chicken in the corner of a parking lot and we had to visit that exact same corner every day for about fifty years because for dogs hope springs eternal when it comes to half a roast chicken - [darth](https://twitter.com/darth/status/1057075608063139840)* (possibly embed tweet)

Tic-Tac-Toe is a simple game. If both sides play perfectly, neither can win. But if one plays imperfectly the other can exploit the flaws in the other's strategy. 

Does that sound a little like trading?

In this post, we will explore reinforcement learning, and apply it to learn an algorithm to play Tic-Tac-Toe and then learn to trade a slightly non-random price series.

### Tic-Tac-Toe With Simple Reinforcement Learning

Here's an algorithm that will learn an exploitive Tic-Tac-Toe strategy, and adapt over time if its opponent learns:

1) Make a big table of all possible Tic-Tac-Toe boards. 

2) Initialize the table to assign a value of 0 to each board, 1.0 where X has won, -1.0 where O has won.

3) Play with your opponent. At each move, pick the best available move in your table, or if several are tied for best, pick one at random. Occasionally pick a move at random just to make sure you explore the whole state space, and keep your opponent on their toes. 

4) After every game, back up through all the moves you played. If you won, update your value table as follows:
	- When X wins, update each board's value part of the way to 1. 
	- When O wins, update part of the way to -1.  
	- When they tie, update part of the way to 0 

This is a very crude brute-force algorithm. It knows almost nothing about the game of Tic-Tac-Toe. It can't reason about the game. It can't generalize to boards it hasn't seen (footnote: even if they are isomorphic to boards it has seen. When you think about it, there are only 3 starting moves, board center, corner, center side. Flipping or rotating the board shouldn't change the value of a position or how to play it.) . It doesn't find the globally optimal strategy. 

But over time, this algorithm will learn, it will exploit flaws in its opponent's strategy, and if the opponent changes tactics, over time it will adapt.

This is *reinforcement learning*. (link to code)

More sophisticated reinforcement algorithms enable [robots to walk on four or two legs](https://www.youtube.com/watch?v=xXrDnq1RPzQ), [driverless cars to drive](https://www.youtube.com/watch?v=eRwTbRtnT1I), computers to play [Atari](https://deepsense.ai/playing-atari-with-deep-reinforcement-learning-deepsense-ais-approach/) and [poker](https://www.engadget.com/2017/02/10/libratus-ai-poker-winner/?guccounter=1) and [Go](https://deepmind.com/blog/article/alphago-zero-starting-scratch) in some cases better than humans. 

In the parts that follow, we'll extend the Tic-Tac-Toe example to more complex deep reinforcement learning, and try to build a reinforcement learning trading robot.

### Reinforcement Learning Concepts

But first, how does reinforcement learning in general work?

![agent-environment](RL1.png "Figure 1")

all these figures are from Silver - http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

1) At time *t*, The *agent* observes the environment *state* *s<sub>t</sub>* (the Tic-Tac-Toe board)

2) From the set of available actions (the open squares), the agent takes *action* *a<sub>t</sub>* (the move that results in the best probability of winning) 

3) The environment updates at the next *timestep* *t+1* to a new state *s<sub>t+1</sub>*. In Tic-Tac-Toe this is the agent's board target. But in a complex environment like a car on a road, the new state may be partly determined by the agent's actions (you turned left and accelerated) and partly by visible or hidden complexities in the environment (a dog runs into the road). And the new state may not be deterministic, it may be stochastic, where some things occur randomly with probabilities dependent on the visible and hidden state and the actions of the agent.

4) The environment generates a *reward*. In Tic-Tac-Toe this happens when you win, lose, or draw. In Space Invaders, you get points awarded at various times when you hit different targets. When training a self-driving car, machine learning engineers have to design rewards for e.g. staying on the road, including negative rewards for e.g. collisions.

The technical name for this environment is a [Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process). Reinforcement learning always has states, actions, transitions between states, rewards, and an agent that chooses actions, and a cycle of observing the state, acting, getting a reward and repeating with the new state. 


### Reinforcement Learning Variations

The agent always has a *policy function*, which chooses the best action based on the environment state. It *may* have the following components:

- *Model* - An internal representation of the environment. Our agent stores the board. It knows some state-action pairs result in same state as other state-action pairs, so it's *model-based*. Other agents are *model-free*. They choose actions without explicitly storing an internal model of the state or modeling state transitions. The understanding of the environment implicit in the policy function (footnote. for instance if for Tic-Tac_Toe instead of a table with all possible boards we used boards+following actions. Then we wouldn't be modeling what happens after a move, but just evaluating state, action pairs.).

- *State value function* - The ability to score a state (our V table)

- *State-action value function* - The ability to score the value of an action in a given state, i.e. a state-action pair, commonly termed a *Q-value function*

Based on which components a reinforcement algorithm uses to generate the workflow shown in Figure 1, it can be characterized as belonging to different flavors.

![taxonomy](RL3.png "Figure 2")

All reinforcement learning variations learn using a similar process:

1) Initialize the algorithm with a naive, possibly random policy.

2) Using the policy, take actions, observe states before and after actions, experience rewards.

3) Fit a model which improves the policy.

4) Go to 2) and iterate, collecting more experience with our improved policy, and continuing to improve it.

(find/make a flowchart)

As we circle the flowchart, we improve the algorithm.

### Reinforcement Learning In Context

In a [previous post](https://alphaarchitect.com/2017/09/27/machine-learning-investors-primer/) we reviewed types of machine learning:

Supervised learning: Any algorithm that predicts labeled data. Regression predicts a continuous response variable.  Classification predicts a discrete response variable. 

Unsupervised learning: Any algorithm that summarizes or learns something about unlabeled data, such as clustering, dimensionality reduction. 

It's like those two books, [What They Teach You At Harvard Business School](https://www.amazon.com/What-Teach-Harvard-Business-School/dp/0141037865) and [What They Don't Teach You At Harvard Business School](https://www.amazon.com/What-Teach-Harvard-Business-School/dp/0553345834). Between the two of them, they must cover everything, right? Nevertheless reinforcement learning is considered the third major machine learning paradigm. 

- The agent doesn't have labeled data, it discovers data via an unsupervised process and figures out what to do.

- The rewards can be viewed as labels generated by a supervisor. But rewards aren't directly related to any specific prediction or action. If the agent shoots a target in Space Invaders, it has to figure out which action or actions possibly several timesteps earlier contributed to the reward. 

- The agent's interactions with the environment *shape* that environment and generate a feedback loop. (A Space Invaders agent changes the world by shooting targets; a self-driving car doesn't modify the road, but its presence and behavior modifies how other vehicles behave, and what environment the algorithm encounters.)

- In supervised learning, the algorithm minimizes an error, like mean squared error or cross-entropy, by optimizing model parameters over training data. In reinforcement learning, the algorithm maximizes the expected cumulative reward generated by the Markov Decision Process (MDP) over time, by searching the state space and optimizing the model parameters.

I view reinforcement learning as meta-supervised-learning. It's the application of supervised machine learning to [*optimal control*](https://en.wikipedia.org/wiki/Optimal_control). In reinforcement learning we predict the value of actions by modeling the environment, explicitly in model-based RL and implicitly in model-free RL, and generate complex behavior to maximize reward.

Reinforcement learning is the machine learning version of a type of complex problem-solving we study under different names across many disciplines. 

![taxonomy](RL2.png "Figure 2")


### Deep Reinforcement Learning

How do we get from our simple Tic-Tac-Toe algorithm to an algorithm that can drive a car or trade a stock?

We can view our table lookup-based algorithm as a model with a *linear value function approximator*. If we represent  our board as a one-hot feature vector based on which known board it represents, and view the lookup table as a vector, then if we multiply the one-hot feature vector by the lookup values, we get a linear value function.

Our linear function approximator takes a board, converts it to a feature vector and outputs a linear function of that feature vector to generate a value for that board. We can swap that linear function for a nonlinear function, such as neural network. When we do that, we get a first, very crude, deep reinforcement learning algorithm.

Our new algorithm is:

1) Initialize our neural network to random weights

2) Play a game with our opponent

3) At the end of the game, put each board we encountered into a 1x9 array (our predictors) associated with the outcome of the game (our response)

4) Fit the neural network to the recent predictors and responses we've seen (run one or more iterations of stochastic gradient descent)

5) Go to 2) and gather more data.

This will work, although it takes a long time to train and just makes our initial brute force method even more inefficient. (see code). 

But in a nutshell, that is how a self-driving car could work. 

- The state is represented by a giant array of inputs from all the onboard cameras and sensors.

- The actions are: turn the steering wheel, accelerate, and brake.

- Positive rewards come from staying on the road and arriving safely to the destination, and negative rewards from breaking traffic rules or colliding.

- The real world provides the state transitions.

- And we train a complex neural network to do all the right things involved in detecting and interpreting all the objects in the environment and navigating from point A to point B.


Finally, a trading example.

- inspired by Gordon Ritter paper https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3015609

- let's create a stocks. give them each a random walk, so they are flat. now make it springy. random walk plus momentum, plus attraction back to the mean, , plus trend.

- let reinforcement learning trade it.



Conclusions

Similar paradigm to Andrew Lo's adaptive markets https://alo.mit.edu/book/adaptive-markets/

anecdotally poker bots taking over, algorithms own very short term hf trading. algorithms like RL they can move up the timescale.

there's a need to combine the brute force RL stuff with some reasoning about the world or the game, which AlphaGo and AlphaZero seem to do to some modest extent.

further reading

- UCL course by David Silver http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
- stanford course  http://web.stanford.edu/class/cs234/schedule.html
- berkeley course http://rail.eecs.berkeley.edu/deeprlcourse/
- sutton book http://incompleteideas.net/book/the-book-2nd.html
- https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf

