*once one of my pups found half a roast chicken in the corner of a parking lot and we had to visit that exact same corner every day for about fifty years because for dogs hope springs eternal when it comes to half a roast chicken - [darth](https://twitter.com/darth/status/1057075608063139840)* (embed tweet)

*Properly used, positive reinforcement is extremely powerful. - [B. F. Skinner](https://www.brainyquote.com/authors/b-f-skinner-quotes)*

Tic-Tac-Toe is a simple game. If both sides play perfectly, neither can win. But if one plays imperfectly, the other can exploit the flaws in the other's strategy. 

Does that sound a little like trading?

In this post, we will explore reinforcement learning, and apply it, first to learn an algorithm to play Tic-Tac-Toe, and then learn to trade a moderately non-random price series.

### Tic-Tac-Toe With Simple Reinforcement Learning

Here's an algorithm that will learn an exploitive Tic-Tac-Toe strategy, and adapt over time if its opponent learns:

1) Make a big table of all possible Tic-Tac-Toe boards. 

2) Initialize the table to assign a value of 0 to each board, 1.0 where X has won, -1.0 where O has won.

3) Play with your opponent. At each move, pick the best available move in your table, or if several are tied for best, pick one at random. Occasionally, pick a move at random just to make sure you explore the whole state space, and to keep your opponent on their toes. 

4) After each game, back up through all the boards that were played. Update the value table like this:
	- When X wins, update each board's value part of the way to 1. 
	- When O wins, update part of the way to -1.  
	- When they tie, update part of the way to 0 

This is a profoundly dumb algorithm in the finest sense of the word. It knows almost nothing about the game of Tic-Tac-Toe, but it works. It can't reason about the game. It needs a lot of training. It can't generalize to boards it hasn't seen (footnote: even if they are isomorphic to boards it has seen. When you think about it, there are only 3 starting moves, board center, corner, center side. Flipping or rotating the board shouldn't change the value of a position or how to play it.) . It doesn't learn the globally optimal strategy. 

But over time, this algorithm learns, it exploits flaws in its opponent's strategy, and if the opponent changes tactics, it adapts.

This is *reinforcement learning*.  

More sophisticated reinforcement learning algorithms enable [robots to walk on four or two legs](https://www.youtube.com/watch?v=xXrDnq1RPzQ), [driverless cars to drive](https://www.youtube.com/watch?v=eRwTbRtnT1I), computers to play [Atari](https://deepsense.ai/playing-atari-with-deep-reinforcement-learning-deepsense-ais-approach/) and [poker](https://www.engadget.com/2017/02/10/libratus-ai-poker-winner/?guccounter=1) and [Go](https://deepmind.com/blog/article/alphago-zero-starting-scratch), in some cases better than humans. 

Here is some sample [Tic-Tac-Toe code](https://github.com/druce/rl/blob/master/Tic-Tac-Toe.ipynb). In this post, we'll extend the Tic-Tac-Toe example to deep reinforcement learning, and build a reinforcement learning trading robot.

### Reinforcement Learning Concepts

But first, how does reinforcement learning in general work?

![agent-environment](RL1.png "Figure 1")

All figures are from the [lectures of David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html), a leading reinforcement learning researcher known for the [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo) project, among others.

1) At time *t*, the *agent* observes the environment *state* *s<sub>t</sub>* (the Tic-Tac-Toe board). (Or o<sub>t</sub>, the observable part of the state, in the event the state is not fully observed, and there is some hidden state.)

2) From the set of available actions (the open board squares), the agent takes *action* *a<sub>t</sub>* (the best move). 

3) The environment updates at the next *timestep* *t+1* to a new state *s<sub>t+1</sub>*. In Tic-Tac-Toe this is the board resulting from the opponent's move. In a complex environment like a car on a road, the new state may be partly determined by the agent's actions (you turned left and accelerated) and partly by visible or hidden complexities in the environment (a dog runs into the road). And the new state may not be deterministic, it may be stochastic, where some things occur randomly with probabilities dependent on the visible and hidden state and the actions of the agent.

4) The environment generates a *reward*. In Tic-Tac-Toe you get a reward when you win, lose, or draw. In Space Invaders, you win points at various times when you hit different targets. When training a self-driving car, machine learning engineers design rewards for staying on the road, getting to the destination, including negative rewards for e.g. collisions.

The technical name for this setting is a *[Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process) (MDP)*. 

- It's based on a vanilla [Markov chain](https://en.wikipedia.org/wiki/Markov_chain), with states, and probabilities of transitions between states.
- The vanilla Markov process is extended with actions: at each state the agent chooses an action which impacts the transition probabilities. The transition probabilities are a function not just of *s<sub>t</sub>* but of *(s<sub>t</sub>, a<sub>t</sub>)*.
- Each state transition is associated with a *reward*.
- Finally, the agent choose states using a *policy function* *π(s<sub>t</sub>) = a<sub>t</sub>*.

Reinforcement learning always has an environment with states, actions, transitions between states, and rewards, and an agent that acts according to a policy, proceeding through a cycle of observing the state, acting, getting a reward and repeating forever, or until some terminal state is reached.


### Reinforcement Learning Components and Variations

The agent always has a *policy function*, which chooses the best action based on the environment state. It *may* have the following components:

- *Model* - An internal representation of the environment. Our agent has a model of the board, and it knows some state-action pairs result in the same state as other state-action pairs. Therefore it could be considered *model-based*. A fully model-based algorithm explcitly models the full MDP with all transition probabilities, which this Tic-Tac-Toe algorithm doesn't do. Other agents may be *model-free*. They choose actions without explicitly storing an internal model of the state or modeling state transitions. The understanding of the environment is implicit in the policy function.

- (footnote. For instance, instead of a table with all possible Tic-Tac-Toe boards we could use a table mapping (board, action) pairs to values. Then we wouldn't be modeling internally what happens after a move, i.e. several (board, action) pairs arrive at the same board. We would just evaluate state, action pairs directly without any internal model. That would work pretty much the same, it would just be a bigger table and take longer to train.).

- *State value function* - A way to score a state (our big table mapping boards to values).

- *State-action value function* - A way to score the value of an action in a given state, i.e. a state-action pair, commonly termed a *Q-value function*.

Just as there are many algorithms for regression or classification, there are many reinforcement learning architectures. It's a fast-moving field with new approaches emerging constantly. Based on which components a reinforcement learning algorithm uses to generate the workflow shown in Figure 1, it can be characterized as belonging to different flavors of reinforcement learning.

![taxonomy](RL3.png "Figure 2")

All reinforcement learning variations learn using a similar workflow:

1) Initialize the algorithm with a naive, possibly random policy.

2) Using the policy, take actions, observe states before and after actions, experience rewards.

3) Fit a model which improves the policy.

4) Go to 2) and iterate, collecting more experience with the improved policy, and continuing to improve it.

As we continue to iterate, we improve the algorithm.

![Flowchart](flowchart.png)

### Reinforcement Learning In Context

In a [previous post](https://alphaarchitect.com/2017/09/27/machine-learning-investors-primer/) we discussed the difference between paradigms of machine learning:

*Supervised learning:* Any algorithm that predicts labeled data. Regression predicts a continuous response variable (next quarter's real GDP growth, next month's stock return).  Classification predicts a categorical response variable (recession or recovery, next month's return quintile). 

*Unsupervised learning:* Any algorithm that summarizes or learns about unlabeled data, such as clustering or dimensionality reduction. 

Every data set is either labeled or unlabeled, so between supervised and unsupervised, that must cover everything, right? It's like those two books, [What They Teach You At Harvard Business School](https://www.amazon.com/What-Teach-Harvard-Business-School/dp/0141037865) and [What They Don't Teach You At Harvard Business School](https://www.amazon.com/What-Teach-Harvard-Business-School/dp/0553345834). Between the two of them, they must cover all human knowledge, right?

Nevertheless reinforcement learning is considered the third major machine learning paradigm. Consider the Tic-Tac-Toe robot:

- The agent doesn't have fixed training data, it discovers data via an unsupervised process and learns a policy.

- The rewards can be viewed as labels generated by a supervisor. But rewards aren't always directly related to one specific prediction or action. If the agent shoots a target in Space Invaders, it has to figure out which action or sequence of actions possibly many timesteps earlier contributed to the reward (the *credit assigment* problem). 

- The agent's interactions with the environment *shape* that environment, help determine what data the learning algorithm subsequently encounters, and generate a *feedback loop*. A Space Invaders agent changes the world by shooting targets; a self-driving car doesn't modify the road, but its presence and behavior modify how other vehicles behave, and what environment the algorithm encounters.

- In supervised learning, the algorithm optimizes model parameters over training data to minimize a loss function, like mean squared error or cross-entropy. In reinforcement learning, the algorithm optimizes model parameters over the state space it encounters, to maximize the expected reward generated by the Markov Decision Process (MDP) over time.

In reinforcement learning, we move beyond *prediction* to *control*. Reinforcement learning could be viewed as meta-supervised-learning. It's the application of supervised machine learning to [*optimal control*](https://en.wikipedia.org/wiki/Optimal_control).  We apply supervised prediction methods such as classification and regression. But we use them to predict the best action to take within the *action space*. We use supervised methods to choose actions, and learn behavior policies to maximize reward in a complex dynamic environment.

Many disciplines have encountered problems like these and developed models and engineering methodologies to address them: 

- Business/Operations Research: Dynamic pricing of airline seats or other products to maximize profits under changing inventory, production, demand conditions.
- Economics: Optimal Fed interest rate policy to maintain full employment and low inflation in a dynamic economy.
- Engineering: Auto-pilots, spacecraft navigation, robots and industrial automation.
- Psychology: Stimulus-response, positive and negative reinforcement.
- Neuroscience: The brain's chemical reward loop, how children learn to walk and talk, or catch a ball.
- Mathematics: Control theory, game theory, optimization.

![connections](RL2.png "Figure 2")



### Deep Reinforcement Learning

How do we get from our simple Tic-Tac-Toe algorithm to an algorithm that can drive a car or trade a stock?

We can view our table lookup as a *linear value function approximator*. If we represent  our board as a one-hot feature vector based on which board it represents, and view the lookup table as a vector, then if we dot-multiply the one-hot feature vector by the lookup table values, we get a linear value function to choose the next move.

Our linear value function approximator takes a board, represents it as a feature vector with one feature for each possible board, and outputs a linear function of that feature vector, the value for that board. We can swap that linear function for a nonlinear function, such as neural network. When we do that, we get our first, very crude, deep reinforcement learning algorithm.

Our new deep Q-learning algorithm is:

1) Initialize our neural network to random weights

2) Play a game with our opponent

3) At the end of the game, append each board we encountered into a nx9 data array (our predictors are the state of each square) associated with the outcome of the game (our response)

4) Fit the neural network to the predictors and responses we've seen (run one or more iterations of stochastic gradient descent)

5) Go to 2), gather more data, and continue training.

We end up with a function that will take a board as input, and output its value as determined by the neural network. The more you play, the better the value function gets. This will work, although it takes a long time to train and makes our initial brute force method even more inefficient. (see [code](https://github.com/druce/rl/blob/master/Tic-Tac-Toe.ipynb)). 

But in a nutshell, that is how a self-driving car could work. 

- The state is represented by a giant array of inputs from all the onboard cameras and sensors.

- The actions are: turn the steering wheel, accelerate, and brake.

- Positive rewards come from staying on the road and arriving safely at the destination, and negative rewards from breaking traffic laws or colliding.

- The real world provides the state transitions.

- And we train a complex neural network to do all the right things involved in detecting and interpreting all the objects in the environment and navigating from point A to point B.

Table lookup cannot scale to high dimensional or continuous action or state spaces. And a linear function approximator can't learn nonlinear behavior. With deep neural networks, reinforcement learning algorithms can learn complex emergent behavior. 


### Reinforcement Learning for Trading

In a trading context, reinforcement learning allows us to use a market signal to create a profitable trading strategy. 
- You need some better-than-random prediction to trade profitably. The signal can come from regression, predicting a continuous variable ([previously discussed here](https://alphaarchitect.com/2018/12/21/machine-learning-classification-methods-and-factor-investing/)) or classification, predicting a discrete variable such as outperform/underperform (binary classification) or quintiles (multinomial classification) ([previously discussed here](https://alphaarchitect.com/2018/06/05/machine-learning-financial-market-prediction-time-series-prediction-sklearn-keras/)). 
- The reward can be raw return or risk-adjusted return (Sharpe). 
- Reinforcement learning allows you to take a signal and find the optimal policy (trading strategy) to maximize the reward (return or risk-adjusted return).


Here's a simple example showing how one might trade using reinforcement learning. This approach is inspired by the paper ["Machine Learning For Trading" by Gordon Ritter](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3015609).

We are going to use simple simulated market data as a stepping stone to more complex trading environments. Let's create a market price time series as a simple sine wave.

![Simple harmonic motion 1](StocksSHM1.png "SHM 1")

- Initially we set the price at 102 and price momentum at 0.
- Set 100 as the price 'anchor'. Each timestep, the price accelerates toward 100 by an amount proportional to the distance from 100. If the price is 102:
    - the distance from 100 is 2
    - the new price momentum is old momentum - 2 * k
    - the new price is old price + momentum
    
This is simple harmonic motion, it describes the oscillation of a spring ([Hooke's law](https://en.wikipedia.org/wiki/Hooke%27s_law)), a pendulum under small oscillations, and many other periodic systems.

We can view this as an extremely simplified value/momentum model. 100 is the intrinsic value the stock tends toward. The farther away from intrinsic value, the stronger the acceleration back toward intrinsic value. And momentum means that if the stock is trending up or down, the trend takes time to be reversed.

To trade this stock, we use the REINFORCE algorithm, which is a Monte Carlo policy gradient-based method. (We could also use Q-learning, but policy gradient seems to work better.)

We will simulate many episodes of 1000 training days, observe the outcomes, and train our policy after each episode.

1) Initialize a neural network to choose actions based on the state.
  - 2 hidden layers of 16 units
  - 32 inputs: the last 16 market values (as deviations from 100 or intrinsic value), and the last 16 daily changes. Footnote: For our simple harmonic motion with no noise, one input of the last change whould be sufficient. But we use a model we can apply later to a more complex example.
  - 3 outputs of the probabilities of 0, 1, or 2 for short, flat, long respectively (softmax activation)
  - Reward: we buy 1 share based on the model output
    - When we choose 2 (long), the next reward is the change in price at the next timestep
    - When we choose 1 (flat), the next reward is 0
    - When we choose 0 (short), the next reward is the opposite of the change in price at the next timestep
  - Choose the initial neural network θ values at random
  
2) Generate one episode trajectory using the current policy. At each timestep:
- Input the current state to the neural network and generate probabilities for short/flat/long. 
- Sample from the generated probability distribution and take the sampled action. 
- Store all the observed states, actions taken, and rewards.

3) At the end of the trajectory, back up and compute a discounted future reward observed at each timestep using the action taken and following the current policy to the end of the episode. (Footnote: In this simple example, we can use a large discount because in our model the action taken only impacts the next trading day. In a more complex environment where the current action can impact rewards far in the future, you want to take those rewards into account, and you would use a smaller discount.) Standardize the returns (discounted future rewards) by subtracting the mean and dividing by standard deviation.

4) For each action taken
  - Compute the gradient vector of that action probability with respect to the policy thetas (neural network parameters)
  - Compute the gradient of average return over all actions w.r.t theta: expected value of gradient * return
  - Update each theta by its gradient w.r.t. average return, times a learning rate. This will update the policy so that
    - actions with above-average rewards become more probable
    - actions with below-average rewards become less probable

5) return to 2) and iterate until the policy stops improving.

Here is a chart of total reward as we train over 1000 episodes. 

![Simple harmonic motion 2](StocksSHM2.png "SHM 2")

Finally, here is one sample episode with color coding by short/flat/long and reward over the course of the episode.

![Simple harmonic motion 3](StocksSHM3.png "SHM 2")


It's not perfect, there are a couple of days where it strays from the ideal policy, but it's pretty good!



For a more complex example, we take the simple harmonic motion dynamics and add noise + damping.

![Simple harmonic motion + noise](SHMplus1.png "SHM plus noise 1")

![Simple harmonic motion + noise training](SHMplus2.png "SHM plus noise 2")

![Simple harmonic motion + noise outcome](SHMplus3.png "SHM plus noise 3")


Finally let's try the Ornstein-Uhlenbeck (OU) process, which is used in the Ritter paper.

The OU process, like simple harmonic motion, has stronger mean reversion the farther away it is from the mean. But there is no momentum, so unlike simple harmonic motion, in the absence of noise it will not oscillate periodically but just revert asymptotically to the mean.

Here is OU process plus noise.

The Ritter paper does this with multiple stocks. This would be fairly straightforward from here by changing the input to have states from multiple stocks, adding multiple outputs for multiple stocks, and compute the reward as a portfolio return. The Ritter paper also uses a Sharpe reward, and finds that the algorithm successfully optimizes mean-variance, which is a nice result. All the model was given was rewards, and it was told nothing about how the stocks behaved or portfolio return was computed. 

But I think this is long enough and sufficient to illlustrate the fundamentals of reinforcement learning, and I'll stop here. 

possibly
- historical data, maybe with LSTM / attention transformer 


### Deeper technical concepts

#### Monte Carlo vs. TD and the strange-loopy bootstrap

In our policy gradient algorithm:

- We run an episode.
- We back up from the final timestep to the beginning using observed rewards to compute discounted rewards over the full episode.
- We train by ascending the theta gradient that improves standardized rewards.

A self-driving car algorithm doesn't have short episodes like a trading day, so we can't easily do that. An alternative is temporal difference learning (TD):

- We use a value function which estimates the expected future reward from this state, following the current policy.
- We run one timestep
- We back up one episode and compute the difference between the expected value we saw at the last timestep and the value after we took this action (the reward from this action, plus the discounted current expected value)
- This improvement is *advantage*, and we train by ascending the theta gradient that improves the probability of the most advantaged actions.

This is actually a slightly strange magical recursive loop, because at the outset our policy has random thetas. So we are training on the improvement from our fairly random value to the slightly less random value at the next timestep where we know one reward. Nevertheless, as we do this many, many times, the influence of rewards further in the future filters back one step at a time.

When we train on the improvement using our policy between now and the next step, it's called TD(0). We can also train on the improvement 2 steps into the future, and that's TD(1), and we can do TD(2) and so on. If we do TD(&infin;) we are projecting through the end of the episode, however long it may be, and we are back to Monte Carlo learning. 

Finally, there's a temporal difference mode known as TD(&lambda;) where we effectively use an exponential moving average of all the TD terms. Setting &lambda; to 0 is effectively TD(0), setting &lambda; to 1 is effectively Monte Carlo, and calibrating &lambda; determines how far into the future we want to peek.

#### Revisiting value-based v. policy-based methods

If we do TD learning with only a state-value neural network function approximator, and our policy is to choose the action resulting in the best state-value, this is called Deep Q Learning (DQN). If we use a value function and a policy function, and train the policy function separately so that it improves the value function as much as possible, this is actor-critic learning.

#### The exploration vs. exploration tradeoff.

When we do Q learning, our policy is to choose the action with the best resulting state-value. However there is a strong possibility that early in our training one action is always best in some part of the state space. So, since we never try the other actions, our training never surfaces situations where they may be better. 

To avoid this, we do what is called &epsilon;-greedy exploration. Early on, we follow our policy say 60% of the time, and a random action 40% of the time. This allows us to search the whole action space. 40% is the &epsilon; parameter. In practice since our policy network is random at the beginning, we typically start with &epsilon; at 100% and gradually reduce it to a very small value. 

#### On-policy vs. off-policy learning

&epsilon;-greedy algorithms are 'off-policy' learning since they sometimes act differently from the policy and train on the resulting outcomes, in contrast to algorithms where you only take 'on-policy' actions. Policy gradient algorithms sample actions from a probability distribution. Bad actions never have a strictly zero probability, they just get reduced over time, so they implicitly trade off exploration of new actions, vs. exploitation of the best known actions.

#### TRPO and PPO, or how to avoid falling off a cliff

Finally, we have noticed that sometimes training will fall off a cliff. Through the beauty of extreme nonlinearity of neural networks, a small gradient descent update may generate a large change in the policy outcomes. Sometimes the policy is much worse and has trouble climbing back up. One could avoid that with very small steps, i.e. a small learning rate, but then training takes forever. Two popular variations that address this issue are Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). Essentially they avoid or penalize updates which change the output policy too much in one update.

You are not expected to understand all of this. But this may give a flavor of the many possible variations and an understanding of a table such as this one from [Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning#Comparison_of_reinforcement_learning_algorithms)

![Table of algorithms](RLtable.png)



### Takeaways

Like the man in the Molière play who discovers he's been speaking in prose his whole life, you may have been doing reinforcement learning and optimal control your whole life without realizing it.

Reinforcement learning bridges the gap between prediction and trading. Sometimes one may find that a very small predictive R-squared can lead to high returns. Hypothetically, suppose you have a stock market that yields a 5% return. And the single best day each year is up 5%. Suppose your predictive algorithm always gets that one day right, and is random the rest of the time. With a fraction of a percent increase in R-squared you almost doubled your expected return. On the other hand, a 1-day-per-year improvement in forecasting an up vs. down day, evenly distributed, gives a smaller improvement (the difference between EV|up-day - EV(all days)).

How is reinforcement learning different from backtesting? Backtesting exhaustively searches a parameter space for a parameter vector that gives best out-of-sample performance. When we use reinforcement learning with a neural network and gradient descent, we can use much more complex models where there are too many parameter combinations to backtest, and still get a good parameter vector.

Training big models end-to-end simulataneously for prediction and control results in complex emergent behavior that can display almost human-seeming intelligence.

Additionally, reinforcement learning works in an inherently online fashion, as opposed to approaches like walk-forward time-based cross-validation that iteratively fit and test models out-of-sample.

A few issues with reinforcement learning ([further discussion](https://www.alexirpan.com/2018/02/14/rl-hard.html
)):

  - RL is very data-hungry or sample-inefficient, more suited to intraday trading.
  - High model complexity makes interpretability challenging.
  - RL can get stuck at local optima/fall off a cliff. You may have to take special care to not just train on recent experience but also important but rare special cases, the way pilots train for equipment failure and upset recovery.

There is a parallel between reinforcement learning and the [adaptive market hypothesis](https://alo.mit.edu/book/adaptive-markets/) of [Andrew Lo](https://blogs.cfainstitute.org/investor/2017/12/18/the-adaptive-markets-hypothesis-a-financial-ecosystems-survival-guide/). Markets may not be perfectly efficient at all times, but they tend to move in that direction via an evolutionary learning process based on experience.

JPMorgan and others have reported using RL to trade in the real world, see for instance this [paper](https://arxiv.org/pdf/1802.03042.pdf) and more readable [article](https://informaconnect.com/the-latest-in-loxm-and-why-we-shouldnt-be-using-single-stock-algos/). 

AI algorithms can sometimes be exploited. An adversarial sticker can make image recognition think [a banana is a toaster](https://medium.com/deep-learning-cafe/neural-networks-easily-fooled-e19bf575b527) or an adversarial temporary tattoo can [defeat face recognition](https://cvdazzle.com/). This may be a problem for trading with reinforcement learning. If a market maker algorithm trades on patterns, adversarial algorithms can learn to front-run it or paint the tape to make it do bad trades. But over time the algorithms should adapt to each other and arrive at a new, more efficient price equilibrium. (Or create mad algorithm-induced oscillations? We can't necessarily know.)

I'm not sure self-driving vehicles on the streets of New York or New Delhi are likely in the near future, without changes like protected lanes for self-driving vehicles. If pedestrians know that the other driver is always going to stop for them no matter what, they will learn to just cross at the red light, never mind traffic. They can even wear a stop sign on a T-shirt. It's not a matter of how good the self-driving technology is, it's a question of game theory. Knowing that the other driver is a fallible human who at best may be angry and honk and give you the finger, and at worst may be on a cell phone and not even see you tends to concentrate the mind. 

Nevertheless, anecdotally, poker bots seem to be taking over small-stakes online poker, and it seems like a safe bet reinforcement learning will be increasingly adopted in financial markets. Controversial statement alert: short-term technical traders who look for setups lasting minutes to hours will be no match for algos that adapt quickly to patterns across all markets.

There's a need to combine brute force RL with models that understand real-world finance and economics, as chess algorithms and AlphaGo and AlphaZero have model the games.

Further reading:

Courses
- [UCL course by David Silver (videos and lecture notes)](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)
- [Stanford course](http://web.stanford.edu/class/cs234/schedule.html)
- [Berkeley course](http://rail.eecs.berkeley.edu/deeprlcourse/)

Books
- [Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto](http://incompleteideas.net/book/the-book-2nd.html)
- https://www.amazon.com/Deep-Reinforcement-Learning-Python-Hands/dp/0135172381
- [Algorithms for Reinforcement Learning, Csaba Szepesvári](https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf)
- http://www.deeplearningbook.org/
- https://www.amazon.com/Artificial-Intelligence-Modern-Approach-4th/dp/0134610997/ref=dp_ob_title_bk

Key papers and blog posts
- http://karpathy.github.io/2016/05/31/rl/
- https://arxiv.org/pdf/1802.03042.pdf

- https://spinningup.openai.com/en/latest/spinningup/keypapers.html
- https://www.econstor.eu/bitstream/10419/183139/1/1032172355.pdf
