# Reinforcement Learning References

© 2019-2022, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademyLogo.png)

RL is a deep topic and a focus of intense research. We can only scratch the surface here, but the following references may be useful. See also links in the [Introduction to Reinforcement Learning](01-Introduction-to-Reinforcement-Learning.ipynb) lesson and other lessons.

## Books

Several books are available on RL:

* [*Reinforcement Learning: An Introduction*](https://mitpress.mit.edu/books/reinforcement-learning-second-edition), by Richard S. Sutton and Andrew G. Barto, MIT Press, 2018. This is the definitive textbook. Deep, but highly recommended. See this independent [repo of Python code](https://github.com/Pulkit-Khandelwal/Reinforcement-Learning-Notebooks).
* [*Practical Reinforcement Learning*](https://www.endtoend.ai/practical-rl/), by Seungjae Ryan Lee.
* [*Hands-On Reinforcement Learning with Python*](https://learning.oreilly.com/library/view/hands-on-reinforcement-learning/9781788836524/), by Sudharsan Ravichandiran, Packt, 2018.
* [*Hands-On Reinforcement Learning for Games*](https://www.packtpub.com/game-development/hands-on-game-ai-with-python), by Micheal Lanham, Packt, 2020.
* [*Grokking Deep Reinforcement Learning*](https://www.manning.com/books/grokking-deep-reinforcement-learning), by Miguel Morales, Manning 2020 (preview). Deep RL means using deep learning as part of the training system.

## Blogs

Several blog posts and series provide concise introductions to RL:

* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70).
* [Anatomy of a custom environment for RLlib](https://medium.com/distributed-computing-with-ray/anatomy-of-a-custom-environment-for-rllib-327157f269e5).
* [A Reinforcement Learning Cheat Sheet](https://towardsdatascience.com/reinforcement-learning-cheat-sheet-2f9453df7651).
* [Reinforcement Learning Explained](https://www.oreilly.com/radar/reinforcement-learning-explained/), Junling Hu, 2016. A gentle introduction to the ideas of RL.
* [A Beginner's Guide to Deep Reinforcement Learning](https://pathmind.com/wiki/deep-reinforcement-learning), Pathmind, 2019. From Pathmind, which uses RLlib for its products and services. Lots of good references at the end of this post.
* [An Outsider's Tour of Reinforcement Learning](http://www.argmin.net/2018/06/25/outsider-rl/), Ben Recht, 2018. A series of posts on technical aspects of RL.
* [Exploration Strategies in Deep Reinforcement Learning](https://lilianweng.github.io/lil-log/2020/06/07/exploration-strategies-in-deep-reinforcement-learning.html), Lilian Weng, Jun 7, 2020.
* [The 32 Implementation Details of Proximal Policy Optimization (PPO) Algorithm](https://costa.sh/blog-the-32-implementation-details-of-ppo.html), Costa Huang, 2019.

## Other Tutorials and Academic Courses on RL

* [OSU: Distributed AI with Ray](http://web.engr.oregonstate.edu/~afern/distributed-AI-labs/osu-distributed-ai.html)
* [University College London COMPM050/COMPGI13](https://www.davidsilver.uk/teaching/)
* [UC Berkeley CS 285](http://rail.eecs.berkeley.edu/deeprlcourse/)
* [CS 294 Deep Reinforcement Learning, Spring 2017](http://rll.berkeley.edu/deeprlcourse/)
* [A Tutorial on Reinforcement Learning I - YouTube](https://www.youtube.com/watch?v=fIKkhoI1kF4)
* [A Tutorial on Reinforcement Learning II - YouTube](https://www.youtube.com/watch?v=8hK0NnG_DhY)
* [ICML 2017 Tutorial](https://sites.google.com/view/icml17deeprl)
* [Ray Summit 2021 RLlib Tutorial](https://www.anyscale.com/events/2021/06/24/hands-on-reinforcement-learning-with-rays-rllib)

## Papers

Here is a small sample of papers on various RL topics.

### General Reinforcement Learning

#### PPO and TRPO
* John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, "Proximal Policy Optimization Algorithms", July 2017, [arxiv](https://arxiv.org/abs/1707.06347). The paper that introduced PPO (Proximal Policy Optimization).  It is a variant of _Trust Region Policy Optimization_ (TRPO) described in the next paper.
* John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel, "Trust Region Policy Optimization", February 2015, [arxiv](https://arxiv.org/abs/1502.05477). 
* OpenAI, "Proximal Policy Optimization", [blog post](https://openai.com/blog/openai-baselines-ppo/). An accessible introduction to PPO.

#### DQN and Variants

* Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et. al, "Playing Atari with Deep Reinforcement Learning", December 2013, [arxiv](https://arxiv.org/abs/1312.5602) (original paper describing DQN).
* Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. "Human-level control through deep reinforcement learning", Nature 518, 529–533 (2015). [Nature](https://doi.org/10.1038/nature14236).
* Dan Horgan, John Quan, David Budden, et al., "Distributed Prioritized Experience Replay", March 2018, [arxiv](https://arxiv.org/abs/1803.00933).
* Matteo Hessel, Joseph Modayil, Hado van Hasselt, et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning, October 2017, [arxiv](https://arxiv.org/abs/1710.02298).
* Dan Horgan, John Quan, David Budden, et al., "Distributed Prioritized Experience Replay", March 2018, [arxiv](https://arxiv.org/abs/1803.00933).
* Hado van Hasselt, Arthur Guez, and David Silver, "Deep Reinforcement Learning with Double Q-learning", December 2015, [arxiv](https://arxiv.org/abs/1509.06461).
* Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver, "Prioritized Experience Replay", November 2015, [arxiv](https://arxiv.org/abs/1511.05952).
* Ziyu Wang, Tom Schaul, Matteo Hessel, et al., "Dueling Network Architectures for Deep Reinforcement Learning", November 2015, [arxiv](https://arxiv.org/abs/1511.06581).

#### Recommender Systems Using RL

* Ken Goldberg, Theresa Roeder, Dhruv Gupta, Chris Perkins, "Eigentaste: A Constant Time Collaborative Filtering Algorithm", *Information Retrieval*, 4(2), 133-151 (July 2001) [pdf](https://goldberg.berkeley.edu/pubs/eigentaste.pdf).
* Julian McAuley and Jure Leskovec, "From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews", [arxiv](https://arxiv.org/abs/1303.4402) (March 18, 2013).


### Multi-Armed Bandits

* Djallel Bouneffouf, Irina Rish, "A Survey on Practical Applications of Multi-Armed and Contextual Bandits", [arxiv](https://arxiv.org/abs/1904.10040).

### Upper Confidence Bound

* Lihong Li, Wei Chu, John Langford, Robert E. Schapire, "A contextual-bandit approach to personalized news article recommendation", Proceedings of the 19th International Conference on World Wide Web (WWW 2010), [arxiv](https://arxiv.org/abs/1003.0146).
* Wei Chu, Lihong Li, Lev Reyzin, Robert E. Schapire (), "Contextual bandits with linear payoff functions" (PDF), Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011), [arxiv](https://arxiv.org/abs/1003.0146).
* T.L. Lai, Herbert Robbins, “Asymptotically efficient adaptive allocation rules”, Advances in Applied Mathematics, Volume 6, Issue 1 (1985), pp 4-22, [link](https://doi.org/10.1016/0196-8858(85)90002-8).
* M N Katehakis and H Robbins, “Sequential choice from several populations”, Proc Natl Acad Sci U S A. 1995 Sep 12; 92(19): 8584–8585, [link](https://doi.org/10.1073/pnas.92.19.8584).
* Warrick Masson, Pravesh Ranchod, George Konidaris, "Reinforcement Learning with Parameterized Actions" [arxiv](https://arxiv.org/abs/1509.01644).

### Thompson Sampling

* William R. Thompson, "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika, 25(3–4):285–294, 1933, [pdf](https://www.dropbox.com/s/yhn9prnr5bz0156/1933-thompson.pdf).
* Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen, "A Tutorial on Thompson Sampling", Foundations and Trends in Machine Learning: Vol. 11: No. 1, pp 1-96, 2018, [pdf](https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf).
* Carlos Riquelme, George Tucker, Jasper Snoek, “Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling”, ICLR 2018, [arxiv](https://arxiv.org/abs/1802.09127). This paper introduces the _WheelBandit_ used in our Thompson Sampling MAB lesson and also discusses exploration algorithms not discussed in this tutorial.
* Shipra Agrawal, Navin Goyal, "Analysis of Thompson Sampling for the Multi-armed Bandit Problem", JMLR: Workshop and Conference Proceedings vol 23 (2012) 39.1–39.26, [pdf](http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf).
* Shipra Agrawal, Navin Goyal, "Thompson Sampling for Contextual Bandits with Linear Payoffs", Proceedings of the 30th International Conference on Ma- chine Learning, Atlanta, Georgia, USA, 2013, [pdf](http://proceedings.mlr.press/v28/agrawal13.pdf).

## RISELab

The RISE Lab and U.C. Berkeley has many useful tutorials, videos, etc.:

* [RISE Lab YouTube channel](https://www.youtube.com/channel/UCP2-wiA964pif0secCpPbfw/videos)
* [RISE Camp 2019](https://risecamp.berkeley.edu/)