d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Introduction to Reinforcement Learning

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you learn:<br>
 - Types of Machine Learning problems
 - Reinforcement Learning problem
 - Agent
 - Environment
 - RL vocabulary
 - RL shortcomings
 
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Audience
* Primary Audience: This course is ideal for data scientists that are interested to learn about next-level algorithms

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 7.0 ML**
* Experience with Python, numpy, and pandas is required
* Familiarity with Probability Theory, Linear Algebra, and Machine Learning is required
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - DB 096 - Just Enough Python for Apache Spark™
  

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) References
* [David Silver lecture](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)
* Sutton book - Chapter 1

### Where does RL fit in Machine Learning?
<br/><br/>
![RL in ML](https://files.training.databricks.com/images/rl/rl.png)
<br>
### RL's Interaction with Other Fields
![RL application](https://files.training.databricks.com/images/rl/RL_application.png)

### RL's Characteristics
<br>
0. Trial and error (like a child to learn how to walk)
0. Sequential process, so time matters (non i.i.d)
0. Current decision affects later outcome
0. Reward signal

### Examples:

0. [Helicopter](https://www.youtube.com/watch?v=VCdxqn0fcnE)
0. [Atari game](https://www.youtube.com/watch?v=V1eYniJ0Rnk)

### Reinforcement Learning Setup
<br>
![RL agent and environment](https://files.training.databricks.com/images/rl/RL_agent_env.png)

**At each time step \\(t\\) agent:**
0. Receive observation \\(O\_{t}\\)
0. Take an action \\(A\_{t}\\) i.e. decision made by an agent (algorithm).
0. Receive reward: \\(R\_{t}\\) is a scalar value. See it as a feedback signal. Indicates how good agent is doing.

**At each time step \\(t\\) environment:**
0. Receive an action \\(A{_t}\\) i.e. decision made by an agent (algorithm)
0. Emits observation \\(O_{t+1}\\): response to agent's action.
0. Emits reward: \\(R_{t+1}\\) is a scalar value. See it as a feedback signal. Indicates how good agent is doing

**RL assumption: All goals can be described by the maximization of expected cumulative reward**

### Exercise ###
- Discuss some examples of agent, environment, action, observation, reward?

**Things to keep in mind:**
0. What actions should we take at each step?
0. Long vs. short term impact of an action?
0. Should we wait or take immediate best action?
0. Reward might be delayed? Examples?

### History and State
<br>
0. All observable variables up to time \\(t\\)
    * \\(H\_{t} =  O\_{0}, A\_{0}, R\_{1}, O\_{1}, A\_{1}, R\_{2},..., O\_{t-1}, A\_{t-1}, R\_{t}\\)
0. What happens next depends on history. How?
0. **State** is the information used to determine what happens next (depends on the history)
    * \\(S_t = f(H_t)\\)
0. Agent's vs. environment's state (\\(S_t^a\\) and \\(S_t^e\\)): the state representation of agent and environment, respectively
0. Most of the time environment state is not visible
0. **Information state** contains all useful information from history. The state is **Markov** if future is independent of the past given the present. i.e we only care about the last state.
$$ P[ S\_{t+1} \bigm\vert S\_{t} ] = P[ S\_{t+1} \bigm\vert S\_{1},S\_{2},...,S\_{t} ] $$
0. Examples of Markov state? 
    * \\(S_t^e\\)?
    * \\(H_t\\)?

### State definition matters!
<br>
![rat, cheese, lever](https://files.training.databricks.com/images/rl/rat_cheese_lever.jpg)

### Fully vs. Partially Observable Environments 
<br>
0. Fully: Agent directly observes the environment state. Formally, this is a Markov Decision Process. 

    * Examples?
    * \\(O\_{t} = S\_{t}^a = S\_{t}^e\\)
0. Partial observation: Agent indirectly observes environment. 

    * Requires memory: 
     0. \\(S\_{t}^a = H_{t}\\)
     0. \\(S\_{t}^a = (P [(S\_{t}^e=s^1 ],...,P[S\_{t}^e=s^n ] )\\)
     0. RNN 
     0. Examples?
     
Note: Remainder of this course focuses on **Markov Decision Process**.

### Components of an RL Agent 
<br>
0. Policy: agent's behavior 
    * Deterministic : \\(\pi(s)\\)
    * Stochastic: \\(\pi(a \bigm\vert s) = P[A_t = a \bigm\vert S_t = s]\\)
0. Value function: How good is each state/action? Thinks of it as expected value of all future rewards given the current state.
    * \\(v\_\pi(s) = E\_\pi[R\_{t+1} + \gamma R\_{t+2} + \gamma^2R\_{t+3}+... \bigm\vert S_t = s] \\)
0. Model: agent's representation of the environment

    * Predicts what environment will do next 
    * Predict the next state \\(\rho\_{ss'}^a = P[S_{t+1} = s' \bigm\vert S_t = s, A_t = a]\\) 
    * Predict the next reward \\(R\_{s}^a = E[R_{t+1} \bigm\vert S_t = s, A_t = a]\\)

### Exercise
<br>
Consider the following grid. The arrows indicate the policy.
0. What are the states?
0. Given the policy (arrows) find the value of each state? 
    * Assume reward is -1 per time step
    * Assume the value of terminal state is 0
    * Assume \\(\gamma = 1\\)
0. What is the agent representation of the environment i.e. what is the model? What is the reward?
<br><br>
<img src="https://files.training.databricks.com/images/rl/maze.png" width="700">

### Types of RL agents
![RL Taxonomy](https://files.training.databricks.com/images/rl/rl_taxonomy.png)

### Types of problems in sequential decision making
<br>
1. RL
    - We do not know the environment initially
    - Agent <> Environment interaction
    - Iterated to improve the policy
    - Example?
2. Planning
    - Model of environment is known
    - Agent computes, because it has the model, without any interaction
    - Agent improves the policy
    - Example?

### Exploration vs. Exploitation
<br>
1. RL is trial-and-error learning. It is similar to how a child learns to walk
2. How does an agent learn a good policy?
    - Its past experiences. It uses information it learned from the environment to maximize the reward. This is called **exploitation**.
        * Example?
    - New experiences. It needs to find more information about the environment. This is called **exploration**.
        * Example?

### Prediction vs. Control
0. Prediction problem's goal: given a policy, what is the value function? 
0. Control problem's goals: find the best policy and accordingly optimal value function over all possible policies.
0. Example on gridworld problem. <br><br><br><br>
![Prediction](https://files.training.databricks.com/images/rl/prediction.png)<br><br><br><br>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>