d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Model-Free Prediction  

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you learn:<br>
 - Monte-Carlo Learning for Prediction Task
 - Temporal-Difference Learning
 - \\(TD(\lambda)\\)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) References
* [David Silver lecture](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)
* Sutton book - Chapter 5, 6, 7, 8

### What is Monte-Carlo (MC) method?
<br>
 - Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.
  - Sampling
  - Estimation
  - Optimization
 - Easy to use and efficient
 - Used in different branches of science (physical processes, operations research etc.) and also in practice e.g. option pricing!

In [4]:
import numpy as np

def is_in_unit_circle(x, y, r):
  """This function tests whether or not a point is inside (or on the line) of a given circle."""
  
  if pow(x,2)+pow(y,2)<= pow(r, 2):
    return True
  else:
    return False


def MCM(a, r, episodes):
  """This function generates multiple episodes."""
  
  inside_count = 0
  for i in range (episodes):
    x = np.random.uniform(low = -a, high = a, size = 1)
    y = np.random.uniform(low = -a, high = a, size = 1)
    if is_in_unit_circle(x, y, r):
      inside_count += 1
      
  return inside_count

In [5]:
MCM (1, 1, 1000)/1000

### MC Reinforcement Learning ###
<br>
 - MC methods learn directly from episodes of experience
 - MC is model-free: no knowledge of MDP transitions / rewards 
 - MC learns from complete episodes: no bootstrapping
 - MC uses the simplest possible idea: value = mean return
 - Caveat: can only apply MC to episodic MDPs
  - Terminal state(s) is required

### Monte-Carlo for Policy Evaluation###

#### Questions ####

Given the fact that \\(v\_{\pi} = E\_{\pi} \big[G\_{t} \bigm\vert S\_{t} = s \big]\\), can use Monte-Carlo? If so, how? Discuss this with your neighbors.

There are multiple approaches to use MCM for value evaluation:

0. First-Visit: Accumulate rewards (and counts) from the point forward **only the first time** state s is visited in an episode. Is there any guarantee that empirical mean converges to actual mean?  
0. Every-Visit: calculate the empirical mean. Accumulate rewards (and counts) from the point forward **whenever** state is visited (possibly multiple times in an episode). 
0. Can we run 1 and 2 more efficiently? i.e. do not keep track of all the returns.

### Temporal-Difference Learning###

- MC methods are offline learning. It means you have to wait until end of episode to update the states.
- TD methods learn directly from episode of experience (online learning)
- TD methods are model free similar to MC methods
- TD methods are useful for incomplete episodes (unlike MC methods)
- TD updates a guess towards a guess

### How does TD work?###

- Same goal: learn the value function \\(v\_{\pi}\\)
- Instead of \\(V(S\_{t}) \longleftarrow V(S\_{t}) + \alpha \Bigg(G\_{t} - V(S\_{t})\Bigg)\\) (updating the value towards actual return)
- Do  \\(V(S\_{t}) \longleftarrow V(S\_{t}) + \alpha \Bigg(R\_{t+1} +\gamma V(S\_{t+1})- V(S\_{t})\Bigg)\\) (update the values toward estimated return)
- What we just did is called TD(0). Simplest form of \\(TD(\lambda)\\) (more on this later)
- \\(R\_{t+1} +\gamma V(S\_{t+1})\\) is called TD target
- \\(\delta\_{t} = R\_{t+1} +\gamma V(S\_{t+1})- V(S\_{t})\\) is TD error

### Questions ###

0. What are the advantages of TD?
0. What can you say about Bias/variance trade off for TD and MC?

### Batch MC and TD ###
- MC and TD converge as we get more samples i.e. as gain more experience 
- What if you have limited experience? i.e. you only have finite experience (sample of observations)

### Questions ###

Consider the following 5 episodes in which the agent goes through two states (A, B). What is V(A) and V(B)? 

0. A, 1, B, 2
0. A, 2
0. B, 1
0. B, 1
0. B, 1

### Putting all we have learned together in one picture ###

<br><br><br><br>
![rat, cheese, lever](https://slideplayer.com/slide/4856063/15/images/28/Dimensions+of+Reinforcement+Learning.jpg)

### Questions ###

0. Bootstrapping refers to situation in which updates involve an estimate. What technique does bootstrap?
0. Sampling refers to situation in which we sample a realization. What technique does sampling?

### n-step Return & n-step temporal-difference learning ###

It is not required to use only one step estimate. One can use any form of n-step return. For example:

- 1-step return: \\(G\_{t}^{1} = R\_{t+1} + \gamma V(S\_{t+1})\\)
- 2-step return: \\(G\_{t}^{2} = R\_{t+1} + \gamma V(S\_{t+1}) + \gamma^2 V(S\_{t+2} )\\)
- n-step return: \\(G\_{t}^{n} = R\_{t+1} + \gamma V(S\_{t+1}) + \gamma^2 V(S\_{t+2} ) + ... + \gamma^nV(S\_{t+n})\\)
- \\(V(S\_{t}) \longleftarrow V(S\_{t}) + \alpha \Bigg(G\_{t}^{n} - V(S\_{t})\Bigg)\\)

### Averaging n-Step Returns ###

- Natural question is whether or not one can combine n-step return for different time steps. For example, you might want to do something like:
$$ \frac{1}{3} G^4 + \frac{1}{3} G^5 + \frac{1}{3} G^6$$
- Doable in practice. One can put weight on different n-steps: the farther you get, the less weight on the return
- Mathematically, this can be represented as:
$$ G^\lambda\_{t} = (1-\lambda) \sum\_{n=1}^{n = \infty}\lambda^{n-1}G\_{t}^n$$
$$ V(S\_{t}) \longleftarrow V(S\_{t}) + \alpha \Bigg(G\_{t}^{\lambda} - V(S\_{t})\Bigg) $$

### Questions ###
0. How does weight change over time? Can you plot that for \\(\lambda = 0.5 \\) for \\(t \in [0, 30]\\)?
0. What is the total area under the curve?

In [18]:
import matplotlib.pyplot as plt
from scipy import integrate
import numpy as np

def plot_weight(lambda_value = 0.5, max_time = 30):
  """This function create the plot of decaying weights"""
  
  time = np.arange(0, max_time, 1)

  # using formula mentioned above
  weight = (1-lambda_value) * pow(lambda_value, time)

  # plotting the result
  fig = plt.figure()
  plt.plot(time , weight)
  fig.suptitle(r'weight over time ($\lambda$ = 0.5)', fontsize=20)
  plt.xlabel('time', fontsize=18)
  plt.ylabel('weight', fontsize=16)

  # display the plot
  plt.show()
  display()

plot_weight()

In [19]:
def calculate_area(lambda_value = 0.5, max_time = 30):
  """This function calculates the area under curve for the given interval"""
  
  area = 0
  for i in range(max_time):
    area += (1-lambda_value) * pow(lambda_value, i)
  return (area)

### Forward View TD(\\(\lambda\\))
- Forward view: it means you look forward in time. So you need to look forward to update \\(G\_{t}^{\lambda}\\)
- Downside? It is like MC. You need to have complete episode to do that. Future is only known when you have the complete realization of the underlying process.

### Backward View TD(\\(\lambda\\))
- Update online, every step, from incomplete sequence. It provides a mechanism.
- In order to do so we need to define two quantities for each state:
 - Frequency: assign credits to most frequent states (why?)
 - Recency: assign credit to most recent states (why?)
- Eligibility traces combines both:
 - \\(E\_{0}(s) = 0\\)
 - \\(E\_{t}(s) = \gamma\lambda E\_{t-1}(s) + 1(S\_{t} = s)\\), where \\(1\\) is a indicator function
- Keep an eligibility trace for every state s
- Update V(s) for every state
$$\delta\_{t} = R\_{t+1} + \gamma V(S\_{t+1}) - V(S\_{t}) $$
$$ V(s) \longleftarrow V(s) + \alpha \delta\_{t}E\_{t}(s) $$

### Questions ###
0. What happens when \\(\lambda = 0\\)? Why?
0. What happens when \\(\lambda = 1\\)? Why?

### Extra (out of scope) ###
One can show that. The sum of offline updates is identical for forward-view and backward-view TD(\\(\lambda\\)). i.e. $$ \sum\_{t = 1}^{T} \alpha \delta\_{t} E\_{t}(s) = \sum\_{t =1}^{T}\alpha\Bigg(G\_{t}^{\lambda} - V(S\_{t})\Bigg) 1(S\_{t} = s) $$

### Offline and Online Updates ###
- Updates are accumulated within episode, but applied at the end of episode. In this case forward and backward TD are equivalent
- Online updates are applied at each step within episode. In this case forward and backward TD are slightly different

### Questions ###
Discuss offline and online updates for forward and backward TD for different values of \\(\lambda\\). In what scenarios are they equivalent?

### Further reading ###
- Exact online \\(TD(\lambda)\\): [ICML paper](http://proceedings.mlr.press/v32/seijen14.pdf)

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>