<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dynamic-Programming-Refresher-&amp;-Dynamic-Programming-for-Reinforcement-Learning" data-toc-modified-id="Dynamic-Programming-Refresher-&amp;-Dynamic-Programming-for-Reinforcement-Learning-1">Dynamic Programming Refresher &amp; Dynamic Programming for Reinforcement Learning</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span></li><li><span><a href="#What-does-&quot;Dynamic-Programming&quot;-(DP)-mean?-" data-toc-modified-id="What-does-&quot;Dynamic-Programming&quot;-(DP)-mean?--3">What does "Dynamic Programming" (DP) mean? </a></span></li><li><span><a href="#How-does-Dynamic-Programming-(DP)-work?" data-toc-modified-id="How-does-Dynamic-Programming-(DP)-work?-4">How does Dynamic Programming (DP) work?</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-5">Check for understanding</a></span></li><li><span><a href="#-Student-Activity" data-toc-modified-id="-Student-Activity-6"> Student Activity</a></span></li><li><span><a href="#The-connection-between-general-Dynamic-Programming-and-Reinforcement-Learning" data-toc-modified-id="The-connection-between-general-Dynamic-Programming-and-Reinforcement-Learning-7">The connection between general Dynamic Programming and Reinforcement Learning</a></span></li><li><span><a href="#Dynamic-programming-for-Reinforcement-Learning" data-toc-modified-id="Dynamic-programming-for-Reinforcement-Learning-8">Dynamic programming for Reinforcement Learning</a></span></li><li><span><a href="#Preview-Lab-3:-Pyramid-Escape" data-toc-modified-id="Preview-Lab-3:-Pyramid-Escape-9">Preview Lab 3: Pyramid Escape</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-10">Takeaways</a></span></li></ul></div>

<center><h2>Dynamic Programming Refresher & Dynamic Programming for Reinforcement Learning</h2></center>
<br>
<center><img src="https://imgs.xkcd.com/comics/travelling_salesman_problem.png" width="100%"/></center>

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Write a dynamic programming function in Python.
- Explain in your own words how dynamic programming (DP) is useful in Reinforcement Learning.

<center><h2>What does "Dynamic Programming" (DP) mean? </h2></center>
 
Dynamic (vs. static) - The problem has a sequential component. An dynamical system involves time dependence (e.g., pendulum, flow of water, ecosystems).

Programming - Mathematical programming (aka, mathematical optimization). 

Try to find the best solution to a problem.

See also {Integer, Linear, Convex, …} Programming.

Dynamic Programming is an optimization technique for certain type of sequential problems.

<center><h2>How does Dynamic Programming (DP) work?</h2></center>

A method for solving a entire problem by:

1. Breaking down the complete problem into smaller related subproblems. 
1. Remembering the optimal solution to the subproblems.
1. Putting the optimal subproblem solutions back together to find a solution to the complete problem.

The breaking down of the problem is often recursive. 

Recall that recursive problems can be transformed into procedural problems by manually managing the stack.

Problem elements:

- Related subproblems - Divide the complete problem in overlapping subproblems.
- Optimal substructure - Solving subproblems optimally can automatically provide an optimal to the original complete problem.


Reduces computations by storing and reusing partial results (caching).

Caching is critically important, one of the most important concepts in software engineering.

[A nonprofit spent all fundraising money on a data bill because it did not cache]( https://news.ycombinator.com/item?id=20020095)

Dynamic programming is Another example of a divide-and-conquer algorithm.

<center><h2>Check for understanding</h2></center>

What are examples of problems that can be solved with dynamic programming (DP)?

General types …   
Specific examples …

1. Shortest path - From here to the another point (e.g., graph path)
1. Generating sequences that depend previous elements (e.g., Fibonacci)
1. Maximizing a sequence (e.g., cumulative sum with the constraints)
1. Scheduling problems (e.g., weighted interval scheduling)
1. String algorithms (e.g., sequence alignment for DNA)
1. Markov decision process (MDP)

<center><h2> Student Activity</h2></center>

Solve the following problems with DP:

1. Generate a Fibonacci Sequence: `0, 1, 1, 2, 3, 5, 8, 13`
1. Find maximum cumulative sum with the constraint of not taking two numbers in a row.
```python
assert max_constrained([1, 2, 3, 1])     ==  4
assert max_constrained([2, 1, 1, 2])     ==  4
assert max_constrained([2, 7, 9, 3, 1])  == 12
```
1. Find the Length of the Longest Increasing Subsequence (LIS) 
```python
assert len_lis([10, 9]) == 1
assert len_lis([10, 22, 9]) == 2
assert len_lis([10, 22, 9, 10, 21, 50]) == 4
```

Other common DP programming problems:

- Given two strings, find the longest common substring.
- Find the longest increasing subsequence.
- Determine the minimum (or unique) number of ways to make n cents, given coins of denominations less than n cents.
- Given a knapsack with a total weight capacity and a list of items with weight w(i) and value v(i), determine the max total value you can carry.

In [1]:
from functools import lru_cache 

@lru_cache()
def fib_dp(n_th):
    "Calculate nth Fibonacci number using dynamic programming"
    if n_th == 0: return 0
    if n_th == 1: return 1
    return fib_dp(n_th-1) + fib_dp(n_th-2)

[fib_dp(n_th) for n_th in range(10)]

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

[Brian’s other solutions to generate Fibonacci Sequence](https://github.com/brianspiering/fibonacci_sequences)

There are many ways of applying DP to the problem.

In [2]:
def max_constrained(nums):
    "Get maximum cumulative sum with the constraint of not taking two numbers in a row."
    total_current, total_previous = 0, 0

    for n in nums: 
        total_previous, total_current = total_current, max(total_previous + n, total_current)

    return total_current

assert max_constrained([1, 2, 3, 1])     ==  4
assert max_constrained([2, 1, 1, 2])     ==  4
assert max_constrained([2, 7, 9, 3, 1])  == 12

In [3]:
def len_lis(nums): 
    n = len(nums)
    seq_len = [1]*n # Initialize LIS values for all indexes 
    
    # Compute the length of each increasing subsequence
    for i in range(1, n): 
        for j in range(0, i): 
            if (nums[i] > nums[j]) and (seq_len[i] < (seq_len[j] + 1)): 
                seq_len[i] = seq_len[j]+1

    return max(seq_len) # Find longest subsequence

assert len_lis([10, 9]) == 1
assert len_lis([10, 22, 9]) == 2
assert len_lis([10, 22, 9, 10, 21, 50]) == 4

[Original Source](https://www.geeksforgeeks.org/python-program-for-longest-increasing-subsequence/)

<center><h2>The connection between general Dynamic Programming and Reinforcement Learning</h2></center>

The goal of RL's is to maximize the cumulative sum of discounted future rewards.

A Fibonacci sequence is also a type of cumulative sum.

The House Robber just adds a deterministic selection function, `max`, to pick the value moving forward.

In fact, some types of Reinforcement Learning are extensions to the House Robber by adding stochastic selection or even a choice (e.g., bandit) to maximize cumulative sum. 

Just swap out `max` with your own favorite custom function.

<center><h2>Dynamic programming for Reinforcement Learning</h2></center>

Recall the Value Function:

<center><img src="images/dp_formula.png" width="75%"/></center>

For the value function of the following:

- a policy π
- a time step k
- a state s

is equal to current reward plus:

- discounted rate
- all possible next future states:
    - complete model of probability transition function
    - future value of that state

Need to visit all possible states (i.e., futures) at least once:

<center><img src="images/complete.png" width="75%"/></center>

<center><h2>Preview Lab 3: Pyramid Escape</h2></center>

- Not a real Reinforcement Learning problem. It is a simplified problem - The model is deterministic.
    - Would be RL if there was randomness in the reward. Each cell is could be a pdf that generates a value.
- It is a binary tree problem represented in a list / array. Use indexing to "walk" the tree.

[Demo of DP for gridworld](https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html)

<center><h2>Takeaways</h2></center>

- Dynamic Programming (DP) is a general method to solve specific kinds of divide-and-conquer problems.
- Dynamic Programming (DP) is useful for finding the optimal policy in MDP by recursively, completely, and efficiently searching the space for a policy maximizes the discounted cumulative reward.


<br>
<br> 
<br>

----