# Dynamic Programming

Dynamic programming. Break up a problem into a series of ordered subproblems; combine solutions to smaller subproblems to form solutions to a large subproblem.   

## Fibonacci sequence 

Using dynamic programming in the calculation of the *n*th member of the
[Fibonacci sequence] improves its performance greatly. Here is a naïve
implementation, based directly on the mathematical definition:

`   `**`function`**` fib(n)`\
`       `**`if`**` n <= 1 `**`return`**` n`\
`       `**`return`**` fib(n − 1) + fib(n − 2)`

Notice that if we call, say, `fib(5)`, we produce a call tree that calls
the function on the same value many different times:

1.  `fib(5)`
2.  `fib(4) + fib(3)`
3.  `(fib(3) + fib(2)) + (fib(2) + fib(1))`
4.  `((fib(2) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))`
5.  `(((fib(1) + fib(0)) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))`

In particular, `fib(2)` was calculated three times from scratch. In
larger examples, many more values of `fib`, or *subproblems*, are
recalculated, leading to an exponential time algorithm.

Now, suppose we have a simple [map] object, *m*, which maps each value
of `fib` that has already been calculated to its result, and we modify
our function to use it and update it. The resulting function requires
only [O][](*n*) time instead of exponential time (but requires
[O][](*n*) space):

`   `**`var`**` m := `***`map`***`(0 → 0, 1 → 1)`\
`   `**`function`**` fib(n)`\
`       `**`if`` `*`key`***` n `**`is`` ``not`` ``in`` `*`map`***` m `\
`           m[n] := fib(n − 1) + fib(n − 2)`\
`       `**`return`**` m[n]`

This technique of saving values that have already been calculated is
called *[memoization]*; this is the top-down approach, since we first
break the problem into subproblems and then calculate and store values.

In the **bottom-up** approach, we calculate the smaller values of `fib`
first, then build larger values from them. This method also uses O(*n*)
time since it contains a loop that repeats n − 1 times, but it only
takes constant (O(1)) space, in contrast to the top-down approach which
requires O(*n*) space to store the map.

`   `**`function`**` fib(n)`\
`       `**`if`**` n = 0`\
`           `**`return`**` 0`\
`       `**`else`**\
`           `**`var`**` previousFib := 0, currentFib := 1`\
`           `**`repeat`**` n − 1 `**`times`**` `*`//`` ``loop`` ``is`` ``skipped`` ``if`` ``n`` ``=`` ``1`*\
`               `**`var`**` newFib := previousFib + currentFib`\
`               previousFib := currentFib`\
`               currentFib  := newFib`\
`       `**`return`**` currentFib`

In both examples, we only calculate `fib(2)` one time, and then use it
to calculate both `fib(4)` and `fib(3)`, instead of computing it every
time either of them is evaluated.

to 
Markdown (pandoc)

## A Dynamic Decision Problem 

Let the state at time $t$ be $x_t$. For a decision that begins at time
0, we take as given the initial state $x_0$. At any time, the set of
possible actions depends on the current state; we can write this as
$a_{t} \in \Gamma (x_t)$, where the action $a_t$ represents one or more
control variables. We also assume that the state changes from $x$ to a
new state $T(x,a)$ when action $a$ is taken, and that the current payoff
from taking action $a$ in state $x$ is $F(x,a)$. Finally, we assume
impatience, represented by a [discount factor] $0<\beta<1$.

Under these assumptions, an infinite-horizon decision problem takes the
following form:

$$V(x_0) \; = \; \max_{ \left \{ a_{t} \right \}_{t=0}^{\infty} }  \sum_{t=0}^{\infty} \beta^t F(x_t,a_{t}),$$

subject to the constraints

$$a_{t} \in \Gamma (x_t), \; x_{t+1}=T(x_t,a_t), \; \forall t = 0, 1, 2, \dots$$

Notice that we have defined notation $V(x_0)$ to denote the optimal
value that can be obtained by maximizing this objective function subject
to the assumed constraints. This function is the *value function*. It is
a function of the initial state variable $x_0$, since the best value
obtainable depends on the initial situation.

### Bellman\'s principle of optimality  

The dynamic programming method breaks this decision problem into smaller
subproblems. Bellman\'s *principle of optimality* describes how to do
this:

> Principle of Optimality: An optimal policy has the property that
> whatever the initial state and initial decision are, the remaining
> decisions must constitute an optimal policy with regard to the state
> resulting from the first decision.  

In computer science, a problem that can be broken apart like this is
said to have optimal substructure. In the context of dynamic game
theory, this principle is analogous to the concept of subgame perfect
equilibrium, although what constitutes an optimal policy in this case
is conditioned on the decision-maker\'s opponents choosing similarly
optimal policies from their points of view.

As suggested by the *principle of optimality*, we will consider the
first decision separately, setting aside all future decisions (we will
start afresh from time 1 with the new state $x_1$). Collecting the
future decisions in brackets on the right, the above infinite-horizon
decision problem is equivalent
to:  

$$\max_{ a_0 } \left \{ F(x_0,a_0)
+ \beta  \left[ \max_{ \left \{ a_{t} \right \}_{t=1}^{\infty} }
\sum_{t=1}^{\infty} \beta^{t-1} F(x_t,a_{t}):
a_{t} \in \Gamma (x_t), \; x_{t+1}=T(x_t,a_t), \; \forall t \geq 1 \right] \right \}$$

subject to the constraints

$$a_0 \in \Gamma (x_0), \; x_1=T(x_0,a_0).$$

Here we are choosing $a_0$, knowing that our choice will cause the time
1 state to be $x_1=T(x_0,a_0)$. That new state will then affect the
decision problem from time 1 on. The whole future decision problem
appears inside the square brackets on the right.

## The Bellman Equation (Recursive Definition)


The following is a recursive definition of Bellman equation. It can be simplified even further if we
drop time subscripts and plug in the value of the next state:

$$V(x) = \max_{a \in \Gamma (x) } \{ F(x,a) + \beta V(T(x,a)) \}.$$

The Bellman equation is classified as a [functional equation], because
solving it means finding the unknown function *V*, which is the *value
function*. Recall that the value function describes the best possible
value of the objective, as a function of the state *x*. By calculating
the value function, we will also find the function *a*(*x*) that
describes the optimal action as a function of the state; this is called
the *policy function*.