<h1><center>Dynamic Programming</center></h1>

Dynamic programming has a long history before the popularity of Reinforcement Learning. Dynamic programming is used to take sequentially actions based on past decisions, current conditions, expected future rewards. We will consider Markov Decision Process (MDP) where the current conditions is brought upon by the decisions made in the past and change how the environment reacted to those decisions (we cannot go back in time and change our decisions). In MDP, we take decision based on current conditions and what can we do get maximum expected reward (not just immediate but long term reward). 

To simplify, we will consider discrete time environment where actions are taken in discrete time slots rather than continuous time. If the system is continuous, we can always time discrete (for example, take decisions every 5 minutes). In driving, for example, we take decisions when we reach a city (node in our model).

THE books on Dynamic Programming (DP) and Reinforcement Learning (RL) are:
1. http://athenasc.com/dpbook.html  (DP)
2. http://www.incompleteideas.net/book/the-book-2nd.html (RL)

These books are very well written, very easy to follow and understand the concepts. This tutorial is just to give an idea of Dynamic programming using one example.

In this brief introduction, we will go through some examples of simple dynamic programming problems and lay a background for model day Reinforcement Learning. First, the most important figure for all of RL. In DP, the agent is the user (manual or automated) who takes decisions for the next step.

<img src='figures/RL.png'></img>
<center>Figure 1. Reinforcement Learning</center>

This figure explains DP mathematically as what we a re trying to achieve when taking sequential decisions. We aim for long term rewards, accumulated over time (for example, winning at the end of the game and not caring about immediate reward like taking out a pawn of the opponent but actually exposing to opponent). 

<h2>Traveling Salesman problem</h2>

Consider a traveling salesman problem. In this problem, a salesman has to visit cover n number of cities or location from warehouse and come back after visiting each city. We can also think of it as amazon delivery. It has to deliver all the packages in the city. The salesman could consider any possible sequence of how they visit the cities. 

In the example, if we have four cities, there are 4!=24 ways in which cities can be covered (A city has to be visited once). In the example from the DP book, the distance between the four cities is given in the matrix form. The objective is to find a sequence that minimizes the total distance.

In [3]:
global distance, cities

distance = [
    [0,5,1,15],
    [5,0,20,4],
    [1,20,0,3],
    [15,4,3,0]
]
cities = ['A','B','C','D']

In [6]:
def cost_function(seq):
    cost = 0
    
    for j in range(1,4):
        c1 = cities.index(seq[j-1])
        c2 = cities.index(seq[j])
        cost += distance[c1][c2]
    return(cost)

In [9]:
# generating all possible sequences

# BRUTE FORCE
# we can directly calculate the cost here for all possible sequences (but see how fast the number of choices increase)
# for 4 cities, it checks 4^4 options
# intelligently checking for all options will lead us to check just 24 sequences. But this grows very fast as well

best_cost = 10000
best_seq  = []

for i in range(4):
    for j in range(4):
        for k in range(4):
            for l in range(4):
                seq  = [cities[i],cities[j],cities[k],cities[l]] 
                
                # following line checks if non of the city is visited twice
                if seq.count('A') == 1 and seq.count('B') == 1 and seq.count('C') == 1 and seq.count('D') == 1:
                    cost = cost_function(seq)
                    if cost < best_cost:
                        best_cost = cost
                        best_seq  = seq
                     
print('best sequence:', best_seq, 'cost of best sequence:', best_cost)

best sequence: ['A', 'C', 'D', 'B'] cost of best sequence: 8
