<img src="img/bigsem.png" width="40%" align="right">
<img src="img/logo_wiwi.png" width="20%" align="left">





<br><br><br><br>

# Dynamic Programming Models in Combinatorial Optimization
**Winter Term 2021/22**


# 1. Introduction to Dynamic Programming (Models)

<img src="img/decision_analytics_logo.png" width="17%" align="right">


<br>

<br>
<br>

**J-Prof. Dr. Michael Römer |  Decision Analytics Group**
                                                    


In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from numba import njit

# Overview
- Combinatorial Optimization Problems and Greedy Algorithms
- From Greedy to Dynamic Programming (Models)
- Online Value Function Approximation
- Wrapping up and Outlook

## Combinatorial optimization problems

- combinatorial optimization (CO) problems are discrete optimization problems, that is, optimization problems in which the set of feasible solutions is discrete
- well-known examples for CO problems are
  - the the travelling salesperson problem (TSP)
  - the 0/1 knapsack problem
  - the set covering problem
- many CO problems are NP-hard, that is, there no known algorithm that can solve them in polynomial time
- in this course, we will learn 
  - how (certain) CO problems can be cast as dynamic programming models
  - and how such models can be useful for solving CO problems approximately or exactly
  - in very different ways

## Greedy Approaches to Combinatorial Optimization Problems
- given that exactly solving (large) CO problems is often not (practically) tractable, one often resorts to heuristic approaches that aim at
  - finding high-quality solutions
  - in an acceptable amount of time
- one class of heuristic solutions that are usually *very* fast (but not alway yield high-quality solutions) are so-called **greedy algorithms**
- greedy algorithms solve a CO problem by construction a solution step by step
- in every step, a decision is made according to a **greedy criterion**:
  - in general, in case of different feasible options, the one that is (locally) optimal with respect to the greedy criterion is taken
   - example: the nearest neighbor algorithm for solving the TSP

## Example: the 0/1 knapsack problem

Given 
- a knapsack with a capacity $W$ 
- and a set of items, each with a weight $w_i$ and a value $p_i$
- determine the the subset of the items to put in the knapsack such that
  - the total value of the items in the knapsack is maximal and
  - the total weight of the items in the knapsack does not exceed $W$

**Example:**

<img src="./img/greedy/07.png" width="20%" align="right">

Assume you are a thief and you are about to steal the three items depicted below from an appartment. However, your backpack can only fit 35 lbs. Which items should you take?



<img src="./img/greedy/08.png" width="40%" align="left ">

## A Greedy Approach for the Knapsack Problem

- start with some item: If it (still) fits in the backpack, put it in the backpack
- repeat for the remaining items

..you never take out an item once it has been packed in the knapsack

#### In Python:

In [46]:
def greedy_knapsack(values, weights, capacity):
    solution = [] # solution array
    obj_val = 0 # accumulated objective
    total_weight = 0 # accumulated weight
    
    for i, weight in enumerate(weights): 
        if total_weight + weight <= capacity: ## if the item still fits..
            solution.append(i) ## add it and 
            total_weight+= weight # update the accumulated weight
            obj_val += values[i] # as well as the optimal objective value
    return obj_val, solution

..let us try it!

In [50]:
values = [3000,2000,1500]
weights = [30,20,15]
capacity = 35

greedy_knapsack(values, weights, capacity)

(3000, [0])

## Let us try larger instances

..there are many instance sets for the 0/1 KP
- as an example, there are some instances from D. Pisinger, see the instances folder in the repository associated with this notebook
- on the following website, you will find optimal objective function values:

http://artemisa.unicauca.edu.co/~johnyortega/instances_01_KP/

.. you find some instances in the GitHub repository in which this notebook resides
- if you download the zip with this notebook (or clone the repository), you will have them in the folder `problems/knapsack/instances`


## Format of the knapsack instances

the instance files have the following format:

- first row: `number_of_items` `capacity`
- every further row contains informatin for each item: `value` `weight`

Our toy instance would look like this:

## Reading in the instances

The following function reads an instance file

In [39]:
def read_knapsack_instance(filename):
    weights=[]
    values=[]
    with open(filename) as f: # open the file
        line = f.readline().split()  # split first row
        number_of_items = int(line[0]) # read number of items
        capacity = int(line[1]) # read capacity
        for i in range(number_of_items): # read rows for the items
            line = f.readline().split() # split row
            values.append(int(line[0])) # read value
            weights.append(int(line[1])) # read weight
    return np.array(values), np.array(weights), capacity

... let us try with a 5000-item instance and solve:

In [53]:
filename = "./../problems/knapsack/instances/knapPI_1_5000_1000_1" # optimal value: 276457
values, weights, capacity  = read_knapsack_instance(filename)

obj_value, _ = greedy_knapsack(values, weights, capacity)
obj_value


33727

## Exercise: Improving the greedy approach by sorting items

- one way to improve the performance of this greedy algorithm for the knapsack problem is to sort items
- which sorting criteria do you consider promising?
- sort the items accordingly and try applying the greedy algorithm to the sorted items

#### Hint:
In numpy, there is the function `argsort` which does not return the sorted values of an array, but an array of the sorted sorted indexes!


## The Travelling Salesperson Problem

<img src="https://pup-assets.imgix.net/onix/images/9780691163529.jpg" width="20%" align="right">


**Informal problem statement:** Given a set of cities and the distances between the cities, find a minimum-cost round-trip that visits each city exactly once.

**More formally:** Given a complete graph and distances between each pair of nodes in the graph, find a cost-minimal hamiltonian cycle in the graph


- one of the best-known combinatorial opimization problem 
- **A nice book on the TSP:**  [In Pursuit of the Traveling Salesman](https://press.princeton.edu/books/paperback/9780691163529/in-pursuit-of-the-traveling-salesman)
 - die story of the TSP presented by one of its protagonists (William Cook)
- TSP website: https://www.math.uwaterloo.ca/tsp/index.html
- there are a lot of instances
    - in particular, there is a full library of instances, the so-called [TSPLib](http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/)
    - some of them are part of the git repository for the course material
    - [here](http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/STSP.html) you find optimal objective values for many instances

**..and there is even a Python library dedicated to solving the TSP: [`python-tsp`](https://github.com/fillipe-gsm/python-tsp)** 

## Nearest Neighbor: A greedy algorithm for the TSP

- we assume that the cities are indexed from 0 to $N-1$

Goal: create a list forming a permutation of the city indexes representing a tour with a small total distance 
- start with some node and add it to the list
- find a node that is not yet in the list and that is nearest to the most recently added node and add it to the list
- repeat until the list has length $N$

## A helper function that computes the nearest neighbor 

In [56]:
def get_nearest_neighbor(distance_matrix, permutation):
    
    # node the last node in the permutation
    node = permutation[len(permutation)-1]
    
    smallest_distance = 9999999999 ## some large value
    nearest_neighbor = 0
    
    #number of nodes = dimension of the distance matrix
    for neighbor in range(len(distance_matrix)):
        
        if neighbor in permutation: continue # skip if already visited
        
        #update the nearest neighbor if needed
        if distance_matrix[node][neighbor] < smallest_distance: 
            nearest_neighbor = neighbor
            smallest_distance = distance_matrix[node][neighbor]            
       
    return nearest_neighbor, smallest_distance 

## The full algorithm in Python

In [57]:
def tsp_nearest_neighbor(distance_matrix, permutation):
    
    total_distance = 0
    
    #as long as the list is not "full"
    while len(permutation) < len(distance_matrix):
        
        node, distance = get_nearest_neighbor(distance_matrix, permutation)
        
        permutation.append(node)
        total_distance += distance
        
    total_distance += distance_matrix[permutation[len(permutation)-1], permutation[0]] # final
    return permutation, total_distance

# Let us try it:

In [59]:
distance_matrix = np.array([
    [0,  5, 4, 10],
    [5,  0, 8,  5],
    [4,  8, 0,  3],
    [10, 5, 3,  0]
])

tsp_nearest_neighbor(distance_matrix, [0])

([0, 2, 3, 1], 17)

## The Python library `python-tsp`

see: https://github.com/fillipe-gsm/python-tsp

### offers:
- functions to read TSP instances in the tsplib-format
  
  
  

In [64]:
from python_tsp.distances import tsplib_distance_matrix

#tsplib_file = "./../problems/tsp/instances/a280.tsp" # optimal solution 2579 (lt. http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/STSP.html)
tsplib_file = "./../problems/tsp/instances/brazil58.tsp" # optimal solution 25395 (lt. http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/STSP.html)
#tsplib_file = "./../problems/tsp/instances/berlin52.tsp" # optimal solution  7542 (lt. http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/STSP.html)

distance_matrix = tsplib_distance_matrix(tsplib_file)

permutation, distance = tsp_nearest_neighbor(distance_matrix, [0])
distance

30774

- and heuristic as well as exact TSP algorithms
  - e.g. local search, simulated annealing and dynamic programming (exact: careful, may take very long)

In [61]:
from python_tsp.heuristics import solve_tsp_local_search, solve_tsp_simulated_annealing

#permutation, distance = solve_tsp_local_search(distance_matrix)

permutation, distance = solve_tsp_simulated_annealing(distance_matrix)
distance

25810

## Improving nearest neighbor: multi-start

- one way to improve a greedy heuristic such as nearest neighbor that relies parameters (here: start city) is to call the greedy algorithm multiple times with different parameters
- in general, many greedy algorithms are so fast that calling them multiple times is a perfectly feasible approach

#### Task: Write a function that calls the nearest-neighbor algorithm for every possible start node

In [66]:
def tsp_multi_start_nearest_neighbor(distance_matrix):

    return permutation, total_distance

Try it:

In [None]:
permutation, distance = tsp_multi_start_nearest_neighbor(distance_matrix)
distance

## A generic wrapper function for calling a function solving the TSP
- to simplify our further experiments let us write a function that can call any function solving a TSP
- we use `*args` to allow passing arguments (e.g. start node) to the function
- the wrapper function
  - reads the distance matrix
  - solves the problem
  - checks the solution for feasibility
  - prints instance name, function name, total distance and time spent for solving

In [32]:
import timeit
from timeit import default_timer as timer

instance_name = "brazil58"
def solve_tsp_using_function(instance_name, tsp_function, *args):
    tsplib_file = f"./../problems/tsp/instances/{instance_name}.tsp" 
    distance_matrix = tsplib_distance_matrix(tsplib_file)
    starttime = timer()  
    permutation, distance = tsp_function(distance_matrix, *args)
    if set(permutation) != set(range(len(distance_matrix))):
        print ("Not a proper permutation!")
    
    print(f"{instance_name}, {tsp_function.__name__}, distance: {distance}, time: {timer()- starttime:0.3f}")

## Trying out our generic wrapper

..with a function from the TSPLib:

In [34]:
solve_tsp_using_function(instance_name, solve_tsp_simulated_annealing)    

brazil58, solve_tsp_simulated_annealing, distance: 25627, time: 4.981


..with our nearest neigbor function

In [35]:
solve_tsp_using_function(instance_name, tsp_nearest_neighbor, [0])    

brazil58, tsp_nearest_neighbor, distance: 30774, time: 0.005


..and finally, with our multi-start function

In [None]:
solve_tsp_using_function(instance_name, tsp_multi_start_nearest_neighbor, [0])    

## Speeding up with numba

**`numba`**

-  among other things, numba allows to **just-in-time** compile Python code
- this make Python code much faster
- but it only applies to a certain subset of Python
- see https://numba.pydata.org/ for more information

- most simple approach to apply numba use the *decorators* `@njit` above the function to just-in-time compile
- let us try this out with the function `tsp_multi_start_nearest_neighbor` and re-do the timing 



# From Greedy to Dynamic Programming (Models)

## Modeling a discrete multi-stage transition system

<img src="./img/deterministic_multistage_problem.png" width="60%">



- $k$ the current step / stage (e.g. the number of cities visited so far), out of $N$ stages.
- $x_k$ the current state needed to calculate the next step and the cost
    - e.g. the cities visited visited so far and the current city
    - the start state is defined as $x_0$
- $u_k$ a decision from the set $U_k(x_k)$ of feasible decisions when being in stage $k$ and in state $x_k$ 
  - e.g. a city that was not visited so far  
- $g(x_k, u_k)$ the cost of choosing decision $u_k$ when being in state $x_k$
  - e.g. the distance to the next city
- $f(x_k, u_k)$ a transition function that computes $x_{k+1}$ from $x_k$ und the decision $u_k$ 
  - e.g. an augmentation of the cities visited so far and an update of the current city



## A dynamic programming model

A model for such a discrete system as defined on the previous slide along with the optimization problem:

$$\min_{u_0,..,u_k,..u_{N-1}} \sum_{k=0}^{N-1} g_k(x_k,u_k)$$


..will be referred to as a **dynamic programming model** in this remainder of this course, and we will refer to this generic problem as $DP$ in what follows

#### Observe:
- here, we assume a minimization problem, but it is straightforward to obtain a corresponding maximization problem
- we also assume the cost are "stage-wise"-additive (but: they can be state-dependent!)
- we assume there are no terminal costs $g(x_N)$ (would be straightforward to include)
- there can be far more general DP models, but for now we stick to classes that can be represented as displayed above


## Example: A DP model for the Knapsack Problem

Given:
- a knapsack instance with $N$ items with weights $w_k$ and profits $p_k$ (zero-indexed) and capacity $W$ 

- state $x_k$: accumulated weight after adding the first $k-1$ items, $x_0 = 0$
- decision $u_k \in \{0, 1\}$ (0: do not add item $k$ to the knapsack; 1: add item $k$)
- $U_k(x_k) = \begin{cases} 
                \{0,1\} \quad \mathrm{if} \quad x_k + w_k \leq W \\
                \{0 \} \quad \mathrm{else}
\end{cases}$
- $f(x_k, u_k) = x_k + w_k u_k $
- $g(x_k, u_k) = p_k u_k$

We have a maximization-objective:

$$\max_{u_0,..,u_k,..u_{N-1}} \sum_{k=0}^{N-1} g_k(x_k,u_k)$$


## Example: A DP model for the TSP

Given:
- a TSP instance with a $N$ cities and distances $d_{i,j}$ between cities $i,j$
  - let us denote with $\mathcal{N} = \{1, \ldots N \}$ the set of cities 



- state $x_k$: sequence / ordered set of cities visited so far, $x_0 = i^0$ where $i^0$ is the first city
  - let us define $l(x_k)$ as the last element in the order set, that is, the "current" city

- decision $u_k \in \{1, .. N\}$ city to visit next 
- $U_k(x_k) = \mathcal{N} \setminus x_k$
- $f(x_k, u_k) = x_k + u_k$  (here, with $+$ we mean to append $u_k$ to the sequence / ordered set $x_k$
- $g(x_k, u_k) = \begin{cases} 
                d_{l(x_k), u_k} \quad \mathrm{if} \quad k < N-1 \\
               d_{l(x_k), u_k} +  d_{u_k, i^0} \quad  \mathrm{if} \quad k = N-1
\end{cases}$

We have a minimization objective:

$$\min_{u_0,..,u_k,..u_{N-1}} \sum_{k=0}^{N-1} g_k(x_k,u_k)$$



## Greedy as a myopic policy 

- we will see later how to solve a $DP$ model to optimality

- in general, we refer to a function $\pi$ that maps a state $x_k$ to a decision $u_k$ as a **policy**
- in a deterministic problem, given a policy $\pi$, we can obtain a solution to $DP$ by 
  - starting from $x_k := x_0$ and selecting the $u_k$ according to the policy
  - applying the state transition $x_{k+1} = f(x_k, u_k)$
  - and continue until $k:= N -1$

Greedy as a policy:
- we can view the greedy algorithm as being based on a policy that selects a $u_k$ that minimizes the transition costs $g$:

$$u_k = \underset{u_k \in U_k}{\operatorname{argmin}} \, g(x_k, u_k)$$

Observe: This policy is **myopic** since it does not account for how deciding for a certain $u_k$ affects the quality of the remaining solution process - hence its name: *greedy*.

## Accounting for the future: The value function 


- to quantify the future value (also called cost-to-go) of a state $x_k$, we use the so-called value function $J(x_k)$ 
  - given a state $x_k$, $J(x_k)$ represents the cost / value obtained by solving the residual problem from stages $k$ to $N-1$.
  - the corresponding problem starting at $k$ is also referred to as the **tail subproblem**.
  
 

 
   
Given a value function, we can compute the decision to take in stage  $k$ as:

$$u_k = \underset{u_k \in U_k(x_k)}{\operatorname{argmin}} \, \Big( g(x_k, u_k) + J(f(x_k, u_k)) \Big) $$


Observe that for the greedy policy for the knapsack, $J(x_{k}) = 0$ 

## Q-values / Q-factors

- in some cases, in particular in some reinforcement learning approaches, it is convenient to use so-called Q-factors,  Q-values or Q-functions

$$Q_k(x_k, u_k) = g(x_k, u_k) + J\big( f(x_k, u_k ) \big)$$

..using these Q-factors, we can re-write the problem of selecting the next decision / control / action as:

$$u_k = \underset{u_k \in U_k(x_k)}{\operatorname{argmin}} \, Q_k(x_k, u_k) $$

## Optimal and approximate value functions


#### The optimal (exact) value function
- we denote the exact / optimal value function with $J^*$.
- if we have access to $J^*$, then the greedy policy based on $g(x_k, u_k) + J(f(x_k, u_k))$ gives us an optimal solution
- the problem: $J^*$ is typically probitively hard

#### Approximate value functions

- we denote an approximate value function with $\tilde{J}$
- a greedy policy based on $\tilde{J}$ is suboptimal,  but can be much faster to compute
- approximate value functions can be determined in various ways $\tilde{J}$:
  - using offline training / learning
  - using problem simplfication or aggregation (solve an approximate tail problem)
  - using online techniquest (e.g. rollout), see later

## Exact Dynamic Programming

How to obtain an exact value function?

In general,  this requires exactly solving the DP model
- there are various ways to do this (backward or forward), we will consider this in somewhat more detail next week
- for large-scale COP, exactly solving the DP model usually takes probhibitively long

## Exact Dynamic Programming: An Illustration of the Knapsack Case

One approach to exactly solve a DP model is to
- create the state transition graph and
- compute the shortest (longest) path in the graph


<img src="./img/reaching_05.png" width="60%">



# Approximate value functions / approximation in value space


### How can we obtain an approximate value function?

#### Offline value function approximation

- training machine learning models using given solutions obtained from
  - exact or heuristic solution approaches
  - of "self-generated solutions" obtained using reinforcement learning methods


#### Online value function approximation

- instead of pre-training, one obtains $\tilde{J}$ by heurstically solving the tail subproblem using
  - so-called rollout with a base heuristic
  - possibly combined with so-called multi-step lookahead
- this is what we will discuss now.




# Online approximation in value space

## Online approximation in value space

We will consider the following approaches now:
- rollout with greedy as base heuristic
- approximate one-step minimization: simplified rollout
- two-step lookahead with rollout
- multi-step lookahead with rollout 



#### Running case study: TSP
- how much are we able to improve upon a simple (multi-start) nearest neighbor algorithm?
- will we be able to "beat" Python-TSP?


## Rollout with a Base Heuristic

- recall that given an approximate value function $\tilde{J}$, we can construct a policy that takes the decision according to the best approximate $Q-$-value $\tilde{Q}_k(x_k, u_k)$

$$u_k = \underset{u_k \in U_k(x_k)}{\operatorname{argmin}} \, \tilde{Q}_k(x_k, u_k) = \underset{u_k \in U_k(x_k)}{\operatorname{argmin}} \, \Big( g(x_k, u_k) +  \tilde{J}(f(x_k, u_k)) \Big) $$
  
  
- key idea of rollout:  run a (simple and fast) base heuristic on the tail subproblem starting from $x_{k+1} = f(x_k, u_k)$ to obtain a cost / value $H(f(x_k, u_k))$, and use that value as value function approximation:
  - $\tilde{J}(x_{k+1}) = H (x_{k+1})$


<img src="./img/rollout_general.png" width="60%">

..we will illustrate this now for the TSP, using nearest neighbor as the base heuristic.

## Rollout with a Base Heuristic for the TSP



## Rollout with a base heuristic for the TSP: getting the next city



In [3]:
@njit
def get_next_city_rollout_nn(distance_matrix, permutation):
    
    node = permutation[len(permutation)-1] 
    best_q_value = 1000000
    best_node = node        
    
    # loop over all u_k
    for next_node in range(len(distance_matrix)):        
        if next_node in permutation: continue # skip if infeasible (not in U(x_k))       
            
        # compute x_k+1 = f(x,u)
        state_next_stage = permutation + [next_node]        
        # compute J-tilde(x_k+1) using nn as base heuristic (_, means we ignore the first return value)
        _, nn_value = tsp_nearest_neighbor(distance_matrix, state_next_stage)
        
        # (approximate) q_value = g(x_k,u_k) + J-tilde
        q_value = distance_matrix[node,next_node] + nn_value
        if q_value < best_q_value:
            best_node = next_node
            best_q_value = q_value  
   
    return best_node, distance_matrix[node,best_node]


NameError: name 'njit' is not defined

## Rollout with a base heuristic for the TSP: the main function

In [None]:
@njit
def tsp_rollout_nn(distance_matrix, permutation):
        
    total_distance = 0
    
    #while the solution is not complete
    while len(permutation) < len(distance_matrix):    
        
        next_node, distance = get_next_city_rollout_nn(distance_matrix, permutation)
        permutation.append(next_node)
        total_distance += distance     
        
    total_distance += distance_matrix[permutation[len(permutation)-1],permutation[0]]
    return permutation, total_distance

In [None]:
#solve_tsp_using_function(instance_name, tsp_rollout_nn, [0])    

## Rollout: Some comments


- applying rollout is always at least as good as running only the base heuristic
  - (as long as certain very natural conditions are satisified)
- the base heuristic is not restricted to greedy heuristics, but may also involve
  - ML-based policie, e.g. policies resulting from applying a "policy (neural) network"
  - multiple different base heuristics
- given that our policy involves one "exact" step before running the rollout, we call this aproach one-step lookahead minimization with rollout
- one-step lookahead minimization also works with other value function approximations (e.g. those obtained from offline training)


## Approximate one-step minimization: Simplified rollout


- applying to every decision may take a lot of time
- this will be even more true for multi-step lookahead


- in order to speed up the solution process, we can approximate the minimization step by not considering every single $u_k \in U_k(x_k)$ but only the "most promising"
  - as an example, we can use the greedy criterion $g(x_k,u_k)$ as criterion for restricting the decisions to consider.
  




## Simplified rollout: getting the next city

In [15]:
@njit
def get_next_city_simplified_rollout_nn(distance_matrix, permutation,  max_number_of_neighbors_rollout):
    
    node = permutation[len(permutation)-1] 
                                    
    best_q_value = 1000000
    best_node = node
   
    #sort neighors according to greedy criterion
    sorted_neighbors = np.argsort(distance_matrix[node])    
    number_of_neighbors_rollout = 0
    
    for next_node in sorted_neighbors:
        if next_node in permutation: continue            
        
        number_of_neighbors_rollout += 1        
        if number_of_neighbors_rollout > max_number_of_neighbors_rollout:
            break     
                                    
        #from now on: same as in non-simplified rollout     
        state_next_stage = permutation + [next_node]        
        # compute J-tilde(x_k+1) using nn as base heuristic (_, means we ignore the first return value)
        _, nn_value = tsp_nearest_neighbor(distance_matrix, state_next_stage)
        
        # (approximate) q_value = g(x_k,u_k) + J-tilde
        q_value = distance_matrix[node,next_node] + nn_value
        if q_value < best_q_value:
            best_node = next_node
            best_q_value = q_value  

    return best_node, distance_matrix[node,best_node]


## Simplified rollout: the main function

In [None]:
@njit
def tsp_simplified_rollout_nn(distance_matrix, permutation):
    
    # please complete this function!
    
    return permutation, total_distance

In [None]:
#solve_tsp_using_function(instance_name, tsp_simplified_rollout_nn, [0])   

## Multi-step lookahead

- in (exact) dynamic programming (by reaching), we (somewhat) construct a full state-transition graph
- in rollout, in each iteration, only the first step is "exact", the rest of the graph is approximated
- in multi-step lookahead, we partially expand the tree for (more than one stage) to have more "exact" steps before using a value function approximation for selection
- below: multi-step lookahead with rollout for value function approximation

<img src="./img/multistep_lookahead.png" width="60%">

#### Very important:
- at each step $k$, only a single $u_k$ is selected - all the computations of the $u_{k+1}$ are only performed to get a better $\tilde{J}$!

## Two-Step lookahead with rollout: getting the next city

- let us start with two-step lookahead
- we will directly start with a simplified version
- please fill the gap in the function below!

In [5]:
@njit
def get_next_city_simplified_two_step_lookahead(distance_matrix, permutation, max_number_of_neighbors):
    
    node = permutation[-1] 
    best_q_value = 1000000
    best_node = node
    
    # caution: we cannot do more lookahead than there are steps left
    number_of_lookahead_steps = min(number_of_lookahead_steps, len(distance_matrix) - len(permutation))    
    
    ## everything below is code needed for iterating over the first feasible next cities 
    ## (see above)
    sorted_neighbors = np.argsort(distance_matrix[node])
    number_of_neighbors = 0
    for next_node in sorted_neighbors:
        if next_node in permutation: continue     
        number_of_neighbors += 1
        if number_of_neighbors > max_number_of_neighbors:
            break                     
            
        
        state_next_stage = permutation + [next_node]
        
        ## how do we obtain an approximate value function in this case?
        
        # PLEASE FILL the gap here

        q_value = distance_matrix[node,next_node] + value
        
        if q_value < best_q_value:
            best_node = next_node
            best_q_value = q_value
    
   
    return best_node, distance_matrix[node,best_node]


## Two-Step lookahead with rollout: main function



In [None]:
@njit
def tsp_simplified_two_step_lookahead(distance_matrix, permutation, max_number_of_neighbors):
    

    total_distance = 0    
    node = permutation[len(permutation)-1]
    
    while len(permutation) < len(distance_matrix):   
        
        #two-step lookahead only makes sense 
        if len(permutation) < len(distance_matrix) -1:
            node, distance = get_next_city_simplified_two_step_lookahead(distance_matrix, permutation, max_number_of_neighbors)
        else:
            node, distance = get_next_city_simplified_rollout_nn(distance_matrix, permutation, max_number_of_neighbors)
            
        permutation.append(node)
        total_distance += distance     
    

    
    total_distance += distance_matrix[permutation[len(permutation)-1],permutation[0]]
    return permutation, total_distance

In [None]:
#solve_tsp_using_function(instance_name, tsp_simplified_two_step_lookahead, [0])   

## (Generic) multi-step lookahead

..how can we generically implement a multi-step lookahead version of the `get_next_city..`-function?

- please fill the gap in the function below


In [6]:
@njit
def get_next_city_simplified_multi_step_lookahead(distance_matrix, permutation, max_number_of_neighbors, number_of_lookahead_steps):
    node = permutation[-1] 
    best_q_value = 1000000
    best_node = node
    
    # caution: we cannot do more lookahead than there are steps left
    number_of_lookahead_steps = min(number_of_lookahead_steps, len(distance_matrix) - len(permutation))    
    
    ## everything below is code needed for iterating over the first feasible next cities 
    ## (see above)
    sorted_neighbors = np.argsort(distance_matrix[node])
    number_of_neighbors = 0
    for next_node in sorted_neighbors:
        if next_node in permutation: continue     
        number_of_neighbors += 1
        if number_of_neighbors > max_number_of_neighbors:
            break                     
        
        state_next_stage = permutation + [next_node]
        
        ## how do we obtain an approximate value function in this case?
        ##PLEASE FILL the gap here
        
        q_value = distance_matrix[node,next_node] + value

        if q_value < best_q_value:
            best_node = next_node
            best_q_value = q_value
    
   
    return best_node, distance_matrix[node,best_node]

## Generic multi-step lookahead: the main function

In [22]:
@njit
def tsp_simplified_multi_step_lookahead(distance_matrix, permutation, max_number_of_neighbors, number_of_lookahead_steps):
    

    total_distance = 0
    
    node = permutation[len(permutation)-1]   

    while len(permutation) < len(distance_matrix):    
        
        node, distance = get_next_city_simplified_multi_step_lookahead(distance_matrix,
                                                                      permutation,                                                                      
                                                                      max_number_of_neighbors,
                                                                      number_of_lookahead_steps)
        permutation.append(node)
        total_distance += distance     
     
    
    total_distance += distance_matrix[permutation[len(permutation)-1],permutation[0]]
    return permutation, total_distance

In [None]:
number_of_lookahead steps 
#solve_tsp_using_function(instance_name, tsp_simplified_multi_step_lookahead, [0], 2)   

# Wrapping up and Outlook

## Approximation in policy space

- so far, we discussed approximation in value space (approximating the value function $J$)
- we then always used a policy by doing one-step lookahead minimization using the approximate value function $\tilde{J}$:
 
 $$u_k = \underset{u_k \in U_k}{\operatorname{argmin}} \, g(x_k, u_k) + \tilde{J}(x_{k+1})$$

- however, it is also possible approximate policies $\pi$ that directly give us a decision (without that minimization):

$$u_k = \pi(x_k)$$

As an example, such an approximate $\pi$ can be obtained via offline training:
- optimizing a parameterized policy function (e.g. a linear decision rule)
- in case of discrete decisions: training / learning a classification model based on given policicies or by reinforcement learning (self-learning)




**Observe:** 

- a (learned) approximate policy can be used as a base policy for a rollout algorithm
- an offline learned policy can often be substantially improved by including some lookahead and rollout steps ("online play" in games)

## Simplified AlphaZero architecture as an example for a hybrid approach

<img src="./img/alpha_zero_sketch.png" width="60%">

- "Position evaluator" is a value function approximation
- rollout is not fully performed but truncated; at the end of the rollout, an approximate value function is used to account for the future value

## References / Going deeper

<img src="./img/lessons_az.jpg" width="20%" align="right">

- much of the notation and most figures are taken from presentations and books from D. Bertsekas
- Bertsekas has many books, his most recent one (see on the right) is available for free
- he also has a couple of lectures and courses available online
- you can find links to all his materials on his website https://www.mit.edu/~dimitrib/home.html
  - in particular in the section http://web.mit.edu/dimitrib/www/RLbook.html



## Conclusions and Outlook


#### This week, we...
- got used to the concept of a DP model and dynamic programming
- learned how to use simple greedy heuristics to derive relatively powerful heuristic solutions approaches
- maybe got a first or different perspective on the relation of DP and reinforcement learning

#### Next week, we...
- will have a closer look at exact approaches for solving DP models#
- will discuss so-called Decision Diagrams which
  - provide an exact tecnnique for reducing the state-transition graph of a DP model
  - provide a generic mechanism for obtaining combinatorial relaxations from DP models
  - and, building on that, allow constructing an interesting Branch-and-Bound scheme that does not rely on LP relaxations