<a href="https://colab.research.google.com/github/changsin/AI/blob/main/rl_value_iteration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Value Iteration

In reinforcement learning, to find the optimal policy two methods are used: value iteration or policy iteration. Using the value iteration method, the optimal policy is calculated using Bellman's equation:

$$ \pi^*(s) = \underset{a} argmax[r(s, a) + \gamma V^*(\delta(s, a))] $$

- $ $$ \pi^*(s) $: optimal policy
- $ r(s, a) $: immediate reward
- $ \gamma V^*(\delta(s, a)) $: 

The optimal policy results in the maximum value $ V^* $:
$$ \hat V(s) = \underset{a} max[r(s, a) + \gamma V^*(\delta(s, a))] $$

Using the equation, the state value is updated till there is no more update needed. The following code implements the algorithm using the example in p. 297 of Ertel (2017).

In [None]:
import numpy as np
from enum import Enum

class Action:
  def __init__(from_pos, move, reward):
    this.from_pos = from_pos
    this.move = move
    this.reward = reward

class Move(Enum):
  L = 1
  R = 2
  U = 3
  D = 4

class Grid:
  # to force convergence, we consider the values are converged
  # if the updated value is less than the precision value
  PRECISION = 0.001
  MOVES = [Move.L, Move.R, Move.U, Move.D]

  def __init__(self, state, actions, gamma = 0.9, rows=3, columns=3):
    self.state = state
    self.actions = actions
    self.gamma = gamma
    self.rows = rows
    self.columns = columns
  
  def __repr__(self):
    rows = ""
    for row in range(self.rows):
      columns = ""
      for col in range(self.columns):
        columns += " {:03.02f} ".format(self.state[row][col])
      rows += "\n" + columns
      # print(row_values)
    return rows

  def converge(self, limit=100):
    for i in range(limit):
      if self.update_state() == 0:
        print("Converged at ", i)
        return
    
    print("Did not converge after ", limit)

  def update_state(self):
    updated = 0
    # update the state from bottom to top, left to right
    for row in range(self.rows - 1, -1, -1):
      for col in range(0, self.columns, 1):
        value_original = self.state[row][col]
        self.state[row][col] = self.calc_optimal_value((row, col))

        if abs(value_original - self.state[row][col]) > self.PRECISION:
          updated += 1
    print(self)
    return updated
  
  def to_target_pos(self, pos, move):
    cur_row, cur_col = pos
    if Move.L == move:
      return (cur_row, cur_col - 1)
    elif Move.R == move:
      return (cur_row, cur_col + 1)
    elif Move.U == move:
      return (cur_row - 1, cur_col)
    elif Move.D == move:
      return (cur_row + 1, cur_col)
    else:
      raise Exception("Invalid move")

  def is_valid_move(self, pos, move):
    target_row, target_col = self.to_target_pos(pos, move)
    return target_row < self.rows and target_col < self.columns and \
      target_row >= 0 and target_col >= 0

  def get_immediate_reward(self, pos, move):
    return self.actions[pos, move] if (pos, move) in self.actions else 0

  def get_state_value(self, pos):
    return self.state[pos[0]][pos[1]]

  def calc_optimal_value(self, pos):
    """
    pi*(s) = argmax[r(s, a) + \gamma V*(\delta(s, a))]
    """
    possible_values = []
    for move in self.MOVES:
      if self.is_valid_move(pos, move):
        target_pos = self.to_target_pos(pos, move)
        value = self.get_immediate_reward(pos, move) + self.gamma*self.get_state_value(target_pos)
        possible_values.append(value)
    return max(possible_values)

In [None]:
grid_state = [[0, 0, 0],
              [0, 0, 0],
              [0, 0, 0]]

actions = {}
actions[(2, 0), Move.R] = -1
actions[(2, 1), Move.R] = -1
actions[(2, 1), Move.L] = 1
actions[(2, 2), Move.L] = 1


grid = Grid(grid_state, actions)
# print(grid)
# grid.update_state()
grid.converge()