<img src="ost_logo.png" width="240" height="240" align="right"/>
<div style="text-align: left"> <b> Machine Learning </b> <br> MSE FTP MachLe <br> 
<a href="mailto:christoph.wuersch@ost.ch"> Christoph Würsch </a> </div>

# Lab13: A2 Value Iteration GridWorld

## Reinforcement Learning — Implement Grid World

### Introduction of Value Iteration
[Based on a code by Jeremy Zhang](https://towardsdatascience.com/reinforcement-learning-implement-grid-world-from-scratch-c5963765ebff)


Gridworld is the most basic as well as classic problem in reinforcement learning and by implementing it on your own, I believe, is the best way to understand the basis of reinforcement learning. 

The rule is simple. 
- Your agent/robot starts at the left-bottom corner(the ‘start’ sign) and ends at either +1 or -1 which is the corresponding reward. 
- At each step, the agent has 4 possible actions including up, down, left and right, whereas the black block is a wall where your agent won’t be able to penetrate through. 
- If the agent hits the wall, it will remain at the same position.




Grid Board
---
<img src="board2.png" alt="drawing" width="600"/>

In [1]:
import numpy as np

In [2]:
BOARD_ROWS = 3
BOARD_COLS = 4
WIN_STATE = (0, 3)
LOSE_STATE = (1, 3)
START = (2, 0)
DETERMINISTIC = True

When our agent takes an action, the `Environment` should have a function to accept an action and return a legal position of next state.

In [8]:
class Environment:
    def __init__(self, state=START):
        self.board = np.zeros([BOARD_ROWS, BOARD_COLS])
        self.board[WIN_STATE[0],WIN_STATE[1]]=+1
        self.board[LOSE_STATE[0],LOSE_STATE[1]]=-1
        self.board[1, 1] = -1
        self.state = state
        self.isEnd = False
        self.determine = DETERMINISTIC
        
    def giveReward(self):
        if self.state == WIN_STATE:
            return 1
        elif self.state == LOSE_STATE:
            return -1
        else:
            return 0
    
    def isEndFunc(self):
        if (self.state == WIN_STATE) or (self.state == LOSE_STATE):
            self.isEnd = True
    
    def nxtPosition(self, action):
        """
        action: up, down, left, right
        -------------
        0 | 1 | 2| 3|
        1 |
        2 |
        return next position
        """
        if self.determine:
            if action == "up":
                nxtState = (self.state[0]-1, self.state[1])
            elif action == "down":
                nxtState = (self.state[0]+1, self.state[1])
            elif action == "left":
                nxtState = (self.state[0], self.state[1]-1)
            else:
                nxtState = (self.state[0], self.state[1]+1)
            # if next state legal
            if ((nxtState[0] >= 0) and (nxtState[0] <= (BOARD_ROWS-1))):
                if ((nxtState[1] >= 0) and (nxtState[1] <= (BOARD_COLS-1))):
                    if nxtState != (1, 1):
                        return nxtState
            return self.state
    
    def showBoard(self):
        self.board[self.state] = 1
        for i in range(0, BOARD_ROWS):
            print('-----------------')
            out = '| '
            for j in range(0, BOARD_COLS):
                if self.board[i, j] == 1:
                    token = '+1'
                if self.board[i, j] == -1:
                    token = '-1'
                if self.board[i, j] == 0:
                    token = '0'
                out += token + ' | '
            print(out)
        print('-----------------')    

In [9]:
env = Environment()
env.state

(2, 0)

In [10]:
env.showBoard()

-----------------
| 0 | 0 | 0 | +1 | 
-----------------
| 0 | -1 | 0 | -1 | 
-----------------
| +1 | 0 | 0 | 0 | 
-----------------


## Agent and Value Iteration

This is the artificial intelligence part, as our agent should be able to learn from the process and thinks like a human. The key of the magic is value iteration.

### Value Iteration

What our agent will finally learn is a policy $\pi$, and a policy is a mapping from state $s$ to action $a$, simply instructs what the agent should do at each state. In our case, instead of learning a mapping from state to action, we will leverage **value iteration** to learn a mapping of state to value (which is the estimated reward) and based on the estimation, at each state, our agent will choose the best action that gives the highest estimated reward.
There is not going to be any cranky, head-scratching math involved, as the core of value iteration is amazingly concise.

$$\displaystyle V(s_t) \leftarrow V(s_t) + \alpha \left[R+ \gamma V(s_{t+1})-V(s_t) \right] $$

$$\displaystyle V(s_t) \leftarrow (1-\alpha) \cdot V(s_t) + \alpha \left[R+ \gamma V(s_{t+1}) \right] $$

$$\displaystyle V(s_t) \leftarrow (1-\alpha) \cdot V(s_t) + \alpha \hat{V}(s_{t+1}) $$

This is the essence of value iteration. This formula almost applies to all reinforcement learning problems. Value iteration, just as its name, updates its value (estimated reward) at each iteration (end of game).

At first, our agent knows nothing about the grid world (environment), so it would simply initialises all rewards as 0. Then, it starts to explore the world by randomly walking around, surely it will endure lots of failure at the beginning, but that is totally fine. Once it reaches end of the game, either reward +1 or reward -1, the whole game reset and the reward **propagates in a backward fashion and eventually the estimated value of all states along the way will be updated based on the formula above**.


The $V(S_t)$ on the left is the updated value of that state, and the right one is the current non-updated value and $\alpha$ is *learning rate*. The formula is simply saying that **the updated value of a state equals to the current value plus a temporal difference**, which is what the agent learned from this iteration of game playing minus the previous estimate. 

In [11]:
class Agent:
    
    def __init__(self):
        self.states = []
        self.actions = ["up", "down", "left", "right"]
        self.State = Environment()
        self.isEnd = self.State.isEnd
        self.lr = 0.2       # alpha
        self.exp_rate = 0.3 # epsilon
        
        # initial state reward
        self.state_values = {}
        for i in range(BOARD_ROWS):
            for j in range(BOARD_COLS):
                self.state_values[(i, j)] = 0  # init_reward[i, j]
    
    def chooseAction(self):
        # choose action with most expected value
        mx_nxt_reward = 0
        action = ""
        
        if np.random.uniform(0, 1) <= self.exp_rate:
            action = np.random.choice(self.actions)
        else:
            # greedy action
            for a in self.actions:
                # if the action is deterministic
                nxt_reward = self.state_values[self.State.nxtPosition(a)]
                if nxt_reward >= mx_nxt_reward:
                    action = a
                    mx_nxt_reward = nxt_reward
            # print("current pos: {}, greedy aciton: {}".format(self.State.state, action))
        return action
    
    def takeAction(self, action):
        position = self.State.nxtPosition(action)
        return Environment(state=position)     
    
    def reset(self):
        self.states = []
        self.State = Environment()
        self.isEnd = self.State.isEnd
    
    def play(self, rounds=10):
        i = 0
        while i < rounds:
            # to the end of game back propagate reward
            if self.State.isEnd:
                # back propagate
                reward = self.State.giveReward()
                # explicitly assign end state to reward values
                self.state_values[self.State.state] = reward
                print("Game End Reward", reward)
                for s in reversed(self.states[:-1]):
                    reward = self.state_values[s] + self.lr*(reward - self.state_values[s])
                    self.state_values[s] = round(reward, 3)
                self.reset()
                i += 1
            else:
                action = self.chooseAction()
                # append trace
                self.states.append(self.State.nxtPosition(action))
                print("current position {} action {}".format(self.State.state, action))
                # by taking the action, it reaches the next state
                self.State = self.takeAction(action)
                # mark is end
                self.State.isEndFunc()
                print("nxt state", self.State.state)
                print("---------------------")
                self.isEnd = self.State.isEnd
    

In [12]:
ag = Agent()

ag.play(50)

current position (2, 0) action right
nxt state (2, 1)
---------------------
current position (2, 1) action right
nxt state (2, 2)
---------------------
current position (2, 2) action down
nxt state (2, 2)
---------------------
current position (2, 2) action right
nxt state (2, 3)
---------------------
current position (2, 3) action up
nxt state (1, 3)
---------------------
Game End Reward -1
current position (2, 0) action left
nxt state (2, 0)
---------------------
current position (2, 0) action left
nxt state (2, 0)
---------------------
current position (2, 0) action down
nxt state (2, 0)
---------------------
current position (2, 0) action left
nxt state (2, 0)
---------------------
current position (2, 0) action left
nxt state (2, 0)
---------------------
current position (2, 0) action left
nxt state (2, 0)
---------------------
current position (2, 0) action up
nxt state (1, 0)
---------------------
current position (1, 0) action right
nxt state (1, 0)
---------------------
curren

---------------------
current position (0, 0) action down
nxt state (1, 0)
---------------------
current position (1, 0) action up
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
current position (0, 0) action left
nxt state (0, 0)
---------------------
curre

In [13]:
ag.showValues()

AttributeError: 'Agent' object has no attribute 'showValues'