# Toy Problem for RL Pathfinding

This notebook is a simplified exploration of an RL-based pathfinding algorithm. We setup a 10x10 grid and implement Q-learning as our algorithm.

### Fundamentals of RL
Reinforcement-Learning is a technique in machine learning where an agent is tasked with interacting within a specified environment. The agent takes an action $a$ during the current state of the environment $s$. The optimal action $a$ is decided by maximising a reward.

### Learning Algorithms
The learning algorithm that we'll be using is Q-learning defined as:

$$
Q_\textrm{new}(s, a) \leftarrow (1-\alpha)Q(s,a)+\alpha(r+\gamma\cdot \max_{a^\prime}Q(s^\prime, a^\prime))
$$

- $Q(s, a)$ is the current $Q$ value, aka the quality.
- $\alpha$ is the learning rate
- $r$ is the reward contribution from taking action $a$
- $\gamma$ is the discount factor for how much we value future rewards
- $\max_{a^\prime}Q(s^\prime, a^\prime)$ returns the maximum $\mathbf{Q}$-value in the next state $s^\prime$ having taken action $a^\prime$.

From an applied mathematics background the Q-learning algorithm reminds me of classical operational research optimisation schemes.

In [2]:
import numpy as np

## The World

In [4]:
grid_size = 10
grid = np.zeros(shape=(grid_size, grid_size))

start = (0, 0)
end = (9, 9)

human_map = {'Up': 0, 'Down': 1, 'Left': 2, 'Right': 3}
agent_map = {
    0 : (-1, 0),
    1 : (1, 0),
    2 : (0,-1),
    3 : (0,1)}

## The Agent

In [5]:
alpha = 0.1
gamma = 0.9

In [6]:
class Agent:
    def __init__(self, curr_pos:tuple =start):
        self.curr_pos = curr_pos
        self.Q_table = np.zeros(shape=(grid_size, grid_size, len(human_map)))

    def move(self, action:int):
        row_change, col_change = agent_map[action]

        row_pos = self.curr_pos[0] + row_change
        row_pos %= grid_size

        col_pos = self.curr_pos[1] + col_change
        col_pos %= grid_size

        new_pos = (row_pos, col_pos)
        
        return new_pos
    
    def choose_action(self, epsilon):
        crit = np.random.uniform(0, 1)

        if crit < epsilon:
            action = np.random.choice([0, 1, 2, 3])
            return int(action)
        else:
            q_vals = self.Q_table[self.curr_pos[0], self.curr_pos[1]]
            action = np.argmax(q_vals)
            return int(action)
        
    def update_q_table(self, action, reward, new_pos):
        old_q = self.Q_table[self.curr_pos[0], self.curr_pos[1], action]
        max_q = np.max(self.Q_table[new_pos[0], new_pos[1]])
        new_q = old_q + alpha * (reward + gamma*(max_q) - old_q)
        self.Q_table[self.curr_pos[0], self.curr_pos[1], action] = new_q


### The $Q$-table

The $Q$-table is basically the registry for all our moves. The value at (3, 4, 1) is the $Q$-value of action 1, moving down, from position (3, 4). Similarly (3, 4, 3) is the $Q$-value of action 3, moving right, from position (3, 4).

This current example is very low dimensional, `(10, 10, 4)`. As such we won't need to implement a neural network to approximate the functions and can explicitly calculate them.

## Training

In [10]:
episodes = 5000

reward_end  = 100
reward_step = -1

epsilon = 1.0
epsilon_decay = 0.9995

bob = Agent()

for ep in range(episodes):
    print(f"Starting Episode {ep}")
    convergence_criteria = 10000
    iter_count = 0
    bob.curr_pos = start

    while bob.curr_pos != end and iter_count <= convergence_criteria:
        iter_count += 1

        action = bob.choose_action(epsilon)
        new_pos = bob.move(action)

        if new_pos == end:
            reward = reward_end
            print(f"Agent found the end point at iteration {iter_count}")
        else:
            reward = reward_step
            
        bob.update_q_table(action, reward, new_pos)
        bob.curr_pos = new_pos
        
    
    epsilon *= epsilon_decay

Starting Episode 0
Agent found the end point at iteration 342
Starting Episode 1
Agent found the end point at iteration 16
Starting Episode 2
Agent found the end point at iteration 132
Starting Episode 3
Agent found the end point at iteration 102
Starting Episode 4
Agent found the end point at iteration 100
Starting Episode 5
Agent found the end point at iteration 208
Starting Episode 6
Agent found the end point at iteration 4
Starting Episode 7
Agent found the end point at iteration 102
Starting Episode 8
Agent found the end point at iteration 88
Starting Episode 9
Agent found the end point at iteration 214
Starting Episode 10
Agent found the end point at iteration 114
Starting Episode 11
Agent found the end point at iteration 44
Starting Episode 12
Agent found the end point at iteration 134
Starting Episode 13
Agent found the end point at iteration 2
Starting Episode 14
Agent found the end point at iteration 30
Starting Episode 15
Agent found the end point at iteration 110
Starting E

### Testing Learned Policy

In [15]:
bob.curr_pos = start
path = [start]

conv_crit = 100
iter_count = 0
while bob.curr_pos != end and iter_count < conv_crit:
    iter_count += 1
    action  = bob.choose_action(0.0)
    new_pos = bob.move(action)
    path.append(new_pos)
    bob.curr_pos = new_pos

print(f"Steps taken = {iter_count}\nPath = {path}")

Steps taken = 2
Path = [(0, 0), (9, 0), (9, 9)]


As expected, the toroidal grid makes it so that the best way to get from `(0, 0)` to `(9, 9)` is to wrap around the grid twice. I.e. move up once `(0 - 1) % 10 = 9` and move left once `(0 - 1) % 10 = 9`.