# Toy Problem for RL Pathfinding

This notebook is a simplified exploration of an RL-based pathfinding algorithm. We setup a 10x10 grid and implement Q-learning as our algorithm.

### Fundamentals of RL
Reinforcement-Learning is a technique in machine learning where an agent is tasked with interacting within a specified environment. The agent takes an action $a$ during the current state of the environment $s$. The optimal action $a$ is decided by maximising a reward.

### Learning Algorithms
The learning algorithm that we'll be using is Q-learning defined as:

$$
Q_\textrm{new}(s, a) \leftarrow (1-\alpha)Q(s,a)+\alpha(r+\gamma\cdot \max_{a^\prime}Q(s^\prime, a^\prime))
$$

- $Q(s, a)$ is the current $Q$ value, aka the quality.
- $\alpha$ is the learning rate
- $r$ is the reward contribution from taking action $a$
- $\gamma$ is the discount factor for how much we value future rewards
- $\max_{a^\prime}Q(s^\prime, a^\prime)$ returns the maximum $\mathbf{Q}$-value in the next state $s^\prime$ having taken action $a^\prime$.

From an applied mathematics background the Q-learning algorithm reminds me of classical operational research optimisation schemes.

In [2]:
import numpy as np

## The World

In [None]:
grid_size = 10
grid = np.zeros(shape=(grid_size, grid_size))

start = (0, 0)
end = (9, 9)

human_map = {'Up': 0, 'Down': 1, 'Left': 2, 'Right': 3}
agent_map = {
    0 : (-1, 0),
    1 : (1, 0),
    2 : (0,-1),
    3 : (0,1)}



## The Agent

In [None]:
alpha = 0.1
gamma = 0.9

In [None]:
class Agent:
    def __init__(self, curr_pos=start):
        self.curr_pos = curr_pos
        self.Q_table = np.zeros(shape=(grid_size, grid_size, len(human_map)))

    def move(self, action:int):
        row_change, col_change = agent_map[action]

        row_pos = self.curr_pos[0] + row_change
        row_pos %= grid_size

        col_pos = self.curr_pos[1] + col_change
        col_pos %= grid_size

        new_pos = (row_pos, col_pos)
        
        return new_pos
    
    def choose_action(self, epsilon):
        crit = np.random.uniform(0, 1)

        if crit < epsilon:
            action = np.random.choice([0, 1, 2, 3])
            return action
        else:
            q_vals = self.Q_table[self.curr_pos[0], self.curr_pos[1]]
            action = np.argmax(q_vals)
            return action
        
    def update_q_table(self, action, reward, new_pos):
        old_q = self.Q_table[self.curr_pos[0], self.curr_pos[1], action]
        max_q = np.max(self.Q_table[new_pos[0], new_pos[1]])
        new_q = old_q + alpha * (reward + gamma*(max_q) - old_q)
        self.Q_table[self.curr_pos[0], self.curr_pos[1], action] = new_q
