# Q Learning

In [None]:
import numpy as np

np.random.seed(0)

In [None]:
def create_q(state_size: int = 3, action_size: int = 2):
    return np.zeros((state_size, action_size))


create_q()

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

"Q" stands for quality, i.e. how useful a given action is in gaining some future
reward

- `Q` function can be implemented as a table
- Q learning is 
  - **Model free**
  - **Value based**
  - **Off policy**

In [None]:
class Environment:
    """Our game is split into three states, 0, 1, 2. The player starts at state
    0, and our goal is to move the player to state 2
    """
    def __init__(self) -> None:
        self.state = 0

    def _assert_state(self):
        state = self.state
        assert state == 0 or state == 1 or state == 2

    def next_state(self, action):
        state = self.state 

        if action == 0:
            return np.clip(state - 1, 0, 2)
        
        if action == 1:
            return np.clip(state + 1, 0, 2)

        assert False

    def reward(self, state) -> float:
        if state == 0:
            return -2.0

        if state == 1:
            return -1.0
        
        return 0.9

## Training

Q agent can perform two actions

- Explore
- Exploit

In [None]:
def simple_train():
    epsilon = 0.5

    random = np.random.random((1,))
    print(f"random = {random[0]}")

    if random[0] < epsilon:
        print("explore")
    else:
        print("exploit")


simple_train()

random = 0.5488135039273248
exploit


We can balance exploration and exploitation using $ \epsilon $ (epsilon)

### Exploit

In [None]:
def exploit(q: np.ndarray, state: int) -> np.int64:
    action_values = q[state]
    return np.argmax(action_values) 


exploit(np.array([[0, 1]]), 0)

1

We find all possible action for a given state, then the agent select the action
based on the max value of these action

### Explore

In [None]:
def explore():
    actions = np.array([0, 1], dtype=np.int64)
    return np.random.choice(actions)


explore()

1

Select action at random

## Updating Q Table

In [None]:
def update_q(q: np.ndarray, state, new_state, action, reward):
    lr = 0.5
    gamma = 0.85

    q[state, action] = q[state, action] \
        + lr * (reward + gamma * np.max(q[new_state, :]) - q[state, action]) 

1. Agent start in a state (s1), take action (a1) and get reward (r1)
2. Agent select the action using Q table or by random
3. Update Q values

Update occurs after each step or action and ends when an episode is done.

- "done" means reaching some terminal point by the agent
- Terminate state can be reaching the end of some game

Agent will not learn much after each episode, but with enough exploring, will 
converge

### Parameters

- **Learning Rate** (`lr`): aka **alpha** (`a`), how much you accept a new value
  vs the old value
- **Gamma** (`gamma`): a discount factor. Balance immediate and future reward.
  Value is typically in the range (`0.8` to `0.99`)
- **Reward**: reward is the value received after completing a certain action
- **Max**: take the maxium of the future reward and applying to the reward 
  for the current state. What this does is impact the current action by the 
  possible future reward. This is the beauty of q-learning. We’re allocating 
  future reward to current actions to help the agent select the highest return 
  action at any given state.

## Putting Everything Together

In [None]:
def train(epochs: int = 10, epsilon: float = 0.7) -> np.ndarray:
    q = create_q()

    for epoch in range(epochs):
        env = Environment()
        step = 0

        # print(f"epoch {epoch}")

        while True:
            # print(f"  step {step}")
            train_explore = np.random.random((1,))

            if train_explore[0] < epsilon:
                # explore
                # print(f"    explore")
                action = exploit(q, env.state) 
            else:
                # exploit
                # print(f"    exploit")
                action = explore() 

            old_state = env.state
            new_state = env.next_state(action)

            update_q(q, old_state, new_state, action, env.reward(new_state))

            # print(f"    {old_state} to {new_state}")
            env.state = new_state

            step += 1

            if env.state == 2:
                break 

    return q


trained_q = train()
trained_q

array([[-2.2354126 , -0.24272949],
       [-1.        ,  0.89912109],
       [ 0.        ,  0.        ]])

# Resources

- [Simple Reinforcement Learning: Q-learning](https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56)
- [A Beginners Guide to Q-Learning](https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c)