# Value-based Learning Methods

### 1. Q-Learning

Update rule: $Q({s}_t, {a}_t) \leftarrow Q({s}_t, {a}_t) + \alpha \left(r_t + \gamma \max_{{a}} Q({s}_t, {a}) - Q({s}_t, {a}_t)\right)$

~~~python
    def train(self, num_episodes):
        
        for episode in range(num_episodes):
            state = self.env.reset()
            done = False
            
            while not done:
                action = self.choose_action(state)
                next_state, reward, done, _ = self.env.step(action)
                
                self.Q[state, action] += self.learning_rate * (reward + self.discount_factor * np.max(self.Q[next_state, :]) - self.Q[state, action])
                
                state = next_state

            self.epsilon = max(self.epsilon * self.epsilon_decay, 0.01)
~~~

In [1]:
from alg.q_learning import QLearning
from env.frozen_lake_env import FrozenLakeEnv

env = FrozenLakeEnv()

num_episode = 10000

alg = QLearning(env)
alg.train(num_episode)
alg.Q


array([[9.35247280e-01, 9.50990050e-01, 8.69988459e-01, 9.33521887e-01],
       [9.29859777e-01, 0.00000000e+00, 5.19787217e-03, 2.06153569e-01],
       [1.22640043e-01, 0.00000000e+00, 2.78610065e-08, 0.00000000e+00],
       [5.34706186e-07, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [9.46705831e-01, 9.60596010e-01, 0.00000000e+00, 9.19128576e-01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [9.50486774e-01, 0.00000000e+00, 9.70299000e-01, 9.23475662e-01],
       [9.46827467e-01, 9.80100000e-01, 9.58237247e-01, 0.00000000e+00],
       [1.26197473e-01, 9.89758554e-01, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 9.77929235e-01, 9.90000000e

### 2. Deep Q-Learning

Optimization objective: $\min \quad \mathbb{E}_{({s}_t, {a}_t, r_t, {s}_{[t+1]}) \sim \mathcal{D}} \left[r_t + \gamma \max_{{a}_{t+1}} Q^{(\theta^{-})}({s}_{t+1}, {a}_{t+1}) -  Q^{(\theta)}({s}_{t}, {a}_{t}) \right]$

#### 2.1 Define 2 networks

~~~python
    self.q_net = MLP(state_dim, self.action_dim, self.hidden_dim).to(device)
    self.target_q_net = MLP(state_dim, self.action_dim, self.hidden_dim).to(device)
~~~

#### 2.2 Replay buffer

~~~python
    class ReplayBuffer:
        
        def __init__(self, capacity):
            self.buffer = collections.deque(maxlen=capacity)

        def add(self, state, action, reward, next_state, done):
            self.buffer.append((state, action, reward, next_state, done))

        def sample(self, batch_size):
            transitions = random.sample(self.buffer, batch_size)
            state, action, reward, next_state, done = zip(*transitions)
            return np.array(state), action, reward, np.array(next_state), done

        def size(self):
            return len(self.buffer)
~~~

#### 2.3 Update Q-net

Key notes:
- Use TD target to optimize the Q-network.
- Use a target network that is updated less frequently to store the previous estimation.


~~~python
    q_values = self.q_net(states).gather(1, actions)
    max_next_q_values = self.target_q_net(next_states).max(1)[0].view(-1, 1)
    td_targets = rewards + self.gamma * max_next_q_values * (1 - dones)
    dqn_loss = torch.mean(F.mse_loss(q_values, q_targets))
    self.optimizer.zero_grad()
    dqn_loss.backward()
    self.optimizer.step()

    if self.count % self.target_update == 0:
        self.target_q_net.load_state_dict(
            self.q_net.state_dict())
    self.count += 1
~~~