# Chapter 18: Reinforcement Learning

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

Reinforcement Learning (RL) is one of the most exciting fields of Machine Learning today. It has been around since the 1950s, but a revolution took place in 2013 when DeepMind demonstrated a system that could learn to play Atari games from scratch, eventually outperforming humans in most of them, using only raw pixels as inputs and without any prior knowledge of the rules. This culminated in the victory of AlphaGo against world champions in the game of Go.

In Reinforcement Learning, a software **agent** makes **observations** and takes **actions** within an **environment**, and in return it receives **rewards**. Its objective is to learn to act in a way that will maximize its expected long-term rewards. If you don't mind a bit of anthropomorphism, you can think of positive rewards as pleasure and negative rewards as pain (or punishment). The agent acts in the environment and learns by trial and error to maximize its pleasure and minimize its pain.

In this chapter, we will look at the fundamental concepts of RL, including Markov Decision Processes (MDPs), Q-Learning, and Policy Gradients. We will then dive into Deep Reinforcement Learning, implementing the algorithms that allowed DeepMind to beat Atari games.

## 2. Learning to Optimize Rewards

In RL, there is no supervisor to tell the agent what is right or wrong. The agent only gets a reward signal (which may be delayed). For example, in a maze, the agent might only get a reward when it escapes. 

The algorithm used by the agent to determine its actions is called its **policy**. The policy can be a neural network taking observations as inputs and outputting the action to take.

### Policy Search
How do we find a good policy? 
* **Genetic Algorithms:** Randomly generate a population of policies, evaluate them, keep the best, and mutate them.
* **Policy Gradients:** Evaluate the gradients of the rewards with respect to the policy parameters, then tweak the parameters to follow the gradient toward higher rewards.

## 3. OpenAI Gym

OpenAI Gym is a toolkit for developing and comparing Reinforcement Learning algorithms. It provides a wide variety of environments (Atari, board games, physics simulations).

Let's create a simple environment: **CartPole**. The goal is to balance a pole on a moving cart.

In [None]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras

# 1. Create the Environment
env = gym.make("CartPole-v1")

# 2. Reset the environment (returns initial observation)
obs = env.reset()
print("Initial Observation:", obs)
# Observation: [Cart Position, Cart Velocity, Pole Angle, Pole Velocity At Tip]

# 3. Render (Note: Rendering often requires a display, which may not work in headless notebooks)
# env.render()

# 4. Action Space
# Discrete(2) means possible actions are 0 (left) and 1 (right)
print("Action Space:", env.action_space)

# 5. Step
action = 1 # Push right
obs, reward, done, info = env.step(action)
print("New Observation:", obs)
print("Reward:", reward)
print("Done:", done)

**Hardcoded Policy:**
A simple strategy: if the pole is tilting left, push the cart left; if right, push right.

In [None]:
def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

totals = []
for episode in range(50):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

print("Mean Reward (Basic Policy):", np.mean(totals), "Std:", np.std(totals))

## 4. Neural Network Policies

Ideally, we want a neural network to learn the policy. This network will take the observation as input and output the probability of taking each action.

* **Input:** Observation (4 dimensions).
* **Output:** Probability of action 0 (Left). (Probability of Right is $1 - p$).
* **Selection:** We sample an action based on this probability. Sampling is better than always picking the highest probability action because it allows the agent to **explore** new strategies.

In [None]:
n_inputs = 4
model = keras.models.Sequential([
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid")
])

print("Model built.")

## 5. Evaluating Actions: The Credit Assignment Problem

If an agent manages to balance the pole for 100 steps, how do we know *which* of those 100 actions were good and which were bad? This is the **Credit Assignment Problem**.

To solve this, we use the **Discount Factor ($\\gamma$)**. We evaluate an action based on the sum of all future rewards it led to, but we discount rewards that occur far in the future.

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

If $\gamma$ is close to 0, the agent cares only about immediate rewards. If close to 1, it cares about the long term.

## 6. Policy Gradients (REINFORCE Algorithm)

The REINFORCE algorithm works as follows:
1.  Let the neural network play the game several times. At each step, calculate gradients that would make the chosen action *more likely*, but don't apply them yet.
2.  Compute the return (total discounted reward) for each episode.
3.  If an episode's return is better than average (positive advantage), apply the gradients to encourage the actions. If worse (negative advantage), apply opposite gradients to discourage them.
4.  Compute the mean of the resulting gradients over all episodes and update the weights.

In [None]:
def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        # Predict probability of going Left (0)
        left_proba = model(obs[np.newaxis])
        
        # Sample action (0 or 1)
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))

    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, info = env.step(int(action[0, 0].numpy()))
    return obs, reward, done, grads

def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, grads = play_one_step(env, obs, model, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            if done:
                break
        all_rewards.append(current_rewards)
        all_grads.append(current_grads)
    return all_rewards, all_grads

def discount_rewards(rewards, discount_factor):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_factor
    return discounted

def discount_and_normalize_rewards(all_rewards, discount_factor):
    all_discounted_rewards = [discount_rewards(rewards, discount_factor) for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std for discounted_rewards in all_discounted_rewards]

optimizer = keras.optimizers.Adam(learning_rate=0.01)
loss_fn = keras.losses.binary_crossentropy

print("Policy Gradient functions defined.")

## 7. Markov Decision Processes (MDPs)

An MDP consists of a set of states $S$, a set of actions $A$, reward probabilities, and transition probabilities $T(s, a, s')$.

**Bellman Optimality Equation:**
The optimal value of a state $V^*(s)$ is the sum of the expected immediate reward and the discounted optimal value of the next state.

$$ V^*(s) = \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] $$

**Q-Values:**
Instead of valuing states ($V$), we value state-action pairs ($Q$). The optimal Q-Value $Q^*(s, a)$ is the expected return of taking action $a$ in state $s$ and acting optimally thereafter.

$$ Q^*(s, a) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma \max_{a'} Q^*(s', a')] $$

## 8. Q-Learning

If we don't know the transitions $T$, we use **Temporal Difference (TD) Learning**. We explore the environment and update our Q-Value estimates iteratively.

**Update Rule:**
$$ Q(s, a) \leftarrow (1 - \alpha) Q(s, a) + \alpha (r + \gamma \max_{a'} Q(s', a')) $$

To explore, we use an **$\epsilon$-greedy policy**: with probability $\epsilon$, pick a random action; otherwise, pick the best action ($\text{argmax}_a Q(s, a)$).

## 9. Deep Q-Learning (DQN)

In complex environments (like Atari games with raw pixels), the number of states is too huge to store in a Q-table. We use a Deep Neural Network (DQN) to approximate the Q-Value function: $Q(s, a) \approx Q(s, a; \theta)$.

**Training Loop Stability Tricks:**
1.  **Experience Replay:** Instead of training on the latest experience, we store experiences $(s, a, r, s', done)$ in a **Replay Buffer** and sample a random batch for training. This breaks correlations between consecutive steps.
2.  **Target Network:** We use two networks. The **Online Model** learns. The **Target Model** defines the target values ($r + \gamma \max Q_{target}$). The Target Model weights are copied from the Online Model only periodically (e.g., every 1000 steps). This prevents the target from moving while we are trying to hit it (feedback loops).

### Implementation of a DQN Agent

In [None]:
from collections import deque

batch_size = 32
discount_factor = 0.95
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = keras.losses.mean_squared_error

def epsilon_greedy_policy(state, epsilon=0):
    if np.random.rand() < epsilon:
        return np.random.randint(2)
    else:
        Q_values = model.predict(state[np.newaxis])
        return np.argmax(Q_values[0])

replay_buffer = deque(maxlen=2000)

def sample_experiences(batch_size):
    indices = np.random.randint(len(replay_buffer), size=batch_size)
    batch = [replay_buffer[index] for index in indices]
    states, actions, rewards, next_states, dones = [
        np.array([experience[field_index] for experience in batch])
        for field_index in range(5)]
    return states, actions, rewards, next_states, dones

def play_one_step_dqn(env, state, epsilon):
    action = epsilon_greedy_policy(state, epsilon)
    next_state, reward, done, info = env.step(action)
    replay_buffer.append((state, action, reward, next_state, done))
    return next_state, reward, done, info

def training_step(batch_size):
    experiences = sample_experiences(batch_size)
    states, actions, rewards, next_states, dones = experiences
    
    # Compute target Q values using the Target Model (not implemented here, using model for simplicity)
    # In full DQN, next_Q_values = target_model.predict(next_states)
    next_Q_values = model.predict(next_states)
    max_next_Q_values = np.max(next_Q_values, axis=1)
    target_Q_values = (rewards + (1 - dones) * discount_factor * max_next_Q_values)
    
    # Compute gradients
    with tf.GradientTape() as tape:
        all_Q_values = model(states)
        # Extract Q-value for the specific action taken
        one_hot_actions = tf.one_hot(actions, depth=2)
        Q_values = tf.reduce_sum(all_Q_values * one_hot_actions, axis=1)
        loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
        
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

print("DQN functions defined.")

## 10. DQN Variants

**Double DQN:**
Standard DQN tends to overestimate Q-values because the `max` operator uses the same values to select and evaluate an action. Double DQN uses the Online Model to *select* the best action and the Target Model to *evaluate* it.

$$ Q_{target} = r + \gamma Q_{target}(s', \text{argmax}_{a'} Q_{online}(s', a')) $$

**Dueling DQN:**
The Q-Value state-action pair can be decomposed into the value of the state $V(s)$ and the advantage of the action $A(s, a)$.
$$ Q(s, a) = V(s) + A(s, a) $$
In Dueling DQN, the network splits into two streams (one for $V$, one for $A$) and merges them at the end. This allows the network to learn which states are valuable without having to learn the effect of every action for every state.