# **Chapter 18 **
# **Reinforcement Learning**

**Introduction to Reinforcement Learning**

This subchapter introduces Reinforcement Learning (RL) as a learning paradigm where an agent learns by interacting with an environment. Unlike supervised learning, RL does not rely on labeled datasets. Instead, the agent receives rewards or penalties based on the actions it takes.

The objective of the agent is to learn a policy that maximizes the cumulative reward over time. At each step, the agent observes the current state, chooses an action, receives a reward, and transitions to a new state. This interaction loop continues until a terminal condition is reached.

Key challenges in RL include:

Exploration vs. exploitation trade-off

Delayed rewards

Large or continuous state spaces

RL is widely used in robotics, game playing (e.g., AlphaGo), recommendation systems, and autonomous systems.

**Policy Search**

This section explains policy search methods, where the policy is directly optimized instead of learning value functions.

A policy can be:

A simple rule-based system

A parameterized function (e.g., neural network)

The agent evaluates a policy by running it multiple times and measuring the average reward, then updates the policy parameters to improve performance.

In [1]:
import numpy as np

def policy(obs, theta):
    return 0 if np.dot(obs, theta) < 0 else 1


**Introduction to OpenAI Gym**

OpenAI Gym provides a standardized environment interface for RL experiments. It allows agents to interact with simulated environments using a simple API.

**Neural Network Policies**

This subchapter introduces neural networks as policies, enabling the agent to handle complex and high-dimensional inputs.

A neural network outputs action probabilities or action values, which are then sampled or selected greedily.

In [7]:
import tensorflow as tf

n_inputs = 4
n_hidden = 4
n_outputs = 1

initializer = tf.keras.initializers.VarianceScaling()

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(n_hidden, activation="elu", kernel_initializer=initializer),
    tf.keras.layers.Dense(n_hidden, activation="elu", kernel_initializer=initializer),
    tf.keras.layers.Dense(n_outputs, activation="sigmoid", kernel_initializer=initializer)
])


  return datetime.utcnow().replace(tzinfo=utc)


**Evaluating Actions: The Policy Gradient**

This section introduces the policy gradient approach, where gradients are computed to adjust the policy in the direction that increases expected rewards.

The key idea is:

Actions leading to higher rewards are reinforced

Actions leading to lower rewards are discouraged

The algorithm samples trajectories, computes rewards, and updates parameters accordingly.

**Implementing Policy Gradients**

This subchapter explains how to implement the REINFORCE algorithm.

In [8]:
def discount_rewards(rewards, discount_rate):
    discounted = np.zeros_like(rewards)
    cumulative = 0
    for step in reversed(range(len(rewards))):
        cumulative = rewards[step] + cumulative * discount_rate
        discounted[step] = cumulative
    return discounted


**Training the Policy**

During training:

Multiple episodes are run

Rewards are collected

Gradients are computed

Model parameters are updated

In [12]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation="relu"),
    tf.keras.layers.Dense(1)
])

X = tf.random.normal([32, 5])
y = tf.random.normal([32, 1])

optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

with tf.GradientTape() as tape:
    y_pred = model(X, training=True)
    loss = tf.reduce_mean(tf.keras.losses.mse(y, y_pred))

grads = tape.gradient(loss, model.trainable_variables)

optimizer.apply_gradients(zip(grads, model.trainable_variables))

print("Loss:", loss.numpy())


Loss: 1.580145


**Advantages and Limitations of Policy Gradients**

Advantages:

Works well with continuous action spaces

Directly optimizes the policy

Can learn stochastic behaviors

Limitations:

High variance in gradient estimates

Requires many episodes

Sensitive to hyperparameters

**Summary of Reinforcement Learning Concepts**

This chapter provides a foundational understanding of reinforcement learning, focusing on:

Agent-environment interaction

Policy-based learning

Policy gradients and neural policies

Practical implementation using OpenAI Gym and TensorFlow

These concepts prepare the reader for more advanced algorithms such as Actor-Critic, Deep Q-Learning, and Proximal Policy Optimization (PPO) in later chapters.