# Chapter 22 - Reinforcement Learning

*In which we see how experiencing rewards and punishments can teach an agent how to
maximize rewards in the future.* - Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach

## - **22.1 Learning from Rewards**  
- **Supervised learning challenges in complex environments:**  Applying supervised learning to complex tasks like chess is challenging due to the vast number of possible states and the difficulty in defining "correct" moves solely based on past grandmaster games. 
- **Introduction to reinforcement learning (RL):**  RL involves an agent learning from interactions with its environment through rewards, aiming to maximize the sum of these rewards. Unlike in supervised learning, the agent in RL may not know the environment's transition model or reward function ahead of time. 
- **Benefits of reinforcement learning:**  Providing reward signals is usually simpler and requires less expertise than supplying labeled examples for supervised learning. Even sparse rewards (where informative signals are rare) can be beneficial, and additional intermediate rewards can significantly aid learning. 
- **Versatility and applications of RL:**  Reinforcement learning is a flexible approach that has been applied successfully in various domains, including video games, robotics, and strategic games like poker. It can be enhanced with deep learning techniques for even broader applications. 
- **RL algorithms and categories:**  The chapter outlines two main types of reinforcement learning strategies: 
- **Model-based RL,**  where the agent uses or learns a model of the environment to interpret rewards and make decisions. This approach often involves learning a utility function based on the sum of rewards. 
- **Model-free RL,**  which does not rely on understanding the environment's model but directly learns how to act. This category includes action-utility learning (like Q-learning, where the agent learns a Q-function to evaluate the sum of rewards for state-action pairs) and policy search (learning a direct mapping from states to actions). 
- **Structure of the chapter:**  The chapter progresses from discussing passive reinforcement learning, where the agent's policy is predetermined, through active reinforcement learning that involves exploration and learning how to act within an environment. It explores the use of inductive and deep learning to enhance RL, the concept of providing intermediate rewards, and organizing behavior hierarchically. It concludes with a discussion on apprenticeship learning and real-world applications of RL.

## **22.2 Passive Reinforcement Learning**  
- **Overview:**  Passive reinforcement learning involves an agent with a fixed policy π(s) learning the utility function Uπ(s) in a fully observable environment with a defined set of actions and states. This utility function represents the expected total discounted reward from following policy π starting in state s. 
- **Difference from policy evaluation:**  While passive reinforcement learning shares similarities with policy evaluation in policy iteration, the key difference is the passive learner's ignorance of the transition model P(s′|s,a) and the reward function R(s,a,s′), which define probabilities of state transitions and rewards for transitions, respectively. 
- **Learning without knowing transition and reward functions:**  The agent executes trials within the environment, following its fixed policy, and observes sequences of state transitions and rewards without prior knowledge of the transition or reward functions. The aim is to use these observations to learn the expected utility Uπ(s) for each nonterminal state. 
- **Example of learning process:**  Using the 4×3 world from Chapter 17 as an example, the agent conducts trials starting from an initial state and moving through the environment until reaching a terminal state. Each transition during the trials is annotated with the action taken and the reward received, which the agent uses to update its understanding of the utility of each state. 
- **Calculation of expected utility:**  The expected utility Uπ(s) is calculated as the expected sum of discounted rewards received by following the policy π from state s. A discount factor γ is included in the calculation to account for the time value of rewards, with γ = 1 indicating no discounting in the example 4×3 world.

This section emphasizes the foundational aspects of passive reinforcement learning by illustrating how an agent can learn about an environment's dynamics and rewards through direct experience, even without initial knowledge of the environment's structure.

### **22.2.1 Direct Utility Estimation**  
- **Concept:**  Direct utility estimation defines the utility of a state as the expected total reward from that state onward, known as the expected reward-to-go. Each trial in the learning process provides a sample of this reward for each visited state. 
- **Method:**  The algorithm updates the estimated utility for each state by calculating the observed reward-to-go at the end of each sequence and maintaining a running average for each state. With an infinite number of trials, this method will converge to the true expected utility as defined by the reinforcement learning model. 
- **Reduction to supervised learning problem:**  This approach effectively reduces reinforcement learning to a supervised learning problem, where each data point is a pair consisting of a state and its corresponding reward-to-go. While this reduction allows the use of powerful supervised learning algorithms, it overlooks the dependencies between states and their successor states. 
- **Ignoring Bellman equations:**  The direct utility estimation method does not account for the Bellman equations, which articulate that the utility of a state is influenced by both the immediate reward and the expected utility of successor states. This oversight limits the method's efficiency by ignoring the inherent connections between state utilities. 
- **Drawbacks:**  By neglecting the relationships between states as described by the Bellman equations, direct utility estimation misses out on learning opportunities and may converge slowly. The approach treats the utility estimation problem as if searching within a hypothesis space larger than necessary, including many potential utility functions that violate the Bellman equations.

### **22.2.2 Adaptive Dynamic Programming**  
- **Definition and approach:**  Adaptive Dynamic Programming (ADP) integrates learning the transition model of the environment with solving the Markov decision process (MDP) via dynamic programming. This method leverages the interconnectedness of state utilities by learning the transition probabilities P(s′|s,π(s)) and observed rewards R(s,π(s),s′) to compute state utilities using Bellman equations. 
- **Use of linear algebra and modified policy iteration:**  Given that the Bellman equations form a linear system when the policy is fixed, they can be solved using linear algebra software. ADP can also use a simplified version of value iteration, called modified policy iteration, to update utility estimates efficiently after each incremental model adjustment. 
- **Learning the transition model:**  In fully observable environments, learning the transition model becomes a straightforward supervised learning task, using state–action pairs as inputs and resulting states as outputs. This model is often represented as a table, with transition probabilities estimated from observed transitions. 
- **Efficiency and limitations:**  The ADP agent's performance is primarily constrained by its ability to accurately learn the transition model. While ADP sets a benchmark for evaluating other reinforcement learning algorithms due to its direct approach to solving the MDP, it becomes impractical for very large state spaces, such as those in complex games like backgammon, due to the computational challenge of solving an enormous number of equations.

### Passive Adaptive Dynamic Programming (ADP) Learner in Python

Implementing a Passive Adaptive Dynamic Programming (ADP) Learner involves several key components: 
1. **Initialization** : Setting up the environment, including states, actions, policy, and initial estimates of the transition model and utilities. 
2. **Learning the Transition Model** : Updating the transition model based on observed transitions. 
3. **Estimating Utilities** : Using the learned transition model and observed rewards to update utilities, typically by solving the Bellman equations. 
4. **Utility Update Method** : Solving the Bellman equations can be done using linear algebra for the entire system or iteratively with a form of value iteration.

Let's consider a simplified environment for clarity. We'll implement a passive ADP learner for a grid world, where the agent has a fixed policy π(s) and learns utilities of states by observing transitions and rewards.

This example assumes a very basic environment setup for demonstration purposes. In more complex scenarios, you would need to expand this framework significantly.

In [1]:
import numpy as np

class PassiveADPLearner:
    def __init__(self, states, actions, policy, gamma=0.9):
        self.states = states  # List of states
        self.actions = actions  # List of actions
        self.policy = policy  # Fixed policy: state -> action
        self.gamma = gamma  # Discount factor
        self.rewards = {}  # Reward function: (state, action, next_state) -> reward
        self.transitions = {}  # Transition model: (state, action) -> {next_state: count}
        self.returns = {state: 0 for state in states}  # State returns
        self.counts = {state: 0 for state in states}  # State visit counts
        self.utilities = {state: 0 for state in states}  # State utilities

    def observe_transition(self, state, action, next_state, reward):
        # Update the rewards and transition counts based on observed (s, a, s', r)
        if (state, action, next_state) not in self.rewards:
            self.rewards[(state, action, next_state)] = reward
        self.transitions.setdefault((state, action), {}).setdefault(next_state, 0)
        self.transitions[(state, action)][next_state] += 1

    def update_utilities(self):
        # Solve the Bellman equations using the observed transition model and rewards
        for state in self.states:
            action = self.policy[state]
            total = 0
            action_transitions = self.transitions.get((state, action), {})
            total_transitions = sum(action_transitions.values())
            for next_state, count in action_transitions.items():
                transition_prob = count / total_transitions
                reward = self.rewards[(state, action, next_state)]
                total += transition_prob * (reward + self.gamma * self.utilities[next_state])
            self.returns[state] += total
            self.counts[state] += 1
            self.utilities[state] = self.returns[state] / self.counts[state] if self.counts[state] else 0

# Example usage
states = ['A', 'B', 'C', 'D']  # Simplified states
actions = ['left', 'right']  # Simplified actions
policy = {'A': 'right', 'B': 'left', 'C': 'right', 'D': 'left'}  # Example policy

learner = PassiveADPLearner(states, actions, policy)
# Assume some transitions and rewards have been observed
learner.observe_transition('A', 'right', 'B', 1)
learner.observe_transition('B', 'left', 'C', -1)
learner.observe_transition('C', 'right', 'D', 2)

learner.update_utilities()
print(learner.utilities)

ModuleNotFoundError: No module named 'numpy'