# Chapter 22 - Reinforcement Learning

*In which we see how experiencing rewards and punishments can teach an agent how to
maximize rewards in the future.* - Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach

<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch22_reinforcement_learning/DALL%C2%B7E%202024-03-23%2022.35.53%20-%20Visualize%20a%20compelling%20scene%20where%20a%20human%20chess%20player%20is%20deeply%20engrossed%20in%20a%20game%20of%20chess%20against%20a%20neural%20network.%20The%20human%2C%20a%20figure%20of%20concen.webp" width="500">

Rewarding learning agent for playing

AlphaGo Zero, a version of the AlphaGo program that learned to play Go by playing against itself, is an example of a reinforcement learning system. The system is trained by playing games against itself, with the goal of maximizing the probability of winning. The system is rewarded for winning games and penalized for losing games, and it uses these rewards and penalties to adjust its strategy and improve its performance over time.

Similary for chess - AlphaZero, a version of the AlphaGo program that learned to play chess by playing against itself, is an example of a reinforcement learning system. The system is trained by playing games against itself, with the goal of maximizing the probability of winning. The system is rewarded for winning games and penalized for losing games, and it uses these rewards and penalties to adjust its strategy and improve its performance over time.

Famously AlphaZero beat Stockfish in 2017, and it was a big deal, because Stockfish was the best chess engine at the time. AlphaZero was able to beat Stockfish by learning from scratch, without any human knowledge or guidance. 
There is some discussion whether AlphaZero is better than Stockfish, but it is clear that AlphaZero is a very strong chess engine.

Src on AlphaZero: https://en.wikipedia.org/wiki/AlphaZero

## - **22.1 Learning from Rewards**  
- **Supervised learning challenges in complex environments:**  Applying supervised learning to complex tasks like chess is challenging due to the vast number of possible states and the difficulty in defining "correct" moves solely based on past grandmaster games. 
- **Introduction to reinforcement learning (RL):**  RL involves an agent learning from interactions with its environment through rewards, aiming to maximize the sum of these rewards. Unlike in supervised learning, the agent in RL may not know the environment's transition model or reward function ahead of time. 
- **Benefits of reinforcement learning:**  Providing reward signals is usually simpler and requires less expertise than supplying labeled examples for supervised learning. Even sparse rewards (where informative signals are rare) can be beneficial, and additional intermediate rewards can significantly aid learning. 
- **Versatility and applications of RL:**  Reinforcement learning is a flexible approach that has been applied successfully in various domains, including video games, robotics, and strategic games like poker. It can be enhanced with deep learning techniques for even broader applications. 
- **RL algorithms and categories:**  The chapter outlines two main types of reinforcement learning strategies: 
- **Model-based RL,**  where the agent uses or learns a model of the environment to interpret rewards and make decisions. This approach often involves learning a utility function based on the sum of rewards. 
- **Model-free RL,**  which does not rely on understanding the environment's model but directly learns how to act. This category includes action-utility learning (like Q-learning, where the agent learns a Q-function to evaluate the sum of rewards for state-action pairs) and policy search (learning a direct mapping from states to actions). 
- **Structure of the chapter:**  The chapter progresses from discussing passive reinforcement learning, where the agent's policy is predetermined, through active reinforcement learning that involves exploration and learning how to act within an environment. It explores the use of inductive and deep learning to enhance RL, the concept of providing intermediate rewards, and organizing behavior hierarchically. It concludes with a discussion on apprenticeship learning and real-world applications of RL.

## **22.2 Passive Reinforcement Learning**  
- **Overview:**  Passive reinforcement learning involves an agent with a fixed policy π(s) learning the utility function Uπ(s) in a fully observable environment with a defined set of actions and states. This utility function represents the expected total discounted reward from following policy π starting in state s. 
- **Difference from policy evaluation:**  While passive reinforcement learning shares similarities with policy evaluation in policy iteration, the key difference is the passive learner's ignorance of the transition model P(s′|s,a) and the reward function R(s,a,s′), which define probabilities of state transitions and rewards for transitions, respectively. 
- **Learning without knowing transition and reward functions:**  The agent executes trials within the environment, following its fixed policy, and observes sequences of state transitions and rewards without prior knowledge of the transition or reward functions. The aim is to use these observations to learn the expected utility Uπ(s) for each nonterminal state. 
- **Example of learning process:**  Using the 4×3 world from Chapter 17 as an example, the agent conducts trials starting from an initial state and moving through the environment until reaching a terminal state. Each transition during the trials is annotated with the action taken and the reward received, which the agent uses to update its understanding of the utility of each state. 
- **Calculation of expected utility:**  The expected utility Uπ(s) is calculated as the expected sum of discounted rewards received by following the policy π from state s. A discount factor γ is included in the calculation to account for the time value of rewards, with γ = 1 indicating no discounting in the example 4×3 world.

This section emphasizes the foundational aspects of passive reinforcement learning by illustrating how an agent can learn about an environment's dynamics and rewards through direct experience, even without initial knowledge of the environment's structure.

### **22.2.1 Direct Utility Estimation**  
- **Concept:**  Direct utility estimation defines the utility of a state as the expected total reward from that state onward, known as the expected reward-to-go. Each trial in the learning process provides a sample of this reward for each visited state. 
- **Method:**  The algorithm updates the estimated utility for each state by calculating the observed reward-to-go at the end of each sequence and maintaining a running average for each state. With an infinite number of trials, this method will converge to the true expected utility as defined by the reinforcement learning model. 
- **Reduction to supervised learning problem:**  This approach effectively reduces reinforcement learning to a supervised learning problem, where each data point is a pair consisting of a state and its corresponding reward-to-go. While this reduction allows the use of powerful supervised learning algorithms, it overlooks the dependencies between states and their successor states. 
- **Ignoring Bellman equations:**  The direct utility estimation method does not account for the Bellman equations, which articulate that the utility of a state is influenced by both the immediate reward and the expected utility of successor states. This oversight limits the method's efficiency by ignoring the inherent connections between state utilities. 
- **Drawbacks:**  By neglecting the relationships between states as described by the Bellman equations, direct utility estimation misses out on learning opportunities and may converge slowly. The approach treats the utility estimation problem as if searching within a hypothesis space larger than necessary, including many potential utility functions that violate the Bellman equations.

### **22.2.2 Adaptive Dynamic Programming**  
- **Definition and approach:**  Adaptive Dynamic Programming (ADP) integrates learning the transition model of the environment with solving the Markov decision process (MDP) via dynamic programming. This method leverages the interconnectedness of state utilities by learning the transition probabilities P(s′|s,π(s)) and observed rewards R(s,π(s),s′) to compute state utilities using Bellman equations. 
- **Use of linear algebra and modified policy iteration:**  Given that the Bellman equations form a linear system when the policy is fixed, they can be solved using linear algebra software. ADP can also use a simplified version of value iteration, called modified policy iteration, to update utility estimates efficiently after each incremental model adjustment. 
- **Learning the transition model:**  In fully observable environments, learning the transition model becomes a straightforward supervised learning task, using state–action pairs as inputs and resulting states as outputs. This model is often represented as a table, with transition probabilities estimated from observed transitions. 
- **Efficiency and limitations:**  The ADP agent's performance is primarily constrained by its ability to accurately learn the transition model. While ADP sets a benchmark for evaluating other reinforcement learning algorithms due to its direct approach to solving the MDP, it becomes impractical for very large state spaces, such as those in complex games like backgammon, due to the computational challenge of solving an enormous number of equations.

### Passive Adaptive Dynamic Programming (ADP) Learner in Python

Implementing a Passive Adaptive Dynamic Programming (ADP) Learner involves several key components: 
1. **Initialization** : Setting up the environment, including states, actions, policy, and initial estimates of the transition model and utilities. 
2. **Learning the Transition Model** : Updating the transition model based on observed transitions. 
3. **Estimating Utilities** : Using the learned transition model and observed rewards to update utilities, typically by solving the Bellman equations. 
4. **Utility Update Method** : Solving the Bellman equations can be done using linear algebra for the entire system or iteratively with a form of value iteration.

Let's consider a simplified environment for clarity. We'll implement a passive ADP learner for a grid world, where the agent has a fixed policy π(s) and learns utilities of states by observing transitions and rewards.

This example assumes a very basic environment setup for demonstration purposes. In more complex scenarios, you would need to expand this framework significantly.

In [1]:
import numpy as np

# print numpy version
print(f"NumPy version: {np.__version__}")

class PassiveADPLearner:
    def __init__(self, states, actions, policy, gamma=0.9):
        self.states = states  # List of states
        self.actions = actions  # List of actions
        self.policy = policy  # Fixed policy: state -> action
        self.gamma = gamma  # Discount factor
        self.rewards = {}  # Reward function: (state, action, next_state) -> reward
        self.transitions = {}  # Transition model: (state, action) -> {next_state: count}
        self.returns = {state: 0 for state in states}  # State returns
        self.counts = {state: 0 for state in states}  # State visit counts
        self.utilities = {state: 0 for state in states}  # State utilities

    def observe_transition(self, state, action, next_state, reward):
        # Update the rewards and transition counts based on observed (s, a, s', r)
        if (state, action, next_state) not in self.rewards:
            self.rewards[(state, action, next_state)] = reward
        self.transitions.setdefault((state, action), {}).setdefault(next_state, 0)
        self.transitions[(state, action)][next_state] += 1

    def update_utilities(self):
        # Solve the Bellman equations using the observed transition model and rewards
        for state in self.states:
            action = self.policy[state]
            total = 0
            action_transitions = self.transitions.get((state, action), {})
            total_transitions = sum(action_transitions.values())
            for next_state, count in action_transitions.items():
                transition_prob = count / total_transitions
                reward = self.rewards[(state, action, next_state)]
                total += transition_prob * (reward + self.gamma * self.utilities[next_state])
            self.returns[state] += total
            self.counts[state] += 1
            self.utilities[state] = self.returns[state] / self.counts[state] if self.counts[state] else 0

# Example usage
states = ['A', 'B', 'C', 'D']  # Simplified states
actions = ['left', 'right']  # Simplified actions
policy = {'A': 'right', 'B': 'left', 'C': 'right', 'D': 'left'}  # Example policy

learner = PassiveADPLearner(states, actions, policy)
# Assume some transitions and rewards have been observed
learner.observe_transition('A', 'right', 'B', 1)
learner.observe_transition('B', 'left', 'C', -1)
learner.observe_transition('C', 'right', 'D', 2)

learner.update_utilities()
print(learner.utilities)

NumPy version: 1.26.4
{'A': 1.0, 'B': -1.0, 'C': 2.0, 'D': 0.0}


## **22.3 Active Reinforcement Learning**  
- **Transition from Passive to Active Learning:**  While a passive learning agent operates under a fixed policy, an active learning agent has the autonomy to choose its actions. This shift demands a broader learning scope, including a comprehensive transition model for all possible actions, not just those dictated by a predetermined policy. 
- **Learning a Complete Transition Model:**  Active reinforcement learning requires the agent to acquire a full transition model that encompasses outcome probabilities for every action across all states. This model facilitates the understanding of the environment's dynamics beyond the constraints of a fixed policy. 
- **Incorporating Choice of Actions:**  Active learning introduces the complexity of decision-making, where the agent must determine the most beneficial actions to take. The objective is to learn the utilities aligned with the optimal policy, adhering to the Bellman equations. These utilities reflect the highest expected returns from any given state, considering all available actions and their consequent states. 
- **Solving for Optimal Utilities:**  The utility function U, indicative of the optimal policy, can be derived through algorithms like value iteration or policy iteration, as outlined in previous chapters. These methods systematically solve the Bellman equations to identify the actions that maximize expected utility from each state. 
- **Determining Actions with Optimal Utility:**  With an established utility function, the agent can engage in one-step look-ahead to identify the action that maximizes expected utility, effectively deciding its next move based on the learned model's predictions. If the agent utilizes policy iteration, the optimal policy is explicitly defined, streamlining the action selection process. 
- **The Dilemma of Exploration vs. Exploitation:**  A critical question for active reinforcement learning agents is whether to follow the optimal action suggested by the current model or to explore alternative actions that might yield better long-term benefits. This dilemma highlights the importance of balancing immediate rewards with the potential discovery of more advantageous policies through exploration.

Active reinforcement learning emphasizes an agent's ability to independently navigate its environment, making informed decisions based on a combination of learned models and strategic exploration. This approach not only requires comprehensive modeling of environmental dynamics but also necessitates a deliberate balance between exploiting known paths to rewards and exploring new possibilities to enhance the agent's understanding and performance.

### **22.3.1 Exploration**  
- **Exploration vs. Exploitation:**  Active reinforcement learning faces the challenge of balancing exploration (discovering new information) and exploitation (utilizing current knowledge to maximize rewards). A greedy agent, which always chooses what seems best based on current information, may miss the optimal policy due to insufficient exploration. 
- **Greedy Agent Limitations:**  A greedy strategy can lead to suboptimal performance when the learned model of the environment is incomplete or inaccurate. Optimal actions within an incomplete model may not be truly optimal, leading the agent to settle prematurely on suboptimal policies. 
- **Informational Value of Actions:**  Actions provide value not just through immediate rewards but also by offering information that can lead to better future decisions. This highlights the necessity for a strategy that considers both immediate gains and the potential for discovering more rewarding options. 
- **GLIE (Greedy in the Limit with Infinite Exploration):**  A GLIE scheme ensures that every action is explored infinitely over time, guaranteeing that the agent eventually discovers the optimal policy. Such schemes balance exploration and exploitation over the long term, allowing for a comprehensive understanding of the environment. 
- **Implementation of Exploration Strategies:**  Practical exploration strategies may include random action selection with diminishing probability over time or prioritizing actions that have been less explored. Adjusting the utility calculation to favor less-explored actions encourages a more thorough exploration of the state space. 
- **Optimistic Initial Values:**  Starting with optimistic estimates of state utilities encourages exploration by initially biasing the agent towards unexplored states, under the assumption that unknown areas could offer high rewards. This optimism is gradually adjusted based on actual experiences, guiding the agent towards more beneficial areas of the state space. 
- **Exploration Functions:**  Exploration functions modulate the trade-off between the desire for high utility (exploitation) and the inclination to explore less-tried actions (exploration). By adjusting the utility estimates for state-action pairs based on the frequency of their selection, these functions ensure a balanced approach to learning the environment. 
- **Consequences of Effective Exploration:**  Properly implemented exploration strategies lead to quicker convergence to optimal or near-optimal policies, as the agent learns not only the immediate utility of actions but also acquires a broad understanding of the environment. This comprehensive learning approach ensures that less rewarding states are explored less frequently, optimizing the learning process towards the most valuable experiences.

Exploration in active reinforcement learning is crucial for overcoming the limitations of greedy decision-making and for ensuring that an agent can discover the best possible policy within a complex environment. By intelligently balancing the need to explore unknown options with the use of current knowledge to gain rewards, an agent can achieve optimal performance over time.


<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch22_reinforcement_learning/DALL%C2%B7E%202024-03-23%2023.17.34%20-%20Depict%20a%20serene%20underwater%20scene%20featuring%20a%20baby%20sunfish%2C%20also%20known%20as%20a%20Mola%20Mola%2C%20swimming%20gracefully%20in%20the%20ocean.%20The%20baby%20sunfish%2C%20characterize.webp" width="500">

If you are a baby sunfish, your probability of surviving to adulthood is about 0.00000001

**22.3.2 Safe Exploration**  
- **Real-world Constraints on Exploration:**  Unlike simulations or games where mistakes can be easily corrected, real-world exploration by agents (such as robots or self-driving cars) must consider the risk of irreversible actions or entering absorbing states with severe negative consequences. 
- **Challenges of Safe Exploration:**  Safe exploration is crucial when negative outcomes can lead to significant harm, such as physical damage in the case of a self-driving car, or even the termination of the agent's ability to act, as in the case of entering an absorbing state from which no recovery is possible. 
- **Bayesian Reinforcement Learning:**  This approach utilizes a probabilistic model over possible hypotheses of the environment, updating beliefs based on observed evidence. The optimal policy maximizes expected utility across all hypotheses, weighted by their probabilities. However, this method may not always prevent risky exploratory actions that could lead to dangerous outcomes. 
- **Exploration Partially Observable Markov Decision Processes (POMDPs):**  In scenarios where future learning is anticipated, the problem of choosing an optimal policy becomes more complex. An exploration POMDP formulates the problem to include the impact of future observations on the agent's model of the environment, though solving this POMDP is often impractical. 
- **Robust Control Theory:**  This method does not assign probabilities to different models but instead considers a set of plausible models. It seeks an optimal policy that performs best in the worst-case scenario across all considered models. While providing a safety net by preparing for adverse outcomes, it may lead to overly cautious behavior that restricts the agent's effectiveness. 
- **Leveraging Human Expertise:**  Incorporating human knowledge and experience can enhance safety in reinforcement learning. This could involve using recorded actions from experienced operators as initial policies or defining explicit constraints on the agent's actions to prevent entering dangerous states. 
- **Trade-offs in Safe Exploration:**  Navigating the balance between exploration and safety involves making trade-offs between the potential for learning and the risk of negative outcomes. Strategies like Bayesian reinforcement learning and robust control theory provide frameworks for managing these risks but often require careful consideration of their assumptions and limitations.

Safe exploration in reinforcement learning addresses the challenge of learning in environments where mistakes can have serious repercussions. By carefully balancing the need for exploration with the imperative of avoiding irreversible harm, reinforcement learning systems can be designed to navigate complex real-world environments more safely.

### **22.3.3 Temporal-Difference Q-learning**  
- **Overview of Active TD Learning:**  Active temporal-difference (TD) learning, like its adaptive dynamic programming (ADP) counterpart, must acquire a transition model for decision-making. However, TD's learning process and updates remain model-free, focusing on direct utility adjustments based on observed transitions. 
- **Introduction to Q-Learning:**  Q-learning is a model-free method that learns an action-utility function, Q(s,a), representing the expected total discounted reward for taking action a in state s and acting optimally afterward. This method enables optimal action selection without the need for a model or look-ahead, simplifying the decision-making process. 
- **TD Update for Q-Values:**  The Q-learning TD update rule adjusts Q-values based on the difference between the current estimate and the combination of immediate reward plus the discounted maximum future Q-value. This process does not require a model of the environment's transitions, making Q-learning applicable in complex or unknown domains. 
- **Model-Free Nature of Q-Learning:**  Unlike ADP, Q-learning operates without needing to learn or use a transition model, P(s′|s,a), focusing instead on learning from the outcomes of taken actions. This feature makes Q-learning particularly suitable for environments where building an accurate model is challenging or infeasible. 
- **Exploration in Q-Learning:**  Q-learning agents can incorporate exploration strategies to avoid local optima and discover more rewarding actions. The exploration function, similar to that used in ADP, guides the agent in balancing between exploring new actions and exploiting known rewards. 
- **SARSA vs. Q-Learning:**  SARSA, a variant of Q-learning, updates Q-values based on the action actually taken rather than the optimal future action. This distinction makes SARSA an on-policy learning algorithm, adapting its policy based on actual experiences, including the outcomes of exploratory actions. 
- **Off-Policy Learning in Q-Learning:**  Q-learning is an off-policy learner, focusing on the hypothetical value of actions if the agent were to follow the optimal policy from that point forward. This approach allows Q-learning to remain effective under various exploration strategies, making it flexible across different learning scenarios. 
- **Performance Comparison:**  Both Q-learning and SARSA can learn optimal policies, albeit more slowly than ADP methods. This slower convergence is attributed to the local nature of their updates, which may not immediately reflect the broader state-action space's dynamics.

Temporal-difference Q-learning offers a powerful, model-free approach for agents to learn optimal policies in environments where acquiring a comprehensive model is difficult. By focusing directly on the utility of actions and leveraging exploration strategies, Q-learning enables agents to navigate complex domains effectively, learning from their interactions to make increasingly informed decisions.

### sample Q-learning agent in Python

- Implementing an exploratory Q-learning agent involves several key components: 
1. **Initialization** : Setting up the environment, actions, states, and initial Q-values. 
2. **Learning** : Updating Q-values based on the agent's experiences using the Q-learning update rule. 
3. **Exploration Strategy** : Incorporating an exploration mechanism (e.g., ε-greedy) to balance between exploration and exploitation. 
4. **Action Selection** : Choosing actions based on the current Q-values and the exploration strategy.

Below is a simple Python implementation of an exploratory Q-learning agent. This example assumes a discrete state and action space and utilizes an ε-greedy strategy for exploration.

In [2]:
import numpy as np

class QLearningAgent:
    def __init__(self, states, actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.states = states  # List of states
        self.actions = actions  # List of actions
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor
        self.epsilon = epsilon  # Exploration rate
        self.q_values = np.zeros((len(states), len(actions)))  # Initialize Q-values

    def choose_action(self, state):
        if np.random.rand() < self.epsilon:
            # Explore: choose a random action
            return np.random.choice(self.actions)
        else:
            # Exploit: choose the best action based on current Q-values
            state_index = self.states.index(state)
            return self.actions[np.argmax(self.q_values[state_index])]

    def update_q_values(self, state, action, reward, next_state):
        # Convert state and action to indices
        state_index = self.states.index(state)
        action_index = self.actions.index(action)
        next_state_index = self.states.index(next_state)

        # Calculate the Q-learning update
        best_next_action = np.max(self.q_values[next_state_index])
        td_target = reward + self.gamma * best_next_action
        td_delta = td_target - self.q_values[state_index][action_index]
        self.q_values[state_index][action_index] += self.alpha * td_delta

# Example usage
states = ['s1', 's2', 's3']  # Example states
actions = ['a1', 'a2']  # Example actions

agent = QLearningAgent(states, actions)

# Simulate a step taken in the environment
current_state = 's1'
action_taken = agent.choose_action(current_state)
# Assume the reward and next state from the environment after taking the action
reward = 1
next_state = 's2'

# Update Q-values based on the experience
agent.update_q_values(current_state, action_taken, reward, next_state)

# After many such updates, the Q-values would guide the agent to take optimal actions
print(agent.q_values)

[[0.1 0. ]
 [0.  0. ]
 [0.  0. ]]


## **22.4 Generalization in Reinforcement Learning**  
- **Limits of Tabular Representation:**  Tabular approaches to representing utility functions and Q-functions are viable for environments with up to approximately one million states. However, for real-world applications or complex games like backgammon, which can have vastly more states, tabular methods are insufficient due to the impracticality of visiting and learning from each state. 
- **Function Approximation for Utility and Q-functions:**  To address the limitations of tabular representations, function approximation is used to construct compact, approximate models of the true utility or Q-function. This method involves representing the utility or Q-function as a weighted linear combination of features of the state, significantly reducing the dimensionality and complexity of the learning task. 
- **Example of Function Approximation:**  A common form of function approximation is a linear combination of features, where each feature fi(s)f_i(s)fi​(s) of a state sss is weighted by a parameter θi\theta_iθi​, and the learning process involves adjusting these weights to best approximate the true utility function. This approach enables the learning of a small set of parameters instead of a utility value for each possible state. 
- **Generalization Across States:**  Function approximation allows for inductive generalization, enabling agents to infer the utility of unvisited states based on similarities to previously encountered states. This generalization is crucial for learning in environments with large or continuous state spaces where exhaustive exploration is infeasible. 
- **Integration with Look-ahead Search:**  Combining approximate utility functions with look-ahead search can enhance decision-making, allowing for effective behavior based on a simpler utility approximator. This combination can produce competent strategies with significantly fewer experiences by projecting the outcomes of possible actions from the current state. 
- **Real-world Application Success:**  The practical application of function approximation has been demonstrated in domains such as backgammon, where reinforcement learning algorithms have achieved human champion-level performance. This success was accomplished despite the algorithm exploring only a minuscule fraction of the game's total state space, showcasing the power of function approximation to enable learning and generalization in complex environments.

In summary, function approximation in reinforcement learning facilitates the representation and learning of utility or Q-functions in environments with vast state spaces, overcoming the scalability issues of tabular methods. By enabling agents to generalize from limited experiences to a broader range of situations, function approximation is a key technique for developing capable reinforcement learning systems in complex, real-world settings.

### 22.4.1 Approximating direct utility estimation

- Direct utility estimation in reinforcement learning creates trajectories in the state space to derive utilities based on observed rewards from a state until termination.
- Utilizes a simple linear function to approximate utilities, where features could be the coordinates (x, y) of a state, leading to an equation like \( \hat{U}_\theta(x,y) = \theta_0 + \theta_1x + \theta_2y \).
- Standard linear regression or online learning algorithms are used to adjust parameters (\(\theta\)) after each trial to minimize the squared error between predicted and actual total rewards.
- Implements the Widrow–Hoff rule, or delta rule, for online least-squares to adjust parameters in a way that decreases the error between observed and predicted utility values.
- The approach allows for generalization across states by updating parameters based on the gradient of the error, suggesting that function approximation enables reinforcement learners to generalize from their experiences.
- The effectiveness of direct utility estimation is enhanced with function approximation, especially when the hypothesis space includes functions that fit well with the true utility function.
- While improvements in small state spaces like the 4×3 world are modest, larger improvements are noted in bigger spaces, such as a 10×10 world with strategically placed rewards.
- Challenges arise when the true utility function is non-linear or when rewards are placed in such a way that a simple linear approximator fails; however, introducing nonlinear features related to the goal or other strategic considerations can significantly enhance performance.

<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch22_reinforcement_learning/DALL%C2%B7E%202024-03-23%2022.35.55%20-%20Illustrate%20an%20intense%20scene%20where%20a%20robot-controlled%20car%20is%20swerving%20off%20the%20road.%20The%20car%2C%20a%20futuristic%2C%20sleek%20design%20with%20visible%20robotic%20elements%20l.webp" width="500">

Catastrophic forgetting is a problem in reinforcement learning where the agent forgets previously learned information when learning new information. This can be a significant issue in real-world applications where the agent must continually adapt to new environments or tasks. If driving agent "learns too well" that middle of road is safest and forgets that sides of roads are dangerous, it can lead to catastrophic consequences.


### **22.4.2 Approximating Temporal-Difference Learning**  
- **Adapting Function Approximation to TD Learning:**  Temporal-difference (TD) learning, like direct utility estimation, can benefit from function approximation to manage large or continuous state spaces. This involves adjusting parameters to minimize the temporal difference between successive states, applying modified versions of the TD and Q-learning update equations. 
- **TD and Q-Learning with Function Approximation:**  The updated rules for adjusting parameters in function approximation-based TD and Q-learning involve correcting the parameters based on the temporal difference error. These updates aim to bring the estimated utilities or Q-values closer to their true values by minimizing the error between predicted and actual outcomes. 
- **Convergence and Stability:**  For passive TD learning with linear function approximators, the update rules can converge to a function that closely approximates the true utility function. However, when using active learning and more complex approximators, like neural networks, the learning process can become unstable, with parameters potentially diverging. 
- **Catastrophic Forgetting:**  A significant challenge in applying function approximation to reinforcement learning is catastrophic forgetting, where an agent forgets previously learned values due to overfitting to recent experiences. This can lead to suboptimal or even dangerous behaviors, as learned knowledge about less frequently encountered states becomes outdated. 
- **Experience Replay:**  One strategy to mitigate catastrophic forgetting is experience replay, which involves periodically revisiting old experiences to reinforce the learned values across the state space. This technique helps maintain a more balanced and comprehensive understanding of the environment. 
- **Model-based Reinforcement Learning and Function Approximation:**  Function approximation is also beneficial for model-based reinforcement learning, where it aids in learning a model of the environment. By approximating the transition dynamics and rewards, the agent can perform look-ahead search and internal simulations to refine its policies without relying solely on real-world interactions. 
- **Learning in Partially Observable Environments:**  In environments where not all variables are observable, learning the model becomes more complex. Techniques like dynamic Bayesian networks and deep recurrent neural networks can be employed to infer the hidden structure and dynamics of the environment, enabling better decision-making under uncertainty.

Function approximation in temporal-difference learning facilitates the extension of reinforcement learning techniques to environments with large or complex state spaces. By enabling generalization and mitigating issues like catastrophic forgetting, function approximation enhances the agent's ability to learn efficient policies in a variety of settings. However, challenges such as parameter divergence and the complexity of learning in partially observable environments underscore the need for careful implementation and ongoing research in this area.

### **22.4.3 Deep Reinforcement Learning**  
- **Beyond Linear Function Approximators:**  Deep reinforcement learning (deep RL) moves beyond linear function approximators due to two main challenges: the absence of a linear function that can accurately approximate the utility or Q-function, and the difficulty in designing the necessary features, especially in new or complex domains. 
- **Need for Nonlinear Function Approximators:**  Traditional linear models may fall short in environments where the relationships between state variables and utilities are complex. Deep neural networks, with their capacity for modeling nonlinear relationships and automatically discovering relevant features from raw inputs like images, offer a potent solution. 
- **Deep Neural Networks in Reinforcement Learning:**  Deep RL employs deep neural networks as function approximators, leveraging their ability to self-identify useful features for making decisions. These networks, composed of multiple layers of interconnected neurons, can represent complex utility or Q-functions far beyond the capability of linear approximators. 
- **Parameterization and Optimization:**  In deep RL, the function approximator (the neural network) is parameterized by the weights and biases of the network. Adjusting these parameters to minimize prediction errors involves calculating gradients, a process facilitated by back-propagation, similar to supervised learning. 
- **Achievements of Deep RL:**  Deep RL has been responsible for breakthroughs across various applications, from mastering a broad spectrum of video games to outperforming human champions in complex games like Go, and advancing robot autonomy. These accomplishments highlight deep RL's ability to handle diverse and complex decision-making tasks. 
- **Challenges and Research Frontiers:**  Despite its achievements, deep RL faces hurdles in achieving consistent performance and predictability, particularly when trained models encounter environments slightly different from their training scenarios. These challenges, alongside the unpredictability of trained systems, limit deep RL's immediate commercial applicability but also mark it as a vibrant area of ongoing research.

Deep reinforcement learning represents a significant evolution in the field, enabling the development of intelligent systems capable of learning sophisticated behaviors in complex environments. While challenges remain, the continual advancement in deep RL promises further breakthroughs in creating autonomous systems with advanced decision-making capabilities.

### **22.4.4 Reward Shaping**  
- **The Credit Assignment Problem:**  In many real-world scenarios, rewards are sparse, meaning significant sequences of actions might be required before encountering a meaningful reward. This sparsity presents the credit assignment problem, where it becomes challenging to determine which actions contributed to receiving the reward, especially in complex tasks like robot soccer. 
- **Introduction to Reward Shaping:**  Reward shaping addresses the credit assignment problem by providing additional, intermediate rewards (pseudorewards) to guide the agent's learning process. These pseudorewards reward "progress" towards the goal, facilitating faster learning by offering more immediate feedback than the sparse rewards from the environment. 
- **Benefits and Risks of Reward Shaping:**  While reward shaping can significantly accelerate learning by breaking down the path to success into more attainable milestones, it carries the risk of the agent optimizing for these pseudorewards at the expense of the actual goal. An agent might learn to exploit these additional rewards in a way that does not align with the overall objective. 
- **Modifying Reward Functions:**  A mathematical approach to reward shaping involves adjusting the reward function to incorporate a potential function Φ(s)Φ(s)Φ(s), which captures desirable state properties or subgoals. The modified reward function, R′(s,a,s′)=R(s,a,s′)+γΦ(s′)−Φ(s)R'(s,a,s') = R(s,a,s') + γΦ(s') - Φ(s)R′(s,a,s′)=R(s,a,s′)+γΦ(s′)−Φ(s), aims to maintain the same optimal policy while encouraging behaviors that are deemed beneficial towards achieving the final goal. 
- **Potential Functions in Reward Shaping:**  The potential function ΦΦΦ can be designed to reward the agent for achieving subgoals or making progress towards the final objective. For instance, in soccer, a potential function might reward possession of the ball or advancement towards the opponent's goal, encouraging strategic play that supports the team's success without detracting from the ultimate aim of winning the game.

Reward shaping provides a practical tool for tackling the credit assignment problem in reinforcement learning, especially in environments with sparse rewards. By carefully designing pseudorewards and potential functions, reward shaping can enhance learning efficiency and guide agents towards effective strategies. However, it requires thoughtful implementation to avoid inadvertently incentivizing suboptimal or undesired behaviors.

### 22.4.5 Hierarchical reinforcement learning

**Hierarchical Reinforcement Learning (HRL)**

- Breaks down long action sequences into smaller, manageable tasks.
- Similar to HTN planning, with tasks such as scoring a goal in soccer divided into obtaining possession, passing, dribbling, and shooting.
- Utilizes a "keepaway" game within the RoboCup 2D simulator to illustrate HRL concepts, focusing on possession and team strategies.

**Partial Program**

- HRL begins with a partial program that outlines the agent's hierarchical behavior structure.
- Extends programming languages with primitives for unspecified choices to be learned.
- Includes a simple example of a continuous choice-making loop based on the current state and available actions.

**Choice and Learning Process**

- The learning process involves associating a Q-function with each choice, optimizing behavior by selecting actions with the highest Q-value.
- Detailed example with a "keeper" team player in "keepaway", deciding actions based on ball possession.

**Joint State Space and Choice State**

- Introduces the concept of joint state space, combining physical state (s) and machine state (m), where the machine state includes internal states of the agent program.
- A choice state is identified at points where the agent program makes a choice, leading to a Markovian decision problem based on choice states, available actions, rewards, and transitions.

**Effectiveness and Applications**

- Demonstrates HRL's effectiveness with the "keepaway" example, learning strategies that outperform standard policies significantly.
- Emphasizes that lower-level skills adapt based on the overall internal state of the agent, allowing for behavior customization according to context.

**Shaping Rewards and Additive Decomposition**

- Shaping rewards in HRL offers opportunities for fine-tuning learning objectives based on the joint state space, allowing for internal constructs like "passing" and "recipient" which are absent in the physical state.
- Explains how the hierarchical structure facilitates a natural additive decomposition of the utility function, enhancing learning efficiency by focusing on variables relevant to the current higher-level task.

## **22.5 Policy Search**  
- **Concept of Policy Search:**  Policy search simplifies reinforcement learning by directly adjusting the policy—defined as a function mapping states to actions—based on its performance. Unlike methods that focus on learning the utility or Q-function, policy search aims to improve the policy iteratively until no further improvements are observed. 
- **Parameterized Policies:**  Policies in policy search are often represented in a parameterized form, significantly reducing the number of parameters compared to the number of states. This representation can include linear functions or nonlinear models like deep neural networks, where the action with the highest predicted value is chosen at each state. 
- **Differentiation from Q-learning:**  Policy search is distinct from Q-learning, even when Q-functions are used for policy representation. The objective is to find parameter values that yield effective performance, which may not necessarily align with achieving closeness to the optimal Q-function. 
- **Challenges with Discrete Actions:**  Discontinuities in policy functions, especially with discrete actions, can complicate gradient-based search methods due to abrupt changes in chosen actions with minor parameter adjustments. Stochastic policy representations, like softmax functions, help mitigate this by providing differentiable and continuous action selection probabilities. 
- **Stochastic Policy Representation:**  Stochastic policies express the likelihood of selecting each action in a given state, offering a smoother landscape for applying gradient-based optimization techniques and enhancing the policy's adaptability to varying states. 
- **Improving Policies:**  In deterministic settings, policy value gradients can guide parameter adjustments. However, nondeterministic environments introduce variability in reward outcomes, necessitating numerous trials to reliably assess policy improvements. Techniques like correlated sampling and algorithms such as REINFORCE offer strategies for estimating policy gradients and comparing policies under variable conditions with reduced trial counts. 
- **Correlated Sampling and PEGASUS:**  Correlated sampling, as implemented in the PEGASUS algorithm, demonstrates a practical approach to evaluating and comparing policies by eliminating variability in experimental conditions, thereby facilitating stable and effective policy improvement.

Policy search represents a direct and intuitive approach to reinforcement learning, focusing on optimizing the policy itself rather than the underlying utility or Q-values. By leveraging parameterized representations and stochastic policies, policy search navigates the complexities of reinforcement learning environments, employing advanced techniques to address the challenges of action selection and policy evaluation in the face of uncertainty and variability.


## **22.6 Apprenticeship and Inverse Reinforcement Learning**  
- **Challenges in Defining Rewards:**  Defining a comprehensive reward function for complex tasks, such as driving, is challenging due to the multitude of factors involved, including safety, legality, comfort, and efficiency. Omitting any aspect can lead to undesirable or extreme behavior. 
- **Apprenticeship Learning:**  Apprenticeship learning involves learning desirable behaviors by observing experts. This approach aims to infer the underlying principles or reward structures that guide expert actions, bypassing the need for explicitly defined reward functions. 
- **Imitation Learning:**  One form of apprenticeship learning, imitation learning, directly mimics observed state-action pairs through supervised learning. However, this approach can be brittle and limited to replicating the observed performance without understanding the underlying decisions. 
- **Inverse Reinforcement Learning (IRL):**  IRL seeks to deduce the reward function that an expert seems to be optimizing based on their actions. By understanding the rewards driving expert behavior, IRL aims to derive robust policies that can potentially surpass the expert's performance. 
- **Determining the Expert's Reward Function:**  IRL involves identifying a reward function under which the observed expert behavior appears rational and optimal. This process is challenged by the existence of multiple reward functions that can explain the same observed behavior, including trivial or uninformative ones. 
- **Feature Matching in IRL:**  A practical approach within IRL, feature matching, approximates the expert's reward function as a linear combination of known features. The goal is to adjust the parameters such that the induced policy's feature expectations align with those observed from the expert, facilitating learning from a minimal set of demonstrations. 
- **Robots Learning from Humans:**  Robots can utilize IRL to learn complex tasks by observing human experts, gaining insights into effective policies and even the strategies of other agents within a multiagent setting. This approach extends beyond robotics to understanding biological behaviors and decision-making processes. 
- **Assumptions and Limitations:**  A fundamental assumption of IRL is the near-optimality of the expert's behavior within a single-agent framework. This assumption may not hold if the expert is aware of being observed and alters their behavior to facilitate learning, indicating the need for a more nuanced understanding of such interactions as assistance games.

Apprenticeship and inverse reinforcement learning represent advanced methodologies for deriving effective behaviors and decision-making strategies in complex domains. By focusing on understanding and replicating expert performance, these approaches offer pathways to robust and potentially superior policy formulation without the need for explicitly defined reward functions.

## 22.7 Applications of Reinforcement Learning
We now turn to applications of reinforcement learning. These include game playing, where
the transition model is known and the goal is to learn the utility function, and robotics, where
the model is initially unknown.

<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch22_reinforcement_learning/DALL%C2%B7E%202024-03-23%2022.31.49%20-%20Visualize%20a%20human%20teacher%20sitting%20at%20a%20table%2C%20intensely%20focused%20on%20a%20game%20of%20backgammon%20against%20a%20computer%20neural%20network.%20The%20computer%20is%20represented.webp" width="400">

Gerry Tesauro's backgammon program, TD-Gammon, was one of the first successful applications of reinforcement learning in game playing.

### **22.7.1 Applications in Game Playing**  
- **Early Developments:**  Arthur Samuel's pioneering work on checkers in 1952 marked one of the first instances of applying reinforcement learning (RL) to game playing, laying the groundwork for future developments in the field. 
- **Advancements with NEUROGAMMON:**  Gerry Tesauro's NEUROGAMMON, developed in the early 1990s, represented a novel approach to learning in backgammon through a form of imitation learning. By analyzing games played against himself, Tesauro trained a neural network to evaluate game positions, leading to a system that won the 1989 Computer Olympiad. 
- **Breakthrough with TD-GAMMON:**  Building on Sutton's temporal-difference learning method, Tesauro created TD-GAMMON. This system used a neural network with a single hidden layer to learn the game's evaluation function, achieving a level of play comparable to the world's top human players. 
- **Deep Q-Network (DQN) by DeepMind:**  In a significant leap towards applying RL to raw perceptual inputs, DeepMind's DQN learned to play various Atari video games directly from image data, achieving human expert-level performance across many games. This was a groundbreaking step in demonstrating the potential of deep RL in learning complex tasks from high-dimensional sensory inputs. 
- **Challenges with Sparse Rewards:**  Despite its successes, DQN struggled with games requiring long-term planning and where rewards were sparse, such as Montezuma's Revenge. This highlighted the limitations of RL systems in environments where exploratory behavior and extended strategy formulation were essential. 
- **Overcoming Limitations:**  Further advancements in deep RL led to systems capable of engaging in more sophisticated exploratory behaviors, addressing challenges presented by games with sparse rewards and complex strategy requirements. 
- **ALPHAGO's Historic Victory:**  DeepMind's ALPHAGO, utilizing both deep reinforcement learning and advanced search techniques, achieved a monumental victory over the world's best human Go players. By learning effective value and Q-functions, ALPHAGO demonstrated the capability of deep RL to tackle games of profound complexity and strategic depth.

The evolution of RL applications in game playing, from early experiments with checkers to mastering Go with ALPHAGO, showcases the significant progress and expanding capabilities of reinforcement learning techniques. These advancements not only highlight RL's potential in understanding and navigating complex strategic environments but also underscore ongoing challenges and the necessity for innovative approaches to learning and decision-making.

<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch22_reinforcement_learning/DALL%C2%B7E%202024-03-23%2022.30.43%20-%20Illustrate%20a%20dynamic%20scene%20showcasing%20the%20problem%20of%20balancing%20a%20long%20pole%20on%20top%20of%20a%20moving%20cart.%20The%20cart%20is%20on%20a%20flat%20surface%2C%20moving%20left%20to%20righ.webp" width="400">

The inverted pendulum problem is a classic example in control theory and reinforcement learning, where the goal is to balance a pole on a moving cart.


###  **22.7.2 Application to Robot Control**  
- **Inverted Pendulum Challenge:**  The cart-pole balancing problem, or the inverted pendulum, is a classic control problem where the goal is to keep a pole balanced on a moving cart by applying discrete left or right forces. This problem is a staple in reinforcement learning (RL) and control theory research, notable for its continuous state variables (position and angle) and discrete action space. 
- **Early Experiments:**  The first significant experiment on this problem was conducted by Michie and Chambers in 1968 with a physical setup. Their BOXES algorithm, which discretized the state space into a finite number of "boxes," demonstrated the capability to balance the pole for an extended period after a relatively short training phase, highlighting the potential of RL in handling control problems. 
- **Advancements in Function Approximation:**  Enhancements in generalization and learning speed for the cart-pole problem have been achieved through adaptive state space partitioning and the use of continuous-state, nonlinear function approximators, such as neural networks. These advancements have enabled the balancing of more complex systems, like a triple inverted pendulum, showcasing RL's ability to solve highly challenging control tasks. 
- **Helicopter Flight Control:**  Reinforcement learning has been applied to the control of radio-controlled helicopters, a task that involves navigating large Markov decision processes (MDPs) and often incorporates elements of imitation learning and inverse reinforcement learning from human experts. This work has pushed the boundaries of autonomous control in complex, dynamic environments. 
- **Interpreting Human Behavior with Inverse RL:**  Inverse RL has been used to understand and predict human behaviors in various contexts, including taxi driver route selection from GPS data and pedestrian movement patterns from video observations. These applications demonstrate inverse RL's utility in deriving meaningful insights from observed actions. 
- **Robotics Applications:**  Robotics represents a significant area of application for both reinforcement learning and inverse RL. For instance, the LittleDog quadruped robot learned to navigate challenging terrains from a single expert demonstration, highlighting the effectiveness of RL techniques in enabling robots to perform complex physical tasks in uncertain environments.

The application of reinforcement learning to robot control illustrates the field's broad utility in solving practical problems that involve complex decision-making under uncertainty. From basic challenges like the inverted pendulum to advanced tasks like autonomous helicopter flight and behavior prediction, RL techniques continue to drive innovation in robot control and behavioral modeling.

## Chapter 22 Summary

**Summary of Reinforcement Learning** 

This chapter delves into how agents can learn to act effectively in unknown environments through reinforcement learning (RL), a paradigm critical for developing intelligent systems. Key takeaways include: 
- **Agent Design and Information Learning:**  Depending on whether an agent is model-based or model-free, it will focus on acquiring different types of information. Model-based agents learn a transition model and utility function, whereas model-free agents may learn an action-utility function (Q-function) or a policy directly. 
- **Learning Utilities:**  Multiple methods are available for learning utilities, including direct utility estimation, adaptive dynamic programming (ADP), and temporal-difference (TD) learning. Each has its strengths, with ADP and TD methods particularly noted for their ability to adjust utility estimates based on successor states. 
- **Learning Q-Functions:**  Q-functions, which are crucial for determining the utility of actions in specific states, can be learned without a model of the environment, simplifying the process but also posing challenges in complex scenarios. 
- **Exploration vs. Exploitation:**  Active learning agents must balance the value of actions against the potential for gaining new information, navigating the risk of premature failure while seeking to learn effectively. 
- **Approximate Function Representation:**  In large or complex state spaces, RL algorithms use functional approximations (e.g., deep neural networks) for generalization. Deep reinforcement learning, in particular, has shown significant success across challenging domains. 
- **Enhancing Learning:**  Techniques like reward shaping and hierarchical reinforcement learning assist in learning complex behaviors, especially when direct rewards are infrequent. 
- **Policy Search:**  This method directly improves policy performance through observed outcomes, with challenges arising in stochastic domains that can be mitigated by techniques like correlated sampling. 
- **Apprenticeship and Inverse Reinforcement Learning:**  Learning from expert behavior, either through imitation or by inferring the underlying reward function, offers strategies for acquiring effective policies when defining explicit rewards is difficult.

Reinforcement learning stands out as a dynamic field of AI, pushing the boundaries of how agents can learn from interactions within their environment without explicit instruction. The choice between model-based and model-free methods, the balance of exploration and exploitation, and the use of function approximation for handling complex state spaces are central themes. As environments grow in complexity, the benefits of model-based approaches, which encapsulate some knowledge of the environment, become increasingly evident. This exploration emphasizes the diverse strategies available in RL for tackling a wide range of problems, from game playing to robot control, highlighting both the achievements and ongoing challenges in the field.

<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch22_reinforcement_learning/DALL%C2%B7E%202024-03-23%2022.37.04%20-%20Illustrate%20a%20historical%20scene%20featuring%20Nobel%20Prize-winning%20scientist%20Ivan%20Pavlov%20conducting%20his%20famous%20experiments%20on%20classical%20conditioning%20with%20dog.webp" width="400">

Ivan Pavlov's dog experiments were the first to demonstrate the principles of classical conditioning, a form of learning that underpins reinforcement learning algorithms. More on Ivan Pavlov and his dog experiments can be found in the [Wikipedia article](https://en.wikipedia.org/wiki/Classical_conditioning).

## Historical and Bibliographical Notes

### **Early Foundations:** 
- Ivan Pavlov, Nobel Prize in 1904 for work on conditioned reflexes.
- Edward Thorndike, "Animal Intelligence" (1911), laid foundational ideas for reinforcement learning (RL). 
- **Key Contributions in Computing:** 
- Alan Turing (1948, 1950) proposed RL as a method for teaching computers.
- Arthur Samuel’s checkers program (1959, 1967) was an early machine learning success, incorporating many modern RL ideas. 
### **Significant Developments:** 
- The concept of temporal-difference learning and function approximation in RL was explored early on by researchers like Widrow and Hoff (1960) and Hebb (1949).
- Connection between RL and Markov decision processes highlighted by Werbos (1977) and Ian Witten (1977).
- University of Massachusetts in the early 1980s (Barto et al., 1981) played a pivotal role in the development of RL.
- Rich Sutton (1988) provided mathematical insights into temporal-difference methods.
- DYNA architecture by Sutton (1990), Q-learning by Chris Watkins (1989), and SARSA by Rummery and Niranjan (1994).
- Prioritized sweeping introduced by Moore and Atkeson (1993) and Peng and Williams (1993). 
- **Function Approximation in RL:** 
- Early use by Arthur Samuel (1959), with neural networks becoming popular in the 1980s.
- Gerry Tesauro’s TD-Gammon program (1992, 1995) showcased the power of neural networks in RL.
- Deep RL has emerged as a dominant approach with systems like DQN (Mnih et al., 2015) and ALPHAZERO (Silver et al., 2018). 
- **Exploration and Policy Search:** 
- Exploration methods discussed by Barto et al. (1995), Kearns and Singh (1998), and Bayesian reinforcement learning (Dearden et al., 1998, 1999).
- REINFORCE algorithm by Williams (1992) and policy search developments by Marbach and Tsitsiklis (1998) and others. 
- **Apprenticeship and Inverse RL:** 
- Behavioral cloning in AI and fragile learning policies noted by Sammut et al. (1992) and Camacho and Michie (1995).
- Inverse RL introduced by Russell (1998), with algorithms developed by Ng and Russell (2000). 
- **Hierarchical RL:** 
- Initial attempts at state abstraction and temporal abstraction developed in the late 1990s (Parr and Russell, 1998; Andre and Russell, 2002; Sutton et al., 2000).
- Keepaway game and HRL solution by Bai and Russell (2017). 
- **Safe RL and Open-source Platforms:** 
- Safe RL surveyed by Garc´ıa and Fern´andez (2015) and algorithms for safe exploration by Munos et al. (2017).
- Open-source environments like the Arcade Learning Environment (ALE), DeepMind Lab, and OpenAI Gym have facilitated RL research. 
### - **Literature and Conferences:** 
- Influential texts include works by Sutton and Barto (2018), Kochenderfer (2015), Szepesvari (2010), and Bertsekas and Tsitsiklis (1996).
- Key journals and conferences for RL research: Machine Learning, Journal of Machine Learning Research, ICML, and NeurIPS.

## Learning More

### - **Books:**
- "Reinforcement Learning: An Introduction" by Sutton and Barto (2018) provides a comprehensive overview of RL concepts and algorithms.
- "Algorithms for Reinforcement Learning" by Szepesvari (2010) offers a detailed exploration of RL algorithms and their theoretical foundations.

### - **Courses:**

- Online courses like David Silver's RL course (UCL) https://www.davidsilver.uk/teaching/  open license

- Sergey Levine's Deep RL course (UC Berkeley) provide in-depth instruction on RL concepts and applications. https://rail.eecs.berkeley.edu/deeprlcourse/