# Reinforcement Learning introduction

## What is Reinforcement Learning?

- **Reinforcement Learning (RL)** is a key pillar of machine learning, though it's not as widely applied commercially as supervised learning. It focuses on learning to make decisions by interacting with the environment.
  
- **Example**: Autonomous helicopter control.
  - The **Stanford Autonomous Helicopter** was used to demonstrate RL. This helicopter can perform complex maneuvers, such as flying upside down, guided by RL algorithms.
  
- **Key Concepts**:
  - **State (s)**: The current condition or status of the system, e.g., the helicopter's position, speed, and orientation.
  - **Action (a)**: The decision or control that changes the state, e.g., moving the helicopter's control sticks.
  - **Reward**: A feedback signal indicating how well the system is performing. A positive reward encourages good behavior (like staying balanced), while a negative reward penalizes poor performance (like crashing).
  
- **Supervised Learning Limitation**: In RL scenarios, it is hard to get a labeled dataset (input-output pairs), making supervised learning impractical for tasks like autonomous flying.
  
- **Why RL?**:
  - RL doesn't require exact instructions for every state-action pair. Instead, it needs a **reward function** that defines success or failure. The system learns by trial and error, similar to training a dog (good dog = positive reward, bad dog = negative reward).

- **Application Examples**:
  - RL has been successfully applied to:
    - **Robotics** (e.g., controlling helicopters, robotic dogs).
    - **Factory optimization** (e.g., maximizing efficiency).
    - **Stock trading** (e.g., optimizing trade execution to minimize price impact).
    - **Gaming** (e.g., playing chess, Go, video games).
  
- **Key Takeaway**: RL shifts the focus from telling the system exactly how to perform a task (as in supervised learning) to defining a reward system that guides the algorithm toward desirable actions. 


## Mars over example

This simplified example provides a great way to explain the formalism of reinforcement learning. Here's a breakdown of the key concepts:

1. **State (S):** The position or situation the agent (in this case, the Mars rover) is in at a given time. In the example, there are six states, representing different positions of the rover.

2. **Action (A):** At each step, the rover can take an action, either moving left or right, which will transition it to a new state.

3. **Reward (R):** Each state has a corresponding reward, which signals how good or bad the state is. In this example, states 1 and 6 have rewards of 100 and 40 respectively, while states 2, 3, 4, and 5 have rewards of zero.

4. **Terminal State:** States where the episode ends. Once the rover reaches either state 1 or 6, the day is over, and no more rewards are collected.

5. **Policy:** The strategy the agent uses to decide which actions to take from each state. The goal is to maximize cumulative rewards, so the rover should learn to head for state 1 for the highest reward, without wasting steps.

6. **State Transition:** The process of moving from one state to another as a result of taking an action, e.g., moving left from state 4 leads to state 3.

By exploring the possible actions and rewards, the rover can learn an optimal policy, aiming to achieve the highest possible reward while minimizing wasted time. In reinforcement learning algorithms, the goal is to maximize the total **return**, which will be discussed in the next step. This return reflects the sum of rewards collected over time, providing a guide for long-term decision-making.

## The Return in reinforcement learning

This lecture explains the concept of **return** in reinforcement learning (RL), which is used to compare different sets of rewards over time. The return is the sum of all rewards, weighted by a **discount factor (γ)**, which reduces the value of rewards received in the future. A higher discount factor means rewards received later are almost as valuable as immediate rewards, while a lower discount factor makes the agent "impatient" by devaluing future rewards more heavily.

### Key Concepts:
- **Discount Factor (γ):** A number less than 1, used to weigh future rewards. Common values are 0.9 or 0.99. In financial terms, it's akin to the time value of money, where future dollars are worth less.
- **Return Formula:**  
  $$G_t = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + ...$$
  This formula sums the rewards, discounting each reward based on how many time steps away it is.

### Example:
In the Mars Rover example:
- If the robot starts at **state 4** and moves left, it gets a final reward of 100 at **state 1**. With a discount factor γ = 0.5, the return is calculated as:
  $$G_4 = 0 + 0.5^3 \times 100 = 12.5$$
- Moving right results in a reward of 40 at **state 6**, giving a return of 10 from state 4.

The choice of actions and discount factor impacts the **optimal return**. For instance, moving left may yield better returns in earlier states, but a smarter policy (like moving right near the end) can maximize rewards.

### Key Insight:
- **Impatience in RL:** Future rewards are weighted less heavily, making immediate rewards more desirable. This bias encourages agents to act in ways that maximize near-term rewards.
- **Negative Rewards:** If rewards are negative (e.g., paying a fine), the agent will try to delay those penalties as much as possible, as future negative rewards are discounted and feel less severe.

In conclusion, **return** in RL reflects the long-term value of actions by considering both immediate and future rewards, controlled by the discount factor.

## Making decisions: Policies in reinforcement learning

In reinforcement learning, a **policy (π)** defines the strategy for choosing actions based on the current state. The policy is a function that maps a given state $$ to an action \(a$, which the agent should take when in that state. The main goal of reinforcement learning is to discover an optimal policy that maximizes the **return** (the total future reward).

### Key Concepts:
- **Policy (π):** A mapping from states $$ to actions \(a$. It tells the agent which action to take in any given state.
  $$
  \pi(s) = a
  $$
- **Deterministic vs Stochastic Policies:**
  - **Deterministic policy:** For each state, there is a single action. \(\pi(s) = a$
  - **Stochastic policy:** The policy assigns probabilities to actions. For a state $$, there is a probability distribution over actions.
  $$
  \pi(a|s) = P(a|s)
  $$
- **Objective of Reinforcement Learning:** Find a policy that maximizes the expected return (total rewards over time).

### Example:
- In the Mars Rover scenario, a policy could be:
  - If the rover is in state 2, go left.
  - If the rover is in state 5, go right.
  
  The policy is used to navigate and take actions that eventually maximize the accumulated rewards (return).

### Summary:
- A **policy (π)** is central to reinforcement learning, guiding the agent’s actions based on its current state.
- The ultimate goal of a reinforcement learning algorithm is to learn the **optimal policy**, which tells the agent which actions to take in any state to maximize the long-term return.

## Review of key concepts

Let's review the key concepts of reinforcement learning (RL) using the Mars rover example and extend the formalism to other applications. Reinforcement learning revolves around defining and interacting with an environment through a series of **states**, **actions**, **rewards**, and a **policy** that guides decision-making to maximize long-term rewards, also called the **return**. Here’s a breakdown:

### Key Concepts in Reinforcement Learning

1. **State (s):** A state represents the current situation or configuration of the system. In the Mars rover example, we had six numbered states (1 to 6). In other applications, like a helicopter or a chessboard, the state could represent the current position, orientation, and speed of the helicopter, or the arrangement of pieces on the chessboard.
   
2. **Action (a):** Actions are the possible moves or decisions the agent can take. For the Mars rover, the actions were to move **left** or **right**. For an autonomous helicopter, the actions would involve adjusting the control stick, while in chess, the actions are legal moves in the game.

3. **Reward (r):** The reward function provides feedback based on the outcome of an action. In the Mars rover example, the rewards were:
   - +100 for reaching the leftmost state,
   - +40 for the rightmost state, 
   - and 0 for all in-between states.
   
   In other applications, the rewards could vary, such as:
   - **Helicopter:** +1 for flying well, and a large negative reward (e.g., -1000) for crashing.
   - **Chess:** +1 for winning, -1 for losing, and 0 for a tie.

4. **Discount Factor (γ):** The discount factor determines how much future rewards are worth relative to immediate rewards. In the Mars rover case, a discount factor of 0.5 was used. For long-term tasks like chess, a high discount factor (e.g., 0.99) is typically chosen to encourage focus on future outcomes.

5. **Return (G):** The return is the total accumulated reward, discounted over time. It’s computed using the formula:
   $$
   G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots
   $$
   The return measures how good a particular sequence of actions was, taking into account the future rewards.

6. **Policy (π):** A policy is a strategy that determines which action to take in a given state. The goal in RL is to find an optimal policy that maximizes the return.

### Extending to Other Applications

The same formalism can be applied to a variety of applications:

- **Autonomous Helicopter:**  
   - **State (s):** Position, orientation, speed of the helicopter.
   - **Actions (a):** How to move the control stick.
   - **Rewards (r):** +1 for stable flight, large negative reward (e.g., -1000) for crashing.
   - **Discount factor (γ):** Typically high (e.g., 0.99), as long-term stability is important.
   
- **Chess:**  
   - **State (s):** Position of the chess pieces on the board.
   - **Actions (a):** Legal moves.
   - **Rewards (r):** +1 for winning, -1 for losing, 0 for a draw.
   - **Discount factor (γ):** Close to 1, e.g., 0.99 or higher, to focus on long-term success in the game.

### Markov Decision Process (MDP)

The **Markov Decision Process (MDP)** is a formalism that captures the structure of reinforcement learning problems. An MDP consists of:
- **States (S)**
- **Actions (A)**
- **Rewards (R)**
- **Transition probabilities** (i.e., how likely the system is to transition from one state to another after taking an action)
- **Policy (π)**

The key characteristic of MDPs is the **Markov property**, which means that the future depends only on the current state and action, not on the sequence of events that preceded it. In other words, the next state depends only on the present state and action, making RL memory-efficient.

### Next Steps: State-Action Value Function

The next step in developing reinforcement learning algorithms is defining the **state-action value function (Q-value)**. The Q-value represents how good a particular action is in a given state, taking into account the rewards the agent is likely to receive in the future if it follows a certain policy. It’s a critical component for learning optimal actions and policies in reinforcement learning.

By learning this function, the agent can begin to systematically improve its policy to make better decisions over time.



# State-action value function

## State-action value function definition

### Key Concepts in Reinforcement Learning: State-Action Value Function (Q-Function)

1. **State-Action Value Function (Q-Function)**: 
   - This is denoted by **Q(s, a)**, where **s** is the state and **a** is the action. 
   - **Q(s, a)** represents the total expected return if you start in state **s**, take action **a**, and then behave optimally afterward (i.e., follow the best possible actions after taking action **a**).

2. **Return Calculation**: 
   - The return is the total future reward you get, taking into account a **discount factor** (usually denoted by **γ**).
   - The **discount factor** reduces the value of future rewards, meaning rewards received sooner are more valuable than rewards received later.

3. **Circularity of Definition**:
   - Initially, this definition seems circular because it depends on knowing the optimal actions to calculate the Q-values, and vice versa. 
   - This circularity will be resolved by using specific reinforcement learning algorithms (like Q-learning or value iteration) that learn Q-values even before the optimal policy is fully known.

4. **Example: Mars Rover**:
   - For the Mars Rover example, consider **Q(s, a)** for different states and actions. 
   - If the discount factor **γ** is 0.5, the Q-values are calculated based on the rewards received for moving left or right and behaving optimally afterward.
   - Example: For **Q(2, right)**, you get 12.5, and for **Q(2, left)**, you get 50, showing that going left in state 2 results in a higher return.

5. **Optimal Policy**:
   - The **optimal policy** **π(s)** is the action that maximizes the Q-value for a given state. 
   - For example, in state 4, if **Q(4, left)** is 12.5 and **Q(4, right)** is 10, the optimal action is **left** because it results in a higher return.

6. **Generalizing the Q-Function**:
   - Once you have Q-values for all states and actions, you can easily determine the optimal action to take in any given state by choosing the action that maximizes **Q(s, a)**.
   - This is why computing the Q-function is critical in reinforcement learning algorithms: it helps guide the agent to take actions that maximize the expected return.

7. **Markov Decision Process (MDP)**:
   - This formalism describes the decision-making problem where future states depend only on the current state and the action taken (Markov property).
   - Reinforcement learning problems can be modeled as MDPs, where the Q-function helps in finding optimal policies within the MDP framework.

### Applications of Q-Function:
1. **Autonomous Helicopter**: 
   - The state could represent the helicopter’s position, speed, and orientation.
   - Actions could be how the controls are moved, and rewards could be positive if the helicopter is flying well and negative if it crashes.
   - Q-learning can help in developing a policy for controlling the helicopter to maximize the flying quality.

2. **Chess Playing**:
   - The state is the arrangement of pieces on the board.
   - Actions are the legal moves, and the rewards are +1 for winning, -1 for losing, and 0 for a draw.
   - Q-learning can guide the system to choose moves that maximize the chances of winning.

Understanding and computing the Q-function is a foundational step in building efficient reinforcement learning systems across various applications, from robots to games and real-world autonomous systems. The goal is always to use the Q-values to derive a policy that maximizes the expected long-term reward.

## State-action value function example

In this reinforcement learning lab, you'll be exploring how the **state-action value function (Q function)** and the **optimal policy** change when modifying key parameters such as rewards and the discount factor (γ) in a Mars Rover example. Here's a quick breakdown of what happens when you play around with these elements:

1. **Q(s,a) Function Overview**:
   - The state-action value function $Q(s, a)$ is a measure of how good it is to take a particular action $a$ in state $s$ and then follow the optimal policy.
   - As seen in the previous example, this function provides numerical values for different states and actions, telling you the total expected reward (return) for choosing that action and then behaving optimally.

2. **Optional Lab Setup**:
   - The lab provides you with a Mars Rover scenario with six states and two possible actions (left or right) from each state.
   - Rewards are assigned to terminal states (left: 100, right: 40) and the intermediate states provide zero rewards.

3. **Gamma (γ) – Discount Factor**:
   - The **discount factor** determines how much the agent values future rewards. A γ closer to 1 makes the agent more patient (it values future rewards more), while a lower γ makes the agent impatient (it prefers immediate rewards).
   - By changing γ, you can observe how the behavior of the Mars Rover changes, and how Q(s,a) adapts accordingly.
   
   **Example:**
   - With **γ = 0.9**, the agent is more patient, waiting longer to accumulate higher future rewards. The rover is willing to go left even from state 5, as the long-term reward is larger (100) compared to 40.
   - With **γ = 0.3**, the agent becomes impatient and favors quicker, smaller rewards. The Mars Rover opts to go right from state 4, as it doesn't want to wait for the larger, distant reward of 100.

4. **Reward Function**:
   - By modifying the rewards (e.g., lowering the terminal right reward from 40 to 10), you can observe how the optimal policy shifts. For instance, the rover would now prefer going left from every state if the right terminal reward is too low.

5. **Key Takeaways**:
   - **Q(s,a)** changes based on how rewards and discounting are set, reflecting how much the agent values future rewards.
   - The **optimal policy** is derived by selecting actions that maximize the Q value from each state.
   - Playing with different values helps sharpen your understanding of how these quantities interact in reinforcement learning problems.

### Next Steps:
After experimenting with the lab, you’ll explore the **Bellman equation**, the core concept in reinforcement learning. It provides a recursive way of computing Q(s,a) by breaking it down into immediate rewards and future Q values.


## Bellman Equation

Your summary provides a thorough walkthrough of the **Bellman equation** and its role in **reinforcement learning**. Here's a concise breakdown to reinforce your understanding:

### Key Concepts:
1. **State-Action Value Function (Q(S,A))**:  
   - The value of taking action **A** in state **S** and then behaving optimally afterward.

2. **Bellman Equation**:  
   - Defines the recursive relationship for **Q(S,A)**:
     $$Q(S,A) = R(S) + \gamma \cdot \max_{A'} Q(S', A')$$
     Where:
     - **R(S)**: Immediate reward at state **S**.
     - **$\gamma$**: Discount factor (how much future rewards are valued compared to immediate rewards).
     - **S'**: Next state after taking action **A**.
     - **A'**: Possible actions in state **S'**.
     - **$\max_{A'} Q(S', A')$**: Optimal future rewards starting from state **S'**.

### Example Walkthrough:
- For **Q(2, right)**:
  - Current state: **S = 2**.
  - Action: **A = right**.
  - Reward: **R(2) = 0**.
  - Next state: **S' = 3**.
  - Future possible actions: **max(Q(3, right), Q(3, left))**.
  - **Q(2, right)** becomes:
    $$
    Q(2, right) = 0 + 0.5 \cdot \max(25, 6.25) = 12.5
    $$
  
- Similarly, for **Q(4, left)**:
  - Current state: **S = 4**.
  - Action: **A = left**.
  - Reward: **R(4) = 0**.
  - Next state: **S' = 3**.
  - Optimal next actions: **max(Q(3, right), Q(3, left))**.
  - **Q(4, left)** becomes:
    $$
    Q(4, left) = 0 + 0.5 \cdot 25 = 12.5
    $$

### Intuition:
- The **Bellman equation** breaks down rewards into two parts:
  1. **Immediate reward** from taking action **A** in state **S**.
  2. **Discounted future rewards** from optimal behavior starting from the next state **S'**.

In essence, this equation allows the agent to compute **Q(S,A)** values and thus select actions that maximize expected returns by considering both immediate and future rewards.

Once the **Q(S,A)** values are calculated, selecting the best action is simply choosing the action **A** that maximizes **Q(S,A)** for a given state.



## Random (stochastic environment)

In this lesson, we're introduced to **stochastic reinforcement learning problems**, where actions in the environment have uncertain outcomes. For instance, in the case of the Mars rover, when you command it to move in a certain direction (e.g., left), it might not always follow perfectly due to environmental factors like slippery surfaces. There’s a probability (say 90%) that it behaves as expected, and a smaller probability (10%) that it moves in an unintended direction. This randomness makes it a **stochastic environment**.

### Key Concepts:

1. **Stochastic Environment**:
   - When actions don't always yield the same results, we call it a stochastic environment. In the Mars rover example, the rover has a 90% chance of moving in the intended direction and a 10% chance of slipping and moving in the opposite direction.

2. **Policy**:
   - A policy defines the action to take in each state. In stochastic environments, even if you follow the same policy, different outcomes can occur due to the randomness of the environment.

3. **Expected Return**:
   - In stochastic reinforcement learning, since outcomes are random, the goal is not to maximize the return from one particular episode but to maximize the **expected return**, which is the average return over many episodes.
   - Mathematically, the expected return is the **sum of discounted rewards** averaged over many trials. It's represented as:
     $$
     E[R_1 + \gamma R_2 + \gamma^2 R_3 + \dots]
     $$
     where $ \gamma $ is the discount factor.

4. **Bellman Equation**:
   - In a stochastic environment, the Bellman equation is modified. Since the next state $S'$ after taking action $a$ in state $s$ is random, the equation takes an expected value of future rewards. The equation becomes:
     $$Q(s, a) = r(s, a) + \gamma \cdot E[V(s')]$$
     where the future state $s'$ is random.

5. **Misstep Probability**:
   - The **misstep probability** models the randomness in the environment. For example, if the rover has a 10% chance of slipping, then the misstep probability is 0.1. As this probability increases, the **Q-values** and **optimal returns** decrease because the agent has less control over the environment.
   - Increasing the misstep probability to, say, 40% (meaning the rover only follows commands 60% of the time) leads to even lower Q-values, reflecting diminished control over the robot.

### Practical Implications:
   - **Maximizing Control**: The effectiveness of the policy depends on how much control the agent has over the environment. In highly stochastic environments, the agent’s expected return will be lower, as it cannot reliably execute the intended actions.
   - **Larger State Spaces**: Although the Mars rover example has only six states, real-world applications often involve much larger or even continuous state spaces. These present greater challenges for reinforcement learning algorithms, and we'll cover these in future lessons.

### Summary:
In stochastic reinforcement learning, the goal is to **maximize the expected sum of discounted rewards**. The randomness in the environment leads to variability in outcomes, and reinforcement learning algorithms aim to find a policy that works well on average, considering the stochastic nature of state transitions. By adjusting the **misstep probability**, you can observe how the level of randomness impacts the agent’s performance.

# Continuous state spaces

## Example of continuous state space applications

In robotic control applications, like the lunar lander simulation in the practice lab, continuous state spaces are a common feature. Continuous state spaces differ from discrete state spaces in that the robot’s position or state isn't limited to a fixed set of values but can be any value within a range. For instance, in the Mars rover example, rather than being confined to six specific positions, the rover could be anywhere between 0 and 6 kilometers, such as 2.7 km or 4.8 km. 

This concept generalizes to more complex robots as well. For example, with a self-driving car or truck, the state might include:
- The car’s **x** position (left-right position),
- **y** position (forward-backward position),
- **θ** orientation (angle at which the car is facing),
- Speed along the x and y directions (**ẋ** and **ẏ**), and
- The rate of turning, or angular velocity (**θ̇**).

This state representation extends further in more advanced scenarios like controlling a helicopter. For a helicopter, the state includes:
- Position in 3D space (**x**, **y**, **z**),
- Orientation captured by **roll**, **pitch**, and **yaw** (denoted as **ϕ**, **θ**, and **ψ** respectively),
- Speeds along these axes (**ẋ**, **ẏ**, **ż**),
- Rates of angular change for roll, pitch, and yaw (angular velocities **ϕ̇**, **θ̇**, **ψ̇**).

In total, the helicopter’s state is represented by 12 continuous numbers that a policy must process to decide the appropriate action for controlling the helicopter. This state-action formulation extends the traditional reinforcement learning framework for more complex, real-world systems.

In the lunar lander application from the practice lab, you will work on a simulated environment that also uses continuous states, requiring you to apply these principles to control the landing accurately on the moon.

## Lunar lander

The lunar lander simulation is a great example of applying reinforcement learning to a real-world scenario, and it involves navigating a continuous state space. In this application, you control a lunar lander as it descends toward the moon's surface, with the goal of landing it safely on a designated landing pad. 

### Key Components of the Lunar Lander Simulation

**Actions:**
You have four possible actions at each time step:
1. **Nothing**: Do not fire any thrusters; gravity and inertia will pull the lander down.
2. **Left**: Fire the left thruster, which will push the lander to the right.
3. **Main**: Fire the main engine to provide thrust downward.
4. **Right**: Fire the right thruster, which will push the lander to the left.

**State Space:**
The state of the lunar lander is represented by several variables:
- **Position (X, Y)**: The horizontal and vertical positions of the lander.
- **Velocity (Vx, Vy)**: The horizontal and vertical speeds.
- **Angle (θ)**: The tilt of the lander.
- **Angular Velocity (θ̇)**: How quickly the angle is changing.
- **Ground Contact (l, r)**: Two binary variables indicating whether the left and right landing legs are on the ground.

**Reward Function:**
The reward structure encourages desired behaviors:
- Achieving a soft landing rewards between **100 to 140** points, depending on the landing precision.
- Moving closer to the landing pad provides positive rewards, while drifting away incurs negative rewards.
- Crashing results in a **-100** penalty.
- Each grounded leg (left or right) grants **+10** points.
- Firing the main engine incurs a **-0.3** penalty, while firing side thrusters incurs a **-0.03** penalty.

This nuanced reward function is designed to encourage the lander to land safely while minimizing unnecessary fuel expenditure.

### Objective:
The goal is to learn a policy $\pi$ that, given a state $S$, selects an action $A = \pi(S)$ to maximize the expected return, which is the sum of discounted rewards. In this case, a high value for the discount factor $\gamma$ (around **0.985**) is typically used, meaning future rewards are almost as valuable as immediate ones.


## Learning the state-value function

The Lunar Lander problem is a classic example for applying reinforcement learning (RL). Here's a condensed summary of the key points:

### Overview of Lunar Lander
- **Objective**: Control a lunar lander to safely land on a designated pad by managing its thrusters.
- **Actions**: Four possible actions:
  1. **Nothing** (do not fire any thrusters)
  2. **Left** (fire left thruster)
  3. **Main** (fire main engine)
  4. **Right** (fire right thruster)

### State Representation
- The state is described by eight variables: 
  - Position (x, y)
  - Velocity (ẋ, ẏ)
  - Angle (θ) and angular velocity (θ̇)
  - Ground contact indicators (l for left leg, r for right leg)

### Reward Function
- **Positive rewards**:
  - Safe landing (100–140 points)
  - Proximity to landing pad (+ reward for approaching)
  - Grounded legs (+10 points per leg)
- **Negative rewards**:
  - Crash (-100 points)
  - Fuel consumption (e.g., -0.3 for the main engine, -0.03 for side thrusters)

### Learning Q-Function
- **Neural Network Structure**:
  - Input: State-action pair (12 features: 8 state variables + 4 action encoding)
  - Output: Estimated Q-value (Q(s, a))

### Training Process
1. **Exploration**: Take random actions to gather state-action-reward-next state tuples.
2. **Bellman Equation**: Use the equation $Q(s, a) = R(s) + \gamma \max_{a'} Q(s', a')$ to compute target values.
3. **Training**: Use supervised learning to adjust neural network weights to minimize the difference between predicted and target Q-values.

### Replay Buffer
- Store the most recent 10,000 experience tuples to use for training, promoting stable learning.

### Algorithm Summary
- Initialize the neural network with random weights.
- Collect experience tuples through exploration.
- Train the network on these tuples to refine the Q-function.
- Update the network iteratively, improving estimates over time.

### DQN (Deep Q-Network)
- Combines deep learning with Q-learning, allowing for better handling of the complexity in the Lunar Lander problem.


## Algorithm refinement: Improved neural network architecture

Here’s a concise recap of the main points:

### Improved Neural Network Architecture for DQN
1. **Single Output Layer**: Instead of creating a separate Q value for each action by running inference four times, the network now outputs all four Q values simultaneously from a single input.
   - **Input**: Eight numbers (state features).
   - **Hidden Layers**: Two layers with 64 units each.
   - **Output Layer**: Four units corresponding to $Q(s, \text{nothing})$, $Q(s, \text{left})$, $Q(s, \text{main})$, and $Q(s, \text{right})$.

2. **Efficiency**:
   - Running inference only once per state instead of four times reduces computational overhead.
   - Facilitates quick selection of the action that maximizes Q, enhancing decision-making speed.

3. **Bellman's Equation**: This architecture allows for simultaneous calculation of $Q(s', a')$ for all actions, streamlining the process of computing the right-hand side of Bellman's equation.

### Epsilon-Greedy Policy
- **Definition**: A strategy to balance exploration and exploitation during training.
  - **Exploration**: With probability $\epsilon$, choose a random action to explore the state space.
  - **Exploitation**: With probability $1 - \epsilon$, choose the action that maximizes the Q value.
  
- **Decay Over Time**: Start with a higher $\epsilon$ for more exploration, gradually reducing it as the agent becomes more confident in its Q values. This helps avoid getting stuck in local optima and allows for better learning of the environment dynamics.

This combination of an efficient neural network design and an epsilon-greedy policy significantly enhances the performance and learning efficiency of the DQN algorithm. 

## Algorithm refinement: ϵ-greedy policy

#### Neural Network Architecture
- **Previous Architecture**: Input of state $s$ requires four separate inferences for Q-values ($Q(s,a)$).
- **Improved Architecture**: Single neural network outputs all Q-values simultaneously for actions (e.g., left, right, main, nothing).
  - **Input**: 8 numbers (state representation).
  - **Hidden Layers**: 2 layers, each with 64 units.
  - **Output**: 4 units representing $Q(s, \text{nothing}), Q(s, \text{left}), Q(s, \text{main}), Q(s, \text{right})$.
- **Efficiency**: Allows inference to run once per state, providing all action Q-values at once. Facilitates quicker selection of the action with the maximum Q-value. 

#### Bellman's Equation
- The improved architecture allows simultaneous computation of $Q(s', a')$ for all actions, simplifying the calculation in Bellman's equation.

#### Epsilon-Greedy Policy
- **Purpose**: Balances exploration and exploitation during learning.
- **Implementation**:
  1. With 95% probability, select the action that maximizes $Q(s,a)$ (greedy action).
  2. With 5% probability, select an action randomly (exploration).
  
- **Rationale**: Prevents the model from getting stuck in local minima by encouraging exploration of potentially good actions that may initially seem poor (e.g., firing the main thruster).

#### Exploration vs. Exploitation
- **Exploration**: Trying random actions to discover new strategies.
- **Exploitation**: Choosing the best-known action to maximize returns.
- **Trade-Off**: Epsilon-greedy policy addresses this trade-off by introducing randomness in action selection.

#### Epsilon Decay Strategy
- **Start High**: Begin with high epsilon (e.g., 1.0) for maximum exploration.
- **Gradually Decrease**: Decrease epsilon over time to reduce random actions (e.g., down to 0.01).

#### Hyperparameter Sensitivity
- **Reinforcement Learning (RL) Complexity**: Sensitive to hyperparameter settings compared to supervised learning.
  - Small misconfigurations can lead to significant delays in learning (10x or 100x longer).
  
#### Additional Refinements (Optional)
- **Mini Batching**: Refines the learning process by processing multiple samples simultaneously.
- **Soft Updates**: Further enhances algorithm performance (not essential for initial implementation).

### Conclusion
- Implementing a more efficient neural network architecture and utilizing an epsilon-greedy policy significantly enhances the performance of DQN algorithms in environments like lunar lander. The balance between exploration and exploitation, as well as careful tuning of hyperparameters, are crucial for successful reinforcement learning.

## Algorithm refinement: Mini-batch and soft updates (optional)

#### Mini-Batch Gradient Descent
- **Concept**: Instead of using the entire dataset for every iteration in gradient descent, mini-batch gradient descent selects a smaller subset.
- **Application**: Works for both supervised and reinforcement learning.
- **Example**: In a large dataset (e.g., 100 million examples), using a smaller batch (e.g., 1,000 examples) reduces computational time and improves efficiency.
- **Process**:
  1. Instead of computing the gradient over the entire dataset, use a mini-batch of data.
  2. Each iteration only requires processing the mini-batch, making training faster.
  3. The algorithm may converge noisily but is computationally more efficient.

#### Benefits of Mini-Batches
- **Speed**: Reduces the time per iteration, especially beneficial with large datasets.
- **Example in Reinforcement Learning**: Use a subset of stored tuples from a replay buffer to train the neural network, speeding up learning.

#### Soft Updates
- **Purpose**: Prevents abrupt changes to the Q-function that could destabilize learning.
- **Implementation**:
  - Instead of directly setting parameters $W$ and $B$ to new values $W_{\text{new}}$ and $B_{\text{new}}$, use:
    - $W = 0.01 \times W_{\text{new}} + 0.99 \times W$
    - $B = 0.01 \times B_{\text{new}} + 0.99 \times B$
- **Mechanism**: This gradual update (soft update) blends the new and old parameters, reducing the risk of learning instability from poor new estimates.

#### Hyperparameters
- **Control Update Aggressiveness**: The weights (0.01 and 0.99) can be adjusted based on the desired level of update sensitivity.
- **Trade-Off**: Too aggressive an update (e.g., setting $W = W_{\text{new}}$) may revert to original instability, while too conservative may slow learning.

#### Conclusion
- **Mini-batch and soft updates** significantly enhance the efficiency and stability of reinforcement learning algorithms, particularly in challenging environments like Lunar Lander.
- These refinements lead to more reliable convergence and improved performance in both reinforcement and supervised learning contexts.


## The state of reinforcement learning

#### Overview
- **Research Momentum**: Reinforcement learning (RL) is a dynamic field with significant research activity and potential applications.
- **Personal Experience**: The speaker's PhD thesis focused on RL, highlighting a long-standing interest.

#### Hype vs. Reality
- **Simulation vs. Real-World Applications**:
  - Many successes in RL come from simulated environments (e.g., games).
  - Implementing RL in real-world scenarios, such as robotics, poses significant challenges.
  - Developers often find it easier to achieve results in simulations than in practical applications.

#### Current Utility of Reinforcement Learning
- **Applications Compared to Other Learning Types**:
  - Fewer real-world applications of RL exist compared to supervised and unsupervised learning.
  - For most practical applications, supervised and unsupervised learning techniques are more likely to be effective.
- **Speaker's Experience**:
  - While RL has been applied in robotic control tasks, the speaker predominantly uses supervised and unsupervised learning in day-to-day work.

#### Future Potential
- Despite current limitations, RL holds substantial promise for future applications and remains a key component of the machine learning landscape.
- Understanding and incorporating RL can enhance the effectiveness of machine learning system development.

#### Conclusion
- The materials on RL, especially practical tasks like the Lunar Lander, are designed to engage learners and provide hands-on experience.
- Encouragement to enjoy the learning process and the satisfaction of implementing RL algorithms effectively.