## A Super-Condensed Intro to RL

#### Basic Terminology and the Agent-Environment Framework

At each time step, the agent observes the current **state** ($s$) of the environment, selects an **action** ($a$) based on its policy ($\pi$), and receives a **reward** ($r$) and a new **state** ($s'$) from the environment. Note: it is actually the agent's state in the environment, ie, the environment is the entire world in which the agent lives, and the state is just one place in the environment where the agent is currently located. The agent-environment is formally known as a **Markov Decision Process (MDP)**.

1. **Agent**: The learner or decision-maker.
2. **Environment**: Everything the agent interacts with.
3. **State ($S$)**: A representation of the environment at a specific time.
4. **Action ($A$)**: A decision made by the agent.
5. **Reward ($R$)**: A scalar feedback signal from the environment to the agent.
6. **Episode**: A sequence of states, actions, and rewards that ends in a terminal state.

#### Policy

- **Definition**: A policy defines the agent's behavior at a given time, ie, it tells the agent what action to take given the current state.
- **Types of Policies**:
  - **Deterministic**: $\pi(s) = a$, where the action $a$ is fixed for a given state $s$.
  - **Stochastic**: $\pi(a|s) = P(a|s)$, where the policy outputs probabilities of actions.

#### Task Types

1. **Episodic (Terminal)**:
   - The interaction ends after a finite sequence of steps (episodes).
   - Example: Games with a defined end, like Super Mario Bros.
2. **Continuous**:
   - The interaction does not have a defined endpoint.
   - Example: Autonomous driving.

#### Exploration vs. Exploitation Tradeoff

- **Exploration**: Trying new actions to discover their effects. This helps in discovering new actions with the hope that they yield better reward value.
- **Exploitation**: Using known actions to maximize rewards. 
- **Challenge**: Balancing exploration and exploitation to ensure long-term success.

#### Discount Factor ($\gamma$)

- The discount factor is used to determine the importance of future rewards, where its domain is: 0 \leq \gamma \leq 1.
- The discount factor determines how future rewards are weighted relative to immediate rewards:
  - $\gamma = 0$: The agent is **short-sighted** and only considers immediate rewards.
  - $\gamma = 1$: The agent is **far-sighted** and values long-term rewards equally with immediate rewards.
- In practice, $0 < \gamma < 1$ balances these extremes by emphasizing both immediate and future rewards.
- The discount factor is used to calculate the discounted reward: $(G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1})$.

#### Reward Hypothesis

All goals can be expressed as the maximization of the cumulative discounted reward. Mathematically, this can be written as:
$$G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$

#### Transition Dynamics

- **Definition**: The probabilistic behavior of the environment.
- **Transition Function**: $P(s', r | s, a)$, the probability of moving to state $s'$ and receiving reward $r$ given state $s$ and action $a$.

#### Taxonomy of RL Methods

Reinforcement Learning methods can be categorized into the following groups:

1. **Model-Free vs. Model-Based**:
   - **Model-Free**: Learn directly from environment interactions without a model (e.g., Q-Learning).
   - **Model-Based**: Build a model of the environment's dynamics for planning (e.g., Dyna-Q).
2. **Value-Based vs. Policy-Based**:
   - **Value-Based**: Learn value functions (e.g., Q-Learning).
   - **Policy-Based**: Learn policies directly (e.g., Policy Gradient).
   - **Actor-Critic**: Combines value-based and policy-based approaches.
3. **On-Policy vs. Off-Policy**:
   - **On-Policy**: Learn from actions taken by the current policy (e.g., SARSA).
   - **Off-Policy**: Learn from actions taken by a different policy (e.g., Q-Learning).

#### Value of a State and an Action

- **Value of a State** $V(s)$: Expected return starting from state $s$ and following policy $\pi$:

$$V_\pi(s) = \mathbb{E}_\pi \left[ G_t \middle| S_t = s \right]$$

- **Value of an Action** $Q(s, a)$: Expected return starting from state $s$, taking action $a$, and following policy $\pi$: 

$$Q_\pi(s, a) = \mathbb{E}_\pi \left[ G_t \; \middle| \; S_t = s, A_t = a \right]$$

#### Value Functions

1. **State-Value Function $V(s)$**: Value of being in state $s$.
2. **Action-Value Function $Q(s, a)$**: Value of taking action $a$ in state $s$.

#### Bellman Equations

The Bellman equations provide a recursive way to compute value functions by breaking down the expected return into immediate rewards and the value of subsequent states.

$$\boxed{V_\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} P(s', r|s, a) \left[ r + \gamma V_\pi(s') \right]} \quad \rightarrow \quad \text{State-Value Function}$$


$$\boxed{Q_\pi(s, a) = \sum_{s', r} P(s', r|s, a) \left[ r + \gamma \sum_{a'} \pi(a'|s') Q_\pi(s', a') \right]} \quad \rightarrow \quad \text{Action-Value Function}$$

Note: $$\sum_{s', r} = \sum_{s'} \sum_r$$

#### Derivation of the State-Value Bellman Equation

\begin{align*}
V_\pi(s) &= \mathbb{E}_\pi \left[ G_t \middle| S_t = s \right] \\ \\
\because \; G_t &= \sum_{k=0}^\infty \gamma^k R_{t+k+1} \\ \\
V_\pi(s) &= \mathbb{E}_\pi \left[ R_{t+1} \, + \, \gamma G_{t+1} \; \middle| \; S_t = s \right] \\ \\
&= \mathbb{E}_\pi \left[ R_{t+1} \, + \, \gamma V_\pi(s_{t+1}) \; \middle| \; S_t = s \right] \\ \\
&= \sum_{a} \pi(a|s) \sum_{s'} \sum_r P(s', r|s, a) \left[ r + \gamma V_\pi(s') \right] \qquad \blacksquare
\end{align*}

#### Derivation of the Action-Value Bellman Equation

\begin{align*}
Q_\pi(s,a) &= \mathbb{E}_\pi \left[ G_t \; \middle| \; S_t = s, A_t = a \right] \\ \\
&= \sum_{s'} \sum_r P(s', r|s, a) \left[ r + \gamma \mathbb{E}_\pi \left[ G_{t+1} \; \middle| \; S_{t+1} = s' \right] \right] \\ \\
&= \sum_{s'} \sum_r P(s', r|s, a) \left[ r + \gamma \sum_{a'} \pi(a' | s') \mathbb{E}_\pi \left[ G_{t+1} \; \middle| \; S_{t+1} = s', A_{t+1} = a' \right] \right] \\ \\
&= \sum_{s'} \sum_r P(s', r|s, a) \left[ r + \gamma \sum_{a'} \pi(a'|s') Q_\pi(s', a') \right] \qquad \blacksquare
\end{align*}

#### Optimality Bellman Equations

The optimality Bellman equations define the value of a state or action under an optimal policy, which maximizes the expected return. They are necessary for:
1. Finding the optimal policy in RL problems.
2. Deriving algorithms like Q-Learning and Value Iteration.

$$\boxed{V_*(s) = \max_a \sum_{s', r} P(s', r|s, a) \left[ r + \gamma V_*(s') \right]} \quad \rightarrow \quad \text{Optimal State-Value Function}$$

$$\boxed{Q_*(s, a) = \sum_{s', r} P(s', r|s, a) \left[ r + \gamma \max_{a'} Q_*(s', a') \right]} \quad \rightarrow \quad \text{Optimal Action-Value Function}$$