## 1. Policy Gradient Theorem

**Theorem Statement:**
The policy gradient theorem states that for any differentiable policy π(a|s;θ), the gradient of the performance objective J(θ) with respect to the policy parameters θ is:

∇θ J(θ) = E[∇θ log π(a|s;θ) * Q^π(s,a)]

**Significance:**
- Provides an **exact expression** for the policy gradient without needing to compute gradients of the state distribution
- Enables **temporal credit assignment** - connects actions to long-term consequences via Q-values
- Forms the **foundation for all policy gradient methods** (REINFORCE, Actor-Critic, PPO, etc.)
- Allows **online learning** - policies can be improved while collecting experience

## 2. Parameterized Policies

**Softmax Policy Example:**
For discrete actions, a parameterized softmax policy can be defined as:

π(a|s;θ) = exp(h(s,a;θ)) / Σ_{a'} exp(h(s,a';θ))

Where:
- h(s,a;θ) are the **action preferences** (logits) parameterized by θ
- For linear approximation: h(s,a;θ) = θ^T φ(s,a) where φ(s,a) is a feature vector
- For neural networks: h(s,a;θ) is the output of a neural network

This gives higher probability to actions with higher preferences while maintaining exploration.

## 3. Actor-Critic vs. Pure Policy Gradient

**Two Advantages:**

1. **Reduced Variance**: Actor-Critic methods use a value function (critic) to estimate the advantage, which has much lower variance than the Monte Carlo returns used in pure policy gradients like REINFORCE.

2. **Sample Efficiency**: By learning a value function, Actor-Critic can learn from incomplete episodes and reuse experience more effectively through bootstrapping, whereas pure policy gradients typically require complete episodes.

## 4. Imitation Learning

**Behavior Cloning vs. Inverse Reinforcement Learning:**

- **Behavior Clustering**: Supervised learning approach that directly maps states to actions using expert demonstrations. Treats imitation as a classification/regression problem. **Limitation**: No reasoning about why actions are taken, susceptible to compounding errors.

- **Inverse Reinforcement Learning**: Learns the underlying reward function that explains expert behavior, then finds optimal policy for that reward. **Advantage**: Generalizes better and understands the intent behind actions.

## 5. Reward Shaping

**Definition**: Reward shaping modifies the original reward function by adding additional shaping rewards to provide more frequent learning signals and guide the agent toward desired behaviors.

**Potential-Based Reward Shaping**: A specific form where shaping rewards are defined as:
F(s,a,s') = γΦ(s') - Φ(s)

**Why Useful**:
- **Policy invariance guarantee**: Doesn't change the optimal policy, only affects learning speed
- **Provides intermediate guidance**: Helps with credit assignment in sparse reward environments
- **Mathematically sound**: Preserves convergence guarantees of original RL algorithm

## 6. Maximum Entropy Principle

**Why Maximizing Entropy Helps Exploration:**

1. **Prevents Premature Convergence**: By encouraging higher entropy (more random policies), the agent continues exploring and avoids getting stuck in suboptimal deterministic policies early in training.

2. **Robustness to Uncertainty**: In partially observable or noisy environments, maximum entropy policies are more robust because they distribute probability mass across multiple plausible actions rather than over-committing to one.

3. **Better Coverage of State Space**: High-entropy policies visit more diverse states, leading to better value function estimates and discovering potentially better strategies.

4. **Connection to Optimality**: In the maximum entropy framework, the optimal policy explicitly balances reward maximization with entropy, leading to policies that are optimal while maintaining stochasticity where multiple good actions exist.

**Mathematical Insight**: The maximum entropy objective J(θ) = E[Σγ^t(r_t + αH(π(·|s_t)))] explicitly trades off reward and entropy, where α controls the exploration-exploitation balance.