## Unit 5. Reinforcement Learning (2 weeks)


### Lecture 17. Reinforcement Learning 1

A **Markov decision process (MDP)** is defined by:

- States: a set of states $ s \in S $; 
- Actions: a set of actions $ a \in A $;
- Action dependent transition probabilities:

$$
T(s, a, s') = P(s' \mid s, a), \quad \text{so that for each state } s \text{ and action } a,
$$

$$
\sum_{s' \in S} T(s, a, s') = 1.
$$

- Reward functions \( R(s, a, s') \), representing the reward for starting in state \( s \), taking action \( a \), and ending up in state \( s' \) after one step. (The reward function may also depend only on \( s \), or only \( s \) and \( a \).)

MDPs satisfy the **Markov property** in that the transition probabilities and rewards depend only on the current state and action, and remain unchanged regardless of the history (i.e. past states and actions) that leads to the current state.



### 1. **Finite horizon based utility**
The utility function is the sum of rewards after acting for a fixed number \( n \) steps. For example, when the rewards depend only on the states, the utility function is defined as:

$
U[s_0, s_1, \dots, s_n] = \sum_{i=0}^n R(s_i) \quad \text{for some fixed number of steps \( n \).}
$

In particular, for any positive integer $ m $:

$
U[s_0, s_1, \dots, s_{n+m}] = U[s_0, s_1, \dots, s_n].
$

---

### 2. **(Infinite horizon) discounted reward based utility**
In this setting, the reward one step into the future is **discounted** by a factor $ \gamma $, the reward two steps ahead by $ \gamma^2 $, and so on. The goal is to continue acting (without an end) while maximizing the expected discounted reward. The discounting allows us to focus on near-term rewards and control this focus by changing $ \gamma $.

For example, if the rewards depend only on the states, the utility function is defined as:

$
U[s_0, s_1, \dots] = \sum_{k=0}^\infty \gamma^k R(s_k).
$

---

### Key Terms
- $ \gamma $: Discount factor where $ 0 \leq \gamma \leq 1 $.
   - $ \gamma $ close to 0: Emphasizes immediate rewards (short-term focus).
   - $ \gamma $ close to 1: Considers future rewards more heavily (long-term focus).
- $ R(s_i) $: Reward at state $ s_i $.
- $ k $: Time step.

---

### Summary Table

| **Utility Type**                 | **Definition**                                  | **Time Horizon** | **Key Characteristic**               |
|----------------------------------|-----------------------------------------------|------------------|--------------------------------------|
| **Finite Horizon Utility**       | Sum of rewards over a fixed \( n \) steps.     | Finite           | Ends after \( n \) steps.            |
| **Discounted Reward Utility**    | Sum of discounted rewards over infinite steps. | Infinite         | Applies a discount factor \( \gamma \). |


Recall from lecture the **Bellman Equations** are:

$
V^*(s) = \max_a Q^*(s, a)
$

$
Q^*(s, a) = \sum_{s'} T(s, a, s') \left( R(s, a, s') + \gamma V^*(s') \right)
$


where

$$
\text{- the **value function** } V^*(s) \text{ is the expected reward from starting at state } s \text{ and acting optimally.}
$$

$$
\text{- the **Q-function** } Q^*(s, a) \text{ is the expected reward from starting at state } s, \text{ then acting with action } a, \text{ and acting optimally afterwards.}
$$


Recall from lecture the **value iteration update rule**:

$
V_{k+1}^*(s) = \max_a \left[ \sum_{s'} T(s, a, s') \left( R(s, a, s') + \gamma V_k^*(s') \right) \right],
$

where $ V_k^*(s) $ is the expected reward from state $ s $ after acting optimally for $ k $ steps.

---

### Example
Recall the example discussed in the lecture:

| Agent's starting state |   |   |   |   | **+1** |
|------------------------|---|---|---|---|--------|

An agent is trying to navigate a one-dimensional grid consisting of **5 cells**. At each step, the agent has only **one action** to choose from (i.e., it moves to the cell on the immediate right).

---

### Reward Function:
- The reward function is defined as \( R(s, a, s') = R(s) \).
- \( R(s = 5) = 1 \) and \( R(s) = 0 \) otherwise.

**Note**: The agent receives the reward **when leaving the current state**. When it reaches the rightmost cell (cell 5), it stays for one more step, receives a reward of **+1**, and comes to a halt.

---

### Value Function:
- Let \( V^*(i) \) denote the value function of state \( i \) (the \( i^{th} \) cell starting from the left).
- \( V_k^*(i) \) is the value function estimate at state \( i \) at the \( k^{th} \) step of the value iteration algorithm.
- \( V_0^*(i) \) is the initialization of this estimate.

---

### Discount Factor:
Use the discount factor \( \gamma = 0.5 \).

---

### Notation:
We write the functions \( V_k^* \) as arrays below, i.e., as:

\[
\left[ V_k^*(1) \quad V_k^*(2) \quad V_k^*(3) \quad V_k^*(4) \quad V_k^*(5) \right].
\]

---

### Initialization:
Initialize by setting \( V_0^*(i) = 0 \) for all \( i \):

\[
V_0^* = [0 \quad 0 \quad 0 \quad 0 \quad 0].
\]

---

### Value Iteration:
Using the value iteration update rule, we get:

\[
V_1^* = [0 \quad 0 \quad 0 \quad 0 \quad 1],
\]

\[
V_2^* = [0 \quad 0 \quad 0 \quad 0.5 \quad 1].
\]

---

### Final Note:
**Note**: As soon as the agent takes the first action to reach **cell 5**, it stays for one more step, halts, and does not take any more action. Therefore:

\[
V_{k+1}^*(5) = V_k^*(5) \quad \text{for all } k \geq 1.
\]


## Lecture 18. Reinforcement Learning 2

### Estimating Inputs for RL algorithm
- The derivation of the Q-value iteration update rule from the equation above is similar to the derivation of the value iteration update rule.
- Bellman Equations: First, recall the Bellman equations:
- Derivation of Q-value Update Rule: Plugging the first equation into the second, we get:
- Explanation: 
  1.1 Represents the expected reward from state  when taking action , and then following the optimal policy.
  1.2 Transition probability of moving from state  to  after taking action .
  1.3 Immediate reward received when transitioning from  to .
  1.4 Discount factor (), controlling how future rewards are weighted.
  1.5 The term  represents the optimal value of the next state  when taking the best possible action .

- Iterative Computation
To compute  iteratively:
1.1 Initialize  for all states  and actions .
1.2 Update using the Q-value iteration update rule:
1.3 Repeat until the values converge.

Reinforcement Learning: Key Concepts and Topics

1. Multi-Armed Bandit (MAB) Problem
- Goal: Maximize cumulative rewards while balancing exploration vs. exploitation.
- Strategies:
* Random Selection: Uniform choice of actions without learning.
* Explore-Then-Commit (ETC): Exploration phase followed by exploitation.
* ϵ-Greedy: Random exploration with a fixed probability.
* Upper Confidence Bound (UCB): Optimistic exploration based on upper confidence bounds.
- Regret Analysis: Quantifies deviation from optimal rewards over time.

2. Sequential Decision Making
- Markov Decision Process (MDP): Framework for RL with states, actions, transitions, and rewards.
- Goal: Maximize cumulative rewards using a policy 𝜋
- Policy Gradient: Optimizing policies using gradient-based methods, suitable for continuous spaces.

3. Policy Gradient Methods
- Reinforce: Vanilla policy gradient that adjusts actions based on expected rewards.
- Discount: Reduces variance by discounting future rewards.
- Baselines: Introduces advantage values to reduce variance.
- Actor-Critic: Combines value function estimation (critic) with policy optimization (actor).

4. Reward Shaping
- Sparse vs. Dense Rewards: Sparse rewards provide limited feedback; dense rewards guide agents efficiently.
- Challenges: Agents may get stuck at local minima or exploit reward functions (reward hacking).
- Solutions: Penalties for obstacles, sub-goals for navigation tasks.

5. Learning from Demonstration (LfD)
- Enables agents to learn by mimicking expert demonstrations, reducing learning time and resource costs.
- Behavior Cloning (BC): Treats learning as supervised learning but struggles with covariate shift.
- DAgger: Iterative training to address covariate shift by querying experts for new states.

6. Bridging Simulation and Reality
- Domain Randomization: Generalizes policies by randomizing simulation parameters.
- System Identification: Fine-tunes policies by inferring real-world system parameters for adaptability.

Main Themes
1. Balancing Exploration and Exploitation: Core to decision-making in RL (MAB and policy optimization).
2. Handling Reward Design: Reward shaping is crucial but challenging for task success.
3. Bridging Simulation-Real World Gap: Techniques like domain randomization and system identification enhance real-world applicability.

## Lecture 19: Applications: Natural Language Processing

## **1. What is Natural Language Processing (NLP)?**
- **Definition**: NLP enables machines to process and understand human language, from basic string matching to deep contextual comprehension.  
- **Origins**: The **Turing Test** (1950s) introduced the idea of evaluating a machine's intelligence through human-like conversation&#8203;:contentReference[oaicite:0]{index=0}.  
- **Early Systems**: MIT's *ELIZA* (1966) demonstrated simplistic dialogue, showing how humans could interpret repetitive machine outputs as meaningful&#8203;:contentReference[oaicite:1]{index=1}.

---

## **2. Evolution of NLP Approaches**  
### **Symbolic vs. Statistical Methods**  
- **Symbolic**: Focuses on hand-crafted grammar and rules to understand language.  
- **Statistical**: Relies on data-driven learning to detect patterns without requiring deep understanding&#8203;:contentReference[oaicite:2]{index=2}.  

### **Key Milestones**  
- **Hidden Markov Model**: Fred Jelinek introduced **Hidden Markov Models** for speech processing, marking a major statistical breakthrough&#8203;:contentReference[oaicite:3]{index=3}.  
- **Penn Treebank**: Released in 1993, the **Penn Treebank** enabled machine learning models to train on syntactically parsed text&#8203;:contentReference[oaicite:4]{index=4}.

---

## **3. Applications of NLP**  
NLP has revolutionized many areas, including:  
- **Practical Applications**: Search engines, machine translation, sentiment analysis, information extraction, and text generation&#8203;:contentReference[oaicite:5]{index=5}&#8203;:contentReference[oaicite:6]{index=6}.  
- **Societal Impact**: NLP powers essay grading for GRE/SAT, structures databases, and automates news summaries&#8203;:contentReference[oaicite:7]{index=7}.  
- **Current Capabilities**: Tasks such as spam detection, named entity recognition (NER), and fact-based question answering are well-solved&#8203;:contentReference[oaicite:8]{index=8}.

---

## **4. Challenges in NLP**  
### **Ambiguity**  
- Examples: *"Iraqi head seeks arms"* or *"A computer that understands you like your mother"*.  
  - Resolving syntactic and semantic ambiguity remains a challenge&#8203;:contentReference[oaicite:9]{index=9}.  

### **Data Dependency**  
- Performance heavily depends on the availability of **task-specific training data**.  
- Example: Machine translation works well for news text but fails for recipes&#8203;:contentReference[oaicite:10]{index=10}.  

### **Complex Tasks**  
- **Unconstrained Question Answering**: Requires logical reasoning and data analysis, making it highly challenging.  
- **Advanced Dialogue Systems**: Effective in narrow domains but struggle with general conversation.  
- **Text Summarization**: Summarizing books or articles with key points remains difficult&#8203;:contentReference[oaicite:11]{index=11}&#8203;:contentReference[oaicite:12]{index=12}.

---

## **5. Future Directions**  
- Improving **machine reasoning** and contextual comprehension is a major focus of ongoing research.  
- **Summarization and Evidence-based NLP**: NLP holds potential for tasks like summarizing medical studies or resolving contradictory information.  
   - Example: Resolving debates such as *"Does coffee cause cancer?"* by analyzing various research findings&#8203;:contentReference[oaicite:13]{index=13}.

---

## **Key Takeaways**  
- **NLP Progress**: Modern NLP systems excel at specific tasks like translation and sentiment analysis but struggle with general reasoning and complex discourse.  
- **Role of Machine Learning**: Machine learning has revolutionized NLP by enabling systems to learn from vast data instead of relying on hand-coded rules.  
- **Remaining Challenges**: Ambiguity, data limitations, and logical reasoning tasks remain key hurdles for NLP advancements.  