### Introduction to Machine Teaching

### University of Virginia
### Reinforcement Learning
#### Last updated: April 11, 2025

---


### SOURCES 

- Mastering Reinforcement Learning with Python, Enes Bilgin. Chapter 10

### LEARNING OUTCOMES

- Explain the benefit of Machine Teaching (MT)
- Describe different methods for MT
- Apply Reward Shaping to teach an agent to reach a goal
- Describe a method for preventing impossible / unwanted actions from a given state
- Explain the idea of Curriculum Learning

### CONCEPTS

- Machine Teaching
- Concept (part of the skillset)
- Reward Shaping
- Action Masking
- Curriculum Learning

---  

### I. Machine Teaching

RL requires a large amount of data since the agent may be only exposed to training examples...and not guidance

*Machine Teaching (MT)* focuses on extracting knowledge from a *teacher*

A teacher is a subject-matter expert on a topic

This can be more efficient (fewer samples / less training / less compute needed)

This is certainly true for humans learning from a teacher as well

The teacher infuses knowledge into the machine

But how to do this?

We will touch on several approaches

---

### II. Concept

A *concept* is a part of the skillset needed to solve a problem

It can be helpful to break problems down this way, as it makes things easier

Consider learning:

- How to play chess if rewards are only given at the end.  
  Can be helpful to have **intermediate rewards** (capturing a queen, having a strong opening)

- How to play basketball if rewards are based on final score. Can be helpful for learning skills such as:
  - Passing
  - Rebounding
  - Shooting
  - Dribbling
 
---

### III. Reward Shaping

Several properties of a problem can make it hard for the agent to learn the optimal policy, such as:

- **Sparse rewards.** If feedback doesn't come often, it can be hard for the agent to learn if it's taking the right actions

- **Attribution Problem.** When sparse rewards come, it is hard to know the action leading to the reward

- **Qualitative objectives.** A task such as *walking* can be hard to define

- **Multi-objective task.** It may be necessary for the agent to learn to balance multiple priorities.  
    Autonomous driving involves learning how to balance things like:
    - speed
    - safety
    - fuel efficiency
  
    Ultimately, these components need to be weighted and combined into a scalar value.

---

*Reward shaping* involves designing a function that moves the agent towards success states and away from failure states. 

Positive rewards are given for moving toward good states  
Negative rewards are given for moving toward bad states 

There are important considerations for this to work well, such as:
- Providing the right incentive to reach a goal (and not linger near the goal)
- Providing the right incentive to end ASAP when needed
- Taking into account the relative size of rewards

**Warning:**  
The agent learns behavior based on the rewards.  
Sometimes the agent learns behavior that maximizes reward but isn't what the designer had in mind. 

**Question:** Suppose we design a reward function that gives a constant reward for all states. Will the agent learn? Explain your answer.

---

Next, we look at reward shaping examples.

**Example 1: Shaped Reward Function**

Imagine a robot that can have position in $[-1,1]$  
The figure below shows a reward function for moving an agent toward the goal state=1  
As the agent moves from 0 to 1, the reward grows larger (by the square of the state value)  
As the agent moves from 0 to -1, the penalty grows larger 


<img src="./shaped_reward1.png">

**Example 2: Designing Rewards to Prevent Sepsis**

In the paper *Deep Reinforcement Learning for Sepsis Treatment* by Raghu et. al., the task is to prevent sepsis by controlling **SOFA** and **Lactate**. 

These measures are proxies for overall patient health.

This is a multi-objective task.

The reward function is composed of intermediate rewards and terminal rewards: 

<img src="./raghu_sepsis.png" height="100" width="800">

---

### IV. Action Masking

From a given state, taking certain actions may be unwanted or even impossible.  

Example:
- Prescribing a medication dose which is likely to put a patient at high risk
- Moving a robot off a cliff

We can prevent the agent from taking such actions with *action masking*

This can avoid transitions to bad states and also limit the possible action space

How to do this?

- For value-based algorithms like Q-Learning, can assign value of $-\infty$ for these states: $Q(s,a)=-\infty$
- For policy gradient algorithms, set logits to $-\infty$. This results in assigning probability of zero 
to unwanted action given the state.

---

 ### V. Curriculum Learning

First we review the [Mountain Car Problem](https://gymnasium.farama.org/environments/classic_control/mountain_car/)

Moderately difficult problem in RL

A car is placed stochastically at the bottom of a sinusoidal valley (see below)

State space:
- position of the car along the x-axis
- velocity of the car

Action space: 

0: Accelerate to the left  
1: Don’t accelerate  
2: Accelerate to the right


Goal: reach the flag at top of mountain at right

<img src="./mtn_car.png">

**Curriculum Learning Strategy**

Start with easy environment configurations for the agent and successively increase the difficulty

Examples:
- For the task of a robot arm grasping an object, the first lesson can situate the arm close to the object 
- For the mountain car problem , the first lesson can place the car close to the valley dip

For each task, iteratively modify the environment with greater challenge until it reaches the full problem

**Caveat:** The "what got you here won't get you there" problem.

The solution for the easy problem may not work for the harder problem.

---