# Notes

Many reinforcement learning environments rely on well-defined extrinsic rewards. In the absence of extrinsic rewards, models often use intrinsic rewards.

In scenarios where there are no extrinsic rewards, models can use intrinsic rewards alone like curiosity. Humans seem to use curiosity as their primary optimization function during early development.

> Indeed, there is evidence that pre-training an agent on a given environment using only intrinsic rewards allows it to learn much faster when fine-tuned to a novel task in a novel environment.

Training with purely curiosity as the objective in an environment shows evidence of being a good pre-training objective with value for fine-tuning to specific tasks later on.

> The central idea is to represent intrinsic reward as the error in predicting the consequence of the agent’s action given its current state, i.e., the prediction error of learned forward-dynamics of the agent.

This is just like the free energy principle. Minimizing free energy is about minimizing prediction error with an active element, which is exactly what curiosity driven learning is doing.

> To ensure stable online training of dynamics, we argue that the desired embedding space should: (a) be compact in terms of dimensionality, (b) preserve sufficient information about the observation, and (c) be a stationary function of the observations.

In this paper, they accomplish:

1. A large scale study of the efficacy of curiosity based learning on different tasks/environments.
2. The different feature spaces used for learning dynamics-based curiosity.
3. Exploring the limitations of prediction-error based curiosity formulation.

### Dynamics-Based Curiosity Driven Learning

> We want to incentivize the agent with a reward $r_t$ relating to how informative the state transition was.

This is accomplished with a network to embed observations into representations $\phi(x)$, and a forward dynamics network that predicts the representation of the next state from the previous state and action $p(\phi(x_{t+1})|x_t, a_t)$.

Then, the exploration reward is given by the **surprisal** of the model given a transition tuple: $r_t = -\log p(\phi(x_{t+1})|x_t, a_t)$.

This means that the model will favor maximally exploring areas of the environment that it least understands, which may be areas that are unexplored or areas with complex dynamics.

This determines the reward function for the RL environment, which can be used in any of the variety of deep RL training methods.

**1. Feature spaces for forward dynamics**

> A good choice of feature space can make the prediction task more tractable and filter out irrelevant aspects of the observation space.

A good feature space should be compact (real features should be able to be modeled in lower dimensional features), sufficient (the representations contain all required information), and stable (the model features change over time, and old information becomes boring - this needs to be addressed).

They evaluate a few different models for the feature space:

1. **Pixels** - the identity transformation where $\phi(x) = x$. This is stable but makes it hard to learn from the environment
2. **Random Features** - the network is fixed after initialization. stable and more complex than the identity, but still insufficient.
3. **VAEs** - fit latent variable generative models $\rho(x, z)$ for observed data $x$ and latent variable $x$ with prior $p(z)$ using variational inference. The mapping to mean can be used as the embedding network $\phi$. This filters out more noise, but features update as the VAE trains.
4. **Inverse Dynamics Features (IDFs)** - given $(s_t, s_{t+1}, a_t)$, the inverse dynamics task is to predict the action $a_t$ given the previous and next states $s_t$ and $s_{t+1}$.

**2. Practical considerations in training an agent driven purely by curiosity**

- They use PPO for all their experiments
- They normalize the scale of rewards so the value function can learn quickly
- They normalize the advantages in PPO
- They normalize the observations when training
- They use many parallel actors

**3. “Death is not the end:” discounted curiosity with infinite horizon**

The done signal actually leaks information about the external environment to the agent because it indicates that a reward point has been reached or death has been achieved, so they instead make the game loop back to the beginning on done so there is no bias except curiosity.

The agent then does learn to avoid dying just because it puts it back to a point of low information again at the beginning.

### Experiments

**1. Curiosity-driven learning without extrinsic rewards**

![Screenshot 2024-11-05 at 5.18.15 PM.png](../../../images/Screenshot_2024-11-05_at_5.18.15_PM.png)

> A pure curiosity-driven agent can learn to obtain external rewards even without using any extrinsic rewards during training.

RF and IDF models performed best.

> We found that an IDF-curious agent collects more game reward than a random agent in 75% of the Atari games, an RF-curious agent does better in 70%.

![Screenshot 2024-11-05 at 5.20.29 PM.png](../../../images/Screenshot_2024-11-05_at_5.20.29_PM.png)

> This result suggests that the performance of a purely curiosity-driven agent would improve as the training of base RL algorithm (which is PPO in our case) gets better.

> We see from the episode length that the agent learns to have more and longer rallies over time, learning to play pong without any teacher – purely by curiosity on both sides.

In pong with just two curiosity driven agents, both agents learn to play longer rallies over time, driven purely by curiosity.

**2. Generalization across novel levels in Super Mario Bros.**

![Screenshot 2024-11-05 at 5.25.19 PM.png](../../../images/Screenshot_2024-11-05_at_5.25.19_PM.png)

> These results might suggest that while random features perform well on training environments, learned features appear to generalize better to novel levels.

> Overall, we find some promising evidence showing that skills learned by curiosity help our agent explore efficiently in novel environments.

**3. Curiosity with Sparse External Reward**

Combining curiosity with an intrinsic reward can help to maximize reward faster (though they use a contrived example in this paper).

This is likely what humans use.

However, the combination of extrinsic and intrinsic rewards isn’t new; achieving this balance is the goal of all dual descent entropy maximization training methods, and the purpose of introduce $\epsilon$ in q-learning.

### Discussion

> There are some Atari games where exploring the environment does not correspond to extrinsic reward.

> More generally, these results suggest that, in environments designed by humans, the extrinsic reward is perhaps often aligned with the objective of seeking novelty.

This makes sense especially in video games, since curiosity is one of the main functions driving humans to interact in their environments.

> A more serious potential limitation is the handling of stochastic dynamics. If the transitions in the environment are random, then even with a perfect dynamics model, the expected reward will be the entropy of the transition, and the agent will seek out transitions with the highest entropy.

Seeking out entropy is only useful to a point. The model may converge on sources of noise, which is not a valuable source of learning. It would be interesting to see a curiosity driven learning that isn’t just focused on unpredicted data but is focused on new signal that has structure.

![Screenshot 2024-11-05 at 5.33.14 PM.png](../../../images/Screenshot_2024-11-05_at_5.33.14_PM.png)

When they add a noisy TV to training environment, the model just gets stuck there forever and learning halts. This needs to be addressed.
