<div align="center">
    <img src="https://www.sharif.ir/documents/20124/0/logo-fa-IR.png/4d9b72bc-494b-ed5a-d3bb-e7dfd319aec8?t=1609608338755" alt="Logo" width="200">
    <p><b> Reinforcement Learning Course, Dr. Rohban</b></p>
</div>


*Full Name:*

*Student Number:*


# Random Network Distillation 

## Overview

RND (Random Network Distillation) was first proposed in [Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894), which introduces an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The exploration bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. RND claims that it is the first method that achieves better than average human performance on Montezuma’s Revenge without using demonstrations or having access to the underlying state of the game.

## Quick Facts

1. The insight behind exploration approaches is that we first establish a method to measure the **novelty of states**, namely, how well we know this state, or the number of times we have visited a state similar to it. Then we assign an exploration reward in proportional to the novelty measure of the state. If the visited state is more novel, or say the state is explored very few times, the agent will get a bigger intrinsic reward. On the contrary, if the agent is more familiar with this state, or say, the state has been explored many times, the agent will get a smaller intrinsic reward on this state.

2. RND is a **prediction-error-based** exploration approach that can be applied in non-tabular cases. The main idea of prediction-error-based approaches is that defining the intrinsic reward as the prediction error for a problem related to the agent’s transitions, such as learning forward dynamics model, learning inverse dynamics model, or even a randomly generated problem, which is the case in RND algorithm.

3. RND involves **two neural networks**: a fixed and randomly initialized target network which sets the prediction problem, and a predictor network trained on data collected by the agent.

4. In RND paper, the underlying base RL algorithm is off-policy PPO. Generally, RND intrinsic reward generation model can be combined with many different RL algorithms such as DDPG, TD3, **SAC** conveniently.

## Key Equations or Key Graphs

The following two graphs are from OpenAI’s blog. The overall sketch of RND is as follows:

### Random Network Distillation

<div style="text-align: center; margin: 20px 0;">
    <img src="https://opendilab.github.io/DI-engine/_images/rnd.png" 
         alt="RND Architecture Diagram" 
         style="max-width: 70%; border: 1px solid #ddd; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    <p style="font-style: italic; color: #666;">Random Network Distillation (RND) Architecture</p>
</div>

The overall sketch of next_sate_prediction exploration method is as follows:


<div style="text-align: center; margin: 20px 0;">
    <img src="https://opendilab.github.io/DI-engine/_images/rnd.png" 
         alt="RND Architecture Diagram" 
         style="max-width: 70%; border: 1px solid #ddd; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    <p style="font-style: italic; color: #666;">Random Network Distillation (RND) Architecture</p>
</div>




## Prediction Error Factors in RND

In the RND paper, the authors point out that prediction errors can be attributed to the following factors:

1. **Amount of training data**  
   Prediction error is high where few similar examples were seen by the predictor.

2. **Stochasticity**  
   Prediction error is high because the target function is stochastic. Stochastic transitions are a source of such error for forward dynamics prediction.

3. **Model misspecification**  
   Prediction error is high because information necessary for the prediction is missing, or the model class of predictors is too limited to fit the complexity of the target function.

---



## Implementation Details That Matter

### 1. Intrinsic Reward Normalization and Weight Factors
- **Normalization Method**: Min-max normalization  
  `normalized_reward = (reward - batch_min) / (batch_max - batch_min)`  
  (Scales intrinsic reward to [0,1] range)
  
- **Weight Factors**:
  - For MiniGrid: Last non-zero positive reward × 1000
  - General case: Can use max game length as weight factor
  - Critical for balancing exploration vs exploitation
  - Experiments show proper weighting is essential for good performance in MiniGrid environments

### 2. Observation Normalization
- **Process**:
  1. Subtract running mean
  2. Divide by running standard deviation
  3. Clip values to [-5, 5] range
- **Initialization**:
  - Use random agent to collect normalization parameters
  - Small number of steps before training begins
- **Application**:
  - Same normalization for predictor and target networks
  - Different normalization for policy network

### 3. Non-Episodic Intrinsic Reward and Two Value Heads
- **Non-Episodic Setting**:
  - Returns continue across episodes ("game over" doesn't reset)
  - Leads to more exploration without extrinsic rewards
- **Dual Value Heads**:
  - Recommended for combining episodic/non-episodic rewards
- **Discount Factors**:
  - Extrinsic rewards: γ = 0.999 (higher for better performance)
  - Intrinsic rewards: γ = 0.99 (lower to maintain exploration)



# Setup Code
Before getting started we need to run some boilerplate code to set up our environment. You'll need to rerun this setup code each time you start the notebook.

First, run this cell load the [autoreload](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html?highlight=autoreload) extension. This allows us to edit `.py` source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [None]:
%load_ext autoreload
%autoreload 2

#### In the following cell you are going to direct to your gooledrive if you are using GooleColab which is preferable 

In [None]:
# ----------------------------
# . Moount Google Drive
# ----------------------------
from google.colab import drive
drive.mount('/content/drive')

# ----------------------------
# 2. Go the Project directory
# ----------------------------
import os

# TODO: Fill in the Google Drive path where you uploaded the assignment
# Example: If you create a 2020FA folder and put all the files under A1 folder, then '2020FA/A1'
# GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '2020FA/A1'
GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = 
GOOGLE_DRIVE_PATH = os.path.join('drive', 'My Drive', GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))

# ----------------------------
# 2. Install dependencies
# ----------------------------

In [None]:
!pip install -r requirements.txt

# ----------------------------
# 3. Introduction
# ----------------------------
Welcome to the Random Network Distillation (RND) + PPO Homework!

In this assignment, you will:
- Understand and implement parts of the RND-based exploration method.
- Train an agent using PPO + RND in the MiniGrid environment.
- Analyze learning curves and evaluate agent performance.

Modules to implement:
- Complete missing parts in `Brain/brain.py` (intrinsic reward, RND loss).
- Complete missing parts in `Brain/model.py` (TargetModel, PredictorModel).


# ----------------------------
# 4. Student Instructions
# ----------------------------
> Please open and edit the following files:
- `Brain/brain.py`
- `Brain/model.py`

> Specifically, look for `TODO` markers in the code and complete the necessary parts.

After you have filled in the missing parts, you can proceed to train the agent.

# ----------------------------
# 5. Train the Agent
# ----------------------------

Now that you've completed the TODOs, let's train your agent!
This will launch the main script with training from scratch.



In [None]:
!python main.py --train_from_scratch

# ----------------------------
# 6. Visualize Logs
# ----------------------------
launch TensorBoard to monitor your training logs.



In [None]:
# Start Tensorboard
%load_ext tensorboard
%tensorboard --logdir Logs

# --------------------------------------------------
# End of Starter Notebook
# --------------------------------------------------