<a href="https://colab.research.google.com/github/asia281/rl2023/blob/main/Asia_of_Lab_09_Imitation_Learning_(with_gaps).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src='https://i.postimg.cc/TPR1n1rp/AI-Tech-PL-RGB.png' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Programu Operacyjnego Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://i.postimg.cc/Gpq2KRQz/logotypy-aitech.jpg'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego 
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej" 
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
</center>

# Lab 03: Imitation Learning

In this lab, we look into the problem of learning from expert demonstrations.

- Find a policy $\pi(a | s)$ that best imitates the expert policy $\pi^*(a | s)$ in the given environment.
- It's worth noting, that we don't need access to the environment rewards.

Major Imitation Learning techniques are:

1. Behavioural Cloning,
1. Imitation Learning via Interactive Demonstrator e.g. SMILe (Ross and Bagnell, 2010) or DAgger (Ross et al., 2011),
1. Inverse Reinforcement Learning -- out of scope of this lab.

We will solve the Ant problem, shown below, examining the first two approaches.

In [1]:
#@title Mount your Google Drive

#@markdown Your work will be stored in a folder called `rl_lab_2022` by default.

#@markdown Run each section with Shift+Enter

#@markdown Double-click on section headers to show code.

import os
from google.colab import drive
drive.mount('/content/gdrive')

LAB_PATH = '/content/gdrive/MyDrive/rl_lab_2022/imitation_learning'
if not os.path.exists(LAB_PATH):
  %mkdir -p $LAB_PATH

MJC_PATH = '{}/mujoco'.format(LAB_PATH)
if not os.path.exists(MJC_PATH):
    %mkdir $MJC_PATH

Mounted at /content/gdrive


In [2]:
#@title Install requirements

!apt -q update 
!apt install -q -y --no-install-recommends \
        build-essential \
        curl \
        git \
        gnupg2 \
        make \
        cmake \
        ffmpeg \
        swig \
        libz-dev \
        unzip \
        zlib1g-dev \
        libglfw3 \
        libglfw3-dev \
        libxrandr2 \
        libxinerama-dev \
        libxi6 \
        libxcursor-dev \
        libgl1-mesa-dev \
        libgl1-mesa-glx \
        libglew-dev \
        libosmesa6-dev \
        lsb-release \
        ack-grep \
        patchelf \
        wget \
        xpra \
        xserver-xorg-dev \
        xvfb \
        python-opengl \
        ffmpeg
!pip -q install gdown

# Installing dependencies for visualization
!apt-get -qq -y install libcusparse8.0 libnvrtc8.0 libnvtoolsext1 > /dev/null
!ln -snf /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so.8.0 /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so
!apt-get -qq -y install xvfb freeglut3-dev ffmpeg> /dev/null
!pip -q install -U gym==0.19
!pip -q install pyglet
!pip -q install pyopengl
!pip -q install pyvirtualdisplay

Get:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease [18.1 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:7 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease
Get:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Get:10 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1,046 kB]
Hit:11 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease
Hit:12 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [2,366 kB]


In [3]:
#@title Download MuJoCo

if not os.path.exists(f'{MJC_PATH}/mujoco210'):
    %cd $MJC_PATH
    !wget -q https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz
    !tar -xzf mujoco210-linux-x86_64.tar.gz
    %rm mujoco210-linux-x86_64.tar.gz

In [4]:
#@title Install MuJoCo

import os

os.environ['LD_LIBRARY_PATH'] += ':{}/mujoco210/bin'.format(MJC_PATH)
os.environ['MUJOCO_PY_MUJOCO_PATH'] = '{}/mujoco210'.format(MJC_PATH)

# Installation on colab does not find *.so files in LD_LIBRARY_PATH,
# copy over manually instead.
!cp $MJC_PATH/mujoco210/bin/*.so /usr/lib/x86_64-linux-gnu/

In [5]:
#@title Clone and install mujoco-py

if not os.path.exists(f'{MJC_PATH}/mujoco-py'):
    %cd $MJC_PATH
    !git clone https://github.com/openai/mujoco-py.git

%cd $MJC_PATH/mujoco-py
!git checkout f1312cceeeebbba17e78d5d77fbffa091eed9a3a # Tested version
%pip install -e .

# Compile at the first import
os.environ['LD_LIBRARY_PATH'] += ':/usr/lib/nvidia'
os.environ['LD_LIBRARY_PATH'] += ':/content/gdrive/MyDrive/rl_lab_2022/imitation_learning/mujoco/mujoco210/bin'
import mujoco_py

/content/gdrive/MyDrive/rl_lab_2022/imitation_learning/mujoco/mujoco-py
M	.dockerignore
M	.gitignore
M	mujoco_py/tests/test_substep.py
M	scripts/gen_wrappers.py
M	vendor/Xdummy-entrypoint
HEAD is now at f1312cc Bump version for release
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/gdrive/MyDrive/rl_lab_2022/imitation_learning/mujoco/mujoco-py
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting glfw>=1.4.0 (from mujoco-py==2.1.2.14)
  Using cached glfw-2.5.9-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38-none-manylinux2014_x86_64.whl (207 kB)
Collecting fasteners~=0.15 (from mujoco-py==2.1.2.14)
  Using cached fasteners-0.18-py3-none

In [6]:
#@title Download the expert checkpoint

if not os.path.exists(f'{LAB_PATH}/expert_checkpoint'):
    %cd $LAB_PATH
    !gdown --id 1CNhGwvqsLd-H0dwh-4L9rEqIo04CyOLW
    !unzip expert_checkpoint.zip

In [7]:
import glob
import time

from functools import partial

import gym
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from base64 import b64encode
from IPython.display import HTML
from pyvirtualdisplay import Display

# Start virtual display
display = Display(visible=0, size=(1024, 768))
display.start()

# Seed random generators
tf.random.set_seed(42)
np.random.seed(42)

# Helpers

def show_video(file_name):
    mp4 = open(file_name,'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    return HTML("""
    <video width=480 controls>
        <source src="%s" type="video/mp4">
    </video>
    """ % data_url)

def mlp(input_shape, output_size, hidden_sizes=(256, 256), hidden_activation=tf.tanh, output_activation=None, l2_weight=0.0001):
    """Creates MLP with the specified parameters."""
    model = tf.keras.Sequential()

    model.add(tf.keras.Input(shape=input_shape))
    for h in hidden_sizes:
        model.add(tf.keras.layers.Dense(units=h,
                                        activation=hidden_activation,
                                        kernel_regularizer=tf.keras.regularizers.L2(l2_weight)))
    model.add(tf.keras.layers.Dense(units=output_size, activation=output_activation))

    return model

def run_policy (env, model, total_steps=10000, verbose=True):
    obs_array = np.empty([total_steps, *env.observation_space.shape])
    act_array = np.empty([total_steps, env.action_space.shape[0]])
    rew_array = np.empty([total_steps, 1])
    done_array = np.empty([total_steps, 1])

    iter_time = time.time()
    done = True
    for i in range(total_steps):
        if verbose and (i + 1) % 1000 == 0:
            steps_per_second = 1000 / (time.time() - iter_time)
            print(f'Step {i + 1}/{total_steps}, Steps per second: {steps_per_second}')
            iter_time = time.time()


        if done:
            obs = env.reset()

        act = model(tf.expand_dims(obs, axis=0))[0]
        obs_, rew, done, _ = env.step(act)
        
        obs_array[i] = obs
        act_array[i] = act
        rew_array[i] = rew
        done_array[i] = float(done)

        obs = obs_

    return obs_array, act_array, rew_array, done_array

def calculate_returns(rew, done):
    rew_cumsum = np.cumsum(rew)[:, None]
    ret_cumsum = rew_cumsum * done
    ret_cumsum_trimed = ret_cumsum[np.nonzero(ret_cumsum)]
    ret_cumsum_trimed[1:] -= ret_cumsum_trimed[:-1]
    return ret_cumsum_trimed

def evaluate_agent(env, model, verbose=False):
    _, _, rew, done = run_policy(env, model, total_steps=25000, verbose=verbose)
    rets = calculate_returns(rew, done)

    print(f'Num. episodes: {len(rets)}')
    print(f'Avg. return: {np.mean(rets)}')
    print(f'Max. return: {np.max(rets)}')
    print(f'Min. return: {np.min(rets)}')

def render_agent(env, model):
    envw = gym.wrappers.Monitor(env, "./", force=True)
    o, d = envw.reset(), False
    while not d:
        envw.render()
        o, _, d, _ = envw.step(model(tf.expand_dims(o, axis=0))[0])
    # envw.close()

    file_name = glob.glob('openaigym.video.*.mp4')[0]
    return show_video(file_name)

class Expert:
    """Streamlined Off-Policy (SOP) actor"""

    def __init__(self, ckpt_path):
        self.model = tf.keras.models.load_model(ckpt_path)

    def __call__(self, obs, exploratory=False):
        # We need to add one more dim. for this model
        mu, pi = self.model(tf.expand_dims(obs, axis=0))
        return pi[0] if exploratory else mu[0]

## 0. Ant

a three-dimensional quadrupedal robot.

- Observations are 111-dim. vectors that describe the kinematic properties of the robot,
- Actions are 8-dim. vectors which specify torques to be applied on the robot joints,
- The goal is to run forward as fast as possible and don’t fall over.

In [8]:
%cd $LAB_PATH
expert = Expert('expert_checkpoint')
env = gym.make('Ant-v2')

  and should_run_async(code)


/content/gdrive/MyDrive/rl_lab_2022/imitation_learning


  logger.warn(
  logger.warn(
  deprecation(
  deprecation(


In [9]:
#render_agent(env, expert)

  and should_run_async(code)


In [10]:
evaluate_agent(env, expert, verbose=True)

Step 1000/25000, Steps per second: 46.90224513742614
Step 2000/25000, Steps per second: 46.566633082426506
Step 3000/25000, Steps per second: 72.02296682688126
Step 4000/25000, Steps per second: 67.07188968180645
Step 5000/25000, Steps per second: 65.82485282486876
Step 6000/25000, Steps per second: 68.52544502408746
Step 7000/25000, Steps per second: 69.31887048311172
Step 8000/25000, Steps per second: 62.95522813304552
Step 9000/25000, Steps per second: 71.70750777728155
Step 10000/25000, Steps per second: 67.61475761648656
Step 11000/25000, Steps per second: 63.26709373636307
Step 12000/25000, Steps per second: 65.89712240865465
Step 13000/25000, Steps per second: 68.15130859714378
Step 14000/25000, Steps per second: 68.64579727092014
Step 15000/25000, Steps per second: 65.53514497115546
Step 16000/25000, Steps per second: 70.04066422526238
Step 17000/25000, Steps per second: 66.5699382774431
Step 18000/25000, Steps per second: 71.89982793118887
Step 19000/25000, Steps per second: 6

## 1. Behaviour Clonning

1. Collect the expert data.
2. Fit the model (classifier/regressor) to the expert data.

In [11]:
# Collect the expert data
# obs, act, _, _ = run_policy(env, expert, total_steps=100000)

import gdown
gdown.download('https://drive.google.com/uc?id=1-0FtkebJvIZ0NUTftMyRavqkV_b0nDVl')

with np.load('expert_greedy_data.npz') as data:
    obs, act, _, _ = data.values()

Downloading...
From: https://drive.google.com/uc?id=1-0FtkebJvIZ0NUTftMyRavqkV_b0nDVl
To: /content/gdrive/MyDrive/rl_lab_2022/imitation_learning/expert_greedy_data.npz
100%|██████████| 96.8M/96.8M [00:01<00:00, 50.5MB/s]


In [12]:
# EXERCISE: Create the imitator model observations -> actions
imitator = mlp(111, 8)

# We will start our experiments from the same weights for the fair comparison
init_weights = imitator.get_weights()

In [13]:
# EXERCISE: Fit the model to the expert data
imitator.set_weights(init_weights)

# ANSWER
imitator.compile(loss=tf.keras.losses.MeanSquaredError())
imitator.fit(obs, act, epochs=25)
# END ANSWER

evaluate_agent(env, imitator)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Num. episodes: 48
Avg. return: 1474.6218163591839
Max. return: 4793.094609015083
Min. return: -1272.697136366507


### Exercise

Discuss the questions

1. In principle, do we need the expert policy for BC?

  > Answer: Behaviour cloning only requires dataset of expert demonstrations, that consist of state-action pairs, and doesn't require expert policy during training process. Expert policy is used solely to generate expert demonstration data. 

1. What are the problems with BC?

  > Answer: 1. Covariant shift -- BC assumes that the data distribution of training and test dataset is the same. 2. Error amplification -- small errors during training can get bigger during test. 3. Fragility to changes -- models are sensitive to changes in input data.

1. How can we help BC do better?

  > Answer: We can do Dagger(Dataset Aggregation collecting new data using the current BC), expert demonstartions and add exploration (by adding noise or eps-greedy).

In [22]:
# Collect the exploratory data
def exploratory(obs):
    """Adds the Gaussian noise to the expert actions."""
    act = expert(obs)
    return act + 0.29 * tf.random.normal(tf.shape(act))
# obs_expl, act_expl, rew_expl, done_expl = run_policy(env, exploratory, total_steps=100000)

if not os.path.exists(f'expert_exploratory_data.npz'):
    !gdown --id 1-9C1hdDY7Q3ckBY3ToVF2VF-BtcoMhQN
    
with np.load('expert_exploratory_data.npz') as data:
    obs_expl, act_expl, rew_expl, done_expl = data.values()

In [23]:
# EXERCISE: Run BC on the exploratory data

imitator.set_weights(init_weights)

# ANSWER
imitator.compile(loss=tf.keras.losses.MeanSquaredError())
imitator.fit(obs_expl, act_expl, epochs=25)
# END ANSWER

evaluate_agent(env, imitator)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Num. episodes: 26
Avg. return: 5142.769298211675
Max. return: 5691.032028480955
Min. return: 1744.9846619234231


### Exercise

Answer the questions

1. Why does it better?

  > Answer: We added noise to the actions, so our model explores beyond the expert's behaviour.

1. How can we use the expert to further improve the data?

  > Hint: Noisy actions help in collecting more diverse data, but we don't want to learn the exploratory actions.

  > Answer: We can fit imitator on exploratory observations and actions made by expert.  

In [None]:
# EXERCISE: Infere the expert actions on the exploratory observations
#           and run BC on it.

imitator.set_weights(init_weights)

# ANSWER
imitator.compile(loss=tf.keras.losses.MeanSquaredError())
imitator.fit(obs_expl, act, epochs=25)
# ANSWER END

evaluate_agent(env, imitator)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25

### Exercise

Answer the questions

1. Did it help? Why?

  > Answer: ...

1. How can you extend this idea?

  > Hint: How can we get more exploratory data?

  > Answer: ...

## 2. Imitation Learning via Interactive Demostrator

[DAgger](https://www.ri.cmu.edu/pub_files/2011/4/Ross-AISTATS11-NoRegret.pdf)

1. Collect the expert data.
2. Fit the model (classifier/regressor) to the expert data.
3. Collect the imitator data.
4. Infere the expert actions on the imitator data.
5. Fit the model to the extended dataset.
6. Repeat from 3.

In [18]:
# We will pre-train on less expert data to keep the same dataset size
obs_ = obs[:30000,:]
act_ = act[:30000,:]

In [19]:
# EXERCISE: Pretrain for 25 epochs

imitator.set_weights(init_weights)

# ANSWER
imitator.compile(loss=tf.keras.losses.MeanSquaredError())
imitator.fit(obs_, act_, epochs=25)
# END ANSWER

evaluate_agent(env, imitator)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Num. episodes: 29
Avg. return: 1366.7703375326773
Max. return: 4127.6208425388395
Min. return: -456.14414318140007


In [21]:
# Exercise: Implement DAgger

for i in range(7):
    print(f'\n### Iter. {i+1} ###')

    # ANSWER
    print('\n1. Data collection')
    obs_extra, _, _, _ = run_policy(env, imitator, total_steps=10000) # Collect 10k steps
   
    obs_ = np.concatenate((obs_, obs_extra))
    act_ = np.concatenate((act_, expert(obs_extra)))

    print('\n2. Training')
    imitator.set_weights(init_weights)

    imitator.compile(loss=tf.keras.losses.MeanSquaredError())
    imitator.fit(obs_, act_)
    
    # END ANSWER

    print('\n3. Evaluation')
    evaluate_agent(env, imitator)

  and should_run_async(code)



### Iter. 1 ###

1. Data collection
Step 1000/10000, Steps per second: 189.53014481240632
Step 2000/10000, Steps per second: 230.13139912602495
Step 3000/10000, Steps per second: 179.28946417323903
Step 4000/10000, Steps per second: 243.87317420899592
Step 5000/10000, Steps per second: 231.95903968839
Step 6000/10000, Steps per second: 205.98316609914085
Step 7000/10000, Steps per second: 181.52724150310453
Step 8000/10000, Steps per second: 219.12998847691767
Step 9000/10000, Steps per second: 222.70904805419818
Step 10000/10000, Steps per second: 185.46181176336063

2. Training

3. Evaluation
Num. episodes: 26
Avg. return: -70.53151077255657
Max. return: 193.51980550007193
Min. return: -136.69397660355162

### Iter. 2 ###

1. Data collection
Step 1000/10000, Steps per second: 239.47327462990245
Step 2000/10000, Steps per second: 203.87849508522686
Step 3000/10000, Steps per second: 183.31867633228904
Step 4000/10000, Steps per second: 231.36570894604395
Step 5000/10000, Steps per se

### Note

Training the expert with the SOP algorithm (Wang et al., 2020) took 3M data samples (env. interactions). Here, we nearly match it with only 100k samples! Training from the expert can be much more efficient than reinforcement learning.