In [2]:
from src.utils import *

In [2]:
values_N = [20,80,300,1000,10000]

# Question 1

We have conducted three different simulations to address Q1, i.e., what is the
improvement in performance of RS-BC and RS-KT against BC and MIMIC-MD?

## Exp 0:

We consider:
- A small environment $S,A=2,2$;
- A short horizon $H=5$;
- Approximation error: $\theta>\rho$;
- Non-Markovian expert.

In [3]:
folder = 'results/exp0/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.081±0.039
RS-KT: 0.095±0.036
BC: 0.099±0.056
MIMIC-MD: 0.127±0.062

N =  80
RS-BC: 0.038±0.016
RS-KT: 0.049±0.017
BC: 0.076±0.054
MIMIC-MD: 0.086±0.055

N =  300
RS-BC: 0.022±0.013
RS-KT: 0.03±0.013
BC: 0.072±0.056
MIMIC-MD: 0.074±0.056

N =  1000
RS-BC: 0.012±0.005
RS-KT: 0.019±0.007
BC: 0.069±0.058
MIMIC-MD: 0.07±0.057

N =  10000
RS-BC: 0.005±0.002
RS-KT: 0.011±0.006
BC: 0.068±0.058
MIMIC-MD: 0.068±0.058


Observations:
- RS-BC and RS-KT outperform BC and MIMIC-MD as expected. For few data, they
  perform comparably as, intuitively, the latter methods have smaller hypothesis
  spaces, and for few data this is better. However, their bias is evident when,
  with $N=10000$ trajectories, they are outperformed drastically by RS-BC and
  RS-KT.
- We do not observe a drastical improvement in sample complexity using RS-KT
  instead of RS-BC because the state-action space is small (think to our theorems). Intuitively, the
  number of trajectories required for accurately estimating
  $\mathbb{P}^{\pi^E}(a|s,g)$ at all $s,g$ is comparable with that for
  accurately estimating $\eta^{\pi^E}$ since $S,A$ are small.

## Exp 3:

We increase the state and action spaces sizes $S,A=50,5$ to understand how algorithms
perform in this case. We keep all other parameters the same, in particular
$\theta>\rho$.

In [4]:
folder = 'results/exp3/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.101±0.041
RS-KT: 0.164±0.04
BC: 0.104±0.041
MIMIC-MD: 0.139±0.058

N =  80
RS-BC: 0.059±0.017
RS-KT: 0.084±0.021
BC: 0.058±0.018
MIMIC-MD: 0.079±0.028

N =  300
RS-BC: 0.032±0.011
RS-KT: 0.052±0.013
BC: 0.035±0.012
MIMIC-MD: 0.045±0.016

N =  1000
RS-BC: 0.016±0.006
RS-KT: 0.033±0.01
BC: 0.024±0.01
MIMIC-MD: 0.029±0.012

N =  10000
RS-BC: 0.005±0.002
RS-KT: 0.02±0.006
BC: 0.018±0.009
MIMIC-MD: 0.019±0.009


Observations:
- The increase in $S,A$ is not so big to guarantee a reduction in sample
  complexity for RS-KT against RS-BC.
- With larger $S,A$, the approximation error due to discretization (controlled
  by $\theta$) is increased. However, we observe this to reduce performance
  mostly to RS-KT, while RS-BC keeps to perform the best. In Q2, we will see
  that RS-KT will perform as good as RS-BC if we remove approximation error,
  even for larger $S,A$.
- The computational time of RS-KT increases significantly.
- By increasing the number of trajectories, the error keeps reducing, as expected.

## Exp 6:

We increase the horizon $H=20$ to understand how algorithms perform in this
case. We keep all other parameters the same, in particular $\theta>\rho$.

In [5]:
folder = 'results/exp6/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.193±0.086
RS-KT: 0.223±0.066
BC: 0.208±0.08
MIMIC-MD: 0.265±0.106

N =  80
RS-BC: 0.087±0.035
RS-KT: 0.115±0.034
BC: 0.162±0.083
MIMIC-MD: 0.18±0.086

N =  300
RS-BC: 0.046±0.019
RS-KT: 0.072±0.023
BC: 0.156±0.086
MIMIC-MD: 0.159±0.084

N =  1000
RS-BC: 0.027±0.01
RS-KT: 0.053±0.017
BC: 0.151±0.086
MIMIC-MD: 0.153±0.087

N =  10000
RS-BC: 0.012±0.005
RS-KT: 0.041±0.018
BC: 0.151±0.085
MIMIC-MD: 0.15±0.085


Observations:
- Same pattern as for small $H$, i.e., RS-BC and RS-KT keep reducing the error
  while BC and MIMIC-MD saturate. However, slightly more samples are required,
  as evident from our theorems, and also larger approximation error, as it
  cumulates with $H$.

# Question 2

Q2 concerns the dependency on parameter $\theta$. We conducted three different simulations.

## Exp 1:

We let $S,A,H=2,2,5$, a non-Markovian expert's policy, but increase $\rho$ so
that $\rho=\theta$ in order to reduce the approximation error.

In [6]:
folder = 'results/exp1/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.081±0.036
RS-KT: 0.095±0.042
BC: 0.104±0.053
MIMIC-MD: 0.13±0.063

N =  80
RS-BC: 0.035±0.016
RS-KT: 0.043±0.019
BC: 0.076±0.048
MIMIC-MD: 0.085±0.046

N =  300
RS-BC: 0.019±0.011
RS-KT: 0.024±0.012
BC: 0.068±0.048
MIMIC-MD: 0.071±0.046

N =  1000
RS-BC: 0.011±0.005
RS-KT: 0.013±0.005
BC: 0.066±0.049
MIMIC-MD: 0.068±0.049

N =  10000
RS-BC: 0.004±0.002
RS-KT: 0.004±0.002
BC: 0.065±0.049
MIMIC-MD: 0.065±0.049


Observations:
- Same trend as for the baseline.
- Very mild or neglectable improvement with using $\rho=\theta$. Intuitively,
  since the horizon $H=5$ is quite small, then the error cumulates over few
  timesteps.

## Exp 4:

We use a very bad choice of $\theta=5e-1$ while keeping $\rho=3e-2$ small, to
see how the approximation error increases.

In [7]:
folder = 'results/exp4/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.087±0.04
RS-KT: 0.144±0.053
BC: 0.103±0.057
MIMIC-MD: 0.132±0.065

N =  80
RS-BC: 0.051±0.022
RS-KT: 0.119±0.039
BC: 0.08±0.053
MIMIC-MD: 0.09±0.055

N =  300
RS-BC: 0.035±0.015
RS-KT: 0.109±0.04
BC: 0.072±0.056
MIMIC-MD: 0.076±0.055

N =  1000
RS-BC: 0.027±0.016
RS-KT: 0.108±0.038
BC: 0.069±0.058
MIMIC-MD: 0.071±0.057

N =  10000
RS-BC: 0.022±0.016
RS-KT: 0.106±0.039
BC: 0.068±0.058
MIMIC-MD: 0.068±0.058


Observations:
- The values of RS-BC and RS-KT are larger.
- RS-KT tends to suffer from approximation error more than RS-BC.

## Exp 5:

We increase $S,A=20,3$ and set $\theta=\rho$, to show that RS-KT without
approximation error does not have the bias that seemed to have in Exp3.

In [8]:
folder = 'results/exp5/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.088±0.026
RS-KT: 0.135±0.037
BC: 0.092±0.032
MIMIC-MD: 0.118±0.042

N =  80
RS-BC: 0.048±0.022
RS-KT: 0.068±0.02
BC: 0.054±0.029
MIMIC-MD: 0.066±0.028

N =  300
RS-BC: 0.025±0.011
RS-KT: 0.034±0.01
BC: 0.038±0.023
MIMIC-MD: 0.044±0.023

N =  1000
RS-BC: 0.012±0.005
RS-KT: 0.019±0.005
BC: 0.031±0.022
MIMIC-MD: 0.033±0.021

N =  10000
RS-BC: 0.004±0.002
RS-KT: 0.006±0.002
BC: 0.028±0.024
MIMIC-MD: 0.029±0.024


Observations:
- Clearly, for large $N$, RS-KT performs as RS-BC, so no bias anymore.

## Exp 8:

We increase $H=20$ and set $\theta=\rho$, to show that RS-KT without
approximation error does not have the bias that seemed to have in Exp6.

In [9]:
folder = 'results/exp8/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.177±0.067
RS-KT: 0.224±0.083
BC: 0.196±0.104
MIMIC-MD: 0.246±0.115

N =  80
RS-BC: 0.091±0.038
RS-KT: 0.108±0.039
BC: 0.159±0.102
MIMIC-MD: 0.174±0.103

N =  300
RS-BC: 0.047±0.018
RS-KT: 0.057±0.018
BC: 0.148±0.103
MIMIC-MD: 0.151±0.103

N =  1000
RS-BC: 0.023±0.008
RS-KT: 0.031±0.01
BC: 0.145±0.104
MIMIC-MD: 0.145±0.104

N =  10000
RS-BC: 0.008±0.003
RS-KT: 0.011±0.004
BC: 0.144±0.106
MIMIC-MD: 0.144±0.106


Observations:
- As $N$ increases, RS-KT keeps reducing, showing that the error in Exp6 was
  only approximation error.

# Question 3

What happens if the expert's policy is Markovian?

## Exp 2:

We consider a Markovian expert instead of a non-Markovian one. The goal is to
understand how better become BC and MIMIC-MD w.r.t. RS-BC and RS-KT.

In [10]:
folder = 'results/exp2/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.102±0.031
RS-KT: 0.118±0.036
BC: 0.085±0.035
MIMIC-MD: 0.132±0.052

N =  80
RS-BC: 0.052±0.015
RS-KT: 0.059±0.017
BC: 0.041±0.016
MIMIC-MD: 0.06±0.022

N =  300
RS-BC: 0.026±0.008
RS-KT: 0.031±0.009
BC: 0.021±0.008
MIMIC-MD: 0.03±0.01

N =  1000
RS-BC: 0.015±0.005
RS-KT: 0.021±0.007
BC: 0.012±0.005
MIMIC-MD: 0.016±0.006

N =  10000
RS-BC: 0.004±0.001
RS-KT: 0.01±0.004
BC: 0.003±0.002
MIMIC-MD: 0.005±0.002


Observations:
- Now BC and MIMIC-MD are expressive enough. Based on the insight of Foster et
  al. "Is Behavior Cloning all you need? Understanding Horizon in Imitation
  Learning", we know that the BC objective corresponds to mimicking the whole
  trajectory distribution, and so also the return distribution. Thus, we see
  that BC and MIMIC-MD perform well, and increasing the number of trajectories
  $N$ they keep reducing the error.
- BC, in particular, is the best method, because smaller hypothesis space than
  RS-BC and RS-KT.

# Question 4

How consistent is the reduction in sample complexity of RS-KT against RS-BC?

## Exp 7:

We increase the size of the state-action space drastically to $S,A=300,5$, as
suggested by our theoretical results. To avoid solving an LP with a very large
amount of variables and constraints (inside RS-KT), we replace the execution of
RS-KT by simply comparing $\eta^{\pi^E}$ with an estimate $\widehat{\eta}$ made
with $\mathcal{D}^E$. Intuitively, due to triangle inequality, we know that
$\mathcal{W}(\eta^{\pi^E},\eta^{\text{RS-KT}})\le 2
\mathcal{W}(\eta^{\pi^E},\widehat{\eta})$.

In [11]:
folder = 'results/exp7/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
show_results(results,values_N)

N =  20
RS-BC: 0.169±0.079
BC: 0.168±0.078
eta_hat: 0.169±0.049

N =  80
RS-BC: 0.168±0.079
BC: 0.166±0.078
eta_hat: 0.08±0.018

N =  300
RS-BC: 0.165±0.081
BC: 0.169±0.085
eta_hat: 0.043±0.01

N =  1000
RS-BC: 0.165±0.081
BC: 0.177±0.091
eta_hat: 0.024±0.006

N =  10000
RS-BC: 0.166±0.081
BC: 0.174±0.093
eta_hat: 0.008±0.002


Observations:
- While for very small values of $N$ like 20 or 80 the two perform comparably
  (recall that we have to double in the worst-case), starting from $N=300$ the
  performance of RS-KT is drastically better.
- Observe also that for such a large value of $S$, RS-BC and BC require a much
  larger amount of samples, as also for $N=10000$ they do not reduce the error.

# Question 5

How better are our algorithms against risk-sensitive IL algorithms for matching
the CVaR at level $\alpha$ in addition to the mean return?

## Exp 9:

We compare the W-RS-GAIL algorithm with RS-BC on average of 20 environments.

In [27]:
folder = 'results/exp9/'

# load the results
results = np.load(folder+'.npy',allow_pickle=True)

# show
values_N = [100,1000]
show_results3(results,values_N)

########## N =  100
***  RS-BC:
W1:  0.045 ± 0.022
***  W-RS-GAIL, alpha=0.3
W1:  0.226 ± 0.143
***  W-RS-GAIL, alpha=0.7
W1:  0.202 ± 0.122
########## N =  1000
***  RS-BC:
W1:  0.025 ± 0.017
***  W-RS-GAIL, alpha=0.3
W1:  0.22 ± 0.147
***  W-RS-GAIL, alpha=0.7
W1:  0.197 ± 0.123
