# Inferring Rewards from Preference in Mountain Car Environment using the Library APReL

---

In this notebook, we aim to show a simple use-case of active learning for inferring reward models from preferences. In particular, we will use the [APReL library](https://github.com/Stanford-ILIAD/APReL) on Mountain Car environment. Most of this notebook is based on [this example](https://github.com/Stanford-ILIAD/APReL/blob/main/examples/simple.py).

Before running the code, ensure you have installed the required packages. If not, you can install them via:

In [1]:
! git clone https://github.com/Stanford-ILIAD/APReL.git --quiet
! cd APReL && pip install -r requirements.txt --quiet && pip install -e . --quiet

fatal: destination path 'APReL' already exists and is not an empty directory.
  Preparing metadata (setup.py) ... [?25l[?25hdone


We now import the APReL library along with OpenAI Gym. Then, we load the Mountain Car environment.

In [22]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import aprel
import numpy as np
import gym

def patch_asscalar(a):
    return a.item()

setattr(np, "asscalar", patch_asscalar)


gym_env = gym.make('MountainCarContinuous-v0')
np.random.seed(0)
gym_env.seed(0)

[0]

In preference-based reward learning, a trajectory features function must accompany the environment. In APReL, this is handled with a user-provided function which takes a list of state-action pairs (of a trajectory) and outputs the array of features. For Mountain Car, where states consist of position and velocity values, we use the minimum position, maximum position and the average speed as our features.

In [23]:
def feature_func(traj):
    """Returns the features of the given MountainCar trajectory, i.e. \Phi(traj).

    Args:
        traj: List of state-action tuples, e.g. [(state0, action0), (state1, action1), ...]

    Returns:
        features: a numpy vector corresponding the features of the trajectory
    """
    states = np.array([pair[0] for pair in traj])
    actions = np.array([pair[1] for pair in traj[:-1]])
    min_pos, max_pos = states[:,0].min(), states[:,0].max()
    mean_speed = np.abs(states[:,1]).mean()
    mean_vec = [-0.703, -0.344, 0.007]
    std_vec = [0.075, 0.074, 0.003]
    return (np.array([min_pos, max_pos, mean_speed]) - mean_vec) / std_vec

features_dim=3

We are now ready to wrap the environment into an APReL environment along with the feature function:

In [24]:
env = aprel.Environment(gym_env, feature_func)

To gather preferences, we need some interaction with a real or simulated human. Although APReL allows for real users, here we consider a simulated one. In particular, we consider a softmax user that follows specific parameters that try to minimize the `min_pos` while maximizing the `max_pos` and `mean_speed`. You could also interact with the library yourself and input your preferences by selected the user as `true_user = aprel.HumanUser()`.

In [25]:
true_user = aprel.SoftmaxUser(params_dict={"weights": np.array([-0.3, 0.7, 0.6]), "beta": 3., "beta_D": 3.})

Now, to learn the reward function, we consider another softmax response model. We will learn a reward function that is linear in trajectory features. Let's initialize this model with a random vector of weights.

In [11]:
params = {'weights': aprel.util_funs.get_random_normalized_vector(features_dim)}
user_model = aprel.SoftmaxUser(params)

After defining our user model, we now create a belief distribution over the parameters we want to learn. We will be learning only the weights, so let's use the same dictionary of parameters. If we wanted to learn the other parameters of the softmax model, we would pass them here.

In [30]:
belief = aprel.SamplingBasedBelief(user_model, [], params)

We should now create an optimizer that selects the next query based on the belief at each time. We can use the `Discrete` optimizer, which only searches over a finite set of trajectories. We can randomly generate some trajectories and use them as the search set for this optimizer.

In [13]:
trajectory_set = aprel.generate_trajectories_randomly(env, num_trajectories=30,
                                                      max_episode_length=300,
                                                      file_name="mountain_car",
                                                      seed=0)
query_optimizer = aprel.QueryOptimizerDiscreteTrajectorySet(trajectory_set)

Moviepy - Building video aprel_trajectories/clips/mountain_car_0.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_0.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_0.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_1.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_1.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_1.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_2.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_2.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_2.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_3.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_3.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_3.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_4.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_4.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_4.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_5.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_5.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_5.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_6.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_6.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_6.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_7.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_7.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_7.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_8.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_8.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_8.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_9.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_9.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_9.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_10.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_10.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_10.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_11.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_11.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_11.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_12.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_12.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_12.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_13.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_13.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_13.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_14.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_14.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_14.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_15.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_15.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_15.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_16.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_16.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_16.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_17.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_17.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_17.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_18.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_18.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_18.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_19.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_19.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_19.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_20.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_20.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_20.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_21.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_21.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_21.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_22.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_22.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_22.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_23.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_23.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_23.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_24.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_24.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_24.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_25.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_25.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_25.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_26.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_26.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_26.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_27.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_27.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_27.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_28.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_28.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_28.mp4
Moviepy - Building video aprel_trajectories/clips/mountain_car_29.mp4.
Moviepy - Writing video aprel_trajectories/clips/mountain_car_29.mp4





Moviepy - Done !
Moviepy - video ready aprel_trajectories/clips/mountain_car_29.mp4


We need to query the user to elicit their preferences. For this, we will first initialize a dummy query. The query optimizer will then optimize a query of the same kind. For example, let's create a dummy preference query (do you prefer trajectory A or B? kind of query) with the first two trajectories in the trajectory set:

In [27]:
dummy_preference_query = aprel.PreferenceQuery(trajectory_set[:2])

In the next for-loop, we repeatedly do three things: (i) optimize a query, (ii) ask the user for a response to the optimized query, (iii) update the belief distribution with the response.

In [29]:
for query_no in range(30):
    queries, objective_values = query_optimizer.optimize('thompson', belief, dummy_preference_query)
    # queries and objective_values are lists even when we do not request a batch of queries.
    print('Objective Value: ' + str(objective_values[0]))

    responses = true_user.respond(queries[0])
    belief.update(aprel.Preference(queries[0], responses[0]))
    print('Estimated user parameters: ' + str(belief.mean))

Objective Value: 1.0
Estimated user parameters: {'weights': array([0.44892678, 0.85546837, 0.25814456])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([0.14783111, 0.92862676, 0.3402915 ])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([-0.23832594,  0.74516819,  0.62283635])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([-0.01630045,  0.94133304,  0.33708516])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([0.21861662, 0.40585441, 0.88740575])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([-0.10018402,  0.62537663,  0.77386512])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([0.12047672, 0.61532803, 0.77901013])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([-0.04771811,  0.91350677,  0.4040153 ])}
Objective Value: 1.0
Estimated user parameters: {'weights': array([-0.07546247,  0.95724754,  0.27925358])}
Objective Value: 1.0
Estimated user para

In [37]:
print('Estimated user parameters: ' + str(belief.mean))

print("Real user parameters: " + str(true_user.params))

Estimated user parameters: {'weights': array([-0.16936477,  0.79984326,  0.57581797])}
Real user parameters: {'weights': array([-0.3,  0.7,  0.6]), 'beta': 3.0, 'beta_D': 3.0, 'delta': 0.1}
