In [5]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import torch
import sys, os
import pystk
import ray
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print('device = ', device)
ray.init(logging_level=50)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
device =  cpu


RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'.

In [33]:
from utils.actors import new_action_net, Actor, GreedyActor
from utils.utils import show_agent, rollout_many
import numpy as np

In [34]:
action_net = new_action_net()
show_agent(Actor(action_net))



In [35]:
many_action_nets = [new_action_net() for i in range(100)]

data = rollout_many([Actor(action_net) for action_net in many_action_nets], n_steps=600)

bad_initialization = many_action_nets[ np.argmin([d[-1]['kart_info'].overall_distance for d in data]) ]

In [None]:
show_agent(Actor(good_initialization))

Recall what we're trying to do in RL: maximize the expected return of a policy $\pi$ (or in turn minmize a los $L$)
$$
-L = E_{\tau \sim P_\pi}[R(\tau)],
$$
where $\tau = \{s_0, a_0, s_1, a_1, \ldots\}$ is a trajectory of states and actions.
The return of a trajectory is then defined as the sum of individual rewards $R(\tau) = \sum_k r(s_k)$ (we won't discount in this assignment).

Policy gradient computes the gradient of the loss $L$ using the log-derivative trick
$$
\nabla_\pi L = -E_{\tau \sim P_\pi}[\sum_k r(s_k) \nabla_\pi \sum_i \log \pi(a_i | s_i)].
$$
Since the return $r(s_k)$ only depends on action $a_i$ in the past $i < k$ we can further simplify the above equation:
$$
\nabla_\pi L = -E_{\tau \sim P_\pi}\left[\sum_i \left(\nabla_\pi \log \pi(a_i | s_i)\right)\left(\sum_{k=i}^{i+T} r(s_k) \right)\right].
$$
We will implement an estimator for this objective below. There are a few steps that we need to follow:

 * The expectation $E_{\tau \sim P_\pi}$ are rollouts of our policy
 * The log probability $\log \pi(a_i | s_i)$ uses the `Categorical.log_prob`
 * Gradient computation uses the `.backward()` function
 * The gradient $\nabla_\pi L$ is then used in a standard optimizer

In [36]:
from utils.reinforce import reinforce
import copy

# good_initialization = best_action_net
action_net = copy.deepcopy(bad_initialization)
best_action_net = reinforce(action_net, n_epochs=5, n_iterations=200, n_trajectories=500, n_validations=100, T=10)

tensor([1., 1., 2.,  ..., 1., 1., 0.])
epoch = 0  dist = 781.8216491699219, best_dist 14.5169677734375 = 
tensor([1., 1., 0.,  ..., 1., 1., 0.])
epoch = 1  dist = 846.720703125, best_dist 783.486083984375 = 
tensor([2., 2., 0.,  ..., 1., 1., 0.])
epoch = 2  dist = 767.7766723632812, best_dist 842.1952117919922 = 
tensor([1., 1., 0.,  ..., 0., 0., 0.])
epoch = 3  dist = 767.8648071289062, best_dist 843.1003100585938 = 
tensor([1., 1., 0.,  ..., 1., 1., 0.])
epoch = 4  dist = 605.0930786132812, best_dist 843.1003100585938 = 


In [37]:
show_agent(GreedyActor(best_action_net))



In [38]:
from utils.utils import Rollout
viz_rollout = Rollout.remote(400, 300, track='hacienda')
show_agent(GreedyActor(best_action_net), rollout=viz_rollout)

