# Understanding of Policy Gradient

## Policy Gradient & Imitation Learning

Policy Gradient formula is similar with Imiation Learning. In the Imitation Learning, we aim to maximum the likelihood between the parametered policy and the expert policy and we plan to minium the objective function of reinforcement learning in Policy Gradient.

Policy Gradient

$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N (\sum_{t=1}^T \nabla_\theta log \pi_\theta(a_{i,t}|s_{i,t}))(\sum_{t=1}^T r(s_{i,t}|a_{i,t}))$

Maximum Likelihood

$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N (\sum_{t=1}^T \nabla_\theta log \pi_\theta(a_{i,t}|s_{i,t}))$

It seems that Policy Gradient is weighted Maximun Likelihood. In policy gradient. There are not expert data to teach the policy so it need the cummulative return to tell it whether the action is good or bad.

Therefore, they are different considering sampling action

- Policy Gradient

In [None]:
class MLPPolicy(nn.Module):
    """Base MLP policy, which can take an observation and output a distribution over actions.

    This class should implement the `forward` and `get_action` methods. The `update` method should be written in the
    subclasses, since the policy update rule differs for different algorithms.
    """

    def __init__(
        self,
        ac_dim: int,
        ob_dim: int,
        discrete: bool,
        n_layers: int,
        layer_size: int,
        learning_rate: float,
    ):
        super().__init__()

        if discrete:
            self.logits_net = ptu.build_mlp(
                input_size=ob_dim,
                output_size=ac_dim,
                n_layers=n_layers,
                size=layer_size,
            ).to(ptu.device)
            parameters = self.logits_net.parameters()
        else:
            self.mean_net = ptu.build_mlp(
                input_size=ob_dim,
                output_size=ac_dim,
                n_layers=n_layers,
                size=layer_size,
            ).to(ptu.device)
            self.logstd = nn.Parameter(
                torch.zeros(ac_dim, dtype=torch.float32, device=ptu.device)
            )
            parameters = itertools.chain([self.logstd], self.mean_net.parameters())

        self.optimizer = optim.Adam(
            parameters,
            learning_rate,
        )

        self.discrete = discrete

    @torch.no_grad()
    def get_action(self, obs: np.ndarray) -> np.ndarray:
        """Takes a single observation (as a numpy array) and returns a single action (as a numpy array)."""
        # TODO: implement get_action
        obs = ptu.from_numpy(obs)
        action_distribution = self.forward(obs)
        if self.discrete:
            action = action_distribution.sample()
        else:
            action = action_distribution.rsample()   

        return ptu.to_numpy(action)

    def forward(self, obs: torch.FloatTensor):
        """
        This function defines the forward pass of the network.  You can return anything you want, but you should be
        able to differentiate through it. For example, you can return a torch.FloatTensor. You can also return more
        flexible objects, such as a `torch.distributions.Distribution` object. It's up to you!
        """
        if self.discrete:
            # TODO: define the forward pass for a policy with a discrete action space.
            logits = self.logits_net(obs)
            dist = distributions.Categorical(F.softmax(logits))
            
        else:
            # TODO: define the forward pass for a policy with a continuous action space.
            mean = self.mean_net(obs)
            std = torch.exp(self.logstd)
            dist = distributions.Normal(mean,std)
        return dist

    def update(self, obs: np.ndarray, actions: np.ndarray, *args, **kwargs) -> dict:
        """Performs one iteration of gradient descent on the provided batch of data."""
        raise NotImplementedError


class MLPPolicyPG(MLPPolicy):
    """Policy subclass for the policy gradient algorithm."""

    def update(
        self,
        obs: np.ndarray,
        actions: np.ndarray,
        advantages: np.ndarray,
    ) -> dict:
        """Implements the policy gradient actor update."""
        obs = ptu.from_numpy(obs)
        actions = ptu.from_numpy(actions)
        advantages = ptu.from_numpy(advantages)

        # TODO: implement the policy gradient actor update.
        self.optimizer.zero_grad()
        action_distribution = self.forward(obs)
        if self.discrete:
            log_prob = action_distribution.log_prob(actions)
        else:
            log_prob = action_distribution.log_prob(actions).sum(dim=-1)
            
        loss = - (log_prob * advantages).mean()
        loss.backward()
        self.optimizer.step()

        return {
            "Actor Loss": ptu.to_numpy(loss),
        }

- Imitation Learning

In [None]:
class MLPPolicySL(BasePolicy, nn.Module, metaclass=abc.ABCMeta):
    """
    Defines an MLP for supervised learning which maps observations to continuous
    actions.

    Attributes
    ----------
    mean_net: nn.Sequential
        A neural network that outputs the mean for continuous actions
    logstd: nn.Parameter
        A separate parameter to learn the standard deviation of actions

    Methods
    -------
    forward:
        Runs a differentiable forwards pass through the network
    update:
        Trains the policy with a supervised learning objective
    """
    def __init__(self,
                 ac_dim,
                 ob_dim,
                 n_layers,
                 size,
                 learning_rate=1e-4,
                 training=True,
                 nn_baseline=False,
                 **kwargs
                 ):
        super().__init__(**kwargs)

        # init vars
        self.ac_dim = ac_dim
        self.ob_dim = ob_dim
        self.n_layers = n_layers
        self.size = size
        self.learning_rate = learning_rate
        self.training = training
        self.nn_baseline = nn_baseline

        self.mean_net = build_mlp(
            input_size=self.ob_dim,
            output_size=self.ac_dim,
            n_layers=self.n_layers, size=self.size,
        )
        self.mean_net.to(ptu.device)
        self.logstd = nn.Parameter(

            torch.zeros(self.ac_dim, dtype=torch.float32, device=ptu.device)
        )
        self.logstd.to(ptu.device)
        self.optimizer = optim.Adam(
            itertools.chain([self.logstd], self.mean_net.parameters()),
            self.learning_rate
        )

    def save(self, filepath):
        """
        :param filepath: path to save MLP
        """
        torch.save(self.state_dict(), filepath)

    def forward(self, observation: torch.FloatTensor) -> Any:
        """
        Defines the forward pass of the network

        :param observation: observation(s) to query the policy
        :return:
            action: sampled action(s) from the policy
        """
        # TODO: implement the forward pass of the network.
        # You can return anything you want, but you should be able to differentiate
        # through it. For example, you can return a torch.FloatTensor. You can also
        # return more flexible objects, such as a
        # `torch.distributions.Distribution` object. It's up to you!
        observation = observation.float().to(ptu.device)
        mean = self.mean_net(observation)
        std = torch.exp(self.logstd)
        dist = distributions.Normal(mean,std)
        action = dist.rsample() # test rsample() & sample()
        # test `get action operation` whether need under `torch.no_grad()`
        # convert tensor to numpy
        return action
        

    def update(self, observations, actions):
        """
        Updates/trains the policy

        :param observations: observation(s) to query the policy
        :param actions: actions we want the policy to imitate
        :return:
            dict: 'Training Loss': supervised learning loss
        """
        # TODO: update the policy and return the loss
        self.optimizer.zero_grad()
        action_pred = self.forward(observations)
        # print(f'action shape: {action_pred.shape}')
        # print(f'truth shape: {actions.shape}')
        loss = F.mse_loss(action_pred,actions)
        loss.backward()
        self.optimizer.step()
        return {
            # You can add extra logging information here, but keep this line
            'Training Loss': ptu.to_numpy(loss),
        }
        
    def get_action(self, obs: np.ndarray) -> np.ndarray:
        obs = torch.tensor(obs,dtype=torch.float32).to(ptu.device)
        with torch.no_grad():
            action = ptu.to_numpy(self.forward(obs))
        return action

There are many differences such as how to get action and how to compute the loss. Imitaion Learning output the action directly and compute the mse loss that is different with policy gradient which use the gradient of J function

## Analysis of Policy Gradient

In the analysis, I think critic skills we should master are: 

- computing the expectation of policy gradient
  
- computing the varience of policy gradient

we are able to reduce the varience of this method only if we master these math skills

The objective function of policy gradient is formulized:

$J(\theta) = \mathbb{E}_{\tau\sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] \approx \frac{1}{N} \sum_i \sum_t r(s_{i,t}, a_{i,t})$

where policy gradient aims that good stuff is made more likely and bad stuff is made less likely.

Let's compute the gradient(derivative) of this objective function

$\theta^* = \underset{\theta}{\mathrm{arg\,max}} \, \mathbb{E}_{\tau\sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right]$

$J(\theta) = \mathbb{E}_{\tau\sim p_\theta(\tau)} [r(\tau)] = \int p_\theta(\tau)r(\tau) d\tau$

$\nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau)r(\tau) d\tau = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)r(\tau) d\tau = \mathbb{E}_{\tau\sim p_\theta(\tau)} [\nabla_\theta \log p_\theta(\tau)r(\tau)]$

$\text{a convenient identity} \qquad p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau)$
