Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the operation in updating policy net #16

Closed
qingyue2014 opened this issue Aug 20, 2019 · 3 comments
Closed

About the operation in updating policy net #16

qingyue2014 opened this issue Aug 20, 2019 · 3 comments

Comments

@qingyue2014
Copy link

Can you explain the goal in computing value loss and action loss when you update the policy net? I don't think that the way to updata net is consistent with the formula in your paper.

Or what should I understand?

@eric-xw
Copy link
Owner

eric-xw commented Aug 30, 2019

Hi,

The RL part is a common implementation of policy gradient with baseline. The overall implementation is aligned with the formulation. Can you be more specific on your questions?

Thanks,

@qingyue2014
Copy link
Author

qingyue2014 commented Aug 30, 2019

Sorry, my previous statement was not clear. My questions are as follows.

  1. About the loss of reward net. In your paper, the objective of reward function is to minimize the exception of reward under empirical distribution subtracting the reward under policy network' s distribution. But in your code, the sign of loss (train_AREL.py, 138 line) is just on the opposite.
    loss = -torch.sum(gt_score) + torch.sum(gen_score) . Why?
  2. About the loss of policy net. Variable opt.rl_weight is used in calculating the loss. What the
    meaning of variable loss and tf-loss
    loss = opt.rl_weight * loss + (1 - opt.rl_weight) * tf_loss
    Looking forward to your reply! Thx.

@eric-xw
Copy link
Owner

eric-xw commented Aug 30, 2019

  1. In the paper, we show the objective functions to be maximized (gradient ascent). In practice we usually minimize the loss functions with gradient descent instead. But they are indeed equivalent.
  2. tf_loss is the cross entropy loss to help stabilize the training.

@eric-xw eric-xw closed this as completed Aug 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants