Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the weight for correction in the importance sampling #7

Closed
miracle24 opened this issue Dec 9, 2019 · 4 comments
Closed

Comments

@miracle24
Copy link

miracle24 commented Dec 9, 2019

Hello. Thanks for your great work, from which I learned a lot about reinforcement learning. I am confused about the computation of the correction weight in the importance sampling.

According to the paper "Top-K Off-Policy Correction for a REINFORCE Recommender System", the correction weight is , in which, I think, is an action sampled from the behavior policy, i.e. , and thus the reward of a sequence is corrected by dividing the likelihood of action given in the updated policy. i.e. by the likelihood of the same action given in . Noted will be input into both and . However, in your implementation, I found that actions are not the same for and , seeing the function "pi_beta_sample" in here.

Am I wrong about this?

Thanks!

@awarebayes
Copy link
Owner

But I will consider changing it as you said. Probably that will be a better solution

@awarebayes
Copy link
Owner

awarebayes commented Dec 9, 2019

Alright, about to release a commit to this.

The solution is simple: substitute the historical action with the one generated by the beta network.

  • Why do we need it? Because we need the probabilities.
    P.S. Historical a_t is just a number, movie_id if you like. For other stuff, just a plain number is not enough, we need at least something like softmax probs or log_prob. You can translate that historical number into one hot, but I think beta is less hustle and is just better. Anyway, the idea is the same:

  • Can we do that? Since the entire purpose of the beta is to emulate the historical action, it can give us the probabilities we need. As well as that historical accurate action you were talking about.

I have addded a parameter called 'action_source' and so far I came up with:

class DiscreteActor(nn.Module):
    def __init__(self, input_dim, action_dim, hidden_size, init_w=0):
        super(DiscreteActor, self).__init__()
        ....
        # What's action source? See this issue: https://github.com/awarebayes/RecNN/issues/7
        # by default {pi: pi, beta: beta}
        # you can change it to be like {pi: beta, beta: beta} as miracle24 suggested
        self.action_source = {'pi': 'pi', 'beta': 'beta'}

    def pi_beta_sample(self, state, beta, action, **kwargs):
        ....

        # 3. sample the actions
        # See this issue: https://github.com/awarebayes/RecNN/issues/7
        # usually it works like:
        # pi_action = pi_categorical.sample(); beta_action = beta_categorical.sample();
        # but changing the action_source to {pi: beta, beta: beta} can be configured to be:
        # pi_action = beta_categorical.sample(); beta_action = beta_categorical.sample();
        available_actions = {'pi': pi_categorical.sample(), 'beta': beta_categorical.sample()}
        pi_action = available_actions[self.action_source['pi']]
        beta_action = available_actions[self.action_source['beta']]

policy_net = recnn.nn.DiscreteActor(1290, num_items, 2048).to(cuda)
# as miracle24 has suggested https://github.com/awarebayes/RecNN/issues/7
# you can enable this to be more like the paper authors meant it to
# policy_net.action_source = {'pi': 'beta', 'beta': 'beta'}

@awarebayes
Copy link
Owner

I have released a commit to this, you can look at it here

Write me back if I can close the issue

@miracle24
Copy link
Author

miracle24 commented Dec 9, 2019

Actually, I do not know which one is correct or better. Maybe I can try both in my task. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants