Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore_eval gives systematically different average loss than cb_explore_adf against the same "simulator" data #2621

Closed
maxpagels opened this issue Nov 1, 2020 · 3 comments
Labels
Bug Bug in learning semantics, critical by default

Comments

@maxpagels
Copy link

maxpagels commented Nov 1, 2020

Describe the bug

explore_eval gives systematically different average loss than cb_explore_adf on a bandit dataset synthetically constructed from a supervised dataset.

I've created a dataset as follows:

  • Ten actions per round
  • All share 2 features
  • Actions 0-4 always receive a cost of -1, actions 4-9 always receive a cost of 1
  • All actions are recorded with probability 0.1 (10%)
  • In other words, the CB module should learn that shared features don't matter here, only the indicator feature per action

See attached dataset: ldf-test.txt.gz

To Reproduce

Steps to reproduce the behavior:

  1. Run vw --cb_explore_adf -d ldf-test.txt.gz --cb_type ips --epsilon 1.0
  2. Observe average loss of around 0 (because the system is exploring 100% of the time, and half of feedback is +1 and the other -1, this is expected)
  3. Run vw --explore_eval -d ldf-test.txt.gz --cb_type ips --epsilon 1.0
  4. Observe loss of -0.098420

Expected behavior

I'd expect the loss when running explore_eval with 100% random exploration (--epsilon 1.0) in this particular case to give a loss similar to --cb_explore_adf with the same exploration. Interestingly, running vw --cb_explore_adf -d ldf-test.txt.gz --cb_type ips --epsilon 1.0 with --progress 1 and looking at the printout, it seems that explore_eval accepts less of the examples with +1 cost and more with -1 cost, and this proportion not being 50/50 would explain the behaviour, though the root cause is a mystery to me.

Observed Behavior

Losses are different, and in this case I would expect them to be approximately the same.

Changing costs to e.g. 0 and 1 doesn't change things (avg. loss is expected to be 0.5, which it is with --cb_explore_adf but not --explore_eval. It seems like the average loss in explore_eval is either systematically overestimating, or then the loss interpretation is different. If the latter, there could perhaps be some explanation as to why this happens, or a "for dummies" tutorial explaining the rejection sampling approach i think is used under the hood.

Environment

Version: 8.8.1 (git commit: 5ff219e)
OS: Mac Catalina
Reproduced via CLI

@maxpagels maxpagels added the Bug Bug in learning semantics, critical by default label Nov 1, 2020
@cheng-tan
Copy link
Collaborator

Hi @maxpagels, in this case, the dataset has uniform probabilities for each played action (1/#actions) and the (stochastic) policy being evaluated is the uniform distribution (--epsilon 1), explore_eval should return the on-policy estimate which is the average of the costs in the file. With the provided dataset, average loss of explore_eval should equal to 0.00138. We found a bug during investigating this issue, will get it fixed shortly.

@maxpagels
Copy link
Author

@cheng-tan thanks! wasn't just me imagining that the loss should equal approx 0 in this case.

@jackgerrits
Copy link
Member

Looks like this was fixed in #2631, which made it into the 8.9.0 release. Let us know if there are any issues here still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bug in learning semantics, critical by default
Projects
None yet
Development

No branches or pull requests

3 participants