You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run vw --cb_explore_adf -d ldf-test.txt.gz --cb_type ips --epsilon 1.0
Observe average loss of around 0 (because the system is exploring 100% of the time, and half of feedback is +1 and the other -1, this is expected)
Run vw --explore_eval -d ldf-test.txt.gz --cb_type ips --epsilon 1.0
Observe loss of -0.098420
Expected behavior
I'd expect the loss when running explore_eval with 100% random exploration (--epsilon 1.0) in this particular case to give a loss similar to --cb_explore_adf with the same exploration. Interestingly, running vw --cb_explore_adf -d ldf-test.txt.gz --cb_type ips --epsilon 1.0 with --progress 1 and looking at the printout, it seems that explore_eval accepts less of the examples with +1 cost and more with -1 cost, and this proportion not being 50/50 would explain the behaviour, though the root cause is a mystery to me.
Observed Behavior
Losses are different, and in this case I would expect them to be approximately the same.
Changing costs to e.g. 0 and 1 doesn't change things (avg. loss is expected to be 0.5, which it is with --cb_explore_adf but not --explore_eval. It seems like the average loss in explore_eval is either systematically overestimating, or then the loss interpretation is different. If the latter, there could perhaps be some explanation as to why this happens, or a "for dummies" tutorial explaining the rejection sampling approach i think is used under the hood.
Environment
Version: 8.8.1 (git commit: 5ff219e)
OS: Mac Catalina
Reproduced via CLI
The text was updated successfully, but these errors were encountered:
Hi @maxpagels, in this case, the dataset has uniform probabilities for each played action (1/#actions) and the (stochastic) policy being evaluated is the uniform distribution (--epsilon 1), explore_eval should return the on-policy estimate which is the average of the costs in the file. With the provided dataset, average loss of explore_eval should equal to 0.00138. We found a bug during investigating this issue, will get it fixed shortly.
Describe the bug
explore_eval
gives systematically different average loss thancb_explore_adf
on a bandit dataset synthetically constructed from a supervised dataset.I've created a dataset as follows:
See attached dataset: ldf-test.txt.gz
To Reproduce
Steps to reproduce the behavior:
vw --cb_explore_adf -d ldf-test.txt.gz --cb_type ips --epsilon 1.0
vw --explore_eval -d ldf-test.txt.gz --cb_type ips --epsilon 1.0
-0.098420
Expected behavior
I'd expect the loss when running
explore_eval
with 100% random exploration (--epsilon 1.0)
in this particular case to give a loss similar to--cb_explore_adf
with the same exploration. Interestingly, runningvw --cb_explore_adf -d ldf-test.txt.gz --cb_type ips --epsilon 1.0
with--progress 1
and looking at the printout, it seems thatexplore_eval
accepts less of the examples with +1 cost and more with -1 cost, and this proportion not being 50/50 would explain the behaviour, though the root cause is a mystery to me.Observed Behavior
Losses are different, and in this case I would expect them to be approximately the same.
Changing costs to e.g. 0 and 1 doesn't change things (avg. loss is expected to be 0.5, which it is with
--cb_explore_adf
but not--explore_eval
. It seems like the average loss inexplore_eval
is either systematically overestimating, or then the loss interpretation is different. If the latter, there could perhaps be some explanation as to why this happens, or a "for dummies" tutorial explaining the rejection sampling approach i think is used under the hood.Environment
Version:
8.8.1 (git commit: 5ff219e)
OS: Mac Catalina
Reproduced via CLI
The text was updated successfully, but these errors were encountered: