Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Saving and Loading a Vowpal Wabbit model with --safe_resume and --cb_explore failes with RuntimeError: Model content is corrupted #3062

Closed
alxbar75 opened this issue Jun 11, 2021 · 7 comments · Fixed by #3063
Labels
Bug Bug in learning semantics, critical by default

Comments

@alxbar75
Copy link

Describe the bug

When initializing a vw model with --cb_explore and --save_resume, an Exception is thrown when saving with vw.save("model.vw") (only happens if previously vw.learn("") was called).

Not using "--save_resume" prevents the Exception but model performance is not as good.

To Reproduce

Code example to reproduce:

from vowpalwabbit import pyvw

print('# test some save/load behavior')
example = "feature1:f feature2:f feature3:y feature4:f feature5:f feature6:f feature7:c feature8:b feature9:h feature10:e feature11:b feature12:k feature13:k feature14:b feature15:b feature16:p feature17:w feature18:o feature19:l feature20:h feature21:v feature22:g"
vw = pyvw.vw("--cb_explore 2 --quiet --save_resume") # removing --save_resume will prevent exception
vw.learn(f"1:1:0.25 | {example}")
before_save = vw.predict(f"| {example}")
print('before saving, prediction =', before_save)
vw.save("model.vw")

# now re-start vw by loading that model
vw = pyvw.vw("--quiet -i model.vw")
after_save = vw.predict(f"| {example}")
print(' after saving, prediction =', after_save)

Expected behavior

No Exception, model loaded and training can be continued.

Observed Behavior

Error: Model content is corrupted, weight vector index 1079738368 must be less than total vector length 262144
Traceback (most recent call last):
  File "/home/alex/projects/experiment/vw_minimal_example_fail.py", line 13, in <module>
    vw = pyvw.vw("--quiet -i model.vw")
  File "/home/alex/projects/datascience/lib/python3.8/site-packages/vowpalwabbit/pyvw.py", line 347, in __init__
    super(vw, self).__init__(" ".join(l))
RuntimeError: Model content is corrupted, weight vector index 1079738368 must be less than total vector length 262144

Environment

vowpalwabbit==8.10.1
Python 3.8.5
Ubuntu 20.04.1

Additional context

I would like to perform online training, with many save and loads in a distributed environment (Ideally not saving to a file on disk). Preferably, I would like to be able to serialize the model as a binary string to send around. However, to my understanding, the python bindings allow only to save to a file directly, which is why I try to write with python to a named temp file which I can read afterwards to retrieve a binary string.

Models that I serialized that way are still converging but the performance is not as good as without serialization. Using " --save_resume" leads to the aforementioned exception.

@alxbar75 alxbar75 added the Bug Bug in learning semantics, critical by default label Jun 11, 2021
@jackgerrits
Copy link
Member

jackgerrits commented Jun 11, 2021

Hi @alxbar75, I was able to repro this on 8.10.1 also. But it seems as though this is fixed on master. Can you please try installing a wheel file produced by CI? See here for instructions: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Python#bleeding-edge-latest-commit-on-master

@alxbar75
Copy link
Author

Hi @jackgerrits, I will give it a shot.

@jackgerrits
Copy link
Member

So I realized now why that helps. There's a new codepath in master (Where cb is converted to cb_adf) that means this bug isn't hit, but still exists there.

I've found the bug and will put out a fix soon. I'll patch 8.10 and release 8.10.2 with the fix also.

@jackgerrits
Copy link
Member

@alxbar75 8.10.2 has now been released on PyPi

@alxbar75
Copy link
Author

Tested vowpalwabbit-8.10.2. Model converged with "--save_resume" properly.

@big-c-note
Copy link

Having this issue with 9.6.0.

@big-c-note
Copy link

Oh, I see I was trying to use copy.deepcopy()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bug in learning semantics, critical by default
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants