Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stopped working (probably because a change Microsoft applied) #87

Closed
ilanshib opened this issue Apr 17, 2023 · 3 comments
Closed

Comments

@ilanshib
Copy link

Whenwever I run training (e.g. python -m vall_e.train yaml=config/libri/ar.yml) I get the following error:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/notebooks/vall-e/vall_e/train.py", line 128, in
main()
File "/notebooks/vall-e/vall_e/train.py", line 119, in main
trainer.train(
File "/notebooks/vall-e/vall_e/utils/trainer.py", line 125, in train
engines = engines_loader()
File "/notebooks/vall-e/vall_e/train.py", line 21, in load_engines
model=trainer.Engine(
File "/notebooks/vall-e/vall_e/utils/engines.py", line 22, in init
super().init(None, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 264, in init
self._do_sanity_check()
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 988, in _do_sanity_check
if self.optimizer_name() is not None:
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 648, in optimizer_name
return (self.client_optimizer.class.name if self.client_optimizer else self._config.optimizer_name)
AttributeError: 'NoneType' object has no attribute 'optimizer_name'

At first look it seems that a change was applied to microsoft's DeepSpeed code. when Micorosoft's module is initialized it looks for a config object that contains the attribute optimizer_name.

vall_e uses DeepSpeed and initializes it as part of the class 'Engine' in utils/engines.py but it does not pass the required config parameter.

I suspect this is the DeepSpeed commit that caused the problem: microsoft/DeepSpeed@47f9f13

Can anyone help?

Can anyone help?

@ashokgit
Copy link

Following

@ilanshib
Copy link
Author

ilanshib commented Apr 18, 2023

More information.... If my assumption is correct than as a temp fix it would be possible to use DeepSpeed versions that are earlier than v0.9.0. Does anyone know how to enforce vall-e to get the DeepSpeed dependency version that is earlier than v0.9.0?

Note: I could see that vall_e has a file called setup.py that contains this line "deepspeed>=0.7.7" maybe we can try updating it...

@ilanshib
Copy link
Author

Found this temporary fix: I changed setup.py under "/notebooks/vall-e/setup.py" (don't get confused, there is another setup.py file under /notebooks/setup.py) and changed the line "deepspeed>=0.7.7" to "deepspeed>=0.7.7,<0.9.0".

This enforces the installation to use the old deepspeed version that doesn't contain the change that caused the problem.

NOTE THAT THIS IS ONLY A TEMPORARY FIX. The right thing to do is to change vall-e's code to suite the updated deepspeed code. I hope someone is up to this challenge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants