Another opt bug fix ('init_model' not being set correctly) #3162
Conversation
@@ -325,6 +325,9 @@ def create_agent_from_opt_file(opt: Opt): | |||
opt_from_file[k] = v | |||
|
|||
opt_from_file['model_file'] = model_file # update model file path | |||
if opt.get('init_model') is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple quick questions:
- why is this not handled in lines 309-315?
- why is this not handled in lines 323-325?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- when train_model.py sets opt['init_model'] to the checkpoint, it does not add it into override. adding to override could be one possible solution, what do you think?
- it isn't handled here in the specific case where init_model did exist in opt_from_file (i.e. (1) the PR description). ex: finetuning a model on BST that's initialized with Reddit and it gets requeued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know which is better, but i'm very slightly leaning towards adding it to override
in train model, it feels marginally better than carving out a special case for init_model
in these opt
functions. do you have an opinion either way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know if I like either. I think adding to override is hacky and can't be trusted lol. But also carving out this special exception isn't very nice either... @stephenroller ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both seem equally gross to me. This solution, with comments explaining wtf is going on, and a test checking that we do it correctly, seems sufficient.
Should we kill override, then we'll at least catch doing it wrong in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test that would catch this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the comment discussing what's going on and why please
@@ -325,6 +325,9 @@ def create_agent_from_opt_file(opt: Opt): | |||
opt_from_file[k] = v | |||
|
|||
opt_from_file['model_file'] = model_file # update model file path | |||
if opt.get('init_model') is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both seem equally gross to me. This solution, with comments explaining wtf is going on, and a test checking that we do it correctly, seems sufficient.
Should we kill override, then we'll at least catch doing it wrong in the future.
Patch description
create_agent_from_opt_file
was not properly copying over'init_model'
as set byopt
. This causes issues, when, for example, we have a model that has been (1) initialized with some model file (2) requeued and has (3)load_from_checkpoint = True
. When requeued, the'init_model'
will revert back to the model set in (1) insidecreate_agent_from_opt_file
, instead of using the checkpoint file set by the train script (which is contained inopt['init_model']
). This in turn leads to TorchAgent loading the model file from the best valid instead of the checkpoint.Thanks to @wyshi for finding this bug!!