Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training resume feature isn't available due to removal in upstream ColBERT #148

Closed
eercanayar opened this issue Feb 20, 2024 · 1 comment

Comments

@eercanayar
Copy link

Hello all,

I would like to resume training from the last checkpoint and last batch ID to handle training interruptions.

Resuming the training/fine-tuning from a batch_id isn't possible because it was somehow removed from the upstream ColBERT repository. I raised stanford-futuredata/ColBERT#307 issue but haven't got any traction so far. Now I'm raising the same here to discuss how we can resolve this discomfort by bringing this feature back. Any ideas?

Hello all,

I would like to resume training from the last checkpoint and last batch ID to handle training interruptions. I see some remainders from possible implementations here, but they're commented out.

https://github.com/stanford-futuredata/ColBERT/blob/7be0114f00dc938aca4a3a5929bef5bbb99485e6/colbert/training/training.py#L81-L83

Also, #43 mentions about resume_optimizer is implemented, however there is no other reference to the parsed argument.

grep -r "resume_optimizer" .
./colbert/utils/parser.py:        # NOTE: Providing a checkpoint is one thing, --resume is another, --resume_optimizer is yet another.
./colbert/utils/parser.py:        self.add_argument('--resume_optimizer', dest='resume_optimizer', default=False, action='store_true')

So, it seems like this feature after these implementations. I tried to dig into this, and found that removed on (October 13th, 2021 7:40 PM) Initial commit with the new API and residual compression by @okhat Reference: https://github.com/stanford-futuredata/ColBERT/blame/7be0114f00dc938aca4a3a5929bef5bbb99485e6/colbert/training/training.py#L81-L83

Could you help me how can I implement these resume and resume_optimizer again? So, I can handle training interruptions in my pipeline, and also contribute back to the repository with examples.

@bclavie
Copy link
Owner

bclavie commented Feb 22, 2024

Hey,

Thank you for raising this... I'll be closing the issue here since this isn't directly RAGatouille related and I'm managing bug fixes via issues here.

I'm not familiar with the OG training runs of ColBERT and the reasons why (if any, other than time) resuming isn't supported right now. Sorry about that!

@bclavie bclavie closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants