Training resume feature isn't available due to removal in upstream ColBERT #148

eercanayar · 2024-02-20T17:00:26Z

Hello all,

I would like to resume training from the last checkpoint and last batch ID to handle training interruptions.

Resuming the training/fine-tuning from a batch_id isn't possible because it was somehow removed from the upstream ColBERT repository. I raised stanford-futuredata/ColBERT#307 issue but haven't got any traction so far. Now I'm raising the same here to discuss how we can resolve this discomfort by bringing this feature back. Any ideas?

Hello all,

I would like to resume training from the last checkpoint and last batch ID to handle training interruptions. I see some remainders from possible implementations here, but they're commented out.

https://github.com/stanford-futuredata/ColBERT/blob/7be0114f00dc938aca4a3a5929bef5bbb99485e6/colbert/training/training.py#L81-L83

Also, #43 mentions about resume_optimizer is implemented, however there is no other reference to the parsed argument.
grep -r "resume_optimizer" .
./colbert/utils/parser.py:        # NOTE: Providing a checkpoint is one thing, --resume is another, --resume_optimizer is yet another.
./colbert/utils/parser.py:        self.add_argument('--resume_optimizer', dest='resume_optimizer', default=False, action='store_true')
So, it seems like this feature after these implementations. I tried to dig into this, and found that removed on (October 13th, 2021 7:40 PM) Initial commit with the new API and residual compression by @okhat Reference: https://github.com/stanford-futuredata/ColBERT/blame/7be0114f00dc938aca4a3a5929bef5bbb99485e6/colbert/training/training.py#L81-L83

Could you help me how can I implement these resume and resume_optimizer again? So, I can handle training interruptions in my pipeline, and also contribute back to the repository with examples.

The text was updated successfully, but these errors were encountered:

bclavie · 2024-02-22T09:17:49Z

Hey,

Thank you for raising this... I'll be closing the issue here since this isn't directly RAGatouille related and I'm managing bug fixes via issues here.

I'm not familiar with the OG training runs of ColBERT and the reasons why (if any, other than time) resuming isn't supported right now. Sorry about that!

bclavie added the Upstream ColBERT Interaction label Feb 22, 2024

bclavie closed this as completed Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training resume feature isn't available due to removal in upstream ColBERT #148

Training resume feature isn't available due to removal in upstream ColBERT #148

eercanayar commented Feb 20, 2024

bclavie commented Feb 22, 2024

Training resume feature isn't available due to removal in upstream ColBERT #148

Training resume feature isn't available due to removal in upstream ColBERT #148

Comments

eercanayar commented Feb 20, 2024

bclavie commented Feb 22, 2024