Trainer stuck #210

jack-lac · 2024-05-02T15:46:22Z

Hi !
I tried to execute the basic_training notebook on Google collab.
The trainer.train() phase is stuck in something an infinite loop for hours after printing only "#> Starting..."
Here is the code that is displayed when I interrupt the execution :

[/usr/local/lib/python3.10/dist-packages/ragatouille/RAGTrainer.py](https://localhost:8080/#) in train(self, batch_size, nbits, maxsteps, use_ib_negatives, learning_rate, dim, doc_maxlen, use_relu, warmup_steps, accumsteps)
    236         )
    237 
--> 238         return self.model.train(data_dir=self.data_dir, training_config=training_config)

[/usr/local/lib/python3.10/dist-packages/ragatouille/models/colbert.py](https://localhost:8080/#) in train(self, data_dir, training_config)
    450             )
    451 
--> 452             trainer.train(checkpoint=self.checkpoint)
    453 
    454     def _colbert_score(self, Q, D_padded, D_mask):

[/usr/local/lib/python3.10/dist-packages/colbert/trainer.py](https://localhost:8080/#) in train(self, checkpoint)
     29         launcher = Launcher(train)
     30 
---> 31         self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
     32 
     33 

[/usr/local/lib/python3.10/dist-packages/colbert/infra/launcher.py](https://localhost:8080/#) in launch(self, custom_config, *args)
     70         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     71 
---> 72         return_values = sorted([return_value_queue.get() for _ in all_procs])
     73         return_values = [val for rank, val in return_values]
     74 

[/usr/local/lib/python3.10/dist-packages/colbert/infra/launcher.py](https://localhost:8080/#) in <listcomp>(.0)
     70         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     71 
---> 72         return_values = sorted([return_value_queue.get() for _ in all_procs])
     73         return_values = [val for rank, val in return_values]
     74 

[/usr/lib/python3.10/multiprocessing/queues.py](https://localhost:8080/#) in get(self, block, timeout)
    101         if block and timeout is None:
    102             with self._rlock:
--> 103                 res = self._recv_bytes()
    104             self._sem.release()
    105         else:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv_bytes(self, maxsize)
    412 
    413     def _recv_bytes(self, maxsize=None):
--> 414         buf = self._recv(4)
    415         size, = struct.unpack("!i", buf.getvalue())
    416         if size == -1:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

The text was updated successfully, but these errors were encountered:

shubham526 · 2024-05-02T19:10:04Z

Facing the same issue, any help would be much appreciated.

I get the following when I force stop it:

Traceback (most recent call last):
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 102, in <module>
    main()
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 83, in main
    train(
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 51, in train
    return model.train(data_dir=data_dir, training_config=training_config)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/ragatouille/models/colbert.py", line 452, in train
    trainer.train(checkpoint=self.checkpoint)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/trainer.py", line 31, in train
    self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in launch
    return_values = sorted([return_value_queue.get() for _ in all_procs])
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in <listcomp>
    return_values = sorted([return_value_queue.get() for _ in all_procs])
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)

bclavie · 2024-05-02T19:34:26Z

Hey! This seems to be a multiprocessing issue. I'll try and figure out what it is, but this can often be finnicky with the upstream ColBERT lib 🤔

@shubham526 are you making sure to run your code within if __name__ == "__main__":? In both cases, the problem appears to be that the individual processes spun up by colbert-ai's multiprocessing handler don't actually ever start.

shubham526 · 2024-05-02T19:46:52Z

@bclavie Yes, I am running inside if __name__ == "__main__":

bclavie · 2024-05-02T19:48:23Z

Could you share some more information about your environment then? This could maybe help point in the right direction! (torch version, cuda version, python version, pip freeze/overall environment (Ubuntu?))

shubham526 · 2024-05-02T19:54:43Z

requirements.txt

I have attached the output of pip freeze.

OS: Ubuntu 22.04 LTS.
Python version: '3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]'
PyTorch version: 2.0.1+cu117
CUDA version: 11.7

shubham526 · 2024-05-05T11:12:04Z

I tried this on another server and it worked. No idea why though.

localcitizen · 2024-06-10T17:07:12Z

Faced the same issue.

@jack-lac have you managed to resolve the problem?
@shubham526 do you run you source code on Google Colab?

viantirreau · 2024-07-05T01:03:53Z

Also facing the same issue on Google Colab.
I thought setting something along the lines of

trainer.model.config.nranks = 1
trainer.model.config.gpus = 1

would solve it, but it's not working either (stuck at "Starting"). It's totally upstream in Colbert's training code, and may be due to Colab being slightly more closed in terms of permissions. Skimming over the code, I noticed that it needs to open some ports for communication between worker processes.
I don't think wrapping the Colab code in an if __name__ == '__main__' would solve this.

edit: this issue could be interesting for further debugging, maybe it's data-related.

bclavie added bug Something isn't working Upstream ColBERT Interaction labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer stuck #210

Trainer stuck #210

jack-lac commented May 2, 2024 •

edited

Loading

shubham526 commented May 2, 2024 •

edited

Loading

bclavie commented May 2, 2024

shubham526 commented May 2, 2024

bclavie commented May 2, 2024

shubham526 commented May 2, 2024

shubham526 commented May 5, 2024

localcitizen commented Jun 10, 2024

viantirreau commented Jul 5, 2024 •

edited

Loading

Trainer stuck #210

Trainer stuck #210

Comments

jack-lac commented May 2, 2024 • edited Loading

shubham526 commented May 2, 2024 • edited Loading

bclavie commented May 2, 2024

shubham526 commented May 2, 2024

bclavie commented May 2, 2024

shubham526 commented May 2, 2024

shubham526 commented May 5, 2024

localcitizen commented Jun 10, 2024

viantirreau commented Jul 5, 2024 • edited Loading

jack-lac commented May 2, 2024 •

edited

Loading

shubham526 commented May 2, 2024 •

edited

Loading

viantirreau commented Jul 5, 2024 •

edited

Loading