Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer stuck #210

Open
jack-lac opened this issue May 2, 2024 · 8 comments
Open

Trainer stuck #210

jack-lac opened this issue May 2, 2024 · 8 comments
Labels

Comments

@jack-lac
Copy link

jack-lac commented May 2, 2024

Hi !
I tried to execute the basic_training notebook on Google collab.
The trainer.train() phase is stuck in something an infinite loop for hours after printing only "#> Starting..."
Here is the code that is displayed when I interrupt the execution :

[/usr/local/lib/python3.10/dist-packages/ragatouille/RAGTrainer.py](https://localhost:8080/#) in train(self, batch_size, nbits, maxsteps, use_ib_negatives, learning_rate, dim, doc_maxlen, use_relu, warmup_steps, accumsteps)
    236         )
    237 
--> 238         return self.model.train(data_dir=self.data_dir, training_config=training_config)

[/usr/local/lib/python3.10/dist-packages/ragatouille/models/colbert.py](https://localhost:8080/#) in train(self, data_dir, training_config)
    450             )
    451 
--> 452             trainer.train(checkpoint=self.checkpoint)
    453 
    454     def _colbert_score(self, Q, D_padded, D_mask):

[/usr/local/lib/python3.10/dist-packages/colbert/trainer.py](https://localhost:8080/#) in train(self, checkpoint)
     29         launcher = Launcher(train)
     30 
---> 31         self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
     32 
     33 

[/usr/local/lib/python3.10/dist-packages/colbert/infra/launcher.py](https://localhost:8080/#) in launch(self, custom_config, *args)
     70         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     71 
---> 72         return_values = sorted([return_value_queue.get() for _ in all_procs])
     73         return_values = [val for rank, val in return_values]
     74 

[/usr/local/lib/python3.10/dist-packages/colbert/infra/launcher.py](https://localhost:8080/#) in <listcomp>(.0)
     70         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     71 
---> 72         return_values = sorted([return_value_queue.get() for _ in all_procs])
     73         return_values = [val for rank, val in return_values]
     74 

[/usr/lib/python3.10/multiprocessing/queues.py](https://localhost:8080/#) in get(self, block, timeout)
    101         if block and timeout is None:
    102             with self._rlock:
--> 103                 res = self._recv_bytes()
    104             self._sem.release()
    105         else:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv_bytes(self, maxsize)
    412 
    413     def _recv_bytes(self, maxsize=None):
--> 414         buf = self._recv(4)
    415         size, = struct.unpack("!i", buf.getvalue())
    416         if size == -1:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:
@shubham526
Copy link

shubham526 commented May 2, 2024

Facing the same issue, any help would be much appreciated.

I get the following when I force stop it:

Traceback (most recent call last):
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 102, in <module>
    main()
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 83, in main
    train(
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 51, in train
    return model.train(data_dir=data_dir, training_config=training_config)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/ragatouille/models/colbert.py", line 452, in train
    trainer.train(checkpoint=self.checkpoint)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/trainer.py", line 31, in train
    self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in launch
    return_values = sorted([return_value_queue.get() for _ in all_procs])
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in <listcomp>
    return_values = sorted([return_value_queue.get() for _ in all_procs])
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)

@bclavie
Copy link
Owner

bclavie commented May 2, 2024

Hey! This seems to be a multiprocessing issue. I'll try and figure out what it is, but this can often be finnicky with the upstream ColBERT lib 🤔

@shubham526 are you making sure to run your code within if __name__ == "__main__":? In both cases, the problem appears to be that the individual processes spun up by colbert-ai's multiprocessing handler don't actually ever start.

@bclavie bclavie added bug Something isn't working Upstream ColBERT Interaction labels May 2, 2024
@shubham526
Copy link

@bclavie Yes, I am running inside if __name__ == "__main__":

@bclavie
Copy link
Owner

bclavie commented May 2, 2024

Could you share some more information about your environment then? This could maybe help point in the right direction! (torch version, cuda version, python version, pip freeze/overall environment (Ubuntu?))

@shubham526
Copy link

requirements.txt

I have attached the output of pip freeze.

  • OS: Ubuntu 22.04 LTS.
  • Python version: '3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]'
  • PyTorch version: 2.0.1+cu117
  • CUDA version: 11.7

@shubham526
Copy link

I tried this on another server and it worked. No idea why though.

@localcitizen
Copy link

Faced the same issue.

@jack-lac have you managed to resolve the problem?
@shubham526 do you run you source code on Google Colab?

@viantirreau
Copy link

viantirreau commented Jul 5, 2024

Also facing the same issue on Google Colab.
I thought setting something along the lines of

trainer.model.config.nranks = 1
trainer.model.config.gpus = 1

would solve it, but it's not working either (stuck at "Starting"). It's totally upstream in Colbert's training code, and may be due to Colab being slightly more closed in terms of permissions. Skimming over the code, I noticed that it needs to open some ports for communication between worker processes.
I don't think wrapping the Colab code in an if __name__ == '__main__' would solve this.

edit: this issue could be interesting for further debugging, maybe it's data-related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants