Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deeprec hangs in distributed mode. #125

Open
silingtong123 opened this issue Mar 31, 2023 · 0 comments
Open

Deeprec hangs in distributed mode. #125

silingtong123 opened this issue Mar 31, 2023 · 0 comments

Comments

@silingtong123
Copy link

silingtong123 commented Mar 31, 2023

Current behavior

In distributed mode, deeprec works fine when training on one hour of data, but hangs when training on one day or more. Log:
6ca9fe77321c27383b3b3de9bb8fc5d5
Nvidia-smi:
a3ee237e24abfd35d1c087126b6331f8
cpu:
071c9938c994a484295fdc3ef25b483d

Expected behavior

Deeprec works fine in distributed mode. Log:
315532d0f8197d279e990d49332c85b3

System information

  • GPU model and memory: Two GPU devices: Tesla T4 . Memory: 15109MiB
  • OS Platform: x86_64 x86_64 x86_64 GNU/Linux
  • Docker version: Docker version 20.10.8, build 3967b7d
  • GCC/CUDA/cuDNN version: CUDA 11.4 /cuDnn 8
  • Python/conda version: python3.6
  • TensorFlow/PyTorch version: DeepRec deeprec2302, HybridBackend a832b4e

Code to reproduce

    sess_config = tf.ConfigProto(
        # If the device you specify doesn't exist, allow TF to assign the device automatically
        allow_soft_placement=True,
        log_device_placement=False,  # Whether to print the device assignment log
    )
    sess_config.gpu_options.force_gpu_compatible = True
    sess_config.gpu_options.allow_growth = True

    with tf.train.MonitoredTrainingSession(master="", checkpoint_dir=self.__ckpt_dir, config=sess_config):

Willing to contribute

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant