Deeprec hangs in distributed mode. #125

silingtong123 · 2023-03-31T08:08:32Z

Current behavior

In distributed mode, deeprec works fine when training on one hour of data, but hangs when training on one day or more. Log：

Nvidia-smi:

cpu:

Expected behavior

Deeprec works fine in distributed mode. Log:

System information

GPU model and memory: Two GPU devices： Tesla T4 . Memory: 15109MiB
OS Platform: x86_64 x86_64 x86_64 GNU/Linux
Docker version: Docker version 20.10.8, build 3967b7d
GCC/CUDA/cuDNN version: CUDA 11.4 /cuDnn 8
Python/conda version: python3.6
TensorFlow/PyTorch version: DeepRec deeprec2302, HybridBackend a832b4e

Code to reproduce

    sess_config = tf.ConfigProto(
        # If the device you specify doesn't exist, allow TF to assign the device automatically
        allow_soft_placement=True,
        log_device_placement=False,  # Whether to print the device assignment log
    )
    sess_config.gpu_options.force_gpu_compatible = True
    sess_config.gpu_options.allow_growth = True

    with tf.train.MonitoredTrainingSession(master="", checkpoint_dir=self.__ckpt_dir, config=sess_config):

Willing to contribute

Yes

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deeprec hangs in distributed mode. #125

Deeprec hangs in distributed mode. #125

silingtong123 commented Mar 31, 2023 •

edited

Deeprec hangs in distributed mode. #125

Deeprec hangs in distributed mode. #125

Comments

silingtong123 commented Mar 31, 2023 • edited

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

silingtong123 commented Mar 31, 2023 •

edited