Skip to content

[BUG] The softlink of model.* files are not local path in aliyun when DPGEN-training is called by tencent slurm V100  #551

@bjyxffj

Description

@bjyxffj

Summary
When the DP-GEN calls tencent slurm V100 for GPU calculation, in 00.train progress, the solflink path is the corresponding calculation path in tencent slurm V100. This is wrong path.
Example: the model.ckpt.data-00000-of-00001 -> /root/dpgen_work/e39e2ed2-89cd-4137-b5d1-2e7f70b37455/000/model.ckpt-800000.data-00000-of-00001.
The correct path would be the " model.ckpt.data-00000-of-00001 -> ../../../../iter.000001/00.train/000/model.ckpt.data-00000-of-00001 ".

DPGEN Version and Platform

Job submission and computing cluster configuration

Expected Behavior

Actual Behavior

Steps to Reproduce

Further Information, Files, and Links

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions