-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在 SFT 微调途中出现报错 #23
Comments
另外建议先设置一个小的save_step,小的epoch或者小的max_step,几十条数据,把全部流程跑通再开始正式训练,否则训练完了发现没法保存就白干了。 |
你上传的权限图,
|
好怪啊,我这边是正常的。 我去看了 ....
# output_dir`......../checkpoint-5000`已经存在了,且文件夹不为空
if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
logger.warning(
f"Checkpoint destination directory {output_dir} already exists and is non-empty."
"Saving will proceed but saved results may be invalid."
)
staging_output_dir = output_dir
else:
# 你的代码执行了这一步
staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
.....
# Then go through the rewriting process, only renaming and rotating from main process(es)
if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
if staging_output_dir != output_dir: # 因为这两个不等,且staging_output_dir文件夹存在,所以要执行下面的部分
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
# Ensure rename completed in cases where os.rename is not atomic
fd = os.open(output_dir, os.O_RDONLY) # line 2418,
# 你的代码抛出错误的地方,将tmp-{checkpoint_folder}重命名为output_dir后无法再次打开,原因是没有权限
# 至于为什么没权限,不知道😂
os.fsync(fd)
os.close(fd)
# Maybe delete some older checkpoints.
if self.args.should_save:
self._rotate_checkpoints(use_mtime=True, output_dir=run_dir) 你写的测试代码,所有的路径都用 最后建议如下:把 |
主要是没法复现😂,我这里win11、wsl都正常,总不能是win10的问题吧?我看其他人做sft的也没出现这个问题,#issuecomment-1897843741。 |
好, |
|
操作系统:windows 10
在第一次达到save_steps 5000步保存完模型后会出现权限不足报错:
已经尝试两次都是这样,且第二次是以管理员权限运行
The text was updated successfully, but these errors were encountered: