Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在 SFT 微调途中出现报错 #23

Closed
aoguai opened this issue Jan 30, 2024 · 11 comments
Closed

在 SFT 微调途中出现报错 #23

aoguai opened this issue Jan 30, 2024 · 11 comments

Comments

@aoguai
Copy link

aoguai commented Jan 30, 2024

操作系统:windows 10

在第一次达到save_steps 5000步保存完模型后会出现权限不足报错:

aving model checkpoint to D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000:12:55,  3.39s/it]
Configuration saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\config.json
Configuration saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\generation_config.json
Model weights saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\model.safetensors
tokenizer config file saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\tokenizer_config.json
Special tokens file saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\special_tokens_map.json
Traceback (most recent call last):
  File "D:\sj\project\python\ChatLM-mini-Chinese\sft_train.py", line 127, in <module>
    sft_train(config)
  File "D:\sj\project\python\ChatLM-mini-Chinese\sft_train.py", line 114, in sft_train
    trainer.train(
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 1539, in train
    return inner_training_loop(
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\accelerate\utils\memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 1929, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 2300, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint
    fd = os.open(output_dir, os.O_RDONLY)
PermissionError: [Errno 13] Permission denied: 'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000'

已经尝试两次都是这样,且第二次是以管理员权限运行
image

@charent
Copy link
Owner

charent commented Jan 30, 2024

应该是你自己的环境问题,检查你登陆的用户在D盘及你的文件路径是否有写权限(右键->属性->安全->选择你登录的用户名->检查是否有写权限,文件夹及D盘都需要检查),我没法复现你的问题。自己写个写文件的python脚本检查一下你的路径D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000是否能写文件(\\\/的区别)。

image

@charent
Copy link
Owner

charent commented Jan 30, 2024

另外建议先设置一个小的save_step,小的epoch或者小的max_step,几十条数据,把全部流程跑通再开始正式训练,否则训练完了发现没法保存就白干了。

@aoguai
Copy link
Author

aoguai commented Jan 30, 2024

我已检查且确定我的登陆的用户在D盘及你的文件路径具有写权限:
image
且我编写了如下python脚本测试:

import os


def check_write_permission(path):
    try:
        # 在指定路径下尝试创建一个临时文件
        with open(os.path.join(path, 'test_file.tmp'), 'w') as f:
            f.write('Testing write permission.')
        # 如果成功创建文件,则写入权限检查通过
        print(f"路径写入权限检查已通过: {path}")
        return True
    except Exception as e:
        # 如果创建文件过程中出现异常,则写入权限检查失败
        print(f"检查写入权限时出错: {e}")
        return False


# 指定路径进行写入权限检查
path_to_check_list = [
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000',
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000',
]

for path in path_to_check_list:
    check_write_permission(path)

得到结果全部通过
image


值得注意的是,当我直接使用 D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000 来通过文件资源管理器的路径栏企图访问的时候会无法访问。使用D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000正常。

但是这个路径是程序自动生成的路径吧?

@charent
Copy link
Owner

charent commented Jan 30, 2024

你上传的权限图,组和用户名那里还有不少用户,你再选其他用户看看,一般登录用户是不排在第一位的,所有用户检查一遍。
能写入文件,难道说是不能创建文件夹??你再试试用代码再你的路径/sft目录下,看看能不能创建文件夹。

checkpoint-5000这个文件夹是trainer在保存检查点的时候自动生成的,在你定义的output_dir后面添加checkpoint-5000

@aoguai
Copy link
Author

aoguai commented Jan 30, 2024

我的output_dir 是这样的:output_dir: str = PROJECT_ROOT + '/model_save/sft',没有改动

我检查了所有用户的权限都是存在写入,且
我为测试代码添加了一个生成文件夹的函数:

import os


def check_write_permission(path):
    try:
        # 在指定路径下尝试创建一个临时文件
        with open(os.path.join(path, 'test_file.tmp'), 'w') as f:
            f.write('Testing write permission.')
        # 如果成功创建文件,则写入权限检查通过
        print(f"路径写入权限检查已通过: {path}")
        return True
    except Exception as e:
        # 如果创建文件过程中出现异常,则写入权限检查失败
        print(f"检查写入权限时出错: {e}")
        return False


def test_create_directory(path):
    try:
        # 尝试创建一个临时文件夹
        os.makedirs(path, exist_ok=True)
        # 如果成功创建文件夹,则生成文件夹测试通过
        print(f"生成文件夹测试已通过: {path}")
        return True
    except Exception as e:
        # 如果创建文件夹过程中出现异常,则生成文件夹测试失败
        print(f"生成文件夹测试时出错: {e}")
        return False


# 指定路径进行写入权限测试
path_to_check_list = [
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000',
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000',
]

# 指定路径进行目录生成测试
path_to_create_list = [
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-test0',
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-test1',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-test2',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-test3',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-test4',
]

for path in path_to_check_list:
    check_write_permission(path)

for path in path_to_create_list:
    test_create_directory(path)

同时在同一个虚拟环境中进行了运行测试(之前也是),还是全部通过
image
image

如果用户权限不存在我认为上述的python代码应该也是无法通过才对(因为cmd我甚至没开管理员权限)

@charent
Copy link
Owner

charent commented Jan 30, 2024

好怪啊,我这边是正常的。

我去看了transformers\trainer.py的源码,我摘选了部分,标注了你报错的部分:

....
# output_dir`......../checkpoint-5000`已经存在了,且文件夹不为空
 if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
  else:
      # 你的代码执行了这一步
      staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")   
.....

# Then go through the rewriting process, only renaming and rotating from main process(es)
  if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
      if staging_output_dir != output_dir:   # 因为这两个不等,且staging_output_dir文件夹存在,所以要执行下面的部分
          if os.path.exists(staging_output_dir):
              os.rename(staging_output_dir, output_dir)

              # Ensure rename completed in cases where os.rename is not atomic
              fd = os.open(output_dir, os.O_RDONLY)      # line 2418, 
              # 你的代码抛出错误的地方,将tmp-{checkpoint_folder}重命名为output_dir后无法再次打开,原因是没有权限
              # 至于为什么没权限,不知道😂

              os.fsync(fd)
              os.close(fd)

      # Maybe delete some older checkpoints.
      if self.args.should_save:
          self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

你写的测试代码,所有的路径都用r''的格式,加了r后路径内的转义字符会失效,比如\\就算两个反斜杠,而不是转义后的一个反斜杠。transformers\trainer.py里面的路径都是带转义的字符串格式,没有用r''停止转义,不太清楚加了r是否有影响。

最后建议如下:把sft及子文件夹删除,重启电脑后,再设置一个小的save_step,小的epoch或者小的max_step,几十条数据,几十秒就跑完那种,看看行不行。如果不行,再把output_dir改为其他路径,比如./model_sftPROJECT_ROOT + '/model_sft’

@aoguai
Copy link
Author

aoguai commented Jan 30, 2024

去掉r''的测试脚本也是一切正常

image
image


重启也试过了,而且两次微调之间都是隔天进行的。
我再研究研究吧,十分感谢😂或许我该去 transformers 反馈?

@aoguai
Copy link
Author

aoguai commented Jan 30, 2024

测试了一下换了个小数据集后一切正常。
image


另外 sft_train.pyloss_log.to_csv(f"./logs/sft_train_log_{time.strftime('%Y%m%d-%H%M')}.csv")这行如果没有 logs 目录的话也会报错 OSError: Cannot save file into a non-existent directory: 'logs'。手动建一个目录才行。(或许是一个微不足道的BUG

@charent
Copy link
Owner

charent commented Jan 30, 2024

主要是没法复现😂,我这里win11、wsl都正常,总不能是win10的问题吧?我看其他人做sft的也没出现这个问题,#issuecomment-1897843741

@charent
Copy link
Owner

charent commented Jan 30, 2024

好,logs这个问题我等会就修。

@aoguai
Copy link
Author

aoguai commented Jan 30, 2024

  • 61af2fe
    无解,logs 修复暂时关了

@aoguai aoguai closed this as not planned Won't fix, can't repro, duplicate, stale Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants