-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
I am new to working with LLMs and require some support with fine-tuning your model “deepseek-ai/deepseek-coder-6.7b-instruct” with my small dataset.
Server-specs:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:01:00.0 Off | 0 |
| N/A 43C P0 52W / 350W | 26MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 PCIe Off | 00000000:02:00.0 Off | 0 |
| N/A 43C P0 49W / 350W | 23MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A4000 Off | 00000000:C1:00.0 Off | Off |
| 41% 43C P8 15W / 140W | 497MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 53937 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2771092 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 53937 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2771092 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 53937 G /usr/lib/xorg/Xorg 73MiB |
| 2 N/A N/A 2771092 G /usr/lib/xorg/Xorg 78MiB |
| 2 N/A N/A 2771368 G /usr/bin/gnome-shell 117MiB |
+-----------------------------------------------------------------------------------------+
I am working in a conda environment with Python 3.11:
conda create --name deepseekEnv -c conda-forge python=3.11
After cloning the git-repository and installing all packages listed in "finetune/requirements.txt" i end up with the following:
conda list
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
accelerate 0.24.1 pypi_0 pypi
aiohttp 3.9.5 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
annotated-types 0.7.0 pypi_0 pypi
attrdict 2.0.1 pypi_0 pypi
attrs 23.2.0 pypi_0 pypi
bzip2 1.0.8 hd590300_5 conda-forge
ca-certificates 2024.6.2 hbcca054_0 conda-forge
certifi 2024.6.2 pypi_0 pypi
charset-normalizer 3.3.2 pypi_0 pypi
cmake 3.29.5.1 pypi_0 pypi
datasets 2.14.7 pypi_0 pypi
deepspeed 0.12.2 pypi_0 pypi
dill 0.3.7 pypi_0 pypi
filelock 3.15.1 pypi_0 pypi
frozenlist 1.4.1 pypi_0 pypi
fsspec 2023.10.0 pypi_0 pypi
hjson 3.1.0 pypi_0 pypi
huggingface-hub 0.16.4 pypi_0 pypi
idna 3.7 pypi_0 pypi
jinja2 3.1.4 pypi_0 pypi
ld_impl_linux-64 2.40 hf3520f5_3 conda-forge
libexpat 2.6.2 h59595ed_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h77fa898_8 conda-forge
libgomp 13.2.0 h77fa898_8 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libsqlite 3.46.0 hde9e2c9_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.3.1 h4ab18f5_1 conda-forge
lit 18.1.7 pypi_0 pypi
markupsafe 2.1.5 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
multidict 6.0.5 pypi_0 pypi
multiprocess 0.70.15 pypi_0 pypi
ncurses 6.5 h59595ed_0 conda-forge
networkx 3.3 pypi_0 pypi
ninja 1.11.1.1 pypi_0 pypi
numpy 1.26.4 pypi_0 pypi
nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi
nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi
nvidia-curand-cu11 10.2.10.91 pypi_0 pypi
nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi
nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi
nvidia-nccl-cu11 2.14.3 pypi_0 pypi
nvidia-nvtx-cu11 11.7.91 pypi_0 pypi
openssl 3.3.1 h4ab18f5_0 conda-forge
packaging 24.1 pypi_0 pypi
pandas 2.2.2 pypi_0 pypi
pip 24.0 pyhd8ed1ab_0 conda-forge
protobuf 5.27.1 pypi_0 pypi
psutil 5.9.8 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
pyarrow 16.1.0 pypi_0 pypi
pyarrow-hotfix 0.6 pypi_0 pypi
pydantic 2.7.4 pypi_0 pypi
pydantic-core 2.18.4 pypi_0 pypi
pynvml 11.5.0 pypi_0 pypi
python 3.11.9 hb806964_0_cpython conda-forge
python-dateutil 2.9.0.post0 pypi_0 pypi
pytz 2024.1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
readline 8.2 h8228510_1 conda-forge
regex 2024.5.15 pypi_0 pypi
requests 2.32.3 pypi_0 pypi
safetensors 0.4.3 pypi_0 pypi
setuptools 70.0.0 pyhd8ed1ab_0 conda-forge
six 1.16.0 pypi_0 pypi
sympy 1.12.1 pypi_0 pypi
tensorboardx 2.6.2.2 pypi_0 pypi
tk 8.6.13 noxft_h4845f30_101 conda-forge
tokenizers 0.14.0 pypi_0 pypi
torch 2.0.1 pypi_0 pypi
tqdm 4.66.4 pypi_0 pypi
transformers 4.35.0 pypi_0 pypi
triton 2.0.0 pypi_0 pypi
typing-extensions 4.12.2 pypi_0 pypi
tzdata 2024.1 pypi_0 pypi
urllib3 2.2.1 pypi_0 pypi
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
xxhash 3.4.1 pypi_0 pypi
xz 5.2.6 h166bdaf_0 conda-forge
yarl 1.9.4 pypi_0 pypi
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
`print(torch.version.cuda)`
11.7
`print(torch.__version__)`
2.0.1+cu117
Running the script:
I created a shell script according to the suggested one in the section "How to Fine-tune DeepSeek-Coder" and a corresponding dataset.
runtrain.sh
DATA_PATH="pathToMyJsonlFile/file.jsonl"
OUTPUT_PATH="finetune/test"
MODEL="deepseek-ai/deepseek-coder-6.7b-instruct"
cd finetune && deepspeed finetune_deepseekcoder.py \
--model_name_or_path $MODEL \
--data_path $DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 3 \
--model_max_length 1024 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 100 \
--learning_rate 1e-5 \
--warmup_steps 10 \
--logging_steps 1 \
--lr_scheduler_type "cosine" \
--gradient_checkpointing True \
--report_to "tensorboard" \
--deepspeed configs/ds_config_zero3.json \
--bf16 True
ERROR-Message:
[2024-06-14 09:24:20,542] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:22,114] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-06-14 09:24:22,114] [INFO] [runner.py:570:main] cmd = /home/c8501207/.conda/envs/deepseekEnv/bin/python3.11 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_deepseekcoder.py --model_name_or_path deepseek-ai/deepseek-coder-6.7b-instruct --data_path data_nosync/totalText_TinyQA_wizardlm13b_withoutRep_deepseekFormat.jsonl --output_dir finetune/test --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 100 --learning_rate 1e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type cosine --gradient_checkpointing True --report_to tensorboard --deepspeed configs/ds_config_zero3.json --bf16 True
[2024-06-14 09:24:23,520] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:24,649] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2024-06-14 09:24:24,649] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
[2024-06-14 09:24:24,649] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2024-06-14 09:24:24,649] [INFO] [launch.py:163:main] dist_world_size=3
[2024-06-14 09:24:24,649] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/cuda/__init__.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/cuda/__init__.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/cuda/__init__.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2024-06-14 09:24:27,538] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:27,815] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:27,820] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:28,619] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 09:24:28,929] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 09:24:29,349] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 09:24:29,349] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
====================================================================================================
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
cache_dir=None,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=configs/ds_config_zero3.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=finetune/test/runs/Jun14_09-24-27_hopper-c850,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=1024,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=finetune/test,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=16,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=finetune/test,
save_on_each_node=False,
save_safetensors=True,
save_steps=100,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=100,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=10,
weight_decay=0.0,
)
PAD Token: <|end▁of▁sentence|> 32014
BOS Token <|begin▁of▁sentence|> 32013
EOS Token <|EOT|> 32021
PAD Token: <|end▁of▁sentence|> 32014
BOS Token <|begin▁of▁sentence|> 32013
EOS Token <|EOT|> 32021
PAD Token: <|end▁of▁sentence|> 32014
BOS Token <|begin▁of▁sentence|> 32013
EOS Token <|EOT|> 32021
Load tokenizer from deepseek-ai/deepseek-coder-6.7b-instruct over.
Traceback (most recent call last):
File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in <module>
train()
File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 144, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3236, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 961, in __init__
self.model = LlamaModel(config)
^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 822, in __init__
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 144, in __init__
self.reset_parameters()
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 153, in reset_parameters
init.normal_(self.weight)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 155, in normal_
return _no_grad_normal_(tensor, mean, std)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 19, in _no_grad_normal_
return tensor.normal_(mean, std)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2024-06-14 09:24:30,961] [INFO] [partition_parameters.py:347:__exit__] finished initializing model - num_params = 0, num_elems = 0.00B
Traceback (most recent call last):
File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in <module>
train()
File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 144, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3236, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 961, in __init__
self.model = LlamaModel(config)
^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 822, in __init__
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 144, in __init__
self.reset_parameters()
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 153, in reset_parameters
init.normal_(self.weight)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 155, in normal_
return _no_grad_normal_(tensor, mean, std)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 19, in _no_grad_normal_
return tensor.normal_(mean, std)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2024-06-14 09:24:31,652] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1699039
Traceback (most recent call last):
File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in <module>
train()
File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 144, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3236, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 961, in __init__
self.model = LlamaModel(config)
^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
f(module, *args, **kwargs)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 822, in __init__
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 465, in wrapper
self._post_init_method(module)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 989, in _post_init_method
self._zero_init_param(param)
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 945, in _zero_init_param
dist.broadcast(param, 0, self.get_dp_process_group())
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 196, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
work = default_pg.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
[2024-06-14 09:24:31,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1699040
[2024-06-14 09:24:31,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1699041
[2024-06-14 09:24:31,987] [ERROR] [launch.py:321:sigkill_handler] ['/home/c8501207/.conda/envs/deepseekEnv/bin/python3.11', '-u', 'finetune_deepseekcoder.py', '--local_rank=2', '--model_name_or_path', 'deepseek-ai/deepseek-coder-6.7b-instruct', '--data_path', 'data_nosync/totalText_TinyQA_wizardlm13b_withoutRep_deepseekFormat.jsonl', '--output_dir', 'finetune/test', '--num_train_epochs', '3', '--model_max_length', '1024', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '100', '--learning_rate', '1e-5', '--warmup_steps', '10', '--logging_steps', '1', '--lr_scheduler_type', 'cosine', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', 'True'] exits with return code = 1
Solution?
I tried different workarounds for fixing dependency issues between different packages (mainly caused by deepspeed) torch and my CUDA version but was not able to solve this problem. Also, I am aware that Pytorch does not provide a pre-built version for CUDA 12.4, but to my knowledge CUDA versions should be backwards compatible.
I would really appreciate any sort of support/advice here.
Thank you in advance!