RuntimeError: CUDA error: no kernel image is available for execution on the device

I am new to working with LLMs and require some support with fine-tuning your model “deepseek-ai/deepseek-coder-6.7b-instruct” with my small dataset.

# Server-specs:
`nvidia-smi`
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               Off |   00000000:01:00.0 Off |                    0 |
| N/A   43C    P0             52W /  350W |      26MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 PCIe               Off |   00000000:02:00.0 Off |                    0 |
| N/A   43C    P0             49W /  350W |      23MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               Off |   00000000:C1:00.0 Off |                  Off |
| 41%   43C    P8             15W /  140W |     497MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     53937      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   2771092      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A     53937      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A   2771092      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A     53937      G   /usr/lib/xorg/Xorg                             73MiB |
|    2   N/A  N/A   2771092      G   /usr/lib/xorg/Xorg                             78MiB |
|    2   N/A  N/A   2771368      G   /usr/bin/gnome-shell                          117MiB |
+-----------------------------------------------------------------------------------------+
```

I am working in a conda environment with Python 3.11:
`conda create --name deepseekEnv -c conda-forge python=3.11`

After cloning the git-repository and installing all packages listed in "finetune/requirements.txt" i end up with the following:
`conda list`
```
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
accelerate                0.24.1                   pypi_0    pypi
aiohttp                   3.9.5                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
annotated-types           0.7.0                    pypi_0    pypi
attrdict                  2.0.1                    pypi_0    pypi
attrs                     23.2.0                   pypi_0    pypi
bzip2                     1.0.8                hd590300_5    conda-forge
ca-certificates           2024.6.2             hbcca054_0    conda-forge
certifi                   2024.6.2                 pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
cmake                     3.29.5.1                 pypi_0    pypi
datasets                  2.14.7                   pypi_0    pypi
deepspeed                 0.12.2                   pypi_0    pypi
dill                      0.3.7                    pypi_0    pypi
filelock                  3.15.1                   pypi_0    pypi
frozenlist                1.4.1                    pypi_0    pypi
fsspec                    2023.10.0                pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
huggingface-hub           0.16.4                   pypi_0    pypi
idna                      3.7                      pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
ld_impl_linux-64          2.40                 hf3520f5_3    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h77fa898_8    conda-forge
libgomp                   13.2.0               h77fa898_8    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
lit                       18.1.7                   pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
multidict                 6.0.5                    pypi_0    pypi
multiprocess              0.70.15                  pypi_0    pypi
ncurses                   6.5                  h59595ed_0    conda-forge
networkx                  3.3                      pypi_0    pypi
ninja                     1.11.1.1                 pypi_0    pypi
numpy                     1.26.4                   pypi_0    pypi
nvidia-cublas-cu11        11.10.3.66               pypi_0    pypi
nvidia-cuda-cupti-cu11    11.7.101                 pypi_0    pypi
nvidia-cuda-nvrtc-cu11    11.7.99                  pypi_0    pypi
nvidia-cuda-runtime-cu11  11.7.99                  pypi_0    pypi
nvidia-cudnn-cu11         8.5.0.96                 pypi_0    pypi
nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
nvidia-curand-cu11        10.2.10.91               pypi_0    pypi
nvidia-cusolver-cu11      11.4.0.1                 pypi_0    pypi
nvidia-cusparse-cu11      11.7.4.91                pypi_0    pypi
nvidia-nccl-cu11          2.14.3                   pypi_0    pypi
nvidia-nvtx-cu11          11.7.91                  pypi_0    pypi
openssl                   3.3.1                h4ab18f5_0    conda-forge
packaging                 24.1                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
pip                       24.0               pyhd8ed1ab_0    conda-forge
protobuf                  5.27.1                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
pyarrow                   16.1.0                   pypi_0    pypi
pyarrow-hotfix            0.6                      pypi_0    pypi
pydantic                  2.7.4                    pypi_0    pypi
pydantic-core             2.18.4                   pypi_0    pypi
pynvml                    11.5.0                   pypi_0    pypi
python                    3.11.9          hb806964_0_cpython    conda-forge
python-dateutil           2.9.0.post0              pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h8228510_1    conda-forge
regex                     2024.5.15                pypi_0    pypi
requests                  2.32.3                   pypi_0    pypi
safetensors               0.4.3                    pypi_0    pypi
setuptools                70.0.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0                   pypi_0    pypi
sympy                     1.12.1                   pypi_0    pypi
tensorboardx              2.6.2.2                  pypi_0    pypi
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tokenizers                0.14.0                   pypi_0    pypi
torch                     2.0.1                    pypi_0    pypi
tqdm                      4.66.4                   pypi_0    pypi
transformers              4.35.0                   pypi_0    pypi
triton                    2.0.0                    pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
urllib3                   2.2.1                    pypi_0    pypi
wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
xxhash                    3.4.1                    pypi_0    pypi
xz                        5.2.6                h166bdaf_0    conda-forge
yarl                      1.9.4                    pypi_0    pypi

```
`nvcc --version`
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
```

```
`print(torch.version.cuda)`
11.7
`print(torch.__version__)`
2.0.1+cu117
```

# Running the script:
I created a shell script according to the suggested one in the section "How to Fine-tune DeepSeek-Coder" and a corresponding dataset.
`runtrain.sh`

        DATA_PATH="pathToMyJsonlFile/file.jsonl"
        OUTPUT_PATH="finetune/test"
        MODEL="deepseek-ai/deepseek-coder-6.7b-instruct"
        
        cd finetune && deepspeed finetune_deepseekcoder.py \
            --model_name_or_path $MODEL \
            --data_path $DATA_PATH \
            --output_dir $OUTPUT_PATH \
            --num_train_epochs 3 \
            --model_max_length 1024 \
            --per_device_train_batch_size 16 \
            --per_device_eval_batch_size 1 \
            --gradient_accumulation_steps 4 \
            --evaluation_strategy "no" \
            --save_strategy "steps" \
            --save_steps 100 \
            --save_total_limit 100 \
            --learning_rate 1e-5 \
            --warmup_steps 10 \
            --logging_steps 1 \
            --lr_scheduler_type "cosine" \
            --gradient_checkpointing True \
            --report_to "tensorboard" \
            --deepspeed configs/ds_config_zero3.json \
            --bf16 True

# ERROR-Message:
```
[2024-06-14 09:24:20,542] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:22,114] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-06-14 09:24:22,114] [INFO] [runner.py:570:main] cmd = /home/c8501207/.conda/envs/deepseekEnv/bin/python3.11 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_deepseekcoder.py --model_name_or_path deepseek-ai/deepseek-coder-6.7b-instruct --data_path data_nosync/totalText_TinyQA_wizardlm13b_withoutRep_deepseekFormat.jsonl --output_dir finetune/test --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 100 --learning_rate 1e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type cosine --gradient_checkpointing True --report_to tensorboard --deepspeed configs/ds_config_zero3.json --bf16 True
[2024-06-14 09:24:23,520] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:24,649] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2024-06-14 09:24:24,649] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
[2024-06-14 09:24:24,649] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2024-06-14 09:24:24,649] [INFO] [launch.py:163:main] dist_world_size=3
[2024-06-14 09:24:24,649] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/cuda/__init__.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/cuda/__init__.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/cuda/__init__.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2024-06-14 09:24:27,538] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:27,815] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:27,820] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 09:24:28,619] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 09:24:28,929] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 09:24:29,349] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 09:24:29,349] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

====================================================================================================
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
cache_dir=None,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=configs/ds_config_zero3.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=finetune/test/runs/Jun14_09-24-27_hopper-c850,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=1024,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=finetune/test,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=16,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=finetune/test,
save_on_each_node=False,
save_safetensors=True,
save_steps=100,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=100,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=10,
weight_decay=0.0,
)
PAD Token: <｜end▁of▁sentence｜> 32014
BOS Token <｜begin▁of▁sentence｜> 32013
EOS Token <|EOT|> 32021
PAD Token: <｜end▁of▁sentence｜> 32014
BOS Token <｜begin▁of▁sentence｜> 32013
EOS Token <|EOT|> 32021
PAD Token: <｜end▁of▁sentence｜> 32014
BOS Token <｜begin▁of▁sentence｜> 32013
EOS Token <|EOT|> 32021
Load tokenizer from deepseek-ai/deepseek-coder-6.7b-instruct over.
Traceback (most recent call last):
  File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in <module>
    train()
  File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 144, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3236, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 961, in __init__
    self.model = LlamaModel(config)
                 ^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 822, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 144, in __init__
    self.reset_parameters()
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 153, in reset_parameters
    init.normal_(self.weight)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 155, in normal_
    return _no_grad_normal_(tensor, mean, std)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 19, in _no_grad_normal_
    return tensor.normal_(mean, std)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2024-06-14 09:24:30,961] [INFO] [partition_parameters.py:347:__exit__] finished initializing model - num_params = 0, num_elems = 0.00B
Traceback (most recent call last):
  File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in <module>
    train()
  File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 144, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3236, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 961, in __init__
    self.model = LlamaModel(config)
                 ^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 822, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 144, in __init__
    self.reset_parameters()
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 153, in reset_parameters
    init.normal_(self.weight)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 155, in normal_
    return _no_grad_normal_(tensor, mean, std)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/nn/init.py", line 19, in _no_grad_normal_
    return tensor.normal_(mean, std)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2024-06-14 09:24:31,652] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1699039
Traceback (most recent call last):
  File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in <module>
    train()
  File "/home/c8501207/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 144, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3236, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 961, in __init__
    self.model = LlamaModel(config)
                 ^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
    f(module, *args, **kwargs)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 822, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 465, in wrapper
    self._post_init_method(module)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 989, in _post_init_method
    self._zero_init_param(param)
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 945, in _zero_init_param
    dist.broadcast(param, 0, self.get_dp_process_group())
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 196, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/c8501207/.conda/envs/deepseekEnv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
    work = default_pg.broadcast([tensor], opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
[2024-06-14 09:24:31,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1699040
[2024-06-14 09:24:31,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1699041
[2024-06-14 09:24:31,987] [ERROR] [launch.py:321:sigkill_handler] ['/home/c8501207/.conda/envs/deepseekEnv/bin/python3.11', '-u', 'finetune_deepseekcoder.py', '--local_rank=2', '--model_name_or_path', 'deepseek-ai/deepseek-coder-6.7b-instruct', '--data_path', 'data_nosync/totalText_TinyQA_wizardlm13b_withoutRep_deepseekFormat.jsonl', '--output_dir', 'finetune/test', '--num_train_epochs', '3', '--model_max_length', '1024', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '100', '--learning_rate', '1e-5', '--warmup_steps', '10', '--logging_steps', '1', '--lr_scheduler_type', 'cosine', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', 'True'] exits with return code = 1

```
# Solution?
I tried different workarounds for fixing dependency issues between different packages (mainly caused by deepspeed) torch and my CUDA version but was not able to solve this problem. Also, I am aware that Pytorch does not provide a pre-built version for CUDA 12.4, but to my knowledge CUDA versions should be backwards compatible. 

I would really appreciate any sort of support/advice here.
Thank you in advance!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: CUDA error: no kernel image is available for execution on the device #168

Server-specs:

Running the script:

ERROR-Message:

Solution?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: CUDA error: no kernel image is available for execution on the device #168

Description

Server-specs:

Running the script:

ERROR-Message:

Solution?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions