Segfault when training large GPT2 models on single GPU

I'm trying to use DeepSpeed to finetune GPT2 models on a single RTX 3090 GPU. Using the scripts included with huggingface-transformers, I have been able to get it working up through the 774M model, and the ZeRO optimizations enable me to double the batch size. However, the CPU Adam optimizer is segfaulting when I try to train the 1558M model. I am using Ubuntu 20.04, CUDA 11.2, Nvidia drivers 460.32.03, and current git master versions of PyTorch, Transformers, and DeepSpeed.

Here is the script I used:

```
export BATCH_SIZE=1

export CUDA_VISIBLE_DEVICES=0
export CUDA_HOME=/usr/local/cuda-11.2
export TOKENIZERS_PARALLELISM=false
export MP_SIZE=1
export NUM_WORKERS=1
export NUM_GPUS_PER_WORKER=1

rm -r test_output

USE_TF=0 deepspeed --num_gpus=1 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size $BATCH_SIZE --per_device_eval_batch_size $BATCH_SIZE --fp16 --deepspeed ds_config.json
```

pofo-corpus.txt is the Poetry Foundation collection in a single text file (around 18MB). Here is the config file:

```
{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 100,
        "hysteresis": 2,
        "min_loss_scale": 1e-24,
        "initial_scale_power": -2
    },

    "zero_allow_untested_optimizer": true,
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 1.8e7,
        "reduce_scatter": true,
        "reduce_bucket_size": 1.8e7,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 1e-6,
            "warmup_max_lr": 5e-5,
            "warmup_num_steps": 500
        }
    }
}
```

I've messed around with a bunch of the settings, but none of them seem to affect the issue. Here is the output:

```
rm: cannot remove 'test_output': No such file or directory
[2021-01-18 14:10:29,800] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-18 14:10:29,815] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --fp16 --deepspeed ds_config.json
[2021-01-18 14:10:30,261] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0]}
[2021-01-18 14:10:30,261] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-01-18 14:10:30,261] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-01-18 14:10:30,261] [INFO] [launch.py:100:main] dist_world_size=1
[2021-01-18 14:10:30,261] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-01-18 14:10:31,069] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
Using custom data configuration default
Reusing dataset text (/home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,547 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,547 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,624 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,625 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/jechk/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/jechk/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/jechk/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1027] 2021-01-18 14:10:32,129 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/jechk/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0
[INFO|modeling_utils.py:1143] 2021-01-18 14:10:54,131 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1151] 2021-01-18 14:10:54,131 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-d5e960aa227f7b5e.arrow
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-1b9b3e2f092a373d.arrow
[INFO|trainer.py:442] 2021-01-18 14:10:55,458 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:359] 2021-01-18 14:10:55,458 >> Using amp fp16 backend
[INFO|integrations.py:323] 2021-01-18 14:10:55,459 >> Keeping the `scheduler` config from ds_config.json intact, ignoring any scheduler-specific cl args
[INFO|integrations.py:368] 2021-01-18 14:10:55,459 >> Keeping the `fp16` config from ds_config.json intact, ignoring any fp16-specific cl args
[2021-01-18 14:10:55,459] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.10+7b07e12, git-hash=7b07e12, git-branch=master
[2021-01-18 14:10:55,472] [INFO] [engine.py:73:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jechk/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.20410561561584473 seconds
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-01-18 14:10:57,968] [INFO] [engine.py:540:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-01-18 14:10:57,968] [INFO] [engine.py:545:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
    amsgrad: False
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 5e-05
    weight_decay: 0.0
)
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-01-18 14:10:57,968] [INFO] [engine.py:661:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/jechk/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.1086723804473877 seconds
[2021-01-18 14:10:58,077] [INFO] [stage2.py:130:__init__] Reduce bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:131:__init__] Allgather bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:132:__init__] CPU Offload: True
group 0 param 0 = 1557611200
[2021-01-18 14:11:03,591] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-01-18 14:11:03,591] [INFO] [engine.py:575:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7fe0f0994d30>
[2021-01-18 14:11:03,591] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fdfbc497ee0>
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]]
[2021-01-18 14:11:03,591] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fe0f0994a60>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   amp_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   dump_state ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 0.25, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1e-24}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7fe0f0994ac0>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   fp16_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 0.25
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 5e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pld_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 1e-06, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 500}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   steps_per_print .............. 10
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   train_batch_size ............. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   wall_clock_breakdown ......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   world_size ................... 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 18000000.0,
    "allgather_partitions": true,
    "contiguous_gradients": true,
    "cpu_offload": true,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": true,
    "reduce_bucket_size": 18000000.0,
    "reduce_scatter": true,
    "stage": 2
}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-01-18 14:11:03,592] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":true,
        "hysteresis":2,
        "initial_scale_power":-2,
        "loss_scale":0,
        "loss_scale_window":100,
        "min_loss_scale":1e-24
    },
    "gradient_accumulation_steps":1,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.9,
                0.999
            ],
            "eps":1e-08,
            "lr":5e-05,
            "weight_decay":0.0
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":5e-05,
            "warmup_min_lr":1e-06,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "train_micro_batch_size_per_gpu":1,
    "zero_allow_untested_optimizer":true,
    "zero_optimization":{
        "allgather_bucket_size":18000000.0,
        "allgather_partitions":true,
        "contiguous_gradients":true,
        "cpu_offload":true,
        "overlap_comm":true,
        "reduce_bucket_size":18000000.0,
        "reduce_scatter":true,
        "stage":2
    }
}
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028705596923828125 seconds
[INFO|trainer.py:810] 2021-01-18 14:11:03,643 >> ***** Running training *****
[INFO|trainer.py:811] 2021-01-18 14:11:03,643 >>   Num examples = 4917
[INFO|trainer.py:812] 2021-01-18 14:11:03,643 >>   Num Epochs = 3
[INFO|trainer.py:813] 2021-01-18 14:11:03,643 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:814] 2021-01-18 14:11:03,643 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:815] 2021-01-18 14:11:03,643 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:816] 2021-01-18 14:11:03,643 >>   Total optimization steps = 14751
2021-01-18 14:11:03.737646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  0%|                                                                                                 | 0/14751 [00:00<?, ?it/s][W reducer.cpp:1042] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
```

The program then exits abruptly. The segfault is reported in dmesg:

```
[ 9250.120732] python3[10345]: segfault at 7fde4685c850 ip 00007fdfbc2057e0 sp 00007fdf8cd3fe40 error 6
[ 9250.120738] python3[10349]: segfault at 7fde846a9f70 ip 00007fdfbc2057e0 sp 00007fdf8ad3be40 error 6
[ 9250.120743] python3[10344]: segfault at 7fde370c9288 ip 00007fdfbc2057e0 sp 00007fdfbcce3e40 error 6
[ 9250.120745] python3[10348]: segfault at 7fde74f169a8 ip 00007fdfbc2057e0 sp 00007fdf8b53ce40 error 6
[ 9250.120749] python3[10347]: segfault at 7fde657833e0 ip 00007fdfbc2057e0 sp 00007fdf8bd3de40 error 6
[ 9250.120752]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120754]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120755]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120761] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120763] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120764] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120766]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120767]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120772] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120778] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff
```

I tried training similar models using the DeepSpeed version of Megatron-LM instead of huggingface-transformers, and the same thing happens--it works correctly up through a certain number of parameters, but it segfaults with sufficiently large models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault when training large GPT2 models on single GPU #679

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segfault when training large GPT2 models on single GPU #679

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions