GPT3XL training #109

loretoparisi · 2021-01-06T09:24:35Z

It's not clear to me how to train the GPT3XL via GPU/Colab.
Could you add more details?

Thank you.

srulikbd · 2021-01-06T11:51:34Z

there are some incompatibility between the tokenizers to the transformers version (it's installing the current transformers version, but the old tokenizers one).

which versions should we use?

loretoparisi · 2021-01-08T11:05:09Z

@srulikbd I asked to Thomas Wolf from HF about this, and his suggestion was to use the latest version of both. Could you be more specific about the tokenizer's version issue?
Thank you.

srulikbd · 2021-01-08T17:50:20Z

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

change installed tokenizers in requirements file to 0.9.4 or add the command
!pip install tokenizers==0.9.4
in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
I might add soon more changes if needed..

StellaAthena · 2021-01-08T19:27:48Z

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

change installed tokenizers in requirements file to 0.9.4 or add the command
!pip install tokenizers==0.9.4

in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .

delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
I might add soon more changes if needed..

Great! Can you put these changes on a branch and open a PR? That way we can verify that it doesn’t break anything on the TPUs and merge it.

srulikbd · 2021-01-08T21:17:23Z

yeah, of course. I'll do that as soon as possible.
well done for your awesome work!

srulikbd · 2021-01-09T08:36:29Z

@StellaAthena hey.
I got to the training stage, but it got stuck over for some reason. do you have any idea why?
I succeed easily run the train_enwik8 on the gpt-neox library...what is the difference between the 2 packages?

here is the output after running on google colab the GPTNEO example:

2021-01-08 22:33:49.795424: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:49.795465: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term Current step 0 Saving config to /content/GPTNeo/model_weights 2021-01-08 22:33:53.177601: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-08 22:33:53.177746: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-08 22:33:53.177944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-08 22:33:53.178523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-08 22:33:53.178667: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178760: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178842: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.180363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-08 22:33:53.180792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-08 22:33:53.182284: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-08 22:33:53.182413: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182497: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182519: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-01-08 22:33:53.285094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-08 22:33:53.285162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-08 22:33:53.285182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-08 22:33:53.291654: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) Done! params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7f9ed91c0158>, {'n_head': 32, 'n_vocab': 50260, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 1, 'attn_dropout': 0, 'train_steps': 1, 'eval_steps': 0, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 64, 'predict_batch_size': 1, 'iterations': 1, 'n_embd': 2048, 'datasets': [['openwebtexts', 21, 'documents_random', 1.0]], 'model': 'GPT', 'model_path': '/content/GPTNeo/model_weights', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'x:4,y:2', 'layout': 'intermediate_expanded:x,heads:x,vocab:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'openwebtexts': {'path': '/content/GPTNeo/openwebtext-small/bundestag_*.tfrecords', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': False, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 2, 'predict': False, 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False}) Using config: {'_model_dir': '/content/GPTNeo/model_weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None} _TPUContext: eval_on_tpu True eval_on_tpu ignored because use_tpu is False. From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.

StellaAthena · 2021-01-09T14:05:09Z

Where are you running this code? Are you using your own GPUs?

srulikbd · 2021-01-09T14:23:58Z

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

StellaAthena · 2021-01-23T07:23:19Z

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

Sorry this slipped through the cracks. I assume you got everything working based on your PR?

srulikbd · 2021-01-24T05:45:51Z

actually it might still not work. I saw that you are focused on gptneox, so I switched over there :)

StellaAthena added the documentation Improvements or additions to documentation. label Jan 6, 2021

StellaAthena closed this as completed Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT3XL training #109

GPT3XL training #109

loretoparisi commented Jan 6, 2021

srulikbd commented Jan 6, 2021

loretoparisi commented Jan 8, 2021

srulikbd commented Jan 8, 2021 •

edited

Loading

StellaAthena commented Jan 8, 2021

srulikbd commented Jan 8, 2021

srulikbd commented Jan 9, 2021

StellaAthena commented Jan 9, 2021

srulikbd commented Jan 9, 2021

StellaAthena commented Jan 23, 2021

srulikbd commented Jan 24, 2021

GPT3XL training #109

GPT3XL training #109

Comments

loretoparisi commented Jan 6, 2021

srulikbd commented Jan 6, 2021

loretoparisi commented Jan 8, 2021

srulikbd commented Jan 8, 2021 • edited Loading

StellaAthena commented Jan 8, 2021

srulikbd commented Jan 8, 2021

srulikbd commented Jan 9, 2021

StellaAthena commented Jan 9, 2021

srulikbd commented Jan 9, 2021

StellaAthena commented Jan 23, 2021

srulikbd commented Jan 24, 2021

srulikbd commented Jan 8, 2021 •

edited

Loading