Skip to content
This repository has been archived by the owner on Feb 25, 2022. It is now read-only.

GPT3XL training #109

Closed
loretoparisi opened this issue Jan 6, 2021 · 10 comments
Closed

GPT3XL training #109

loretoparisi opened this issue Jan 6, 2021 · 10 comments
Labels
documentation Improvements or additions to documentation.

Comments

@loretoparisi
Copy link

It's not clear to me how to train the GPT3XL via GPU/Colab.
Could you add more details?

Thank you.

@srulikbd
Copy link
Contributor

srulikbd commented Jan 6, 2021

there are some incompatibility between the tokenizers to the transformers version (it's installing the current transformers version, but the old tokenizers one).

  1. which versions should we use?

@StellaAthena StellaAthena added the documentation Improvements or additions to documentation. label Jan 6, 2021
@loretoparisi
Copy link
Author

@srulikbd I asked to Thomas Wolf from HF about this, and his suggestion was to use the latest version of both. Could you be more specific about the tokenizer's version issue?
Thank you.

@srulikbd
Copy link
Contributor

srulikbd commented Jan 8, 2021

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

  1. change installed tokenizers in requirements file to 0.9.4 or add the command
    !pip install tokenizers==0.9.4
  2. in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
  3. delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
    I might add soon more changes if needed..

@StellaAthena
Copy link
Member

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

  1. change installed tokenizers in requirements file to 0.9.4 or add the command
    !pip install tokenizers==0.9.4
  2. in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
  3. delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
    I might add soon more changes if needed..

Great! Can you put these changes on a branch and open a PR? That way we can verify that it doesn’t break anything on the TPUs and merge it.

@srulikbd
Copy link
Contributor

srulikbd commented Jan 8, 2021

yeah, of course. I'll do that as soon as possible.
well done for your awesome work!

@srulikbd
Copy link
Contributor

srulikbd commented Jan 9, 2021

@StellaAthena hey.
I got to the training stage, but it got stuck over for some reason. do you have any idea why?
I succeed easily run the train_enwik8 on the gpt-neox library...what is the difference between the 2 packages?

here is the output after running on google colab the GPTNEO example:

2021-01-08 22:33:49.795424: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:49.795465: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term Current step 0 Saving config to /content/GPTNeo/model_weights 2021-01-08 22:33:53.177601: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-08 22:33:53.177746: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-08 22:33:53.177944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-08 22:33:53.178523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-08 22:33:53.178667: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178760: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178842: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.180363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-08 22:33:53.180792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-08 22:33:53.182284: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-08 22:33:53.182413: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182497: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182519: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-01-08 22:33:53.285094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-08 22:33:53.285162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-08 22:33:53.285182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-08 22:33:53.291654: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) Done! params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7f9ed91c0158>, {'n_head': 32, 'n_vocab': 50260, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 1, 'attn_dropout': 0, 'train_steps': 1, 'eval_steps': 0, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 64, 'predict_batch_size': 1, 'iterations': 1, 'n_embd': 2048, 'datasets': [['openwebtexts', 21, 'documents_random', 1.0]], 'model': 'GPT', 'model_path': '/content/GPTNeo/model_weights', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'x:4,y:2', 'layout': 'intermediate_expanded:x,heads:x,vocab:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'openwebtexts': {'path': '/content/GPTNeo/openwebtext-small/bundestag_*.tfrecords', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': False, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 2, 'predict': False, 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False}) Using config: {'_model_dir': '/content/GPTNeo/model_weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None} _TPUContext: eval_on_tpu True eval_on_tpu ignored because use_tpu is False. From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.

@StellaAthena
Copy link
Member

Where are you running this code? Are you using your own GPUs?

@srulikbd
Copy link
Contributor

srulikbd commented Jan 9, 2021

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

@StellaAthena
Copy link
Member

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

Sorry this slipped through the cracks. I assume you got everything working based on your PR?

@srulikbd
Copy link
Contributor

actually it might still not work. I saw that you are focused on gptneox, so I switched over there :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation.
Projects
None yet
Development

No branches or pull requests

3 participants