Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training rushes through all epochs after error while decoding to model/output_dev #156

Open
M0rica opened this issue May 8, 2020 · 10 comments

Comments

@M0rica
Copy link

M0rica commented May 8, 2020

First of all my specs:
gtx 1070ti 8gb vram
16gb ram
ryzen 7 2700
training on m.2 ssd

My issue is that the model somewhat fails to decode to the model/output_dev file while training (at diffrent steps each time, most times after 5k or 10k steps), which causes it to rush through all other epochs with the same error instantly and then finishing training. I've read about someone who had the same issue and he solved it by decreasing the batch-size, but I tried that as well and nothing helped:

decoding to output model/output_dev_5000
2020-05-08 18:17:53.781721: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 64.0KiB (rounded to 65536). Current allocation summary follows.
2020-05-08 18:17:53.786389: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 25, Chunks in use: 24. 6.3KiB allocated for chunks. 6.0KiB in use in bin. 118B client-requested in use in bin.
2020-05-08 18:17:53.791567: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 1, Chunks in use: 0. 768B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.796353: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 3, Chunks in use: 3. 3.8KiB allocated for chunks. 3.8KiB in use in bin. 3.0KiB client-requested in use in bin.
2020-05-08 18:17:53.802074: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.806886: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.811956: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): Total Chunks: 20, Chunks in use: 20. 160.0KiB allocated for chunks. 160.0KiB in use in bin. 160.0KiB client-requested in use in bin.
2020-05-08 18:17:53.816760: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.821948: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.827046: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): Total Chunks: 7, Chunks in use: 7. 539.5KiB allocated for chunks. 539.5KiB in use in bin. 494.0KiB client-requested in use in bin.
2020-05-08 18:17:53.831938: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): Total Chunks: 556, Chunks in use: 556. 86.89MiB allocated for chunks. 86.89MiB in use in bin. 59.73MiB client-requested in use in bin.
2020-05-08 18:17:53.837532: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.842321: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.847328: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.851795: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): Total Chunks: 9, Chunks in use: 9. 22.00MiB allocated for chunks. 22.00MiB in use in bin. 22.00MiB client-requested in use in bin.
2020-05-08 18:17:53.857609: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): Total Chunks: 1, Chunks in use: 1. 5.97MiB allocated for chunks. 5.97MiB in use in bin. 3.00MiB client-requested in use in bin.
2020-05-08 18:17:53.862976: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): Total Chunks: 576, Chunks in use: 576. 5.16GiB allocated for chunks. 5.16GiB in use in bin. 5.15GiB client-requested in use in bin.
2020-05-08 18:17:53.868138: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): Total Chunks: 1, Chunks in use: 1. 16.14MiB allocated for chunks. 16.14MiB in use in bin. 9.16MiB client-requested in use in bin.
2020-05-08 18:17:53.873656: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): Total Chunks: 1, Chunks in use: 1. 55.00MiB allocated for chunks. 55.00MiB in use in bin. 55.00MiB client-requested in use in bin.
2020-05-08 18:17:53.878707: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.883832: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): Total Chunks: 6, Chunks in use: 6. 885.64MiB allocated for chunks. 885.64MiB in use in bin. 842.45MiB client-requested in use in bin.
2020-05-08 18:17:53.889042: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.894022: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 64.0KiB was 64.0KiB, Chunk State:
2020-05-08 18:17:53.897269: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 6667798272
2020-05-08 18:17:57.863482: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 00000008926C5600 next 18446744073709551615 of size 16920832
2020-05-08 18:17:57.866990: I tensorflow/core/common_runtime/bfc_allocator.cc:914] Summary of in-use Chunks by size:
2020-05-08 18:17:57.869504: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 24 Chunks of size 256 totalling 6.0KiB
2020-05-08 18:17:57.872427: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3 Chunks of size 1280 totalling 3.8KiB
2020-05-08 18:17:57.874727: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 20 Chunks of size 8192 totalling 160.0KiB
2020-05-08 18:17:57.877055: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 65536 totalling 320.0KiB
2020-05-08 18:17:57.879407: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 112128 totalling 109.5KiB
2020-05-08 18:17:57.882387: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 112640 totalling 110.0KiB
2020-05-08 18:17:57.884772: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 277 Chunks of size 131072 totalling 34.63MiB
2020-05-08 18:17:57.887408: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 149504 totalling 146.0KiB
2020-05-08 18:17:57.890274: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 278 Chunks of size 196608 totalling 52.13MiB
2020-05-08 18:17:57.892768: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 2097152 totalling 10.00MiB
2020-05-08 18:17:57.895113: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 3145728 totalling 12.00MiB
2020-05-08 18:17:57.898162: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 6257152 totalling 5.97MiB
2020-05-08 18:17:57.900500: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 14 Chunks of size 8388608 totalling 112.00MiB
2020-05-08 18:17:57.903097: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 553 Chunks of size 9600512 totalling 4.94GiB
2020-05-08 18:17:57.905463: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 12258048 totalling 11.69MiB
2020-05-08 18:17:57.908372: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 12467456 totalling 11.89MiB
2020-05-08 18:17:57.910728: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 12582912 totalling 72.00MiB
2020-05-08 18:17:57.913084: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 16633088 totalling 15.86MiB
2020-05-08 18:17:57.915528: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 16920832 totalling 16.14MiB
2020-05-08 18:17:57.918616: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 57671680 totalling 55.00MiB
2020-05-08 18:17:57.920980: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 153606144 totalling 732.45MiB
2020-05-08 18:17:57.923380: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 160628736 totalling 153.19MiB
2020-05-08 18:17:57.926221: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 6.21GiB
2020-05-08 18:17:57.928590: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 6667798272 memory_limit_: 6667798446 available bytes: 174 curr_region_allocation_bytes_: 13335597056
2020-05-08 18:17:57.932841: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 6667798446
InUse: 6667797248
MaxInUse: 6667798016
NumAllocs: 666955
MaxAllocSize: 489619712

2020-05-08 18:17:57.938305: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************
Exception in thread Thread-5:
Traceback (most recent call last):
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
target_list, run_metadata)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Dst tensor is not initialized.
[[{{node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup}}]]
(1) Internal: Dst tensor is not initialized.
[[{{node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup}}]]
[[dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/All/_221]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 88, in nmt_train
tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 701, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 694, in run_main
train_fn(hparams, target_session=target_session, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 518, in train
sample_tgt_data, avg_ckpts, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 351, in run_full_eval
summary_writer, avg_ckpts, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 288, in run_internal_and_external_eval
summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 177, in run_external_eval
avg_ckpts=avg_ckpts)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 740, in _external_eval
infer_mode=hparams.infer_mode)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\utils\nmt_utils.py", line 60, in decode_and_evaluate
nmt_outputs, _ = model.decode(sess)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 692, in decode
output_tuple = self.infer(sess)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 680, in infer
return sess.run(output_tuple)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Dst tensor is not initialized.
[[node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup (defined at C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
(1) Internal: Dst tensor is not initialized.
[[node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup (defined at C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
[[dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/All/_221]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup':
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 88, in nmt_train
tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 701, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 694, in run_main
train_fn(hparams, target_session=target_session, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 477, in train
infer_model = model_helper.create_infer_model(model_creator, hparams, scope)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model_helper.py", line 228, in create_infer_model
extra_args=extra_args)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\attention_model.py", line 64, in init
extra_args=extra_args)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 95, in init
res = self.build_graph(hparams, scope=scope)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 393, in build_graph
self._build_decoder(self.encoder_outputs, encoder_state, hparams))
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 587, in _build_decoder
scope=decoder_scope)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\decoder.py", line 469, in dynamic_decode
swap_memory=swap_memory)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2753, in while_loop
return_same_structure)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2245, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2170, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2705, in
body = lambda i, lv: (i + 1, orig_body(*lv))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\decoder.py", line 412, in body
decoder_finished) = decoder.step(time, inputs, state)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\basic_decoder.py", line 145, in step
sample_ids=sample_ids)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 627, in next_inputs
lambda: self._embedding_fn(sample_ids))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 1235, in cond
orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 1061, in BuildCondBranch
original_result = fn()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 627, in
lambda: self._embedding_fn(sample_ids))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 579, in
lambda ids: embedding_ops.embedding_lookup(embedding, ids))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\embedding_ops.py", line 317, in embedding_lookup
transform_fn=None)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\embedding_ops.py", line 135, in _embedding_lookup_and_transform
array_ops.gather(params[0], ids, name=name), ids, max_norm)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\array_ops.py", line 3956, in gather
params, indices, axis, name=name)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\gen_array_ops.py", line 4082, in gather_v2
batch_dims=batch_dims, name=name)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

I have a vocab-size of 75k, and i am trying to train a model with ~10.7 million pairs. I trained a smaller model with around 800k pairs before with no issues. The guy from the other issue report says that it's a memory issue and it's the vault of a too big batch-size and he mentions that a batch-size of 16 worked for him, but even a batch-size of 4 causes this error (at step 40k) in my case which is interesting because he has the same graphics card. I also tried to decrease the vocab-size to 15k, but again same error. Can someone help me?
Thanks.

@M0rica M0rica changed the title Training rushes through all epochs after error while decoding to output_dev Training rushes through all epochs after error while decoding to model/output_dev May 8, 2020
@Nathan-Chell
Copy link

When you say ' I also tried to decrease the vocab-size to 15k' did you re-run prepare_data.py ? If not then you need too. The error you are experiencing is, indeed due to your GPU running out of memory. Reducing the batch size will fix this. What is the size of your model?

@M0rica
Copy link
Author

M0rica commented May 8, 2020

I always ran prepare_data.py after changing any settings, and as I said the error occured even with a batch-size of 4 but a few external evaluations later than with higher ones.
Edit: I also run a programm called HWInfo64 while training which monitors the usage of all hardware and this said that the gpu-memory never went over 92%, so in theory there should be enough memory, but i don't know how accurat the programm is...

@Nathan-Chell
Copy link

HwInfo is very good, quite accurate also. What is the size of your model and what gpu are you training it on?

@M0rica
Copy link
Author

M0rica commented May 8, 2020

I train on a gtx1070ti and what exactly do you mean by size of the model? everything is default except vocab-size=75.000 and i have 10.7 million training pairs (total 3.8gb text files).

@Nathan-Chell
Copy link

I mean the amount of neurons and layers you have in your network. You should easily be able to fit the model you have described with a batch size of 4 in 8GB of ram. Are you updating the settings in settings.py and do you have overide existing settings set to true ?

@M0rica
Copy link
Author

M0rica commented May 9, 2020

I got standart model size of num_layers=2 with num_units=512 and override_loaded_hparams=True, I set the settings in settings.py
These are all hparams in settings.py:
"attention": "scaled_luong",
"num_train_steps": 10000000,
"num_layers": 2,
#"num_encoder_layers": 2,
#"num_decoder_layers": 2,
"num_units": 512,
"batch_size": 4,
"override_loaded_hparams": True,
#"decay_scheme": "luong234"
#"residual": True,
"optimizer": "adam",
"encoder_type": "bi",
"learning_rate": 0.001,
"beam_width": 20,
"length_penalty_weight": 1.0,
"num_translations_per_input": 20,
#"num_keep_ckpts": 5,

@M0rica
Copy link
Author

M0rica commented May 24, 2020

Small update:
So I tried lots of diffrent settings the last few days, literally NOTHING seems to work: The smallest I tried training was a model with 2 layers, 256 units, vocab-size of 15000 and a batch-size of 1, but even this very low configuration caused the same error. HWInfo interestingly says that no matter what batch-size, vocab-size etc., the gpu memory usage stays the same at around 7.2 GB every single time. Btw i'm using python 3.7.6, tensorflow-gpu 1.15.2 (also tried 1.14 but nothing diffrent, so don't think tensorflow is the problem) with CUDA 10 with cuDNN 7.6.4

@Nathan-Chell
Copy link

How much system RAM do you have?

@M0rica
Copy link
Author

M0rica commented May 25, 2020

I have 16GB DDR4 3000mhz ram, the training uses about 4GB, total ram usage is at about 70% during training

@M0rica
Copy link
Author

M0rica commented Jun 4, 2020

So seems like I fixed the issue, but the solution is not perfekt:
I just added to hparams in settings.py 'steps_per_external_eval': 10000000000, which prevents making external evals, this way it works with a batch-size of 16 and a model with 2 layers and 512 units, but i think the bleu score won't be updatet which is not a big problem. Once i stopped the training, it made an external eval on next startup, which again caused a crash. To prevent this, I commented out "run_full_eval" under "#First evaluation" (l. 514) in the train.py inside the nmt folder and it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants