Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training crashes at end #4

Closed
devinbostIL opened this issue Jun 7, 2018 · 7 comments
Closed

Training crashes at end #4

devinbostIL opened this issue Jun 7, 2018 · 7 comments

Comments

@devinbostIL
Copy link

I setup a conda environment with PyTorch 0.4 and installed this fork of OpenNMT-py, as I mentioned in the original issue I had posted here: OpenNMT#743

I then ran:

python preprocess.py -train_src C:\src\torchevere-offensive-classifier\training\character\train_src.txt -train_tgt C:\src\torchevere-offensive-classifier\training\character\train_dst.txt -valid_src C:\src\torchevere-offensive-classifier\training\character\val_src.txt -valid_tgt C:\src\torchevere-offensive-classifier\training\character\val_dst.txt -save_data data/character/tc-offense-classifier-character_v3 -src_seq_length 5000 -tgt_seq_length 5000

python train.py -data data/character/tc-offense-classifier-character_v3 -save_model tc-offense-classifier-character_v3 -gpuid 0 -layers 3 -learning_rate_decay 0.99 -train_steps 10000 -rnn_size 500

Everything ran smoothly until it crashed at the end.
Here's the output:

(pyTorchOffensive) C:\src\pyopennmt\ubiqus\OpenNMT-py>python train.py -data data/character/tc-offense-classifier-character_v3 -save_model tc-offense-classifier-character_v3 -gpuid 0 -layers 3 -learning_rate_decay 0.99 -train_steps 10000 -rnn_size 500
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		 * vocabulary size. source = 165; target = 6
		Building model...
		Intializing model parameters.
		NMTModel(
		  (encoder): RNNEncoder(
		    (embeddings): Embeddings(
		      (make_embedding): Sequential(
		        (emb_luts): Elementwise(
		          (0): Embedding(165, 500, padding_idx=1)
		        )
		      )
		    )
		    (rnn): LSTM(500, 500, num_layers=3, dropout=0.3)
		  )
		  (decoder): InputFeedRNNDecoder(
		    (embeddings): Embeddings(
		      (make_embedding): Sequential(
		        (emb_luts): Elementwise(
		          (0): Embedding(6, 500, padding_idx=1)
		        )
		      )
		    )
		    (dropout): Dropout(p=0.3)
		    (rnn): StackedLSTM(
		      (dropout): Dropout(p=0.3)
		      (layers): ModuleList(
		        (0): LSTMCell(1000, 500)
		        (1): LSTMCell(500, 500)
		        (2): LSTMCell(500, 500)
		      )
		    )
		    (attn): GlobalAttention(
		      (linear_in): Linear(in_features=500, out_features=500, bias=False)
		      (linear_out): Linear(in_features=1000, out_features=500, bias=False)
		      (softmax): Softmax()
		      (tanh): Tanh()
		    )
		  )
		  (generator): Sequential(
		    (0): Linear(in_features=500, out_features=6, bias=True)
		    (1): LogSoftmax()
		  )
		)
		* number of parameters: 13862506
		encoder:  6094500
		decoder:  7768006
		Making optimizer for training.
		
		Start training...
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 50, 10000; acc:  27.34; ppl:  16.07; xent:   2.78; lr: 1.00000; 14099 / 3200 tok/s;      4 sec
		GPU 0: for information we completed an epoch at step 54
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 100, 10000; acc:  74.22; ppl:   1.45; xent:   0.37; lr: 1.00000; 26876 / 2614 tok/s;      6 sec
		GPU 0: for information we completed an epoch at step 107
		
		. . . 
		
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 9950, 10000; acc: 100.00; ppl:   1.00; xent:   0.00; lr: 1.00000; 16583 / 2462 tok/s;    616 sec
		GPU 0: for information we completed an epoch at step 9965
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 10000, 10000; acc: 100.00; ppl:   1.00; xent:   0.00; lr: 1.00000; 19345 / 2416 tok/s;    619 sec
		Loading valid dataset from data/character\tc-offense-classifier-character_v3.valid.1.pt, number of examples: 376
		Traceback (most recent call last):
		  File "train.py", line 41, in <module>
		    main(opt)
		  File "train.py", line 28, in main
		    single_main(opt)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\train_single.py", line 120, in main
		    opt.valid_steps)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\trainer.py", line 176, in train
		    valid_stats = self.validate(valid_iter)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\trainer.py", line 208, in validate
		    for batch in valid_iter:
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\inputters\inputter.py", line 423, in __iter__
		    for batch in self.cur_iter:
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\iterator.py", line 151, in __iter__
		    self.train)
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\batch.py", line 27, in __init__
		    setattr(self, name, field.process(batch, device=device, train=train))
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 188, in process
		    tensor = self.numericalize(padded, device=device, train=train)
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in numericalize
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in <listcomp>
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in <listcomp>
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		KeyError: '🏻'

Luckily, there was still a model file that was produced during the middle of the training (half way through the specified number of training steps), and I was able to use that with translate.py. And thankfully, it solved the issue with having blanks in the output file, as I mentioned in that original issue (OpenNMT#743).

However, it's still an issue to have the training crash at the end of the process, so I'm reporting this bug.

@devinbostIL
Copy link
Author

Update: I spoke a little too soon on the blanks issue. There are still a very small number of blanks in the output file. However, it's a major improvement. Instead of nearly half of the outputs being blank, out of 1374971 input lines, only 211 resulted in blank values.

@vince62s
Copy link

vince62s commented Jun 8, 2018

yes we faced a similar issue. see pytorch/text#337

I was able to train a big model without any issue but obviously in some situation this is happening.
We'll try to look into it.

@vince62s
Copy link

vince62s commented Jun 8, 2018

@devinbostIL can you pull and try again ?
thanks

@hughperkins
Copy link

Might be working now :)

$ python train.py -data data/demo -save_model demo-model -gpuid 0 -train_steps 100
Start training...
Loading train dataset from data/demo.train.1.pt, number of examples: 10000
Step 50,   100; acc:   6.27; ppl: 10845.86; xent:   9.29; lr: 1.00000; 4726 / 5988 tok/s;     13 sec
Step 100,   100; acc:   6.66; ppl: 271385.80; xent:  12.51; lr: 1.00000; 6241 / 6769 tok/s;     26 sec
GPU 0: for information we completed an epoch at step 101
Loading train dataset from data/demo.train.1.pt, number of examples: 10000

(whereas before I got File "/toknas/hugh/condap4/lib/python3.6/site-packages/torchtext/data/field.py", line 287, in <listcomp> arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] KeyError: 'notebooks' , but that was training over 10,000 steps, which I havent tried yet)

@hughperkins
Copy link

Update: seems like saving is working without crashing for me now. Not sure if I was encountering the same issue as @devinbostIL ?

GPU 0: for information we completed an epoch at step 9892
Loading train dataset from data/demo.train.1.pt, number of examples: 10000
Step 9900, 100000; acc:  59.22; ppl:   5.49; xent:   1.70; lr: 1.00000; 4926 / 4436 tok/s;   2637 sec
Step 9950, 100000; acc:  64.17; ppl:   4.26; xent:   1.45; lr: 1.00000; 7627 / 6924 tok/s;   2650 sec
Step 10000, 100000; acc:  64.87; ppl:   4.44; xent:   1.49; lr: 1.00000; 4288 / 4225 tok/s;   2663 sec
Loading valid dataset from data/demo.valid.1.pt, number of examples: 2819
Validation perplexity: 8379.74
Validation accuracy: 16.3435
Saving checkpoint demo-model_step_10000.pt
GPU 0: for information we completed an epoch at step 10049
Loading train dataset from data/demo.train.1.pt, number of examples: 10000
Step 10050, 100000; acc:  52.43; ppl:   8.35; xent:   2.12; lr: 1.00000; 5995 / 5589 tok/s;   2683 sec
Step 10100, 100000; acc:  93.80; ppl:   1.41; xent:   0.35; lr: 1.00000; 2254 / 3393 tok/s;   2695 sec
Step 10150, 100000; acc:  52.90; ppl:   8.33; xent:   2.12; lr: 1.00000; 5719 / 5469 tok/s;   2708 sec
Step 10200, 100000; acc:  73.04; ppl:   2.95; xent:   1.08; lr: 1.00000; 5330 / 6146 tok/s;   2722 sec
GPU 0: for information we completed an epoch at step 10206
Loading train dataset from data/demo.train.1.pt, number of examples: 10000

@vince62s
Copy link

closing this, reopen it if needed

@devinbostIL
Copy link
Author

It's running to the end without crashing now.

francoishernandez referenced this issue in francoishernandez/OpenNMT-py Jun 26, 2018
Make comments compatible with doc format
francoishernandez referenced this issue in francoishernandez/OpenNMT-py Sep 7, 2020
* wip simplify shards stuff

* some cleanup

* rename shard to corpus

* some cleanup

* multiple producers, move transform up in code execution

* apply transform properly in verbose mode

* do not use cycle, which causes memory leak

* make one queue per gpu per producer, to preserve reproducibility

* enable bpe dropout for pyonmttok, v>1.19, fix inverted dropout value

* terminate producers properly when training processes are done

* single queue per producer/consumer couple

* revert semaphore size change
francoishernandez referenced this issue in francoishernandez/OpenNMT-py Sep 21, 2020
* bin/train_dynamicdata.py

* opts: --data_config and --bucket_size

* train_single.main_dynamicdata

* dynamicdata package

* broken

* still broken: not enough to get rid of immediate generator

* build_data_loader in producer

* nfeats 0, not 1

* fixes to mismatching batch structure

* looks weird reporting while filling up queue

* bin/preprocess_dynamicdata.py

* enable PrefixTransform to work with only src

* WIP: translator for dynamicdata

* WIP: opened too early

* dynamicdata runs

* removed obsolete function

* restore mixing weight schedule counter

* pop all expired mixing weights at once

* compressed inputs

* allow missing transforms during sharding

* fail fast on missing files

* bad cleverness in vocab paths

* repr for transforms

* SentencepieceTransform

* send cpu tensors from producer to consumer

The data loader cannot access the gpu when CUDA Compute Mode is set to
Exclusive_Process. Trying to do so results in "CUDA error:
all CUDA-capable devices are busy or unavailable".
As the training process has the gpu, it is responsible for calling
torch.device to send data from cpu to gpu.

* torch.device moved to wrong place. Also missing from validation.

* prints to logging

* access subword vocab from transforms. SwitchOut

* WbNoiseTransform

* multiple transform pipelines with share_inputs

* reverse: flips src and trg for backtranslation

* saving and loading data_loader_step

* pre-de-tokenization

* MorfessorEmTabooPeturbedTransform

* logging (on and off)

* parameter seg_n_samples

* protect from clobbering

* extra_prefix

* comment

* bin/frankenstein.py

* prune matching keys when config doesn't match

* insertion transform

* order of _inputs shouldn't matter

* Sentencepiece sampling params

* need to store in pickle

* started writing dynamicdata documentation

* critique of the current dataloader

* minor

* concepts

* usage

* dyndata.md

* updated fig

* better naming

* pep8

* translate opt

* re.groups should not be changed to tasks

* rename group to task in translator

* replaced readme with dynamicdata version

* dynamicdata requirements as optional

* DeterministicSegmentationTransform

* WIP: bin/debug_dynamicdata.py

* bin/debug_dynamicdata.py

* DeterministicSegmentationTransform

* sanity checks to SentencepieceTransform

* update README.md

* formatting

* template_dynamicdata.py

* Cleaner structure of sharded directory.

Separate shards from vocabs/transforms

* pep8

* completed some 2dos

* Automatically determine input corpus sizes

* use line counts to ensure shards of even size

* update README.md

* don't die on blanks

* spm_to_vocab.py, minor bugfixes

* Missing subword vocab conversion step from SentencePiece usage.

* clarified that meta.train.vocab_path points to subword vocab

* morfessor expected counts to vocab

* use Morfessor EM+Prune models directly as vocab

* pretokenize=True silently ignored

* documentation: pretokenization, exclusive_process

* arxiv link

* documentation: dampening, pretokenizing test set

* mention BART

* minor

* note about BPE-Dropout

* reverse README

* fix multiprocess & pooling

* some cleaning

* merge train_single

* fix gpu_rank for cpu

* add adapted dynamic data iterate mecanism

* some cleaning

* clean train script

* remove deprecated code

* remove deprecated code, patch 2

* fix multiprocessing not pickle generator for train dynamic

* fix some mentioned issues & update train config example file

* more transforms

* drop sharding, iterate directely on full corpus

* transform composite and statistics

* add some documentation

* improve transform perf. relate to sampling

* add ONMTTokenizer Transform & use new interface of sentencepiece

* MixingStrategy for DatasetIter to make it interchangeable

* fix relate to valid_iter

* fix translate with dynamic trained model

* misc fix

* add BPE

* save dynamic sample if verbose

* [WIP] Some adaptations to the dynamic data pipeline (#4)

* wip simplify shards stuff

* some cleanup

* rename shard to corpus

* some cleanup

* multiple producers, move transform up in code execution

* apply transform properly in verbose mode

* do not use cycle, which causes memory leak

* make one queue per gpu per producer, to preserve reproducibility

* enable bpe dropout for pyonmttok, v>1.19, fix inverted dropout value

* terminate producers properly when training processes are done

* single queue per producer/consumer couple

* revert semaphore size change

* make transforms a folder

* add bart noise as transform

* register transform to support extension (Ubiqus#6)

* fix to BART

* remove preprocess from dynamic (Ubiqus#8)

* do not make links, make counters

* add initial build_vocab script

* merge preprocess into train

* compute index with stride and offset

* remove preprocess & some cleaning

Co-authored-by: François Hernandez <francois.hernandez.fh@gmail.com>

* enable transforms behavior change between train/valid

* fix minor issue with bart

* fix reproducibility for transforms (Ubiqus#9)

* update doc & config example

Co-authored-by: Stig-Arne Gronroos <stig-arne.gronroos@aalto.fi>
Co-authored-by: François Hernandez <francois.hernandez.fh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants