Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling chooses vocab index that does not exist with certain random seeds #866

Closed
bricksdont opened this issue Sep 2, 2020 · 21 comments
Closed

Comments

@bricksdont
Copy link
Contributor

bricksdont commented Sep 2, 2020

Running into the following error while sampling with certain seeds:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 269, in <module>
    main()
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 46, in main
    run_translate(args)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 155, in run_translate
    input_is_json=args.json_input)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 237, in read_and_translate
    chunk_time = translate(output_handler, chunk, translator)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 260, in translate
    trans_outputs = translator.translate(trans_inputs)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/inference.py", line 861, in translate
    results.append(self._make_result(trans_input, translation))
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/inference.py", line 963, in _make_result
    target_tokens = [self.vocab_target_inv[target_id] for target_id in target_ids]
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/inference.py", line 963, in <listcomp>
    target_tokens = [self.vocab_target_inv[target_id] for target_id in target_ids]
KeyError: 7525

I am calling Sockeye with a script such as

OMP_NUM_THREADS=1 python -m sockeye.translate \
                -i $data_sub/$corpus.pieces.src \
                -o $samples_sub_sub/$corpus.pieces.$seed.trg \
                -m $model_path \
                --sample \
                --seed $seed \
                --length-penalty-alpha 1.0 \
                --device-ids 0 \
                --batch-size 64 \
                --disable-device-locking

Sockeye and Mxnet versions:

[2020-08-25:17:03:03:INFO:sockeye.utils:log_sockeye_version] Sockeye version 2.1.17, commit 92a020a25cbe75935c700ce2f29b286b31a87189, path /net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/__init__.py
[2020-08-25:17:03:03:INFO:sockeye.utils:log_mxnet_version] MXNet version 1.6.0, path /net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/mxnet/__init__.py

Details that may be relevant:


The vocabulary does not have this index:


[INFO:sockeye.vocab] Vocabulary (7525 words) loaded from "/net/cephfs/scratch/mathmu/map-volatility/models/bel-eng/baseline/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary (7525 words) loaded from "/net/cephfs/scratch/mathmu/map-volatility/models/bel-eng/baseline/vocab.trg.0.json"

I suspect that the sampling procedure somehow assumes 1-based indexing, whereas the vocabulary is 0-indexed. This would mean that there is a small chance that max_vocab_id+1 is picked as the next token.

Looking at the inference code, I am not sure yet why this happens.

@fhieber
Copy link
Contributor

fhieber commented Sep 2, 2020

Interesting, thanks for opening this issue. Can you disable hybridization and check for the presence of 7525 after this line? That is, see whether this value is sampled there directly?

@bricksdont
Copy link
Contributor Author

@fhieber I assume you mean something like this?

# Sample from the target distributions over words, then get the corresponding values from the cumulative scores
best_word_indices = F.random.multinomial(target_dists, get_prob=False)

# warn if 7525 was sampled here
if F.sum(best_word_indices == 7525) > 0:
    print("7525 actually in best_word_indices!")

Yes, 7525 is actually sampled there.

@fhieber
Copy link
Contributor

fhieber commented Sep 2, 2020

Very strange, what is the input shape of target_dists, specifically the size of the last dimension?

@bricksdont
Copy link
Contributor Author

bricksdont commented Sep 2, 2020

Shape of target_dists:
(320, 7525)

Which I think is correct and should mean that 7525 cannot be sampled

btw, I did not find a minimal reproducible example for this yet, since this only happens for certain models and only 1 random seed out of 30. Also, this index only appears 1 time in best_word_indices.

@fhieber
Copy link
Contributor

fhieber commented Sep 2, 2020

That points very much to a bug in the implementation of random.multinomial then. Do you want to submit an issue to mxnet?

You could consider trying MXNet 1.7 to see whether the operator behaves differently there: https://pypi.org/project/mxnet-cu102/

@bricksdont
Copy link
Contributor Author

bricksdont commented Sep 2, 2020

Sure, I'll try to isolate the problem a bit more (= strip away Sockeye) and then open an MXnet issue

@bricksdont
Copy link
Contributor Author

I am at a loss (not a benign cross-entropy one!) with this bug;

  • It only happens on GPU
  • It only happens with exactly this input file. If I split the file up into several parts, the error is not thrown for any chunk of the input, even though taken together, the lines are the same.
  • If I save the target_dists that causes the error, then load it in isolation and call random.multinomial, the error never happens, no matter the device context or random seed.

Here is a gist that I can run on a clean instance and that throws the error:

https://gist.github.com/bricksdont/58dfa0964201c83961a30f23406baa5d

I would be glad to know if someone can reproduce this error at all.

@fhieber
Copy link
Contributor

fhieber commented Sep 2, 2020

Is there a way to store the random generator state of MXNet somehow? I guess you cannot reproduce in isolation because it depends on the previous inputs. If you would run multinomial on smaller random input data, 1000k times and always check for sampled index not being out of bounds, would you see the error?

@bricksdont
Copy link
Contributor Author

Finally found the problem I think: mx.random.multinomial consistently returns impossible values if inputs are not proper distributions. For instance:

>>> dist = mx.nd.array([1.2, 9.4, 8.8])
>>> mx.random.multinomial(dist)
[3]
<NDArray 1 @cpu(0)>

MXnet also does not check if inputs are indeed distributions. If I renormalize target_dists:

# Map the negative logprobs to probabilities so as to have a distribution
target_dists = F.exp(-target_dists)

# normalize to make sure values are proper distributions
target_dists = F.softmax(target_dists, axis=-1)

then the problem never occurs. Does that make sense as an explanation / is applying softmax in that spot a good solution?

@fhieber
Copy link
Contributor

fhieber commented Sep 2, 2020

Thats a good find! This is actually also part of the documentation for multinomial: ".. note:: The input distribution must be normalized, i.e. data must sum to 1 along its last dimension."

But what I don't understand yet is why target_dists does not contain proper distributions, i.e. summing up to 1? Isn't the line above exactly aimed at that?

Edit: are you somehow skipping the softmax operation earlier (beam-size 1)?

Edit2: or this is some VERY unlikely issue with numerical precision of log_softmax -> exp, followed by sampling from a distribution that does not EXACTLY sum to 1.

@mjpost
Copy link
Contributor

mjpost commented Sep 2, 2020

Yes, it's a somewhat confusing implicit assumption of multinomial. But it's a reasonable one—you wouldn't want the code to have to waste time ensuring a slice sums to 1.

I haven't looked at the code (that I wrote :)), but it seems clear to me that you want to sample from the output distribution at the current step, and you should not be incorporating the summed scores so far. This might not actually be what I wrote. If that is the case, I don't know why I did it that way.

@fhieber
Copy link
Contributor

fhieber commented Sep 2, 2020

Thanks @mjpost. The code is correct, as far as I can tell. Given that this seems to be such an unlikely event, I would think this is a numerical precision issue.

@bricksdont
Copy link
Contributor Author

bricksdont commented Sep 2, 2020

"but it seems clear to me that you want to sample from the output distribution at the current step, and you should not be incorporating the summed scores so far"

@mjpost The shape of the distributions as input for multinomial sampling is exactly (batch_size * beam_size, target_vocab_size) in my case, which confirms that this is the correct array, i.e. distributions of the current time step.


"Edit: are you somehow skipping the softmax operation earlier (beam-size 1)?"

@fhieber beam_size is at its default value (5), but I'm not sure if the behaviour is different for sampling (if sampling skips softmax or enforces beam=1 that could be an explanation).

" I would think this is a numerical precision issue."

After I re-normalize with softmax, some distributions still do not sum to 1.0 exactly, but mx.random.multinomial does the right thing. This might mean that before my "second" softmax, values do not actually sum to 1.

@mjpost
Copy link
Contributor

mjpost commented Sep 2, 2020

@bricksdont Another thing to check—are the values sorted from highest to lowest? If the algorithm works the way I suspect, they would have to be sorted.

Edit: the documentation says nothing about sorting, and the examples suggest it's not necessary.

If so, then I'd be curious to see the cdf of the slice. Output distributions are typically pretty peaked, so it seems very unlikely the last item would get selected.

@bricksdont
Copy link
Contributor Author

bricksdont commented Sep 2, 2020

@mjpost It's not the last item that gets selected; if mx.random.multinomial is fed malformed input, it often returns max_possible_index+1, regardless of probability of the actual last entry in the input array.

I can still check if values are sorted of course.

@fhieber
Copy link
Contributor

fhieber commented Sep 2, 2020

So, can you confirm that most distributions in your case significantly do not sum up to 1? Could you add a line that asserts on the sum with some epsilon?

@bricksdont
Copy link
Contributor Author

bricksdont commented Sep 3, 2020

All distributions are within a small tolerance of 1.0; the following print statement is never executed (hope I got this test right):

# Map the negative logprobs to probabilities so as to have a distribution
target_dists = F.exp(-target_dists)

# numpy isclose: absolute(a - b) <= (atol + rtol * absolute(b))
# b is the reference value = not symmetric
atol = 0.00000001
rtol = 0.00001
one_array = F.array([1.0])

target_dists_sums = F.sum(target_dists, axis=-1)

if any(F.abs(target_dists_sums - one_array) > (atol + rtol * F.abs(one_array))):
    print("target_dists_sums contains non-distributions")

Sorry for jumping to conclusions about this! A different explanation is that re-normalizing the values (applying softmax in the SampleK function) changes them slightly, which causes mx.random.multinomial to behave differently.

edit: Problem re-appeared even with distributions I have re-normalized in SampleK.

@bricksdont
Copy link
Contributor Author

Update: the behaviour is exactly the same with MXnet 1.7.0

@SamuelLarkin
Copy link
Contributor

I came across the same error in Sockeye-1.18.115.

I've started to write a issue/ticket when I realize this one was about the same thing. I'll add what I was in the process of writing here in case it helps.

Invalid token id during n-best translation

Hi,
I'm trying to perform translation to produce N-best and I occasionally encounter a keyError, a target token was produced that isn't in the target vocabulary.
To be more precise, that token is one past the last target token.
I'm using a beam size of 10, a N-best size of 10 and I want to sample 5.

Observations

While writing the bug report, I noticed that the target vocabulary has 15017 tokens, the KeyError is for 15018` but in the logs I see

[INFO:sockeye.vocab] Vocabulary (15018 words) loaded from "model/vocab.trg.0.json"

Translation Command

   time python \
      -m sockeye.translate \
         --disable-device-locking \
         --batch-size 32 \
         --strip-unknown-words \
         --models model/params/best \
         --input big_input \
         --beam-size=10 \
         --nbest-size=10 \
         --sample=5 \
      2> logs/log.trans.big_input \
	  > translation.json

Error Message

[INFO:__main__] Translating...
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 273, in <module>
    main()
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 41, in main
    run_translate(args)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 159, in run_translate
    input_is_json=args.json_input)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 241, in read_and_translate
    chunk_time = translate(output_handler, chunk, translator)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 264, in translate
    trans_outputs = translator.translate(trans_inputs)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1620, in translate
    results.append(self._make_result(trans_input, translation))
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1733, in _make_result
    target_tokens_list = [[self.vocab_target_inv[id] for id in ids] for ids in nbest_target_ids]
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1733, in <listcomp>
    target_tokens_list = [[self.vocab_target_inv[id] for id in ids] for ids in nbest_target_ids]
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1733, in <listcomp>
    target_tokens_list = [[self.vocab_target_inv[id] for id in ids] for ids in nbest_target_ids]
KeyError: 15018

model/vocab.trg.0.json

{
    "<pad>": 0,
    "<unk>": 1,
    "<s>": 2,
    "</s>": 3,
    "<BT>": 4,
    "<TAG01>": 5,
	...
    "-RSB-": 15015,
    "-LCB-": 15016,
    "-RCB-": 15017
}

Translate Log

[INFO:sockeye.utils] Sockeye version 1.18.115, commit 41772bcb68b0e4bfb216f6541b5dde72c8c53f7a, path /project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/soc
keye-1.18.115-py3.7.egg/sockeye/__init__.py
[INFO:sockeye.utils] MXNet version 1.6.0, path /project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/mxnet/__init__.py
[INFO:sockeye.utils] Command: /project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py --disable-device-locking --batch-size 32 --strip-unknown-words --models model --input news.2019.de.shuffled.deduped_1200k.12/source.de.bpe --beam-size=10 --nbest-size=10 --sample=5
[INFO:sockeye.utils] Arguments: Namespace(avoid_list=None, batch_size=32, beam_prune=0, beam_search_stop='all', beam_size=10, brevity_penalty_constant_length_ratio=0.0, brevity_penalty_type='none', brevity_penalty_weight=1.0, bucket_width=10, checkpoints=None, chunk_size=None, config=None, device_ids=[-1], disable_device_locking=True, ensemble_mode='linear', input='news.2019.de.shuffled.deduped_1200k.02/source.de.bpe', input_factors=None, json_input=False, length_penalty_alpha=1.0, length_penalty_beta=0.0, lock_dir='/tmp', loglevel='INFO', max_input_len=None, max_output_length_num_stds=2, models=['model'], nbest_size=10, no_logfile=False, output=None, output_type='translation', override_dtype=None, quiet=False, restrict_lexicon=None, restrict_lexicon_topk=None, sample=5, seed=None, skip_topk=False, softmax_temperature=None, strip_unknown_words=False, sure_align_threshold=0.9, use_cpu=False)
[WARNING:__main__] For nbest translation, you must specify `--output-type 'json'; overriding your setting of 'translation'.
[WARNING:sockeye.utils] Sockeye currently does not respect CUDA_VISIBLE_DEVICE settings when locking GPU devices.
[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs.
[INFO:__main__] Translate Device: gpu(0)
[INFO:sockeye.inference] Loading 1 model(s) from ['model'] ...
[INFO:sockeye.vocab] Vocabulary (15023 words) loaded from "model/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary (15018 words) loaded from "model/vocab.trg.0.json"
[INFO:sockeye.inference] Model version: 1.18.115
[INFO:sockeye.model] ModelConfig loaded from "model/config"
[INFO:sockeye.inference] Disabling dropout layers for performance reasons
[INFO:sockeye.model] Config[_frozen=True, config_data=Config[_frozen=True, data_statistics=Config[_frozen=True, average_len_target_per_bucket=[6.25482373282954, 11.892531980174182, 19.073197720870297, 26.527372049663935, 34.26083732096971, 42.12438040952975, 50.09595801659126, 58.10145174518264, 66.00924772501672, 74.14691173890164, 82.42240797028039, 91.38442822384417, 99.4557937846303, 108.12328767123283, 115.82057562767925, 124.02948402948415, 132.54601226993876, 140.0126050420168, 147.45000000000013, 156.14912280701765, 164.29687499999997, 171.9240506329113, 180.2716981132075, 188.9595141700405, 195.46896551724134, 201.0], buckets=[(10, 8), (20, 16), (30, 24), (40, 32), (50, 40), (60, 48), (70, 56), (80, 64), (90, 72), (100, 80), (110, 88), (120, 96), (130, 104), (140, 112), (150, 120), (160, 128), (170, 136), (180, 144), (190, 152), (200, 160), (201, 168), (201, 176), (201, 184), (201, 192), (201, 200), (201, 201)], length_ratio_mean=0.8325109757761393, length_ratio_stats_per_bucket=[(0.8700030596380449, 0.1851696052412553), (0.8473665592088504, 0.17201207474914348), (0.8170229505520449, 0.14491530614255255), (0.8149323443509092, 0.13291835554954182), (0.8215325578146861, 0.1295657278136262), (0.8293434993330212, 0.12631407865092517), (0.8385051454171029, 0.12531272036219746), (0.851452400207557, 0.12876879907862093), (0.8599266729886459, 0.13141393459725778), (0.8877793928981718, 0.14618004142830063), (0.9214403439662803, 0.1601146281820892), (0.9602362130596434, 0.17428407805892418), (0.9833917929841112, 0.16728597881644056), (1.0021719944585867, 0.17916335979053893), (0.9910454401200192, 0.14480671253359975), (0.999556313437337, 0.12327089340893257), (1.0526777885269414, 0.2021022084047301), (1.0556842543995564, 0.12353050809867584), (1.0269060332691413, 0.1138430347535324), (1.0744121479510391, 0.1179606715644975), (1.1298447627160189, 0.1909577604664158), (1.0961347956331, 0.22572763587713024), (1.1223604301072536, 0.20324548343424761), (1.0899442747764476, 0.08077486023781334), (1.1040011403845262, 0.09554039706492333), (1.079371703266767, 0.03331348410566477)], length_ratio_std=0.1531593009215404, max_observed_len_source=201, max_observed_len_target=201, num_discarded=1153, num_sents=7056012, num_sents_per_bucket=[468465, 2118500, 1866151, 1224479, 664763, 341758, 175879, 93887, 47363, 22796, 11844, 6165, 4151, 2628, 1633, 1628, 815, 714, 560, 456, 320, 395, 265, 247, 145, 5], num_tokens_source=190338708, num_tokens_target=155687988, num_unks_source=8, num_unks_target=2685, size_vocab_source=15023, size_vocab_target=15018], max_seq_len_source=201, max_seq_len_target=201, num_source_factors=1, source_with_eos=True], config_decoder=Config[_frozen=True, act_type=relu, attention_heads=8, conv_config=None, dropout_act=0.0, dropout_attention=0.0, dropout_prepost=0.0, dtype=float32, feed_forward_num_hidden=2048, lhuc=False, max_seq_len_source=201, max_seq_len_target=201, model_size=512, num_layers=6, positional_embedding_type=fixed, postprocess_sequence=dr, preprocess_sequence=n, use_lhuc=False], config_embed_source=Config[_frozen=True, dropout=0.0, dtype=float32, factor_configs=None, num_embed=512, num_factors=1, source_factors_combine=concat, vocab_size=15023], config_embed_target=Config[_frozen=True, dropout=0.0, dtype=float32, factor_configs=None, num_embed=512, num_factors=1, source_factors_combine=concat, vocab_size=15018], config_encoder=Config[_frozen=True, act_type=relu, attention_heads=8, conv_config=None, dropout_act=0.0, dropout_attention=0.0, dropout_prepost=0.0, dtype=float32, feed_forward_num_hidden=2048, lhuc=False, max_seq_len_source=201, max_seq_len_target=201, model_size=512, num_layers=6, positional_embedding_type=fixed, postprocess_sequence=dr, preprocess_sequence=n, use_lhuc=False], config_length_task=None, config_length_task_loss=None, config_loss=Config[_frozen=True, label_smoothing=0.1, length_task_link=None, length_task_weight=1.0, name=cross-entropy, normalization_type=valid, vocab_size=15018], lhuc=False, num_pointers=0, vocab_source_size=15023, vocab_target_size=15018, weight_normalization=False, weight_tying=False, weight_tying_type=None]
[INFO:sockeye.encoder] sockeye.encoder.EncoderSequence dtype: float32
[INFO:sockeye.encoder] sockeye.encoder.AddSinCosPositionalEmbeddings dtype: float32
[INFO:sockeye.encoder] sockeye.encoder.TransformerEncoder dtype: float32
[INFO:sockeye.decoder] sockeye.decoder.TransformerDecoder dtype: float32
[INFO:sockeye.encoder] sockeye.encoder.AddSinCosPositionalEmbeddings dtype: float32
[INFO:sockeye.encoder] sockeye.encoder.Embedding dtype: float32
[INFO:sockeye.encoder] sockeye.encoder.Embedding dtype: float32
[INFO:sockeye.model] Loaded params from "model/params.best"
[INFO:sockeye.inference] 1 model(s) loaded in 4.6880s
[INFO:sockeye.inference] Translator (1 model(s) beam_size=10 beam_prune=off beam_search_stop=all nbest_size=10 ensemble_mode=None max_batch_size=32 buckets_source=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 201] avoiding=0)
[INFO:__main__] Translating...
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 273, in <module>
    main()
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 41, in main
    run_translate(args)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 159, in run_translate
    input_is_json=args.json_input)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 241, in read_and_translate
    chunk_time = translate(output_handler, chunk, translator)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/translate.py", line 264, in translate
    trans_outputs = translator.translate(trans_inputs)
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1620, in translate
    results.append(self._make_result(trans_input, translation))
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1733, in _make_result
    target_tokens_list = [[self.vocab_target_inv[id] for id in ids] for ids in nbest_target_ids]
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1733, in <listcomp>
    target_tokens_list = [[self.vocab_target_inv[id] for id in ids] for ids in nbest_target_ids]
  File "/project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101/lib/python3.7/site-packages/sockeye-1.18.115-py3.7.egg/sockeye/inference.py", line 1733, in <listcomp>
    target_tokens_list = [[self.vocab_target_inv[id] for id in ids] for ids in nbest_target_ids]
KeyError: 15018

Sockeye Version

1.18.115

Conda Environment.yaml

name: sockeye-1.18.115_cu101
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _tflow_select=2.3.0=mkl
  - absl-py=0.9.0=py37_0
  - alabaster=0.7.12=py37_0
  - asn1crypto=1.3.0=py37_0
  - astor=0.8.0=py37_0
  - babel=2.8.0=py_0
  - blas=1.0=mkl
  - blinker=1.4=py37_0
  - c-ares=1.15.0=h7b6447c_1001
  - ca-certificates=2020.1.1=0
  - cachetools=3.1.1=py_0
  - certifi=2020.4.5.1=py37_0
  - cffi=1.14.0=py37h2e261b9_0
  - chardet=3.0.4=py37_1003
  - click=7.1.1=py_0
  - cryptography=2.8=py37h1ba5d50_0
  - cudatoolkit=10.1.243=h6bb024c_0
  - cudnn=7.6.5=cuda10.1_0
  - cycler=0.10.0=py37_0
  - dbus=1.13.12=h746ee38_0
  - docutils=0.16=py37_0
  - expat=2.2.6=he6710b0_0
  - fontconfig=2.13.0=h9420a91_0
  - freetype=2.9.1=h8a8886c_1
  - gast=0.2.2=py37_0
  - glib=2.63.1=h5a9c865_0
  - google-auth=1.13.1=py_0
  - google-auth-oauthlib=0.4.1=py_2
  - google-pasta=0.2.0=py_0
  - grpcio=1.27.2=py37hf8bcb03_0
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - h5py=2.10.0=py37h7918eee_0
  - hdf5=1.10.4=hb1b8bf9_0
  - icu=58.2=h9c2bf20_1
  - idna=2.9=py_1
  - imagesize=1.2.0=py_0
  - intel-openmp=2020.0=166
  - jinja2=2.11.1=py_0
  - jpeg=9b=h024ee3a_2
  - keras-applications=1.0.8=py_0
  - keras-preprocessing=1.1.0=py_1
  - kiwisolver=1.1.0=py37he6710b0_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libpng=1.6.37=hbc83047_0
  - libprotobuf=3.11.4=hd408876_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libuuid=1.0.3=h1bed415_2
  - libxcb=1.13=h1bed415_1
  - libxml2=2.9.9=hea5a465_1
  - markdown=3.1.1=py37_0
  - markupsafe=1.1.1=py37h7b6447c_0
  - matplotlib=3.1.3=py37_0
  - matplotlib-base=3.1.3=py37hef1b27d_0
  - mkl=2020.0=166
  - mkl-service=2.3.0=py37he904b0f_0
  - mkl_fft=1.0.15=py37ha843d7b_0
  - mkl_random=1.1.0=py37hd6b4f25_0
  - ncurses=6.2=he6710b0_0
  - numpy=1.18.1=py37h4f9e942_0
  - numpy-base=1.18.1=py37hde5b4d6_1
  - oauthlib=3.1.0=py_0
  - openssl=1.1.1f=h7b6447c_0
  - opt_einsum=3.1.0=py_0
  - packaging=20.3=py_0
  - pcre=8.43=he6710b0_0
  - pip=20.0.2=py37_1
  - protobuf=3.11.4=py37he6710b0_0
  - pyasn1=0.4.8=py_0
  - pyasn1-modules=0.2.7=py_0
  - pycparser=2.20=py_0
  - pygments=2.6.1=py_0
  - pyjwt=1.7.1=py37_0
  - pyopenssl=19.1.0=py37_0
  - pyparsing=2.4.6=py_0
  - pyqt=5.9.2=py37h05f1152_2
  - pysocks=1.7.1=py37_0
  - python=3.7.7=hcf32534_0_cpython
  - python-dateutil=2.8.1=py_0
  - pytz=2019.3=py_0
  - qt=5.9.7=h5867ecd_1
  - readline=8.0=h7b6447c_0
  - requests=2.23.0=py37_0
  - requests-oauthlib=1.3.0=py_0
  - rsa=4.0=py_0
  - scipy=1.4.1=py37h0b6359f_0
  - setuptools=46.1.3=py37_0
  - sip=4.19.8=py37hf484d3e_0
  - six=1.14.0=py37_0
  - snowballstemmer=2.0.0=py_0
  - sphinx=2.4.4=py_0
  - sphinxcontrib-applehelp=1.0.2=py_0
  - sphinxcontrib-devhelp=1.0.2=py_0
  - sphinxcontrib-htmlhelp=1.0.3=py_0
  - sphinxcontrib-jsmath=1.0.1=py_0
  - sphinxcontrib-qthelp=1.0.3=py_0
  - sphinxcontrib-serializinghtml=1.1.4=py_0
  - sqlite=3.31.1=h7b6447c_0
  - tensorboard=2.1.0=py3_0
  - tensorflow=2.1.0=mkl_py37h80a91df_0
  - tensorflow-base=2.1.0=mkl_py37h6d63fb7_0
  - tensorflow-estimator=2.1.0=pyhd54b08b_0
  - termcolor=1.1.0=py37_1
  - tk=8.6.8=hbc83047_0
  - tornado=6.0.4=py37h7b6447c_1
  - urllib3=1.25.8=py37_0
  - werkzeug=1.0.1=py_0
  - wheel=0.34.2=py37_0
  - wrapt=1.12.1=py37h7b6447c_1
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - decorator==4.4.2
    - mxboard==0.1.0
    - mxnet-cu101mkl==1.6.0
    - mxnet-mkl==1.5.1
    - networkx==2.0
    - pillow==7.1.1
    - portalocker==1.7.0
    - pudb==2019.2
    - python-graphviz==0.8.4
    - sentencepiece==0.1.85
    - sockeye==1.18.115
    - subword-nmt==0.3.7
    - typing==3.7.4.1
    - urwid==2.1.0
prefix: /project/WMT20/opt/miniconda3/envs/sockeye-1.18.115_cu101

args.yaml

allow_missing_params: false
attention_based_copying: false
batch_size: 8192
batch_type: word
bucket_width: 10
checkpoint_interval: 4752
cnn_activation_type: glu
cnn_hidden_dropout: 0.2
cnn_kernel_width:
- 3
- 3
cnn_num_hidden: 512
cnn_positional_embedding_type: learned
cnn_project_qkv: false
config: model_config.yaml
conv_embed_add_positional_encodings: false
conv_embed_dropout: 0.0
conv_embed_max_filter_width: 8
conv_embed_num_filters:
- 200
- 200
- 250
- 250
- 300
- 300
- 300
- 300
conv_embed_num_highway_layers: 4
conv_embed_output_dim: null
conv_embed_pool_stride: 5
decode_and_evaluate: -1
decode_and_evaluate_device_id: null
decode_and_evaluate_use_cpu: false
decoder: transformer
decoder_only: false
device_ids:
- -4
disable_device_locking: true
dry_run: false
embed_dropout: &id001
- 0.0
- 0.0
embed_weight_init: default
encoder: transformer
fixed_param_names: []
fixed_param_strategy: null
gradient_clipping_threshold: 1.0
gradient_clipping_type: abs
gradient_compression_threshold: 0.5
gradient_compression_type: null
initial_learning_rate: 0.0001
keep_initializations: false
keep_last_params: 2
kvstore: device
label_smoothing: 0.1
layer_normalization: false
learning_rate_decay_optimizer_states_reset: 'off'
learning_rate_decay_param_reset: false
learning_rate_half_life: 10
learning_rate_reduce_factor: 0.7
learning_rate_reduce_num_not_improved: 8
learning_rate_schedule: null
learning_rate_scheduler_type: plateau-reduce
learning_rate_warmup: 0
length_task: null
length_task_layers: 1
length_task_weight: 1.0
lhuc: null
lock_dir: /tmp
loglevel: INFO
loss: cross-entropy
loss_normalization_type: valid
max_checkpoints: null
max_num_checkpoint_not_improved: 32
max_num_epochs: 1000
max_samples: null
max_seconds: null
max_seq_len:
- 200
- 200
max_updates: null
metrics:
- perplexity
- accuracy
min_num_epochs: null
min_samples: null
min_updates: null
momentum: null
monitor_pattern: null
monitor_stat_func: mx_default
no_bucketing: false
no_logfile: false
num_embed:
- null
- null
num_layers:
- 6
- 6
num_words:
- 0
- 0
optimized_metric: bleu
optimizer: adam
optimizer_params: null
output: model
overwrite_output: false
pad_vocab_to_multiple_of: null
params: ../../teacher.wmt.6.bpe-dr-t-src/de2cs.bpe.15k.bpe-dr/model/params.best
prepared_data: null
quiet: false
rnn_attention_coverage_max_fertility: 2
rnn_attention_coverage_num_hidden: 1
rnn_attention_coverage_type: count
rnn_attention_in_upper_layers: false
rnn_attention_mhdot_heads: null
rnn_attention_num_hidden: null
rnn_attention_type: mlp
rnn_attention_use_prev_word: false
rnn_cell_type: lstm
rnn_context_gating: false
rnn_decoder_hidden_dropout: 0.2
rnn_decoder_state_init: last
rnn_dropout_inputs: *id001
rnn_dropout_recurrent: *id001
rnn_dropout_states: *id001
rnn_enc_last_hidden_concat_to_embedding: false
rnn_encoder_reverse_input: false
rnn_first_residual_layer: 2
rnn_forget_bias: 0.0
rnn_h2h_init: orthogonal
rnn_num_hidden: 1024
rnn_residual_connections: false
rnn_scale_dot_attention: false
seed: 13
shared_vocab: false
source: ../../../corpora/preprocessing.multilingual.wmt.6/bpe.15k/bt.mono_hsb.backtrans.1/hsb-de_x12+witaj+si+web.bt_tag.bpe-dr-de/train.hsb-de.de.gz
source_factor_vocabs: []
source_factors: []
source_factors_combine: concat
source_factors_num_embed: []
source_vocab: ../../teacher.wmt.6.bpe-dr-t-src/de2cs.bpe.15k.bpe-dr/model/vocab.src.0.json
stop_training_on_decoder_failure: false
target: ../../../corpora/preprocessing.multilingual.wmt.6/bpe.15k/bt.mono_hsb.backtrans.1/hsb-de_x12+witaj+si+web.bt_tag.bpe-dr-de/train.hsb-de.hsb.gz
target_vocab: ../../teacher.wmt.6.bpe-dr-t-src/de2cs.bpe.15k.bpe-dr/model/vocab.trg.0.json
transformer_activation_type: relu
transformer_attention_heads:
- 8
- 8
transformer_dropout_act: 0.1
transformer_dropout_attention: 0.1
transformer_dropout_prepost: 0.1
transformer_feed_forward_num_hidden:
- 2048
- 2048
transformer_model_size:
- 512
- 512
transformer_positional_embedding_type: fixed
transformer_postprocess:
- dr
- dr
transformer_preprocess:
- n
- n
update_interval: 1
use_cpu: false
validation_source: ../../../corpora/preprocessing.multilingual.wmt.6/bpe.15k/bt.mono_hsb.backtrans.1/hsb-de_x12+witaj+si+web.bt_tag.bpe-dr-de/devel.hsb-de.de
validation_source_factors: []
validation_target: ../../../corpora/preprocessing.multilingual.wmt.6/bpe.15k/bt.mono_hsb.backtrans.1/hsb-de_x12+witaj+si+web.bt_tag.bpe-dr-de/devel.hsb-de.hsb
weight_decay: 0.0
weight_init: xavier
weight_init_scale: 3.0
weight_init_xavier_factor_type: avg
weight_init_xavier_rand_type: uniform
weight_normalization: false
weight_tying: false
weight_tying_type: trg_softmax
word_min_count:
- 1
- 1

@bricksdont
Copy link
Contributor Author

I still believe this is an MXnet bug, but don't know how to reduce the problem to the single RNG state and input that cause random.multinomial to misbehave. As @fhieber said, that would be possible if the RNG state could be saved somehow.

@KellenSunderland we could need some MXnet expertise here, if you are interested in tackling this.

@mjdenkowski
Copy link
Contributor

Closing for now as this applies to an older version of Sockeye.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants