Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source factors #275

Merged
merged 77 commits into from Feb 19, 2018
Merged

Source factors #275

merged 77 commits into from Feb 19, 2018

Conversation

mjpost
Copy link
Contributor

@mjpost mjpost commented Jan 18, 2018

Added source factors, as described in.

Linguistic Input Features Improve Neural Machine Translation.
Rico Sennrich & Barry Haddow
In Proceedings of the First Conference on Machine Translation. Berlin, Germany, pp. 83-91.

Source factors are enabled by passing --source-factors file1 [file2 ...] (-sf), where file1, etc. are token-parallel to the source (-s).
This option can be passed both to sockeye.train or in the data preparation step, if data sharding is used.
An analogous parameter, --validation-source-factors, is used to pass factors for validation data.
The flag --source-factors-num-embed D1 [D2 ...] denotes the embedding dimensions.
These are concatenated with the source word dimension (--num-embed), which can continue to be tied to the target (--weight-tying --weight-tying-type=src_trg).

At test time, the input sentence and its factors can be passed by multiple parallel files (--input and --input-factors) or through stdin with token-level annotations, separated by |. Another way is to send a string-serialized JSON object to the CLI through stdin which needs to have a top-level key called 'text' and optionally a key 'factors' of type List[str].

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • System tests pass (pytest test/system)
  • Passed code style checking (./pre-commit.sh or manual run of pylint & mypy)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

(source_embed,
source_embed_length,
source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to do this would be to encode factors as additional dimensions on the tensor; then the logic would go into assembling source and choose the right subclass for the embedder, and the calls here would be the same. This would probably be cleaner...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a good idea. If there are no factors, the data iterator should yield source data of (batch, time, 1) or (batch, time) [the latter is the current]. If there are source factors present it should yield batches with source data of shape (batch, time,1+num_factors).
Since we elevated the embedding operation as a core model operation (by making it a model component along with encoder & decoder), you can implement the embedding encoder class to split on the third dimension if num_factors > 0.

new_tokens = self.tokens[i:i + chunk_size]
new_sent = ' '.join(new_tokens)
new_factors = [self.factors[f][i: i + chunk_size] for f in range(self.num_factors)]
yield TranslatorInput(id=self.id, chunk_id=chunk_id, sentence=new_sent, tokens=new_tokens, factors=new_factors)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have to carry factors around now, splitting for sentences that are too long is a bit more complicated, so this is now a class instead of a namedtuple. I also got rid of InputChunk, rolling it into TranslatorInput, which has an optional additional field, chunk_id.

source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)]
(source_embed,
source_embed_length,
source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note below for training.py: it might be better to encode factors as additional dimensions on the source ndarray, which FactoredEmbedding would know to look for. We could also do away with FactoredEmbedding entirely, and have the embedding receive the number of factors when it is instantiated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a good idea to add the factor functionality to the existing Embedding class (via arguemnts in its constructor)

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a longer look tomorrow. Would it be possible to remove the new scripts folder from this PR to make the diff a little bit smaller? (we can think about adding those later (if needed).

@@ -642,8 +676,12 @@ def get_prepared_data_iters(prepared_data_dir: str,
for shard_fname in shard_fnames:
check_condition(os.path.exists(shard_fname), "Shard %s does not exist." % shard_fname)

source_vocab_fname = os.path.join(prepared_data_dir, C.VOCAB_SRC_NAME) + C.JSON_SUFFIX
target_vocab_fname = os.path.join(prepared_data_dir, C.VOCAB_TRG_NAME) + C.JSON_SUFFIX
# MJP: Why was this hard-coded instead of reading from the data config?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because these entries are None in the case where the vocab is created on the fly and not given on the commandline

Copy link
Contributor Author

@mjpost mjpost Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the prepared data data.config file has "null" entries for the vocab, however. Is this intended? It seems more sensible for the data.config file to list the actual locations of the vocabularies, and then to load them from those locations, rather than hard-coding the path in the code and ignoring the value written to the config.

Here's a data config, for example, under current master:

shared_vocab: false
source: /redacted/path/train.bpe.lc.en.10
source_factor_vocabs:
- null
source_factors:
- /redacted/path/train.bpe.en.factor_case.10
target: /redacted/path/train.bpe.de.10
vocab_source: null
vocab_target: null

What is the purpose of the vocabulary paths? To record where a pre-built vocabulary came from?

@@ -741,6 +795,7 @@ def get_training_data_iters(source: str, target: str,

source_sentences = SequenceReader(source, vocab_source, add_bos=False)
target_sentences = SequenceReader(target, vocab_target, add_bos=True)
source_factor_sentences = [SequenceReader(f, v, add_bos=False) for f,v in zip(source_factors, source_factor_vocabs)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reformat according to pep8: f,v -> f, v

:return: Encoded versions of input data (data, data_length, seq_len).
"""

embeddings = [ super().encode(data, data_length, seq_len)[0] ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer to get the length return from the word embeddings and return that, something like:

word_embeddings, embed_length, embed_seq_len = super.encode(data, data_length, seq_len)
factor_embeddings = []
[...]
embedding = mx.sym.concat(word_embeddings, factor_embeddings, ...)
return embedding, embed_length, embed_seq_len

source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)]
(source_embed,
source_embed_length,
source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a good idea to add the factor functionality to the existing Embedding class (via arguemnts in its constructor)

def load_source_factor_vocabs(model_folder: str) -> List[Dict[str,int]]:
"""
Loads source factor vocabs of the form "vocab.src.N" until one can no longer be found.
This could alternately be done by first reading the number of factors from the config file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest only trying to load as many vocabs as you expect from the config.

sockeye/utils.py Outdated
:return: The streams, pasted together as a single string.
"""
for sentences in zip(*streams):
yield '\t'.join([x.replace('\t', ' ').rstrip() for x in sentences]) + '\n' if add_eol else ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can even get away with () around the list comprehension as join anyway would consume that generator.

sockeye/vocab.py Outdated
return build_from_paths(paths=[data],
num_words=num_words,
min_count=word_min_count) if vocab_path is None else vocab_from_json(vocab_path)
min_count=word_min_count) if not os.path.exists(vocab_path) else vocab_from_json(vocab_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is hard-coded logic about the vocabulary names scattered through the code (data_io, prepare_data, and train[ing]). This logic is often triggered when the vocabulary file names are set to None, since that is used to determine if a file needs to be read from a vocab file or built from a path. It made more sense to me to specify the vocab names at the top level and then key the logic to file existence. However, I went through and reverted all this.

@mjdenkowski
Copy link
Contributor

Very nice addition! I like using the paste format to avoid issues with token-level factor delimiters or other markup. Would it make sense to also use this format for training? We could have a single flag like --source-factors present for both train and translate that enables parsing this format.

@mjpost
Copy link
Contributor Author

mjpost commented Jan 22, 2018

@mjdenkowski Yes, this is a good idea. It also makes things easier for the workflow, because then there is only a single source factors file to deal with instead of a variable number of them.

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feature, thanks!

I have two points we should discuss:

  1. Can we find a way to abstract away the factor code in the model classes? The factor configuration can be moved into a sub Config object for EmbeddingConfig so that ModelConfig would not need to know about it directly.

  2. I like the current way of providing multiple factors through different files since it follows the API of providing source and target through 2 different commandline arguments. It also does not require the user to paste multiple factor files together before-hand which might be error-prone. Regarding inference, it's ok to use the tab input, but I think it would be good to try to define an API that accepts metadata (including source factors) to a translation request instead of moving the parsing of the input into the translator.

  3. This change needs to have unit tests, a system test, and should be tested with various model architectures and configurations. Transformer models will probably fail with the current version, as they expect embedding size to be equal to their 'model size' (which is not the case if factor embeddings are concatenated).

@@ -40,6 +41,7 @@ class CheckpointDecoder:
:param inputs: Path to file containing input sentences.
:param references: Path to file containing references.
:param model: Model to load.
:param input_factors: Path to files containing input factors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths to files

@@ -272,18 +277,32 @@ class Embedding(Encoder):
def __init__(self,
config: EmbeddingConfig,
prefix: str,
embed_weight: Optional[mx.sym.Symbol] = None) -> None:
embed_weight: Optional[mx.sym.Symbol] = None,
factor_vocab_sizes: Optional[List[int]] = [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these needed? Aren't they already part of the config passed in this constructor?

@@ -253,11 +253,16 @@ class EmbeddingConfig(Config):
def __init__(self,
vocab_size: int,
num_embed: int,
dropout: float) -> None:
dropout: float,
source_factor_vocab_sizes: Optional[List[int]] = [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two arguments are optional lists of the same length, correct?
Would it make sense to structure this a little bit differently, for example by saying there is a FactorConfig object, containing the vocab_size and the dim_size.
The Argument to EmbeddingConfig would then be an optional list of FactorConfigs.

@@ -314,6 +314,8 @@ def shard_data(source_fname: str,
target_fname: str,
vocab_source: Dict[str, int],
vocab_target: Dict[str, int],
source_factors: List[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docstring

@@ -314,6 +314,8 @@ def shard_data(source_fname: str,
target_fname: str,
vocab_source: Dict[str, int],
vocab_target: Dict[str, int],
source_factors: List[str],
source_factor_vocabs: List[Dict[str, int]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typoe should be Optional[...]

@@ -159,9 +160,10 @@ def sym_gen(source_seq_len: int):
source_length = utils.compute_lengths(source)

# source embedding
source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: I would move variable declaration up to line 159, next to the source variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

:param model_folder: The location of the model folder
:return: The list of source factor vocabs found in the model folder.
"""
return [vocab.vocab_from_json_or_pickle(os.path.join(model_folder, C.VOCAB_SRC_NAME) + "." + str(fi)) for fi in range(num_factors)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that important but I'd prefer string formatting over concatenation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, done

cache_output_layer_w_b=cache_output_layer_w_b)
models.append(model)
num_factors = model.SockeyeModel.load_config(os.path.join(model_folder, C.CONFIG_NAME)).num_factors
source_factor_vocabs = load_source_factor_vocabs(model_folder, num_factors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is strange: for multiple models it will return the last source_factor_vocabs, whereas for source/target vocabs we are returning the first. There should also be a check that all factor vocabs across models are identical (as in line 424).

Also, you do not seem to need the num_factors variable so you can avoid reloading the config from disk. It's available from the model instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

:param tokens: List of input tokens.
:param factors: List of zero or more input factors.
"""
def __init__(self, id, sentence, tokens, chunk_id = 0, factors = []):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing information missing.
id is a reserved word, maybe we can rename it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used elsewhere as well (e.g., TranslatedChunk), does this matter if it's just a function def? If you change it, we also have to change it in calls where the ID is passed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be easy with an IDE :)
Agreed that it is not crucial in this case. Your decision whether you want to change it.

def __str__(self):
return '[%d.%d] %s' % (self.id, self.chunk_id, self.sentence)

def chunks(self, chunk_size: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return value missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the factors into the source tensors for training (and then also for inference) would simplify the code quite a bit, as you said yourself.

@@ -411,6 +418,9 @@ def load(self,
for (source_len, target_len), num_samples in zip(self.buckets, num_samples_per_bucket)]
data_target = [np.full((num_samples, target_len), self.pad_id, dtype=self.dtype)
for (source_len, target_len), num_samples in zip(self.buckets, num_samples_per_bucket)]
data_source_factor_sources = [[np.full((num_samples, source_len), self.pad_id, dtype=self.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also move the factor data into data_source, making it 3d?
So without factors it would be (num_samples, source_len, 1), and with n factors (num_samples, source_len, 1+n).

source = data[:n]
target = data[n:2 * n]
label = data[2 * n:]
num_parts = num_source_factors + 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move the factor data into source its cleaner and simplifies the loading logic a lot

@mjdenkowski
Copy link
Contributor

Re: factor input formats, I think we have some competing design goals:

  • Specifying options and input files should be consistent with the rest of the codebase.
  • Formats should be consistent between train and decode.
  • Formats should be robust and easy to validate (avoid human error or at least fail fast).
  • Formats should be as simple as possible given the above.

Here are a few ideas to consider:

If we like multiple files, what about folding factors into existing options? Something like:

--source corpus.en corpus.en.factor2 corpus.en.factor3
--target corpus.de
--num-embed 512,64,64:512

If we add target factors, we don't need any additional args to extend this.

Paste being error-prone is a good point, especially if some tabs sneak into training files. A user could get "pre-tokenized" data somewhere that isn't totally whitespace sanitized, try to build a system, and get confusing errors. As reluctant as I am to suggest it, what about Moses-style factor delimiters?

An|O exam@@|B ple|I

The difference could be that we only scan for them if we're running in factored mode, so we don't crash on random characters when users don't want factored input. (Factored mode could be defined as having multiple input files or a specific flag in training and based on model config at decode time). This would give us a decode-time solution and we could support both multi-file input or word-level factor-delimited input for training (both ease-of-use and consistency). As soon as we see a single token with the wrong number of factors, we throw an error and report the exact problem to the user.

@fhieber
Copy link
Contributor

fhieber commented Jan 24, 2018

Thanks Michael, I think for training time moving the factor files to the --source argument is nice, but complicates the --num-embed flag parsing: you have to parse it, store the components in different variables, and pass them around in train.py (additionally to args) to check for equal length of embedding sizes and input files etc.
It also makes the behavior of --source rather implicit. I would vote for being explicit and keep the extra --source-factors and num-embed flags for factors. What does the rest of the team think about this (@tdomhan @davvil)?

Regarding decoding input formats: I do not care so much about type of separator for command line input but more about how we build it into the framework. The inference code (including the translator) should NOT do any kind of parsing but expose a well-defined API that accepts input strings and corresponding metadata in a structured way. This would cover the library use case.
For the CLI, we should have a set of input parsing functions/classes (similar to the OutputHandler concept) that take care of preparing structured input for the translate API. This would allow rigorous unit testing of these parsers.
Failing and informing the user should happen at two places: when parsing errors occur (e.g. two adjacent separator strings), and when an input request without metadata is passed to a model/Translator that expects metadata, and vice versa.

@davvil
Copy link
Contributor

davvil commented Jan 24, 2018

I would prefer to include the factors in the input file (at least for the CLI), i.e. the "An|O exam@@|B ple|I" proposed by Michael. I think this is less error prone. There should be a flag to specify the use of factors and an option to define the separator. The parsing of the input lines can be left out of the API and be done in the translate tool.

@mjpost
Copy link
Contributor Author

mjpost commented Jan 24, 2018

I kind of like @mjdenkowski's suggestions for adding factors to the source list, and doing the same with num embed. It would not be hard to separate this internally and do a check to ensure the lists were the same length. In generally I like multi-purpose flags instead of new flags if their use cases are clear. But I am also fine to stick with --source-factors and --source-factor-dims. Michael's thought about later fitting multi-source decoding into this API is a good one. I think this is an argument for keeping --source-factors separate — multiple sources would then be multiple arguments to --source. So I guess my vote is to keep them separate.

For input, I would strongly prefer tab-delimited data to the Moses-style.

  • Tab is already a meta-data character and it is better to make use of that than to introduce new metadata characters that we'd then have to ensure was handled and undone by Sockeye or wrapper scripts. The Moses support mailing list is full of people wondering why their decoder died when it hit a '|'.
  • It fits easier with research workflows, where you generate multiple factors separately and store them into files. With the Moses format, you need an intervening script to generate it from the factors. Not a big deal but one of those minor issues that help make life nicer.
  • Note that my paste() function strips tabs out of each of the input streams before pasting them, so there is no worry about errant tabs in the input. Since tabs should not be there anyway, this function both does not harm and succinctly and easily ensures the data is clean (at least from the tab perspective).

At test time, it seems clean to me to similarly rely only pasted data on STDIN. We should also add support for --source-factors there for people passing input via -s instead.

@fhieber
Copy link
Contributor

fhieber commented Jan 24, 2018

I agree with Matt's points on choosing tab over "|", but making it configurable would be the best option, as suggested by @davvil . We can set the default to tab.
I guess it would then be nice to have something like a SourceFactorInputHandler class, that takes the separator string as a parameter and implements an interface that 'decodes' from string to TranslateInput (an object of string + metadata).

For training I am not strongly opposed to either of the options. If we leave in --source-factors, I suggest to rename --source-factor-dims to --source-factors-num-embed to be consistent with num-embed for the 'primary' factor (words).

I am not fully convinced that multi-source input is a use case we have to support in the future.

@fhieber
Copy link
Contributor

fhieber commented Feb 8, 2018

Lets decide whether we want this to be backwards compatible or breaking. If it is the latter we can also change the name of the source vocabs from vocab.src.json to vocab.src.i.json where i corresponds to the factors. This would make the loading and saving logic for vocabs cleaner.

@tdomhan
Copy link
Contributor

tdomhan commented Feb 9, 2018

I'm fine with making it backwards-incompatible. It's a fairly large change. If we do so then we could think about pulling some things out of DataConfig that are there only for documentation purposes (unlike the length ratios). We could have a DataInfo object that is also stored in the model folder, but not used at inference time.

@fhieber
Copy link
Contributor

fhieber commented Feb 9, 2018

I created a PR to separate these two objects in #293
UPDATE merged

fhieber and others added 4 commits February 9, 2018 12:03
* BREAKING: Removed max_seq_len_{source,target} from ModelConfig

* BREAKING: Separate data statistics relevant for inference from data information relevant only for training.
@fhieber
Copy link
Contributor

fhieber commented Feb 9, 2018

As far as I can tell, I addressed all comments. @davvil @tdomhan I would merge once both of you approve.

utils.check_condition(vocab.are_identical(*target_vocabs), "Target vocabulary ids do not match")
for fi in range(len(source_vocabs[0])):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for clarity source_vocabs[0] could be pulled out to:

first_model_vocabs = source_vocabs[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Returns a TranslatorInput object from a JSON object, serialized as a string.

:param sentence_id: An integer id.
:param json_string: A JSON object serialized as a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to briefly mention the format in which the data is expected to be? Namely, the structure where there is one entry per factor (rather than something like a list of tuples [("token", "factor1", ...)]). etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


tokens = list(data_io.get_tokens(tokens))
factors = jobj.get(C.JSON_FACTORS_KEY)
if isinstance(factors, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not the same "is None" check?

factors = [[] for _ in range(model_num_source_factors - 1)] # type: List[Tokens]
for token_id, token in enumerate(data_io.get_tokens(factored_string)):
pieces = token.split(delimiter)
utils.check_condition(all(pieces) and len(pieces) == model_num_source_factors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far no input to inference could cause the sockeye process to crash, e.g. empty inputs were just returned. All words are mapped to some word id ( if nothing matches). I really think that we should continue to do so. Specifically for the factors I would very much prefer to fill factors with rather than raising an exception here. In order to make the user aware we should log to stderr that we failed parsing the input. This way it is not possible for a single wrongly formatted input to crash the process. For training this is a different story.

fhieber and others added 2 commits February 9, 2018 17:46
* Do not throw exceptions but return empty outputs and log the errors.
decode_params.add_argument('--json-input',
action='store_true',
default=False,
help="If given, the CLI expects string-serialized json objects as input."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is the only place where a user gets information about the input format without looking at the source code we should fully describe the format here. Namely, which key/value pairs one expects and maybe a tiny example with factors..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

else:
self.input_sentences, self.target_sentences = input_sentences, target_sentences
self.inputs_sentences, self.target_sentences = inputs_sentences, target_sentences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the check_condition was pulled into parallel_subsample it is no longer done for this branch. I would move the check_condition back out to do the check in either case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

:param target_sentences: Target sentences.
:param max_seq_len_source: Maximum source sequence length.
:param max_seq_len_target: Maximum target sequence length.
:return: The number of sentences as well as the mean and standard deviation of target to source length ratios.
"""
mean_and_variance = OnlineMeanAndVariance()

for target, source in zip(target_sentences, source_sentences):
source_len = len(source)
for target, (sources) in zip(target_sentences, zip(*sources_sentences)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the parenthesis actually have an effect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, they can be removed

source_len = len(source)
for target, (sources) in zip(target_sentences, zip(*sources_sentences)):
check_condition(are_token_parallel(sources),
"Source sequences are not token-parallel: %s" % (str(sources)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this print all of the source sentences? I think the message would be fine, if anything it would be great to point to the line which is not token parallel.

"""
if not sequences:
return True
return all(len(s) == len(sequences[0]) for s in sequences)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also return True if sequences contains a single element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, done!

@@ -640,7 +798,6 @@ def _concat_translations(translations: List[Translation], start_id: int, stop_id
target_ids = [start_id]
attention_matrices = []
for idx, translation in enumerate(translations):
assert translation.target_ids[0] == start_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinserted

jobj = json.loads(json_string, encoding=C.JSON_ENCODING)
tokens = jobj[C.JSON_TEXT_KEY]
tokens = list(data_io.get_tokens(tokens))
factors = jobj.get(C.JSON_FACTORS_KEY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a check for the factors having as many elements as expected here as well, just like in the other two methods, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

total_source_factor_size = sum(args.source_factors_num_embed)
if total_source_factor_size > 0:
adjusted_transformer_encoder_model_size = args.num_embed[0] + total_source_factor_size
check_condition(adjusted_transformer_encoder_model_size % 2 == 0 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the adjusted_transformer_encoder_model_size % 2 == 0 stem from? It's not mentioned in the message below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positional embeddings require an even number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i fixed the error message as well.

:param input_factors: Source factor files.
:return: TranslatorInput objects.
"""
if input is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to name input differently as input is a builtin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to inp (and inp_factors)

output_handler=output_handler,
chunk_size=args.chunk_size,
input=args.input,
input_factors=args.input_factors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to add a check for the number of input_factors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

sockeye/train.py Outdated
@@ -90,8 +90,8 @@ def check_arg_compatibility(args: argparse.Namespace):
adjusted_transformer_encoder_model_size = args.num_embed[0] + total_source_factor_size
check_condition(adjusted_transformer_encoder_model_size % 2 == 0 and
adjusted_transformer_encoder_model_size % args.transformer_attention_heads == 0,
"Sum of source factor sizes (%d) has to be a multiple of attention heads (%d)" % (
total_source_factor_size, args.transformer_attention_heads))
"Sum of source factor sizes (%d) has to be even and a multiple of attention heads (%d)" % (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this error message would be clear to a user. Maybe mention the embedding size "The source embedding size plus all source factor embeddings sizes (%d) has to be even and a ...".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to make it clearer AND to reflect the fact that we regard the source surface form as the first & primary factor:
Sum of source factor sizes, i.e. num-embed plus source-factors-num-embed, (%d) has to be even and a multiple of attention heads (%d)

@fhieber fhieber merged commit 5dcb60d into master Feb 19, 2018
@fhieber fhieber deleted the source_factors branch February 19, 2018 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants