Source factors #275

mjpost · 2018-01-18T16:37:45Z

Added source factors, as described in.

Linguistic Input Features Improve Neural Machine Translation.
Rico Sennrich & Barry Haddow
In Proceedings of the First Conference on Machine Translation. Berlin, Germany, pp. 83-91.

Source factors are enabled by passing --source-factors file1 [file2 ...] (-sf), where file1, etc. are token-parallel to the source (-s).
This option can be passed both to sockeye.train or in the data preparation step, if data sharding is used.
An analogous parameter, --validation-source-factors, is used to pass factors for validation data.
The flag --source-factors-num-embed D1 [D2 ...] denotes the embedding dimensions.
These are concatenated with the source word dimension (--num-embed), which can continue to be tied to the target (--weight-tying --weight-tying-type=src_trg).

At test time, the input sentence and its factors can be passed by multiple parallel files (--input and --input-factors) or through stdin with token-level annotations, separated by |. Another way is to send a string-serialized JSON object to the CLI through stdin which needs to have a top-level key called 'text' and optionally a key 'factors' of type List[str].

Pull Request Checklist

Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
Unit tests pass (pytest)
System tests pass (pytest test/system)
Passed code style checking (./pre-commit.sh or manual run of pylint & mypy)
You have considered writing a test
Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mjpost · 2018-01-18T16:47:04Z

sockeye/training.py

+                (source_embed,
+                 source_embed_length,
+                 source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors)
+


Another way to do this would be to encode factors as additional dimensions on the tensor; then the logic would go into assembling source and choose the right subclass for the embedder, and the calls here would be the same. This would probably be cleaner...

I think this would be a good idea. If there are no factors, the data iterator should yield source data of (batch, time, 1) or (batch, time) [the latter is the current]. If there are source factors present it should yield batches with source data of shape (batch, time,1+num_factors).
Since we elevated the embedding operation as a core model operation (by making it a model component along with encoder & decoder), you can implement the embedding encoder class to split on the third dimension if num_factors > 0.

mjpost · 2018-01-18T16:48:59Z

sockeye/inference.py

+            new_tokens = self.tokens[i:i + chunk_size]
+            new_sent = ' '.join(new_tokens)
+            new_factors = [self.factors[f][i: i + chunk_size] for f in range(self.num_factors)]
+            yield TranslatorInput(id=self.id, chunk_id=chunk_id, sentence=new_sent, tokens=new_tokens, factors=new_factors)


Since we have to carry factors around now, splitting for sentences that are too long is a bit more complicated, so this is now a class instead of a namedtuple. I also got rid of InputChunk, rolling it into TranslatorInput, which has an optional additional field, chunk_id.

mjpost · 2018-01-18T16:50:44Z

sockeye/inference.py

+                source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)]
+                (source_embed,
+                 source_embed_length,
+                 source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors)


See note below for training.py: it might be better to encode factors as additional dimensions on the source ndarray, which FactoredEmbedding would know to look for. We could also do away with FactoredEmbedding entirely, and have the embedding receive the number of factors when it is instantiated.

I think it might be a good idea to add the factor functionality to the existing Embedding class (via arguemnts in its constructor)

fhieber

I will take a longer look tomorrow. Would it be possible to remove the new scripts folder from this PR to make the diff a little bit smaller? (we can think about adding those later (if needed).

fhieber · 2018-01-18T17:16:15Z

sockeye/data_io.py

@@ -642,8 +676,12 @@ def get_prepared_data_iters(prepared_data_dir: str,
    for shard_fname in shard_fnames:
        check_condition(os.path.exists(shard_fname), "Shard %s does not exist." % shard_fname)

-    source_vocab_fname = os.path.join(prepared_data_dir, C.VOCAB_SRC_NAME) + C.JSON_SUFFIX
-    target_vocab_fname = os.path.join(prepared_data_dir, C.VOCAB_TRG_NAME) + C.JSON_SUFFIX
+    # MJP: Why was this hard-coded instead of reading from the data config?


because these entries are None in the case where the vocab is created on the fly and not given on the commandline

This means that the prepared data data.config file has "null" entries for the vocab, however. Is this intended? It seems more sensible for the data.config file to list the actual locations of the vocabularies, and then to load them from those locations, rather than hard-coding the path in the code and ignoring the value written to the config.

Here's a data config, for example, under current master:

shared_vocab: false source: /redacted/path/train.bpe.lc.en.10 source_factor_vocabs: - null source_factors: - /redacted/path/train.bpe.en.factor_case.10 target: /redacted/path/train.bpe.de.10 vocab_source: null vocab_target: null

What is the purpose of the vocabulary paths? To record where a pre-built vocabulary came from?

fhieber · 2018-01-18T17:17:02Z

sockeye/data_io.py

@@ -741,6 +795,7 @@ def get_training_data_iters(source: str, target: str,

    source_sentences = SequenceReader(source, vocab_source, add_bos=False)
    target_sentences = SequenceReader(target, vocab_target, add_bos=True)
+    source_factor_sentences = [SequenceReader(f, v, add_bos=False) for f,v in zip(source_factors, source_factor_vocabs)]


please reformat according to pep8: f,v -> f, v

fhieber · 2018-01-18T17:21:08Z

sockeye/encoder.py

+        :return: Encoded versions of input data (data, data_length, seq_len).
+        """
+
+        embeddings = [ super().encode(data, data_length, seq_len)[0] ]


I think it would be nicer to get the length return from the word embeddings and return that, something like:

word_embeddings, embed_length, embed_seq_len = super.encode(data, data_length, seq_len) factor_embeddings = [] [...] embedding = mx.sym.concat(word_embeddings, factor_embeddings, ...) return embedding, embed_length, embed_seq_len

fhieber · 2018-01-18T17:22:16Z

sockeye/inference.py

+                source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)]
+                (source_embed,
+                 source_embed_length,
+                 source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors)


I think it might be a good idea to add the factor functionality to the existing Embedding class (via arguemnts in its constructor)

fhieber · 2018-01-18T17:22:59Z

sockeye/inference.py

+def load_source_factor_vocabs(model_folder: str) -> List[Dict[str,int]]:
+    """
+    Loads source factor vocabs of the form "vocab.src.N" until one can no longer be found.
+    This could alternately be done by first reading the number of factors from the config file.


I would suggest only trying to load as many vocabs as you expect from the config.

fhieber · 2018-01-18T17:24:07Z

sockeye/utils.py

+    :return: The streams, pasted together as a single string.
+    """
+    for sentences in zip(*streams):
+        yield '\t'.join([x.replace('\t', ' ').rstrip() for x in sentences]) + '\n' if add_eol else ''


I think you can even get away with () around the list comprehension as join anyway would consume that generator.

fhieber · 2018-01-18T17:24:56Z

sockeye/vocab.py

    return build_from_paths(paths=[data],
                            num_words=num_words,
-                            min_count=word_min_count) if vocab_path is None else vocab_from_json(vocab_path)
+                            min_count=word_min_count) if not os.path.exists(vocab_path) else vocab_from_json(vocab_path)


why this change?

There is hard-coded logic about the vocabulary names scattered through the code (data_io, prepare_data, and train[ing]). This logic is often triggered when the vocabulary file names are set to None, since that is used to determine if a file needs to be read from a vocab file or built from a path. It made more sense to me to specify the vocab names at the top level and then key the logic to file existence. However, I went through and reverted all this.

…unctionality)

mjdenkowski · 2018-01-19T16:20:28Z

Very nice addition! I like using the paste format to avoid issues with token-level factor delimiters or other markup. Would it make sense to also use this format for training? We could have a single flag like --source-factors present for both train and translate that enables parsing this format.

mjpost · 2018-01-22T09:45:50Z

@mjdenkowski Yes, this is a good idea. It also makes things easier for the workflow, because then there is only a single source factors file to deal with instead of a variable number of them.

fhieber

Great feature, thanks!

I have two points we should discuss:

Can we find a way to abstract away the factor code in the model classes? The factor configuration can be moved into a sub Config object for EmbeddingConfig so that ModelConfig would not need to know about it directly.
I like the current way of providing multiple factors through different files since it follows the API of providing source and target through 2 different commandline arguments. It also does not require the user to paste multiple factor files together before-hand which might be error-prone. Regarding inference, it's ok to use the tab input, but I think it would be good to try to define an API that accepts metadata (including source factors) to a translation request instead of moving the parsing of the input into the translator.
This change needs to have unit tests, a system test, and should be tested with various model architectures and configurations. Transformer models will probably fail with the current version, as they expect embedding size to be equal to their 'model size' (which is not the case if factor embeddings are concatenated).

fhieber · 2018-01-19T14:27:47Z

sockeye/checkpoint_decoder.py

@@ -40,6 +41,7 @@ class CheckpointDecoder:
    :param inputs: Path to file containing input sentences.
    :param references: Path to file containing references.
    :param model: Model to load.
+    :param input_factors: Path to files containing input factors.


Paths to files

fhieber · 2018-01-19T14:30:45Z

sockeye/encoder.py

@@ -272,18 +277,32 @@ class Embedding(Encoder):
    def __init__(self,
                 config: EmbeddingConfig,
                 prefix: str,
-                 embed_weight: Optional[mx.sym.Symbol] = None) -> None:
+                 embed_weight: Optional[mx.sym.Symbol] = None,
+                 factor_vocab_sizes: Optional[List[int]] = [],


why are these needed? Aren't they already part of the config passed in this constructor?

fhieber · 2018-01-19T14:31:50Z

sockeye/encoder.py

@@ -253,11 +253,16 @@ class EmbeddingConfig(Config):
    def __init__(self,
                 vocab_size: int,
                 num_embed: int,
-                 dropout: float) -> None:
+                 dropout: float,
+                 source_factor_vocab_sizes: Optional[List[int]] = [],


these two arguments are optional lists of the same length, correct?
Would it make sense to structure this a little bit differently, for example by saying there is a FactorConfig object, containing the vocab_size and the dim_size.
The Argument to EmbeddingConfig would then be an optional list of FactorConfigs.

fhieber · 2018-01-19T14:51:07Z

sockeye/data_io.py

@@ -314,6 +314,8 @@ def shard_data(source_fname: str,
               target_fname: str,
               vocab_source: Dict[str, int],
               vocab_target: Dict[str, int],
+               source_factors: List[str],


missing docstring

fhieber · 2018-01-19T14:51:21Z

sockeye/data_io.py

@@ -314,6 +314,8 @@ def shard_data(source_fname: str,
               target_fname: str,
               vocab_source: Dict[str, int],
               vocab_target: Dict[str, int],
+               source_factors: List[str],
+               source_factor_vocabs: List[Dict[str, int]],


typoe should be Optional[...]

fhieber · 2018-01-22T09:37:15Z

sockeye/inference.py

@@ -159,9 +160,10 @@ def sym_gen(source_seq_len: int):
            source_length = utils.compute_lengths(source)

            # source embedding
+            source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)]


[nit]: I would move variable declaration up to line 159, next to the source variable.

fhieber · 2018-01-22T09:38:49Z

sockeye/inference.py

+    :param model_folder: The location of the model folder
+    :return: The list of source factor vocabs found in the model folder.
+    """
+    return [vocab.vocab_from_json_or_pickle(os.path.join(model_folder, C.VOCAB_SRC_NAME) + "." + str(fi)) for fi in range(num_factors)]


Not that important but I'd prefer string formatting over concatenation :)

agreed, done

fhieber · 2018-01-22T09:42:31Z

sockeye/inference.py

-                               cache_output_layer_w_b=cache_output_layer_w_b)
-        models.append(model)
+        num_factors = model.SockeyeModel.load_config(os.path.join(model_folder, C.CONFIG_NAME)).num_factors
+        source_factor_vocabs = load_source_factor_vocabs(model_folder, num_factors)


this is strange: for multiple models it will return the last source_factor_vocabs, whereas for source/target vocabs we are returning the first. There should also be a check that all factor vocabs across models are identical (as in line 424).

Also, you do not seem to need the num_factors variable so you can avoid reloading the config from disk. It's available from the model instance.

fhieber · 2018-01-22T09:43:09Z

sockeye/inference.py

+    :param tokens: List of input tokens.
+    :param factors: List of zero or more input factors.
+    """
+    def __init__(self, id, sentence, tokens, chunk_id = 0, factors = []):


typing information missing.
id is a reserved word, maybe we can rename it?

this is used elsewhere as well (e.g., TranslatedChunk), does this matter if it's just a function def? If you change it, we also have to change it in calls where the ID is passed

That'd be easy with an IDE :)
Agreed that it is not crucial in this case. Your decision whether you want to change it.

fhieber · 2018-01-22T09:43:42Z

sockeye/inference.py

+    def __str__(self):
+        return '[%d.%d] %s' % (self.id, self.chunk_id, self.sentence)
+
+    def chunks(self, chunk_size: int):


return value missing

fhieber

I think moving the factors into the source tensors for training (and then also for inference) would simplify the code quite a bit, as you said yourself.

fhieber · 2018-01-22T15:04:39Z

sockeye/data_io.py

@@ -411,6 +418,9 @@ def load(self,
                       for (source_len, target_len), num_samples in zip(self.buckets, num_samples_per_bucket)]
        data_target = [np.full((num_samples, target_len), self.pad_id, dtype=self.dtype)
                       for (source_len, target_len), num_samples in zip(self.buckets, num_samples_per_bucket)]
+        data_source_factor_sources = [[np.full((num_samples, source_len), self.pad_id, dtype=self.dtype)


could we also move the factor data into data_source, making it 3d?
So without factors it would be (num_samples, source_len, 1), and with n factors (num_samples, source_len, 1+n).

fhieber · 2018-01-22T15:05:37Z

sockeye/data_io.py

-        source = data[:n]
-        target = data[n:2 * n]
-        label = data[2 * n:]
+        num_parts = num_source_factors + 3


If you move the factor data into source its cleaner and simplifies the loading logic a lot

mjdenkowski · 2018-01-23T16:35:34Z

Re: factor input formats, I think we have some competing design goals:

Specifying options and input files should be consistent with the rest of the codebase.
Formats should be consistent between train and decode.
Formats should be robust and easy to validate (avoid human error or at least fail fast).
Formats should be as simple as possible given the above.

Here are a few ideas to consider:

If we like multiple files, what about folding factors into existing options? Something like:

--source corpus.en corpus.en.factor2 corpus.en.factor3
--target corpus.de
--num-embed 512,64,64:512

If we add target factors, we don't need any additional args to extend this.

Paste being error-prone is a good point, especially if some tabs sneak into training files. A user could get "pre-tokenized" data somewhere that isn't totally whitespace sanitized, try to build a system, and get confusing errors. As reluctant as I am to suggest it, what about Moses-style factor delimiters?

An|O exam@@|B ple|I

The difference could be that we only scan for them if we're running in factored mode, so we don't crash on random characters when users don't want factored input. (Factored mode could be defined as having multiple input files or a specific flag in training and based on model config at decode time). This would give us a decode-time solution and we could support both multi-file input or word-level factor-delimited input for training (both ease-of-use and consistency). As soon as we see a single token with the wrong number of factors, we throw an error and report the exact problem to the user.

fhieber · 2018-01-24T08:44:58Z

Thanks Michael, I think for training time moving the factor files to the --source argument is nice, but complicates the --num-embed flag parsing: you have to parse it, store the components in different variables, and pass them around in train.py (additionally to args) to check for equal length of embedding sizes and input files etc.
It also makes the behavior of --source rather implicit. I would vote for being explicit and keep the extra --source-factors and num-embed flags for factors. What does the rest of the team think about this (@tdomhan @davvil)?

Regarding decoding input formats: I do not care so much about type of separator for command line input but more about how we build it into the framework. The inference code (including the translator) should NOT do any kind of parsing but expose a well-defined API that accepts input strings and corresponding metadata in a structured way. This would cover the library use case.
For the CLI, we should have a set of input parsing functions/classes (similar to the OutputHandler concept) that take care of preparing structured input for the translate API. This would allow rigorous unit testing of these parsers.
Failing and informing the user should happen at two places: when parsing errors occur (e.g. two adjacent separator strings), and when an input request without metadata is passed to a model/Translator that expects metadata, and vice versa.

davvil · 2018-01-24T09:22:52Z

I would prefer to include the factors in the input file (at least for the CLI), i.e. the "An|O exam@@|B ple|I" proposed by Michael. I think this is less error prone. There should be a flag to specify the use of factors and an option to define the separator. The parsing of the input lines can be left out of the API and be done in the translate tool.

mjpost · 2018-01-24T11:27:08Z

I kind of like @mjdenkowski's suggestions for adding factors to the source list, and doing the same with num embed. It would not be hard to separate this internally and do a check to ensure the lists were the same length. In generally I like multi-purpose flags instead of new flags if their use cases are clear. But I am also fine to stick with --source-factors and --source-factor-dims. Michael's thought about later fitting multi-source decoding into this API is a good one. I think this is an argument for keeping --source-factors separate — multiple sources would then be multiple arguments to --source. So I guess my vote is to keep them separate.

For input, I would strongly prefer tab-delimited data to the Moses-style.

Tab is already a meta-data character and it is better to make use of that than to introduce new metadata characters that we'd then have to ensure was handled and undone by Sockeye or wrapper scripts. The Moses support mailing list is full of people wondering why their decoder died when it hit a '|'.
It fits easier with research workflows, where you generate multiple factors separately and store them into files. With the Moses format, you need an intervening script to generate it from the factors. Not a big deal but one of those minor issues that help make life nicer.
Note that my paste() function strips tabs out of each of the input streams before pasting them, so there is no worry about errant tabs in the input. Since tabs should not be there anyway, this function both does not harm and succinctly and easily ensures the data is clean (at least from the tab perspective).

At test time, it seems clean to me to similarly rely only pasted data on STDIN. We should also add support for --source-factors there for people passing input via -s instead.

fhieber · 2018-01-24T12:23:13Z

I agree with Matt's points on choosing tab over "|", but making it configurable would be the best option, as suggested by @davvil . We can set the default to tab.
I guess it would then be nice to have something like a SourceFactorInputHandler class, that takes the separator string as a parameter and implements an interface that 'decodes' from string to TranslateInput (an object of string + metadata).

For training I am not strongly opposed to either of the options. If we leave in --source-factors, I suggest to rename --source-factor-dims to --source-factors-num-embed to be consistent with num-embed for the 'primary' factor (words).

I am not fully convinced that multi-source input is a use case we have to support in the future.

…acking in some functions

…print

fhieber · 2018-02-08T22:34:29Z

Lets decide whether we want this to be backwards compatible or breaking. If it is the latter we can also change the name of the source vocabs from vocab.src.json to vocab.src.i.json where i corresponds to the factors. This would make the loading and saving logic for vocabs cleaner.

tdomhan · 2018-02-09T07:24:13Z

I'm fine with making it backwards-incompatible. It's a fairly large change. If we do so then we could think about pulling some things out of DataConfig that are there only for documentation purposes (unlike the length ratios). We could have a DataInfo object that is also stored in the model folder, but not used at inference time.

fhieber · 2018-02-09T09:54:07Z

I created a PR to separate these two objects in #293
UPDATE merged

* BREAKING: Removed max_seq_len_{source,target} from ModelConfig * BREAKING: Separate data statistics relevant for inference from data information relevant only for training.

fhieber · 2018-02-09T11:09:44Z

As far as I can tell, I addressed all comments. @davvil @tdomhan I would merge once both of you approve.

tdomhan · 2018-02-09T11:00:29Z

sockeye/inference.py

    utils.check_condition(vocab.are_identical(*target_vocabs), "Target vocabulary ids do not match")
+    for fi in range(len(source_vocabs[0])):


maybe for clarity source_vocabs[0] could be pulled out to:

first_model_vocabs = source_vocabs[0]

tdomhan · 2018-02-09T11:13:46Z

sockeye/inference.py

+    Returns a TranslatorInput object from a JSON object, serialized as a string.
+
+    :param sentence_id: An integer id.
+    :param json_string: A JSON object serialized as a string.


It might make sense to briefly mention the format in which the data is expected to be? Namely, the structure where there is one entry per factor (rather than something like a list of tuples [("token", "factor1", ...)]). etc

tdomhan · 2018-02-09T11:14:22Z

sockeye/inference.py

+
+    tokens = list(data_io.get_tokens(tokens))
+    factors = jobj.get(C.JSON_FACTORS_KEY)
+    if isinstance(factors, list):


why not the same "is None" check?

tdomhan · 2018-02-09T11:22:52Z

sockeye/inference.py

+    factors = [[] for _ in range(model_num_source_factors - 1)]  # type: List[Tokens]
+    for token_id, token in enumerate(data_io.get_tokens(factored_string)):
+        pieces = token.split(delimiter)
+        utils.check_condition(all(pieces) and len(pieces) == model_num_source_factors,


So far no input to inference could cause the sockeye process to crash, e.g. empty inputs were just returned. All words are mapped to some word id ( if nothing matches). I really think that we should continue to do so. Specifically for the factors I would very much prefer to fill factors with rather than raising an exception here. In order to make the user aware we should log to stderr that we failed parsing the input. This way it is not possible for a single wrongly formatted input to crash the process. For training this is a different story.

* Do not throw exceptions but return empty outputs and log the errors.

…no benefit. (#296)

tdomhan · 2018-02-14T13:00:50Z

sockeye/arguments.py

+    decode_params.add_argument('--json-input',
+                               action='store_true',
+                               default=False,
+                               help="If given, the CLI expects string-serialized json objects as input."


as this is the only place where a user gets information about the input format without looking at the source code we should fully describe the format here. Namely, which key/value pairs one expects and maybe a tiny example with factors..

tdomhan · 2018-02-14T13:04:48Z

sockeye/checkpoint_decoder.py

            else:
-                self.input_sentences, self.target_sentences = input_sentences, target_sentences
+                self.inputs_sentences, self.target_sentences = inputs_sentences, target_sentences


as the check_condition was pulled into parallel_subsample it is no longer done for this branch. I would move the check_condition back out to do the check in either case.

tdomhan · 2018-02-14T13:07:53Z

sockeye/data_io.py

    :param target_sentences: Target sentences.
    :param max_seq_len_source: Maximum source sequence length.
    :param max_seq_len_target: Maximum target sequence length.
    :return: The number of sentences as well as the mean and standard deviation of target to source length ratios.
    """
    mean_and_variance = OnlineMeanAndVariance()

-    for target, source in zip(target_sentences, source_sentences):
-        source_len = len(source)
+    for target, (sources) in zip(target_sentences, zip(*sources_sentences)):


do the parenthesis actually have an effect?

no, they can be removed

tdomhan · 2018-02-14T13:09:44Z

sockeye/data_io.py

-        source_len = len(source)
+    for target, (sources) in zip(target_sentences, zip(*sources_sentences)):
+        check_condition(are_token_parallel(sources),
+                        "Source sequences are not token-parallel: %s" % (str(sources)))


wouldn't this print all of the source sentences? I think the message would be fine, if anything it would be great to point to the line which is not token parallel.

tdomhan · 2018-02-14T13:19:46Z

sockeye/data_io.py

+    """
+    if not sequences:
+        return True
+    return all(len(s) == len(sequences[0]) for s in sequences)


you could also return True if sequences contains a single element.

good point, done!

tdomhan · 2018-02-14T13:42:16Z

sockeye/inference.py

@@ -640,7 +798,6 @@ def _concat_translations(translations: List[Translation], start_id: int, stop_id
    target_ids = [start_id]
    attention_matrices = []
    for idx, translation in enumerate(translations):
-        assert translation.target_ids[0] == start_id


why was this removed?

tdomhan · 2018-02-14T13:51:14Z

sockeye/inference.py

+        jobj = json.loads(json_string, encoding=C.JSON_ENCODING)
+        tokens = jobj[C.JSON_TEXT_KEY]
+        tokens = list(data_io.get_tokens(tokens))
+        factors = jobj.get(C.JSON_FACTORS_KEY)


we need a check for the factors having as many elements as expected here as well, just like in the other two methods, no?

tdomhan · 2018-02-14T13:55:18Z

sockeye/train.py

+        total_source_factor_size = sum(args.source_factors_num_embed)
+        if total_source_factor_size > 0:
+            adjusted_transformer_encoder_model_size = args.num_embed[0] + total_source_factor_size
+            check_condition(adjusted_transformer_encoder_model_size % 2 == 0 and


Where does the adjusted_transformer_encoder_model_size % 2 == 0 stem from? It's not mentioned in the message below.

positional embeddings require an even number.

but i fixed the error message as well.

tdomhan · 2018-02-14T14:04:42Z

sockeye/translate.py

+    :param input_factors: Source factor files.
+    :return: TranslatorInput objects.
+    """
+    if input is None:


might make sense to name input differently as input is a builtin.

renamed to inp (and inp_factors)

tdomhan · 2018-02-14T14:08:47Z

sockeye/translate.py

+                           output_handler=output_handler,
+                           chunk_size=args.chunk_size,
+                           input=args.input,
+                           input_factors=args.input_factors,


would it make sense to add a check for the number of input_factors?

tdomhan · 2018-02-14T15:05:17Z

sockeye/train.py

@@ -90,8 +90,8 @@ def check_arg_compatibility(args: argparse.Namespace):
            adjusted_transformer_encoder_model_size = args.num_embed[0] + total_source_factor_size
            check_condition(adjusted_transformer_encoder_model_size % 2 == 0 and
                            adjusted_transformer_encoder_model_size % args.transformer_attention_heads == 0,
-                            "Sum of source factor sizes (%d) has to be a multiple of attention heads (%d)" % (
-                                total_source_factor_size, args.transformer_attention_heads))
+                            "Sum of source factor sizes (%d) has to be even and a multiple of attention heads (%d)" % (


I'm not sure this error message would be clear to a user. Maybe mention the embedding size "The source embedding size plus all source factor embeddings sizes (%d) has to be even and a ...".

Changed it to make it clearer AND to reflect the fact that we regard the source surface form as the first & primary factor:
Sum of source factor sizes, i.e. num-embed plus source-factors-num-embed, (%d) has to be even and a multiple of attention heads (%d)

Matt Post added 11 commits January 18, 2018 15:35

data preparation with source factors works

7d3d3f8

validation data loading with prepared data

4cda665

now loads unprepared data

a779c1e

seems to be working but only tested on toy examples

1bf90de

bug fixes in restoring from prepared data

5443ec6

added scripts

c7fef5b

now works with BPE

c97ec01

inference

ec84d80

inference module now working

f5881e2

fixing test cases and lint

6fc0c04

CHANGELOG

512346c

mjpost requested review from davvil, fhieber, mjdenkowski and tdomhan as code owners January 18, 2018 16:37

mjpost commented Jan 18, 2018

View reviewed changes

fhieber reviewed Jan 18, 2018

View reviewed changes

Matt Post added 4 commits January 19, 2018 10:10

removed FactoredEmbedding*

04a89ac

reverted changes to vocabulary logic

0888e30

removed scripts

d849375

hard-coding paths instead of loading from the data config (restores f…

10c7f22

…unctionality)

fhieber requested changes Jan 22, 2018

View reviewed changes

fhieber reviewed Jan 22, 2018

View reviewed changes

fhieber added 6 commits February 8, 2018 22:46

Moved parallelism checks in analysis functions in data io

106b395

Generalized all data_io functions to multiple sources, avoids the unp…

566009e

…acking in some functions

nits

3a45219

Made is_source a flag to Embedding class not Config. Removed a debug …

5d98edd

…print

Removed some maps

79582a1

Simplify sharding with multiple sources

c35cd4c

fhieber added 2 commits February 9, 2018 08:52

mypi

4b2cdc7

Cleanup and some logging messages

14d96f1

fhieber added the backwards_incompatible label Feb 9, 2018

fhieber and others added 4 commits February 9, 2018 12:03

revised factor parsing according to davids comments

c24f25b

Data cleanup (towards source_factors PR) (#293)

b74fbd0

* BREAKING: Removed max_seq_len_{source,target} from ModelConfig * BREAKING: Separate data statistics relevant for inference from data information relevant only for training.

removed comment

fd86a09

Bumped Major Version to 1.17.0

6b095ea

tdomhan requested changes Feb 9, 2018

View reviewed changes

davvil approved these changes Feb 9, 2018

View reviewed changes

fhieber and others added 2 commits February 9, 2018 17:46

Do not throw exceptions while translating (#294)

1a48ebe

* Do not throw exceptions but return empty outputs and log the errors.

docstring & clarity

18aa317

fhieber mentioned this pull request Feb 12, 2018

Remove bias parameters in Transformer attention layers as they bring no benefit #296

Merged

7 tasks

Remove bias parameters in Transformer attention layers as they bring …

7d958c1

…no benefit. (#296)

tdomhan reviewed Feb 14, 2018

View reviewed changes

Adressed Tobis comments

e0a30d6

tdomhan reviewed Feb 14, 2018

View reviewed changes

tdomhan approved these changes Feb 14, 2018

View reviewed changes

revised error message

60963ca

fhieber merged commit 5dcb60d into master Feb 19, 2018

fhieber deleted the source_factors branch February 19, 2018 08:53

		utils.check_condition(vocab.are_identical(*target_vocabs), "Target vocabulary ids do not match")
		for fi in range(len(source_vocabs[0])):

Source factors #275

Source factors #275

Conversation

mjpost commented Jan 18, 2018 • edited by fhieber

Pull Request Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhieber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjpost Jan 19, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjdenkowski commented Jan 19, 2018

mjpost commented Jan 22, 2018

fhieber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhieber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjdenkowski commented Jan 23, 2018

fhieber commented Jan 24, 2018

davvil commented Jan 24, 2018

mjpost commented Jan 24, 2018

fhieber commented Jan 24, 2018 • edited

fhieber commented Feb 8, 2018

tdomhan commented Feb 9, 2018

fhieber commented Feb 9, 2018 • edited

fhieber commented Feb 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjpost commented Jan 18, 2018 •

edited by fhieber

mjpost Jan 19, 2018 •

edited

fhieber commented Jan 24, 2018 •

edited

fhieber commented Feb 9, 2018 •

edited

fhieber commented Feb 9, 2018 •

edited