New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source factors #275
Source factors #275
Conversation
sockeye/training.py
Outdated
(source_embed, | ||
source_embed_length, | ||
source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to do this would be to encode factors as additional dimensions on the tensor; then the logic would go into assembling source
and choose the right subclass for the embedder, and the calls here would be the same. This would probably be cleaner...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a good idea. If there are no factors, the data iterator should yield source data of (batch, time, 1)
or (batch, time)
[the latter is the current]. If there are source factors present it should yield batches with source data of shape (batch, time,1+num_factors)
.
Since we elevated the embedding operation as a core model operation (by making it a model component along with encoder & decoder), you can implement the embedding encoder class to split on the third dimension if num_factors > 0.
sockeye/inference.py
Outdated
new_tokens = self.tokens[i:i + chunk_size] | ||
new_sent = ' '.join(new_tokens) | ||
new_factors = [self.factors[f][i: i + chunk_size] for f in range(self.num_factors)] | ||
yield TranslatorInput(id=self.id, chunk_id=chunk_id, sentence=new_sent, tokens=new_tokens, factors=new_factors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have to carry factors around now, splitting for sentences that are too long is a bit more complicated, so this is now a class instead of a namedtuple
. I also got rid of InputChunk
, rolling it into TranslatorInput
, which has an optional additional field, chunk_id
.
sockeye/inference.py
Outdated
source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)] | ||
(source_embed, | ||
source_embed_length, | ||
source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note below for training.py
: it might be better to encode factors as additional dimensions on the source
ndarray, which FactoredEmbedding
would know to look for. We could also do away with FactoredEmbedding
entirely, and have the embedding receive the number of factors when it is instantiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be a good idea to add the factor functionality to the existing Embedding class (via arguemnts in its constructor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a longer look tomorrow. Would it be possible to remove the new scripts folder from this PR to make the diff a little bit smaller? (we can think about adding those later (if needed).
sockeye/data_io.py
Outdated
@@ -642,8 +676,12 @@ def get_prepared_data_iters(prepared_data_dir: str, | |||
for shard_fname in shard_fnames: | |||
check_condition(os.path.exists(shard_fname), "Shard %s does not exist." % shard_fname) | |||
|
|||
source_vocab_fname = os.path.join(prepared_data_dir, C.VOCAB_SRC_NAME) + C.JSON_SUFFIX | |||
target_vocab_fname = os.path.join(prepared_data_dir, C.VOCAB_TRG_NAME) + C.JSON_SUFFIX | |||
# MJP: Why was this hard-coded instead of reading from the data config? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because these entries are None in the case where the vocab is created on the fly and not given on the commandline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the prepared data data.config
file has "null" entries for the vocab, however. Is this intended? It seems more sensible for the data.config
file to list the actual locations of the vocabularies, and then to load them from those locations, rather than hard-coding the path in the code and ignoring the value written to the config.
Here's a data config, for example, under current master:
shared_vocab: false
source: /redacted/path/train.bpe.lc.en.10
source_factor_vocabs:
- null
source_factors:
- /redacted/path/train.bpe.en.factor_case.10
target: /redacted/path/train.bpe.de.10
vocab_source: null
vocab_target: null
What is the purpose of the vocabulary paths? To record where a pre-built vocabulary came from?
sockeye/data_io.py
Outdated
@@ -741,6 +795,7 @@ def get_training_data_iters(source: str, target: str, | |||
|
|||
source_sentences = SequenceReader(source, vocab_source, add_bos=False) | |||
target_sentences = SequenceReader(target, vocab_target, add_bos=True) | |||
source_factor_sentences = [SequenceReader(f, v, add_bos=False) for f,v in zip(source_factors, source_factor_vocabs)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please reformat according to pep8: f,v
-> f, v
sockeye/encoder.py
Outdated
:return: Encoded versions of input data (data, data_length, seq_len). | ||
""" | ||
|
||
embeddings = [ super().encode(data, data_length, seq_len)[0] ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nicer to get the length return from the word embeddings and return that, something like:
word_embeddings, embed_length, embed_seq_len = super.encode(data, data_length, seq_len)
factor_embeddings = []
[...]
embedding = mx.sym.concat(word_embeddings, factor_embeddings, ...)
return embedding, embed_length, embed_seq_len
sockeye/inference.py
Outdated
source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)] | ||
(source_embed, | ||
source_embed_length, | ||
source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len, source_factors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be a good idea to add the factor functionality to the existing Embedding class (via arguemnts in its constructor)
sockeye/inference.py
Outdated
def load_source_factor_vocabs(model_folder: str) -> List[Dict[str,int]]: | ||
""" | ||
Loads source factor vocabs of the form "vocab.src.N" until one can no longer be found. | ||
This could alternately be done by first reading the number of factors from the config file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest only trying to load as many vocabs as you expect from the config.
sockeye/utils.py
Outdated
:return: The streams, pasted together as a single string. | ||
""" | ||
for sentences in zip(*streams): | ||
yield '\t'.join([x.replace('\t', ' ').rstrip() for x in sentences]) + '\n' if add_eol else '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can even get away with ()
around the list comprehension as join anyway would consume that generator.
sockeye/vocab.py
Outdated
return build_from_paths(paths=[data], | ||
num_words=num_words, | ||
min_count=word_min_count) if vocab_path is None else vocab_from_json(vocab_path) | ||
min_count=word_min_count) if not os.path.exists(vocab_path) else vocab_from_json(vocab_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is hard-coded logic about the vocabulary names scattered through the code (data_io, prepare_data, and train[ing]). This logic is often triggered when the vocabulary file names are set to None, since that is used to determine if a file needs to be read from a vocab file or built from a path. It made more sense to me to specify the vocab names at the top level and then key the logic to file existence. However, I went through and reverted all this.
Very nice addition! I like using the paste format to avoid issues with token-level factor delimiters or other markup. Would it make sense to also use this format for training? We could have a single flag like |
@mjdenkowski Yes, this is a good idea. It also makes things easier for the workflow, because then there is only a single source factors file to deal with instead of a variable number of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great feature, thanks!
I have two points we should discuss:
-
Can we find a way to abstract away the factor code in the model classes? The factor configuration can be moved into a sub Config object for
EmbeddingConfig
so thatModelConfig
would not need to know about it directly. -
I like the current way of providing multiple factors through different files since it follows the API of providing source and target through 2 different commandline arguments. It also does not require the user to paste multiple factor files together before-hand which might be error-prone. Regarding inference, it's ok to use the tab input, but I think it would be good to try to define an API that accepts metadata (including source factors) to a translation request instead of moving the parsing of the input into the translator.
-
This change needs to have unit tests, a system test, and should be tested with various model architectures and configurations. Transformer models will probably fail with the current version, as they expect embedding size to be equal to their 'model size' (which is not the case if factor embeddings are concatenated).
sockeye/checkpoint_decoder.py
Outdated
@@ -40,6 +41,7 @@ class CheckpointDecoder: | |||
:param inputs: Path to file containing input sentences. | |||
:param references: Path to file containing references. | |||
:param model: Model to load. | |||
:param input_factors: Path to files containing input factors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paths to files
sockeye/encoder.py
Outdated
@@ -272,18 +277,32 @@ class Embedding(Encoder): | |||
def __init__(self, | |||
config: EmbeddingConfig, | |||
prefix: str, | |||
embed_weight: Optional[mx.sym.Symbol] = None) -> None: | |||
embed_weight: Optional[mx.sym.Symbol] = None, | |||
factor_vocab_sizes: Optional[List[int]] = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these needed? Aren't they already part of the config passed in this constructor?
sockeye/encoder.py
Outdated
@@ -253,11 +253,16 @@ class EmbeddingConfig(Config): | |||
def __init__(self, | |||
vocab_size: int, | |||
num_embed: int, | |||
dropout: float) -> None: | |||
dropout: float, | |||
source_factor_vocab_sizes: Optional[List[int]] = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two arguments are optional lists of the same length, correct?
Would it make sense to structure this a little bit differently, for example by saying there is a FactorConfig object, containing the vocab_size
and the dim_size
.
The Argument to EmbeddingConfig would then be an optional list of FactorConfigs.
sockeye/data_io.py
Outdated
@@ -314,6 +314,8 @@ def shard_data(source_fname: str, | |||
target_fname: str, | |||
vocab_source: Dict[str, int], | |||
vocab_target: Dict[str, int], | |||
source_factors: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing docstring
sockeye/data_io.py
Outdated
@@ -314,6 +314,8 @@ def shard_data(source_fname: str, | |||
target_fname: str, | |||
vocab_source: Dict[str, int], | |||
vocab_target: Dict[str, int], | |||
source_factors: List[str], | |||
source_factor_vocabs: List[Dict[str, int]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typoe should be Optional[...]
sockeye/inference.py
Outdated
@@ -159,9 +160,10 @@ def sym_gen(source_seq_len: int): | |||
source_length = utils.compute_lengths(source) | |||
|
|||
# source embedding | |||
source_factors = [mx.sym.Variable(C.SOURCE_NAME + "_factor_" + str(i)) for i in range(self.config.num_factors)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: I would move variable declaration up to line 159, next to the source variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sockeye/inference.py
Outdated
:param model_folder: The location of the model folder | ||
:return: The list of source factor vocabs found in the model folder. | ||
""" | ||
return [vocab.vocab_from_json_or_pickle(os.path.join(model_folder, C.VOCAB_SRC_NAME) + "." + str(fi)) for fi in range(num_factors)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that important but I'd prefer string formatting over concatenation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, done
sockeye/inference.py
Outdated
cache_output_layer_w_b=cache_output_layer_w_b) | ||
models.append(model) | ||
num_factors = model.SockeyeModel.load_config(os.path.join(model_folder, C.CONFIG_NAME)).num_factors | ||
source_factor_vocabs = load_source_factor_vocabs(model_folder, num_factors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is strange: for multiple models it will return the last source_factor_vocabs, whereas for source/target vocabs we are returning the first. There should also be a check that all factor vocabs across models are identical (as in line 424).
Also, you do not seem to need the num_factors
variable so you can avoid reloading the config from disk. It's available from the model instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sockeye/inference.py
Outdated
:param tokens: List of input tokens. | ||
:param factors: List of zero or more input factors. | ||
""" | ||
def __init__(self, id, sentence, tokens, chunk_id = 0, factors = []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing information missing.
id
is a reserved word, maybe we can rename it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used elsewhere as well (e.g., TranslatedChunk), does this matter if it's just a function def? If you change it, we also have to change it in calls where the ID is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be easy with an IDE :)
Agreed that it is not crucial in this case. Your decision whether you want to change it.
sockeye/inference.py
Outdated
def __str__(self): | ||
return '[%d.%d] %s' % (self.id, self.chunk_id, self.sentence) | ||
|
||
def chunks(self, chunk_size: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return value missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving the factors into the source tensors for training (and then also for inference) would simplify the code quite a bit, as you said yourself.
sockeye/data_io.py
Outdated
@@ -411,6 +418,9 @@ def load(self, | |||
for (source_len, target_len), num_samples in zip(self.buckets, num_samples_per_bucket)] | |||
data_target = [np.full((num_samples, target_len), self.pad_id, dtype=self.dtype) | |||
for (source_len, target_len), num_samples in zip(self.buckets, num_samples_per_bucket)] | |||
data_source_factor_sources = [[np.full((num_samples, source_len), self.pad_id, dtype=self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also move the factor data into data_source
, making it 3d?
So without factors it would be (num_samples, source_len, 1)
, and with n factors (num_samples, source_len, 1+n)
.
sockeye/data_io.py
Outdated
source = data[:n] | ||
target = data[n:2 * n] | ||
label = data[2 * n:] | ||
num_parts = num_source_factors + 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move the factor data into source its cleaner and simplifies the loading logic a lot
Re: factor input formats, I think we have some competing design goals:
Here are a few ideas to consider: If we like multiple files, what about folding factors into existing options? Something like:
If we add target factors, we don't need any additional args to extend this. Paste being error-prone is a good point, especially if some tabs sneak into training files. A user could get "pre-tokenized" data somewhere that isn't totally whitespace sanitized, try to build a system, and get confusing errors. As reluctant as I am to suggest it, what about Moses-style factor delimiters?
The difference could be that we only scan for them if we're running in factored mode, so we don't crash on random characters when users don't want factored input. (Factored mode could be defined as having multiple input files or a specific flag in training and based on model config at decode time). This would give us a decode-time solution and we could support both multi-file input or word-level factor-delimited input for training (both ease-of-use and consistency). As soon as we see a single token with the wrong number of factors, we throw an error and report the exact problem to the user. |
Thanks Michael, I think for training time moving the factor files to the Regarding decoding input formats: I do not care so much about type of separator for command line input but more about how we build it into the framework. The inference code (including the translator) should NOT do any kind of parsing but expose a well-defined API that accepts input strings and corresponding metadata in a structured way. This would cover the library use case. |
I would prefer to include the factors in the input file (at least for the CLI), i.e. the "An|O exam@@|B ple|I" proposed by Michael. I think this is less error prone. There should be a flag to specify the use of factors and an option to define the separator. The parsing of the input lines can be left out of the API and be done in the translate tool. |
I kind of like @mjdenkowski's suggestions for adding factors to the source list, and doing the same with num embed. It would not be hard to separate this internally and do a check to ensure the lists were the same length. In generally I like multi-purpose flags instead of new flags if their use cases are clear. But I am also fine to stick with For input, I would strongly prefer tab-delimited data to the Moses-style.
At test time, it seems clean to me to similarly rely only pasted data on STDIN. We should also add support for |
I agree with Matt's points on choosing tab over "|", but making it configurable would be the best option, as suggested by @davvil . We can set the default to tab. For training I am not strongly opposed to either of the options. If we leave in I am not fully convinced that multi-source input is a use case we have to support in the future. |
…acking in some functions
Lets decide whether we want this to be backwards compatible or breaking. If it is the latter we can also change the name of the source vocabs from |
I'm fine with making it backwards-incompatible. It's a fairly large change. If we do so then we could think about pulling some things out of DataConfig that are there only for documentation purposes (unlike the length ratios). We could have a DataInfo object that is also stored in the model folder, but not used at inference time. |
I created a PR to separate these two objects in #293 |
* BREAKING: Removed max_seq_len_{source,target} from ModelConfig * BREAKING: Separate data statistics relevant for inference from data information relevant only for training.
sockeye/inference.py
Outdated
utils.check_condition(vocab.are_identical(*target_vocabs), "Target vocabulary ids do not match") | ||
for fi in range(len(source_vocabs[0])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for clarity source_vocabs[0] could be pulled out to:
first_model_vocabs = source_vocabs[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sockeye/inference.py
Outdated
Returns a TranslatorInput object from a JSON object, serialized as a string. | ||
|
||
:param sentence_id: An integer id. | ||
:param json_string: A JSON object serialized as a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to briefly mention the format in which the data is expected to be? Namely, the structure where there is one entry per factor (rather than something like a list of tuples [("token", "factor1", ...)])
. etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sockeye/inference.py
Outdated
|
||
tokens = list(data_io.get_tokens(tokens)) | ||
factors = jobj.get(C.JSON_FACTORS_KEY) | ||
if isinstance(factors, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not the same "is None" check?
sockeye/inference.py
Outdated
factors = [[] for _ in range(model_num_source_factors - 1)] # type: List[Tokens] | ||
for token_id, token in enumerate(data_io.get_tokens(factored_string)): | ||
pieces = token.split(delimiter) | ||
utils.check_condition(all(pieces) and len(pieces) == model_num_source_factors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far no input to inference could cause the sockeye process to crash, e.g. empty inputs were just returned. All words are mapped to some word id ( if nothing matches). I really think that we should continue to do so. Specifically for the factors I would very much prefer to fill factors with rather than raising an exception here. In order to make the user aware we should log to stderr that we failed parsing the input. This way it is not possible for a single wrongly formatted input to crash the process. For training this is a different story.
* Do not throw exceptions but return empty outputs and log the errors.
decode_params.add_argument('--json-input', | ||
action='store_true', | ||
default=False, | ||
help="If given, the CLI expects string-serialized json objects as input." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this is the only place where a user gets information about the input format without looking at the source code we should fully describe the format here. Namely, which key/value pairs one expects and maybe a tiny example with factors..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
else: | ||
self.input_sentences, self.target_sentences = input_sentences, target_sentences | ||
self.inputs_sentences, self.target_sentences = inputs_sentences, target_sentences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the check_condition
was pulled into parallel_subsample
it is no longer done for this branch. I would move the check_condition
back out to do the check in either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sockeye/data_io.py
Outdated
:param target_sentences: Target sentences. | ||
:param max_seq_len_source: Maximum source sequence length. | ||
:param max_seq_len_target: Maximum target sequence length. | ||
:return: The number of sentences as well as the mean and standard deviation of target to source length ratios. | ||
""" | ||
mean_and_variance = OnlineMeanAndVariance() | ||
|
||
for target, source in zip(target_sentences, source_sentences): | ||
source_len = len(source) | ||
for target, (sources) in zip(target_sentences, zip(*sources_sentences)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the parenthesis actually have an effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, they can be removed
sockeye/data_io.py
Outdated
source_len = len(source) | ||
for target, (sources) in zip(target_sentences, zip(*sources_sentences)): | ||
check_condition(are_token_parallel(sources), | ||
"Source sequences are not token-parallel: %s" % (str(sources))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this print all of the source sentences? I think the message would be fine, if anything it would be great to point to the line which is not token parallel.
sockeye/data_io.py
Outdated
""" | ||
if not sequences: | ||
return True | ||
return all(len(s) == len(sequences[0]) for s in sequences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also return True
if sequences contains a single element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, done!
sockeye/inference.py
Outdated
@@ -640,7 +798,6 @@ def _concat_translations(translations: List[Translation], start_id: int, stop_id | |||
target_ids = [start_id] | |||
attention_matrices = [] | |||
for idx, translation in enumerate(translations): | |||
assert translation.target_ids[0] == start_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reinserted
jobj = json.loads(json_string, encoding=C.JSON_ENCODING) | ||
tokens = jobj[C.JSON_TEXT_KEY] | ||
tokens = list(data_io.get_tokens(tokens)) | ||
factors = jobj.get(C.JSON_FACTORS_KEY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a check for the factors having as many elements as expected here as well, just like in the other two methods, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
total_source_factor_size = sum(args.source_factors_num_embed) | ||
if total_source_factor_size > 0: | ||
adjusted_transformer_encoder_model_size = args.num_embed[0] + total_source_factor_size | ||
check_condition(adjusted_transformer_encoder_model_size % 2 == 0 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the adjusted_transformer_encoder_model_size % 2 == 0
stem from? It's not mentioned in the message below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positional embeddings require an even number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i fixed the error message as well.
sockeye/translate.py
Outdated
:param input_factors: Source factor files. | ||
:return: TranslatorInput objects. | ||
""" | ||
if input is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to name input
differently as input
is a builtin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to inp (and inp_factors)
sockeye/translate.py
Outdated
output_handler=output_handler, | ||
chunk_size=args.chunk_size, | ||
input=args.input, | ||
input_factors=args.input_factors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to add a check for the number of input_factors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
sockeye/train.py
Outdated
@@ -90,8 +90,8 @@ def check_arg_compatibility(args: argparse.Namespace): | |||
adjusted_transformer_encoder_model_size = args.num_embed[0] + total_source_factor_size | |||
check_condition(adjusted_transformer_encoder_model_size % 2 == 0 and | |||
adjusted_transformer_encoder_model_size % args.transformer_attention_heads == 0, | |||
"Sum of source factor sizes (%d) has to be a multiple of attention heads (%d)" % ( | |||
total_source_factor_size, args.transformer_attention_heads)) | |||
"Sum of source factor sizes (%d) has to be even and a multiple of attention heads (%d)" % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this error message would be clear to a user. Maybe mention the embedding size "The source embedding size plus all source factor embeddings sizes (%d) has to be even and a ...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to make it clearer AND to reflect the fact that we regard the source surface form as the first & primary factor:
Sum of source factor sizes, i.e. num-embed plus source-factors-num-embed, (%d) has to be even and a multiple of attention heads (%d)
Added source factors, as described in.
Source factors are enabled by passing
--source-factors file1 [file2 ...]
(-sf
), where file1, etc. are token-parallel to the source (-s
).This option can be passed both to
sockeye.train
or in the data preparation step, if data sharding is used.An analogous parameter,
--validation-source-factors
, is used to pass factors for validation data.The flag
--source-factors-num-embed D1 [D2 ...]
denotes the embedding dimensions.These are concatenated with the source word dimension (
--num-embed
), which can continue to be tied to the target (--weight-tying --weight-tying-type=src_trg
).At test time, the input sentence and its factors can be passed by multiple parallel files (
--input
and--input-factors
) or through stdin with token-level annotations, separated by|
. Another way is to send a string-serialized JSON object to the CLI through stdin which needs to have a top-level key called 'text' and optionally a key 'factors' of type List[str].Pull Request Checklist
until you can check this box.
pytest
)pytest test/system
)./pre-commit.sh
or manual run of pylint & mypy)sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.