Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an end of sentence symbol to the source side. #392

Merged
merged 8 commits into from
May 15, 2018
Merged

Conversation

tdomhan
Copy link
Contributor

@tdomhan tdomhan commented May 15, 2018

With this change we add an additional EOS symbol on the source side. From the CLI/user perspective the --max-seq-len is the maximum length of the raw tokens without special symbols. Internally lengths are now based on the lengths including the special symbols like EOS. The change is backwards compatible in that the inference logic for existing models did not change.

System tests are currently running.

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
  • System tests pass (pytest test/system)
  • Passed code style checking (./style-check.sh)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making this backwards compatible!

yield sequence


def create_sequence_readers(sources, target, vocab_sources, vocab_target):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add type annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -423,6 +427,11 @@ def load_models(context: mx.context.Context,
utils.check_condition(vocab.are_identical(*[source_vocabs[i][fi] for i in range(len(source_vocabs))]),
"Source vocabulary ids do not match. Factor %d" % fi)

source_with_eos = models[0].source_with_eos
utils.check_condition(all(source_with_eos == m.source_with_eos for m in models),
"All models must match either take an additional EOS symbol on the source side or not. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar, maybe: "All models must agree on using source-side EOS symbols or not"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, changed.

@@ -591,6 +600,13 @@ def chunks(self, chunk_size: int) -> Generator['TranslatorInput', None, None]:
factors=factors,
chunk_id=chunk_id)

def with_eos(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

sockeye/train.py Outdated
@@ -258,12 +260,14 @@ def create_data_iters_and_vocabs(args: argparse.Namespace,
Create the data iterators and the vocabularies.

:param args: Arguments as returned by argparse.
:param max_seq_len_source: Source maximum sequence length.
:param max_seq_len_target: Target maximum sequence length.
:param shared_vocab: Whether to create a shared vocabulary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double docstring for shared_vocab

@fhieber
Copy link
Contributor

fhieber commented May 15, 2018

1 unit test is failing (test_arguments.py)

@tdomhan
Copy link
Contributor Author

tdomhan commented May 15, 2018

true, should be fixed now.

@tdomhan
Copy link
Contributor Author

tdomhan commented May 15, 2018

they seem to work now :)

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks!

@fhieber
Copy link
Contributor

fhieber commented May 15, 2018

Feel free to merge when travis or system tests are ready.

@tdomhan tdomhan merged commit bb0b782 into master May 15, 2018
@tdomhan tdomhan deleted the eos_bos branch May 15, 2018 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants