Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[bpe] Support BPE dropout #3232

Merged
merged 8 commits into from Nov 11, 2020
Merged

[bpe] Support BPE dropout #3232

merged 8 commits into from Nov 11, 2020

Conversation

stephenroller
Copy link
Contributor

Patch description
Only works for BART and SlowByteLevelBPE. Still need to add support for HuggingFace Tokenizers.

Testing steps
Internal experiments.

@@ -694,21 +697,6 @@ def sort(self, trim=True):
assert len(self.freq) == len(self.ind2tok) == len(self.tok2ind)
return sorted_pairs

def parse(self, txt_or_vec, vec_type=list):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated change, but just killing a bad method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Probably best to put this in a separate change for rubber stamping?

@stephenroller stephenroller changed the base branch from master to gpt2specialtok November 8, 2020 22:54
@stephenroller stephenroller marked this pull request as ready for review November 8, 2020 22:58
@stephenroller
Copy link
Contributor Author

Should be finished now.

return parser

def set_training_mode(self, mode: bool):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Make function name more descriptive and reference BPE explicitly?

(Likelihood of set_training_mode being a name collision seems decently high enough.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was intentional. BPE dropout is just one training time decision you might make.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can take multiple values beyond just a setting BPE to true, either it should be changed away from a bool to an enum or mode should be use_bpe or something.

(All kind of nits, but from a "someone else looking at the code quickly figuring out what's supposed to happen here" perspective...)


# possibly change tokenization methodology based on if this is a
# training example
if hasattr(self.dict, 'set_training_mode'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lightly confused by this piece of code. set_training_mode is a function in DictAgent but this if seems to be looking for it as a member variable?

(Noticing that this is stacked on top of another diff so maybe it's written there, but makes it a tad tricky to review if so.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions are also attributes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where self.dict isn't DictionaryAgent? Cause otherwise, this will always return True, no?

@@ -694,21 +697,6 @@ def sort(self, trim=True):
assert len(self.freq) == len(self.ind2tok) == len(self.tok2ind)
return sorted_pairs

def parse(self, txt_or_vec, vec_type=list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Probably best to put this in a separate change for rubber stamping?

Copy link
Contributor

@moyapchen moyapchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits, but otherwise seems fine

Base automatically changed from gpt2specialtok to master November 11, 2020 00:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants