Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LRU cache, add faster tokenization #37

Merged
merged 11 commits into from
Aug 5, 2021
Merged

Add LRU cache, add faster tokenization #37

merged 11 commits into from
Aug 5, 2021

Conversation

huu4ontocord
Copy link
Contributor

Note that this supprts LRU caching in Python3 only. In python2, the caching is removed from BPE altogether.

Adding LRU cache and speeding up tokenization.
Removing _old method. Note that the chinese token processing is optional and not used currently in training.
Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove chinese code! Otherwise looks good! Some small modification to improve the quality of the code, and I think we're good to go.

Also I think it's safe to say that we won't be running python2. Typically it's enforces in setup.py.

megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
Comment on lines 283 to 286
def normalize_token_py3(self, token):
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
ret = [bpe_token for bpe_token in self.bpe(token).split(' ')]
return ret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def normalize_token_py3(self, token):
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
ret = [bpe_token for bpe_token in self.bpe(token).split(' ')]
return ret
def normalize_token(self, token):
# I think it's safe to say that sys.version_info[0] is constant across the whole run.
if sys.version_info[0] != 2:
token = ''.join(self.byte_encoder[ord(b)] for b in token)
else:
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
ret = [bpe_token for bpe_token in self.bpe(token).split(' ')]
return ret

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the if statement being outside the for loop is better.

Comment on lines +295 to +296
token = ''.join(self.byte_encoder[ord(b)] for b in token)
bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
token = ''.join(self.byte_encoder[ord(b)] for b in token)
bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
bpe_tokens.extend(self.normalize_token(token))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the if statement being outside the for loop is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better version would just to ignore that case since python 2 is not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but then it would be inconsistent since we STILL have code that tests for python2.

Copy link
Member

@stas00 stas00 Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove that code and anywhere else where similar branches are found?

megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
@huu4ontocord
Copy link
Contributor Author

Should I create another PR and close this one?

@thomasw21
Copy link
Member

No you can keep working on that branch, and changes will be available in this PR directly.

@sbmaruf
Copy link
Collaborator

sbmaruf commented Aug 3, 2021

Just a small comment, recently I processed mC4 dataset. I found that if you have fair-amount of cpu (~65-128) machine the given pre-processing is fairly faster. The processing speed is around 25MiB/s-30Mib/s.
Are we also benchmarking these faster tokenization methods? If yes, then I can benchmark them on various systems.

@stas00
Copy link
Member

stas00 commented Aug 3, 2021

Note, on JZ we have 40 cpu cores / node.

Also note, we don't need to support python2. Almost no project does it any longer.

@thomasw21
Copy link
Member

@sbmaruf hopefully with #18 we should obtain higher speed, it usually obtains 40-45mb/s on JZ using 40 cpu physical cores. When @ontocord and I tried to combine this PR with #18, we obtained 60Mb/s (though there's still the task of merging datasets after). I'll check again once #18 is merged.

We've benchmarked some preprocessing, though that branch is gone now. Basically this tokenizer was much faster that HF tokenizer (even the fast version) by using the cache mechanism.

@huu4ontocord
Copy link
Contributor Author

huu4ontocord commented Aug 4, 2021

FYI, I think I saw this in another branch, but I don't know when this was changed in preprocess_data, but this line

from megatron.data.indexed_dataset import best_fitting_dtype

Needs to come after this line:

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(file),
os.path.pardir)))

This doesn't have anything to do with the tokenizer. I can push in this PR too if you guys want. It's a different issue.

The path needs to be set before we can find the "megatron" package.
Adding comments about max_token_len_cache
@huu4ontocord
Copy link
Contributor Author

huu4ontocord commented Aug 4, 2021

Also, as a note so we have a record of it. In the new gpt2 tokenize method, we use this class variable max_token_len_cache, which is fixed at 9 for now to determine whether to cache a normalized token, but we can change this by doing tokenizer.max_token_len_cache = X to test performance vs. cache usage if we want.

Thanks to Tal Perry for the suggestion to memoize the token.

@huu4ontocord
Copy link
Contributor Author

@thomasw21 @stas00 lmk if you need anything else or otherwise we can merge.

Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Otherwise LGTM! Thanks!

megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
Comment on lines +295 to +296
token = ''.join(self.byte_encoder[ord(b)] for b in token)
bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better version would just to ignore that case since python 2 is not supported.

megatron/tokenizer/gpt2_tokenization.py Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
@huu4ontocord huu4ontocord changed the title Add LRU cache, add faster tokenization, and add optional Chinese tokenization. Add LRU cache, add faster tokenization Aug 4, 2021
Copy link
Member

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ontocord

Left a few minor style suggestions.

megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved
huu4ontocord and others added 2 commits August 4, 2021 17:00
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
@stas00
Copy link
Member

stas00 commented Aug 4, 2021

looks like github hides suggestions when they are done for the same code that has been resolved, please also merge this:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/37/files#r682848490
thanks.

@huu4ontocord
Copy link
Contributor Author

Can we merge this now?

@huu4ontocord huu4ontocord merged commit 3628457 into bigscience-workshop:main Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants