Add LRU cache, add faster tokenization #37

huu4ontocord · 2021-08-03T13:27:07Z

Note that this supprts LRU caching in Python3 only. In python2, the caching is removed from BPE altogether.

Adding LRU cache and speeding up tokenization.

Removing _old method. Note that the chinese token processing is optional and not used currently in training.

thomasw21

Let's remove chinese code! Otherwise looks good! Some small modification to improve the quality of the code, and I think we're good to go.

Also I think it's safe to say that we won't be running python2. Typically it's enforces in setup.py.

megatron/tokenizer/gpt2_tokenization.py

thomasw21 · 2021-08-03T13:40:01Z

megatron/tokenizer/gpt2_tokenization.py

+    def normalize_token_py3(self, token):
+          token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+          ret = [bpe_token for bpe_token in self.bpe(token).split(' ')]
+          return ret


Suggested change

def normalize_token_py3(self, token):

token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))

ret = [bpe_token for bpe_token in self.bpe(token).split(' ')]

return ret

def normalize_token(self, token):

# I think it's safe to say that sys.version_info[0] is constant across the whole run.

if sys.version_info[0] != 2:

token = ''.join(self.byte_encoder[ord(b)] for b in token)

else:

token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))

ret = [bpe_token for bpe_token in self.bpe(token).split(' ')]

return ret

I think the if statement being outside the for loop is better.

thomasw21 · 2021-08-03T13:40:55Z

megatron/tokenizer/gpt2_tokenization.py

+              token = ''.join(self.byte_encoder[ord(b)] for b in token)
+              bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))


Suggested change

token = ''.join(self.byte_encoder[ord(b)] for b in token)

bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))

bpe_tokens.extend(self.normalize_token(token))

I think the if statement being outside the for loop is better.

I think a better version would just to ignore that case since python 2 is not supported.

Yeah, but then it would be inconsistent since we STILL have code that tests for python2.

remove that code and anywhere else where similar branches are found?

megatron/tokenizer/gpt2_tokenization.py

huu4ontocord · 2021-08-03T15:59:55Z

Should I create another PR and close this one?

thomasw21 · 2021-08-03T16:05:34Z

No you can keep working on that branch, and changes will be available in this PR directly.

sbmaruf · 2021-08-03T19:18:40Z

Just a small comment, recently I processed mC4 dataset. I found that if you have fair-amount of cpu (~65-128) machine the given pre-processing is fairly faster. The processing speed is around 25MiB/s-30Mib/s.
Are we also benchmarking these faster tokenization methods? If yes, then I can benchmark them on various systems.

stas00 · 2021-08-03T19:28:33Z

Note, on JZ we have 40 cpu cores / node.

Also note, we don't need to support python2. Almost no project does it any longer.

thomasw21 · 2021-08-04T01:12:46Z

@sbmaruf hopefully with #18 we should obtain higher speed, it usually obtains 40-45mb/s on JZ using 40 cpu physical cores. When @ontocord and I tried to combine this PR with #18, we obtained 60Mb/s (though there's still the task of merging datasets after). I'll check again once #18 is merged.

We've benchmarked some preprocessing, though that branch is gone now. Basically this tokenizer was much faster that HF tokenizer (even the fast version) by using the cache mechanism.

huu4ontocord · 2021-08-04T14:46:35Z

FYI, I think I saw this in another branch, but I don't know when this was changed in preprocess_data, but this line

from megatron.data.indexed_dataset import best_fitting_dtype

Needs to come after this line:

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(file),
os.path.pardir)))

This doesn't have anything to do with the tokenizer. I can push in this PR too if you guys want. It's a different issue.

The path needs to be set before we can find the "megatron" package.

Adding comments about max_token_len_cache

huu4ontocord · 2021-08-04T15:01:15Z

Also, as a note so we have a record of it. In the new gpt2 tokenize method, we use this class variable max_token_len_cache, which is fixed at 9 for now to determine whether to cache a normalized token, but we can change this by doing tokenizer.max_token_len_cache = X to test performance vs. cache usage if we want.

Thanks to Tal Perry for the suggestion to memoize the token.

huu4ontocord · 2021-08-04T15:09:38Z

@thomasw21 @stas00 lmk if you need anything else or otherwise we can merge.

thomasw21

Some minor comments. Otherwise LGTM! Thanks!

megatron/tokenizer/gpt2_tokenization.py

thomasw21 · 2021-08-04T16:40:39Z

megatron/tokenizer/gpt2_tokenization.py

+              token = ''.join(self.byte_encoder[ord(b)] for b in token)
+              bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))


I think a better version would just to ignore that case since python 2 is not supported.

megatron/tokenizer/gpt2_tokenization.py

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

stas00

Thank you, @ontocord

Left a few minor style suggestions.

megatron/tokenizer/gpt2_tokenization.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

stas00 · 2021-08-04T21:17:59Z

looks like github hides suggestions when they are done for the same code that has been resolved, please also merge this:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/37/files#r682848490
thanks.

huu4ontocord · 2021-08-05T16:45:01Z

Can we merge this now?

huu4ontocord added 2 commits August 3, 2021 09:23

Update gpt2_tokenization.py

160e2bd

Adding LRU cache and speeding up tokenization.

Update gpt2_tokenization.py

e7c3d51

Removing _old method. Note that the chinese token processing is optional and not used currently in training.

thomasw21 reviewed Aug 3, 2021

View reviewed changes

Update gpt2_tokenization.py

dc80008

huu4ontocord added 2 commits August 4, 2021 10:49

Update preprocess_data.py

a405b9e

The path needs to be set before we can find the "megatron" package.

Update gpt2_tokenization.py

54ab4e3

Adding comments about max_token_len_cache

Merge branch 'bigscience-workshop:main' into main

e729aba

thomasw21 approved these changes Aug 4, 2021

View reviewed changes

Update megatron/tokenizer/gpt2_tokenization.py

3501149

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

huu4ontocord changed the title ~~Add LRU cache, add faster tokenization, and add optional Chinese tokenization.~~ Add LRU cache, add faster tokenization Aug 4, 2021

Update gpt2_tokenization.py

75cce0b

stas00 approved these changes Aug 4, 2021

View reviewed changes

megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved

megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved

megatron/tokenizer/gpt2_tokenization.py Outdated Show resolved Hide resolved

huu4ontocord and others added 2 commits August 4, 2021 17:00

Update megatron/tokenizer/gpt2_tokenization.py

cc57925

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Update gpt2_tokenization.py

1811892

Update gpt2_tokenization.py

02b2d2f

huu4ontocord merged commit 3628457 into bigscience-workshop:main Aug 5, 2021

This was referenced Aug 5, 2021

WIP: Faster preprocess multi #32

Closed

Limit cache for GPT2Tokenizer #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LRU cache, add faster tokenization #37

Add LRU cache, add faster tokenization #37

huu4ontocord commented Aug 3, 2021

thomasw21 left a comment •

edited

Loading

thomasw21 Aug 3, 2021

huu4ontocord Aug 3, 2021

thomasw21 Aug 3, 2021

huu4ontocord Aug 3, 2021

thomasw21 Aug 4, 2021

huu4ontocord Aug 4, 2021

stas00 Aug 4, 2021 •

edited

Loading

huu4ontocord commented Aug 3, 2021

thomasw21 commented Aug 3, 2021

sbmaruf commented Aug 3, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

thomasw21 commented Aug 4, 2021

huu4ontocord commented Aug 4, 2021 •

edited

Loading

huu4ontocord commented Aug 4, 2021 •

edited

Loading

huu4ontocord commented Aug 4, 2021

thomasw21 left a comment

thomasw21 Aug 4, 2021

stas00 left a comment

stas00 commented Aug 4, 2021 •

edited

Loading

huu4ontocord commented Aug 5, 2021

		token = ''.join(self.byte_encoder[ord(b)] for b in token)
		bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))

	token = ''.join(self.byte_encoder[ord(b)] for b in token)
	bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
	bpe_tokens.extend(self.normalize_token(token))

Add LRU cache, add faster tokenization #37

Add LRU cache, add faster tokenization #37

Conversation

huu4ontocord commented Aug 3, 2021

thomasw21 left a comment • edited Loading

Choose a reason for hiding this comment

thomasw21 Aug 3, 2021

Choose a reason for hiding this comment

huu4ontocord Aug 3, 2021

Choose a reason for hiding this comment

thomasw21 Aug 3, 2021

Choose a reason for hiding this comment

huu4ontocord Aug 3, 2021

Choose a reason for hiding this comment

thomasw21 Aug 4, 2021

Choose a reason for hiding this comment

huu4ontocord Aug 4, 2021

Choose a reason for hiding this comment

stas00 Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

huu4ontocord commented Aug 3, 2021

thomasw21 commented Aug 3, 2021

sbmaruf commented Aug 3, 2021 • edited Loading

stas00 commented Aug 3, 2021 • edited Loading

thomasw21 commented Aug 4, 2021

huu4ontocord commented Aug 4, 2021 • edited Loading

huu4ontocord commented Aug 4, 2021 • edited Loading

huu4ontocord commented Aug 4, 2021

thomasw21 left a comment

Choose a reason for hiding this comment

thomasw21 Aug 4, 2021

Choose a reason for hiding this comment

stas00 left a comment

Choose a reason for hiding this comment

stas00 commented Aug 4, 2021 • edited Loading

huu4ontocord commented Aug 5, 2021

thomasw21 left a comment •

edited

Loading

stas00 Aug 4, 2021 •

edited

Loading

sbmaruf commented Aug 3, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

huu4ontocord commented Aug 4, 2021 •

edited

Loading

huu4ontocord commented Aug 4, 2021 •

edited

Loading

stas00 commented Aug 4, 2021 •

edited

Loading