Faster preprocessing #18

thomasw21 · 2021-07-25T21:25:29Z

The main idea is to use workers to write on disk and thus removing communication between processes. It should work quite nice with the ability to merge. Using a sample I have, I'm running at Processed 987700 documents (21903.35820268358 docs/s, 50.4128344196837 MB/s). with 48 workers. I haven't tested extensively but I'd say it's pretty nice compared to 15mb/s :D

Also, GPT2Tokenizer has a memory issue where it holds a cache that can become arbitrarily big, so I ran everything using HF tokenizer ("gpt2"), which was slightly slower (probably dues to non existent cache). Please check #35 for more details.

TODO

preprocess oscar and report time/speed

real	499m11.132s
user	27829m36.869s
sys	74m30.285s

check same output as merge variant
preprocess + run few steps small model

tools/preprocess_data.py

stas00 · 2021-07-26T19:36:15Z

This is really odd, I get exactly the same speed processing https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz with 8 workers on my setup, as compared to the script in master. I checked out your branch and did:

$ time python tools/preprocess_data_fast.py --input oscar-1GB.jsonl --output-prefix my-gpt2 \
--vocab  gpt2-vocab.json --dataset-impl mmap --tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt --append-eod --workers 8
[...]
real    1m38.991s
user    9m35.102s
sys     0m5.212s

or was I supposed to merge #17 into it?

stas00 · 2021-07-26T19:38:19Z

Also the new script leaves behind a lot of temp files - but no need to deal with those until we see the speed is better and that this PR goes in.

thomasw21 · 2021-07-26T19:41:12Z

Can you try launching with a lot more workers? like 48? I obtained substantial speed up on prepost.

…s communications

stas00 · 2021-07-28T17:18:32Z

Just to update this thread as we were discussing this on slack, getting more workers wasn't helping. I only have 12 cores on my desktop and things get slower once context switch kicks in.

thomasw21 · 2021-07-31T22:42:16Z

I had memory issues with GPT2Tokenizer (#35). I was able to run this on oscar (1.2TB) with a single node (40 physical cores) using 60 workers, and HF version of the tokenizer:

real	499m11.132s
user	27829m36.869s
sys	74m30.285s

On average it handled 44Mb/s making it about three times faster compared existing implementation. It can also benefit from the chunking strategy to run faster.

cc @ontocord

stas00

I re-run your new version vs the original and I'm getting no speed difference.

I tested on my home dev box (nvme) and also on JZ (both slow disk and SSD), with 10 and 40 workers. 40 workers is about 5% faster over 10. SSD was ~2% faster with your version.

Not sure how our setups are different, that you see a huge speed improvement. I have a feeling IO saturation might be the main bottleneck.

So we overcame it by using multiple nodes (not workers) on JZ.

Perhaps if we identify how yours is different we document it?

tools/preprocess_data_fast.py

thomasw21 · 2021-08-02T23:57:11Z

Are you sure you're using the right script, ie preprocess_data_fast.py instead of preprocess_data.py?

stas00 · 2021-08-03T02:20:40Z

I am.

So there is no doubt, here is what I'm running (just changing the workers)

# fast
python tools/preprocess_data_fast.py --input data/oscar-1GB.jsonl --output-prefix data/meg-gpt2-oscar-en-500-p3 --dataset-impl mmap --tokenizer-type PretrainedFromHF --tokenizer-name-or-path gpt2 --merge-file data/gpt2-merges.txt --vocab data/gpt2-vocab.json --append-eod --workers 12
# current
python tools/preprocess_data.py --input data/oscar-1GB.jsonl --output-prefix data/meg-gpt2-oscar-en-500-p3 --dataset-impl mmap --tokenizer-type PretrainedFromHF --tokenizer-name-or-path gpt2 --merge-file data/gpt2-merges.txt --vocab data/gpt2-vocab.json --append-eod --workers 12

Update: you shared on slack that you were using a different tokenizer.

I was using: --tokenizer-type GPT2BPETokenizer - you were using --tokenizer-type PretrainedFromHF --tokenizer-name-or-path gpt2

After updating the script to use your tokenizer

on JZ with 40 cpu cores:
- 60 workers: (fast) 0:19 vs (current) 0:42
- 10 workers: (fast) 1:37 vs (current) 1:28
on my home machine using your tokenizer I get slower outcome
- 12 workers: (fast) 1:48 vs (current) 1:37

As you can see the speed up only works when a machine can handle a lot of workers, i.e. lots of cores. Otherwise, the new script is either on par or slower.

I'm processing 1GB oscar slice.

So perhaps we keep both scripts and document which to use when. I which case perhaps the second needs to be called preprocess_data_tons_of_cpu_cores.py :)

thomasw21 · 2021-08-03T08:02:06Z

So it's great you were able to obtain similar performance! (I've been mostly testing my script on JZ).

Concerning the usage of the tokenizer, I'm still unclear, I was able to obtain similar performance with the other tokenizer (GPT2BPETokenizer), there was just an issue with memory that I've documented here #35

I agree the naming wasn't great, I'll switch it and modify the readme in order to reflect that we have two version depending on the number of workers you can provide.

stas00 · 2021-08-03T20:34:52Z

Also @adammoody here microsoft/DeepSpeedExamples#120 suggests to work directly with HF datasets, skipping the jsonl stage, which is probably a great idea, since converting it to jsonl isn't fast either - so why handle the data twice if we could do it in one go. It will also dramatically reduce data storage needs.

Perhaps in another PR?

tools/preprocess_data_with_lots_of_cpus.py

stas00

With a few suggestions, looks good.

Are we happy with the script name? not too long? with/of don't contribute much other than length, so perhaps preprocess_data_many_cpus.py would be good enough of a mnemonic? Not attached to the name though.

README.md

tools/preprocess_data_many_cores.py

stas00 · 2021-08-04T00:20:59Z

tools/preprocess_data_many_cores.py

 - workers >= 20
 - cpus >= 20 (logical cores)
 - large inputs: size >= 1GB
+
+For example using a 40 physical cores (80 logical cores) setup, we can run 60 workers on oscar (1.2T) to increase the speed of preprocessing.


we aren't using logical cores on JZ (SBATCH --hint=nomultithread) and it'd be just confusing to the user, just saying 40 cpu cores is most simple IMHO, since the scripts say: #SBATCH --cpus-per-task=40

I think it's just confusing to say that there's 40 cpu cores, when you could run:

from multiprocessing import cpu_count print(cpu_count())

and obtain 80. I think making the difference is important to understand why 60 workers makes sense, despite being higher than 40.

oh, I see, I thought SBATCH --hint=nomultithread was guaranteeing not to use virtual cores.

In which case, why aren't we using workers=80?

python -c "from multiprocessing import cpu_count; print(cpu_count())" 80

on:

#SBATCH --cpus-per-task=40 #SBATCH --hint=nomultithread

I'm not quite sure then what -hint=nomultithread is for. Do you know?

So if you run scontrol show node {node in cpu_p1} you see:

CPUTot=80

CoresPerSocket=20

Sockets=2

ThreadsPerCore=2

CoresPerSocket refers to physical cores, and I think Threads refers to threads, also known as virtual cores. We could maybe ask for 60 cpus with --hint=multithread, though I have never tested that (don't know how the billing works in this case, ie does it force you to obtain threads on same cpus cores?)

most JZ nodes have only 40 cores (cpu or gpu-enabled node - it's the same)

I can't quite find were this is stated, but you can see here:

http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-exec_partition_slurm-eng.html

from:

Warning: Since October 11, 2019, any job requiring more than one node runs in exclusive mode: The nodes are not shared. Consequently, the use of a part of a node results in the entire node being counted. For example, the reservation of 41 cores (or 1 node + 1 core) results in the invoicing of 80 cores (or 2 nodes). On the other hand, the total memory of the reserved nodes is available (on the order of 160 usable GBs per node).

I guess it's covered better here:

http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-exec_alloc-mem-eng.html

On the cpu_p1 partition

Each node of the cpu_p1 partition offers 160 GB of usable memory. The memory allocation is automatically determined on the basis of:

4 GB per CPU core if hyperthreading is deactivated (Slurm option --hint=nomultithread).

For example, a job specifying --ntasks=1 --cpus-per-task=5 on the cpu_p1 partition has access to 1 x 5 x 4 GB = 20 GB of memory if hyperthreading is deactivated (if not, half of that memory).

So the hyperthreading deactivation seems to be a useless option, other than controlling the memory allocation as explained in the para above.

if hyperthreading is deactivated, then we shouldn't get 80 cores when --cpus-per-task=40, but 40.

And if we have 80 cores, then we should use them all and not 60, if the IO doesn't become saturated that is.

You can find some info here, though it doesn't show which partition:

http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-hw-eng.html

Concerning the hyperthreading otions, I'd say the memory allocation is direct consequence oh the notion vs physical core vs logical core. You have 40 physical cores, and 160 GB of memory, consequently when you share proportionally you get 160 / 40 = 4 Gb per physical core, when you use activate hyperthreading you access logical cores, and they are twice as many so 160/80 = 2 Gb per logical core.

What I suggest is we first merge this, and then start benchmarking whether we can use 80 cores without I/O, or update some slurm script. Also another option is just not to ask for 40 cores.

I still miss the point of the whole --hint=nomultithread - clearly it makes no difference to the actual usage core-wise. Since it's always 80 cores seen by the software and never 40.

If I understand correctly --hint=nomultithread is really only indicating to the SLURM scheduler whether to allocate x or 2x memory per core and nothing else. So specifying:

#SBATCH --cpus-per-task=40 #SBATCH --hint=nomultithread

and:

#SBATCH --cpus-per-task=80 #SBATCH --hint=multithread

should be identical then.

Yeah that's what I'm not clear about yet ... Maybe we can ask Remy?

thomasw21 · 2021-08-04T00:26:15Z

Also @adammoody here microsoft/DeepSpeedExamples#120 suggests to work directly with HF datasets, skipping the jsonl stage, which is probably a great idea, since converting it to jsonl isn't fast either - so why handle the data twice if we could do it in one go. It will also dramatically reduce data storage needs.

Perhaps in another PR?

It's probably a very good idea, though my intuition is that we're going to likely have some preprocessing on the raw dataset, for example selecting specific subsets, shuffling, sampling strategies on them, making it much harder to handle them via script only. I think a better solution, would be to create some sort of dataset.to_megatron_dataset() method where dataset would be an already processed dataset, with all script arguments as available parameters. That method would either write the dataset or write and load dataset, or write/load cache dataset. I think this will depend on the usage.

Either way, I think we should do this in another PR, and try to better understand the usage.

stas00 · 2021-08-04T00:30:21Z

shuffling

shuffling is very slow in datasets, that's why we use terashuf in step 3 in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar#how-pre-processing-was-done

the other mindset is having many intermediate stages as we just did with oscar, which makes it easier to debug and speed up each stage separately.

hyunwoongko · 2021-08-22T01:23:00Z

tools/preprocess_data_many_cores.py

+
+    print("Opening", args.input)
+    simple_queue = multiprocessing.Queue(1_000) # we can also limit the number of elements to reduce the memory footprint.
+    chunk_size = 25


how about this variable?
chunk size is very important variable for speed.

You want it exposed via arguments? If so, feel free to open a PR as indeed we could have fun with optimising that value!

* cl update * script update

…8s_action_runner Feature/add k8s action runner

thomasw21 force-pushed the faster_preprocessing branch from f67b36e to 0e8287c Compare July 25, 2021 21:28

thomasw21 commented Jul 26, 2021

View reviewed changes

tools/preprocess_data.py Outdated Show resolved Hide resolved

thomasw21 mentioned this pull request Jul 26, 2021

Tokenizer batch encode #17

Closed

thomasw21 requested a review from stas00 July 26, 2021 14:49

thomasw21 commented Jul 26, 2021

View reviewed changes

tools/preprocess_data.py Outdated Show resolved Hide resolved

thomasw21 force-pushed the faster_preprocessing branch from b5c74a7 to e8152b1 Compare July 26, 2021 17:09

stas00 reviewed Jul 26, 2021

View reviewed changes

tools/preprocess_data.py Outdated Show resolved Hide resolved

Propose a faster preprocessing mechanim by reducing the interprocesse…

fac6e90

…s communications

thomasw21 force-pushed the faster_preprocessing branch from e8152b1 to fac6e90 Compare July 28, 2021 12:09

Thomas added 2 commits July 29, 2021 19:23

Add flush in order to force print

42aeef3

Try to prevent dead locks

25c9090

thomasw21 force-pushed the faster_preprocessing branch from 0554e0b to 25c9090 Compare July 30, 2021 10:27

Thomas and others added 5 commits July 30, 2021 12:43

Woops

ce80823

Trying to figure out what causes deadlock

80ef737

Limit queue size to 1_000_000

a0f0b9a

Drastically reduce the maximum number of element in the queue

bdacde2

Threading does not use a worker

f05ba9f

Remove shard files and factorise shard naming

fed86e9

stas00 reviewed Aug 2, 2021

View reviewed changes

tools/preprocess_data_fast.py Outdated Show resolved Hide resolved

tools/preprocess_data_fast.py Outdated Show resolved Hide resolved

Document high number of worker preprocessing script

d9736bd

stas00 reviewed Aug 3, 2021

View reviewed changes

tools/preprocess_data_with_lots_of_cpus.py Outdated Show resolved Hide resolved

stas00 self-requested a review August 3, 2021 21:15

stas00 approved these changes Aug 3, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

thomasw21 added 2 commits August 4, 2021 02:04

Improve naming

7d53441

Update comments and readmes

229159d

stas00 reviewed Aug 4, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

Woops

8119113

stas00 reviewed Aug 4, 2021

View reviewed changes

tools/preprocess_data_many_cores.py Outdated Show resolved Hide resolved

stas00 reviewed Aug 4, 2021

View reviewed changes

Remove the notion of vanilla and point to the script instead

43bf26c

thomasw21 mentioned this pull request Aug 4, 2021

Add LRU cache, add faster tokenization #37

Merged

Rephrase readme to use around 60 cores instead of 40

6ccf7d0

thomasw21 merged commit 7b99881 into bigscience-workshop:main Aug 4, 2021

thomasw21 mentioned this pull request Aug 4, 2021

Remove json intermediary for preprocessing datasets. #41

Closed

huu4ontocord mentioned this pull request Aug 6, 2021

use HuggingFace Datasets as source to build Megatron data files #48

Merged

7 tasks

hyunwoongko reviewed Aug 22, 2021

View reviewed changes

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 20, 2021

CL script update (bigscience-workshop#18)

54884b5

* cl update * script update

janEbert pushed a commit to janEbert/Megatron-DeepSpeed that referenced this pull request Dec 13, 2022

Merge pull request bigscience-workshop#18 from OpenGPTX/feature/add_k…

d89b00e

…8s_action_runner Feature/add k8s action runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster preprocessing #18

Faster preprocessing #18

thomasw21 commented Jul 25, 2021 •

edited

Loading

stas00 commented Jul 26, 2021 •

edited

Loading

stas00 commented Jul 26, 2021

thomasw21 commented Jul 26, 2021

stas00 commented Jul 28, 2021

thomasw21 commented Jul 31, 2021 •

edited

Loading

stas00 left a comment •

edited

Loading

thomasw21 commented Aug 2, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

thomasw21 commented Aug 3, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

stas00 left a comment

stas00 Aug 4, 2021

thomasw21 Aug 4, 2021

stas00 Aug 4, 2021

stas00 Aug 4, 2021

thomasw21 Aug 4, 2021

stas00 Aug 4, 2021 •

edited

Loading

stas00 Aug 4, 2021

thomasw21 Aug 4, 2021 •

edited

Loading

stas00 Aug 4, 2021

thomasw21 Aug 4, 2021

thomasw21 commented Aug 4, 2021 •

edited

Loading

stas00 commented Aug 4, 2021

hyunwoongko Aug 22, 2021

thomasw21 Aug 22, 2021

Faster preprocessing #18

Faster preprocessing #18

Conversation

thomasw21 commented Jul 25, 2021 • edited Loading

TODO

stas00 commented Jul 26, 2021 • edited Loading

stas00 commented Jul 26, 2021

thomasw21 commented Jul 26, 2021

stas00 commented Jul 28, 2021

thomasw21 commented Jul 31, 2021 • edited Loading

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

thomasw21 commented Aug 2, 2021 • edited Loading

stas00 commented Aug 3, 2021 • edited Loading

thomasw21 commented Aug 3, 2021 • edited Loading

stas00 commented Aug 3, 2021 • edited Loading

stas00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 commented Aug 4, 2021 • edited Loading

stas00 commented Aug 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 commented Jul 25, 2021 •

edited

Loading

stas00 commented Jul 26, 2021 •

edited

Loading

thomasw21 commented Jul 31, 2021 •

edited

Loading

stas00 left a comment •

edited

Loading

thomasw21 commented Aug 2, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

thomasw21 commented Aug 3, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

stas00 Aug 4, 2021 •

edited

Loading

thomasw21 Aug 4, 2021 •

edited

Loading

thomasw21 commented Aug 4, 2021 •

edited

Loading