Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster preprocessing #18

Merged
merged 15 commits into from
Aug 4, 2021

Conversation

thomasw21
Copy link
Member

@thomasw21 thomasw21 commented Jul 25, 2021

The main idea is to use workers to write on disk and thus removing communication between processes. It should work quite nice with the ability to merge. Using a sample I have, I'm running at Processed 987700 documents (21903.35820268358 docs/s, 50.4128344196837 MB/s). with 48 workers. I haven't tested extensively but I'd say it's pretty nice compared to 15mb/s :D

Also, GPT2Tokenizer has a memory issue where it holds a cache that can become arbitrarily big, so I ran everything using HF tokenizer ("gpt2"), which was slightly slower (probably dues to non existent cache). Please check #35 for more details.

TODO

  • preprocess oscar and report time/speed
real	499m11.132s
user	27829m36.869s
sys	74m30.285s
  • check same output as merge variant
  • preprocess + run few steps small model

tools/preprocess_data.py Outdated Show resolved Hide resolved
@thomasw21 thomasw21 mentioned this pull request Jul 26, 2021
@thomasw21 thomasw21 requested a review from stas00 July 26, 2021 14:49
tools/preprocess_data.py Outdated Show resolved Hide resolved
tools/preprocess_data.py Outdated Show resolved Hide resolved
@stas00
Copy link
Member

stas00 commented Jul 26, 2021

This is really odd, I get exactly the same speed processing https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz with 8 workers on my setup, as compared to the script in master. I checked out your branch and did:

$ time python tools/preprocess_data_fast.py --input oscar-1GB.jsonl --output-prefix my-gpt2 \
--vocab  gpt2-vocab.json --dataset-impl mmap --tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt --append-eod --workers 8
[...]
real    1m38.991s
user    9m35.102s
sys     0m5.212s

or was I supposed to merge #17 into it?

@stas00
Copy link
Member

stas00 commented Jul 26, 2021

Also the new script leaves behind a lot of temp files - but no need to deal with those until we see the speed is better and that this PR goes in.

@thomasw21
Copy link
Member Author

Can you try launching with a lot more workers? like 48? I obtained substantial speed up on prepost.

@stas00
Copy link
Member

stas00 commented Jul 28, 2021

Just to update this thread as we were discussing this on slack, getting more workers wasn't helping. I only have 12 cores on my desktop and things get slower once context switch kicks in.

@thomasw21
Copy link
Member Author

thomasw21 commented Jul 31, 2021

I had memory issues with GPT2Tokenizer (#35). I was able to run this on oscar (1.2TB) with a single node (40 physical cores) using 60 workers, and HF version of the tokenizer:

real	499m11.132s
user	27829m36.869s
sys	74m30.285s

On average it handled 44Mb/s making it about three times faster compared existing implementation. It can also benefit from the chunking strategy to run faster.

cc @ontocord

Copy link
Member

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-run your new version vs the original and I'm getting no speed difference.

I tested on my home dev box (nvme) and also on JZ (both slow disk and SSD), with 10 and 40 workers. 40 workers is about 5% faster over 10. SSD was ~2% faster with your version.

Not sure how our setups are different, that you see a huge speed improvement. I have a feeling IO saturation might be the main bottleneck.

So we overcame it by using multiple nodes (not workers) on JZ.

Perhaps if we identify how yours is different we document it?

tools/preprocess_data_fast.py Outdated Show resolved Hide resolved
tools/preprocess_data_fast.py Outdated Show resolved Hide resolved
@thomasw21
Copy link
Member Author

thomasw21 commented Aug 2, 2021

Are you sure you're using the right script, ie preprocess_data_fast.py instead of preprocess_data.py?

@stas00
Copy link
Member

stas00 commented Aug 3, 2021

I am.

So there is no doubt, here is what I'm running (just changing the workers)

# fast
python tools/preprocess_data_fast.py --input data/oscar-1GB.jsonl --output-prefix data/meg-gpt2-oscar-en-500-p3 --dataset-impl mmap --tokenizer-type PretrainedFromHF --tokenizer-name-or-path gpt2 --merge-file data/gpt2-merges.txt --vocab data/gpt2-vocab.json --append-eod --workers 12
# current
python tools/preprocess_data.py --input data/oscar-1GB.jsonl --output-prefix data/meg-gpt2-oscar-en-500-p3 --dataset-impl mmap --tokenizer-type PretrainedFromHF --tokenizer-name-or-path gpt2 --merge-file data/gpt2-merges.txt --vocab data/gpt2-vocab.json --append-eod --workers 12

Update: you shared on slack that you were using a different tokenizer.

I was using: --tokenizer-type GPT2BPETokenizer - you were using --tokenizer-type PretrainedFromHF --tokenizer-name-or-path gpt2

After updating the script to use your tokenizer

  • on JZ with 40 cpu cores:

    • 60 workers: (fast) 0:19 vs (current) 0:42
    • 10 workers: (fast) 1:37 vs (current) 1:28
  • on my home machine using your tokenizer I get slower outcome

    • 12 workers: (fast) 1:48 vs (current) 1:37

As you can see the speed up only works when a machine can handle a lot of workers, i.e. lots of cores. Otherwise, the new script is either on par or slower.

I'm processing 1GB oscar slice.

So perhaps we keep both scripts and document which to use when. I which case perhaps the second needs to be called preprocess_data_tons_of_cpu_cores.py :)

@thomasw21
Copy link
Member Author

thomasw21 commented Aug 3, 2021

So it's great you were able to obtain similar performance! (I've been mostly testing my script on JZ).

Concerning the usage of the tokenizer, I'm still unclear, I was able to obtain similar performance with the other tokenizer (GPT2BPETokenizer), there was just an issue with memory that I've documented here #35

I agree the naming wasn't great, I'll switch it and modify the readme in order to reflect that we have two version depending on the number of workers you can provide.

@stas00
Copy link
Member

stas00 commented Aug 3, 2021

Also @adammoody here microsoft/DeepSpeedExamples#120 suggests to work directly with HF datasets, skipping the jsonl stage, which is probably a great idea, since converting it to jsonl isn't fast either - so why handle the data twice if we could do it in one go. It will also dramatically reduce data storage needs.

Perhaps in another PR?

@stas00 stas00 self-requested a review August 3, 2021 21:15
Copy link
Member

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a few suggestions, looks good.

Are we happy with the script name? not too long? with/of don't contribute much other than length, so perhaps preprocess_data_many_cpus.py would be good enough of a mnemonic? Not attached to the name though.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
- workers >= 20
- cpus >= 20 (logical cores)
- large inputs: size >= 1GB

For example using a 40 physical cores (80 logical cores) setup, we can run 60 workers on oscar (1.2T) to increase the speed of preprocessing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't using logical cores on JZ (SBATCH --hint=nomultithread) and it'd be just confusing to the user, just saying 40 cpu cores is most simple IMHO, since the scripts say: #SBATCH --cpus-per-task=40

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just confusing to say that there's 40 cpu cores, when you could run:

from multiprocessing import cpu_count
print(cpu_count())

and obtain 80. I think making the difference is important to understand why 60 workers makes sense, despite being higher than 40.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see, I thought SBATCH --hint=nomultithread was guaranteeing not to use virtual cores.

In which case, why aren't we using workers=80?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -c "from multiprocessing import cpu_count; print(cpu_count())"
80

on:

#SBATCH --cpus-per-task=40
#SBATCH --hint=nomultithread

I'm not quite sure then what -hint=nomultithread is for. Do you know?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you run scontrol show node {node in cpu_p1} you see:

  • CPUTot=80
  • CoresPerSocket=20
  • Sockets=2
  • ThreadsPerCore=2

CoresPerSocket refers to physical cores, and I think Threads refers to threads, also known as virtual cores. We could maybe ask for 60 cpus with --hint=multithread, though I have never tested that (don't know how the billing works in this case, ie does it force you to obtain threads on same cpus cores?)

Copy link
Member

@stas00 stas00 Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most JZ nodes have only 40 cores (cpu or gpu-enabled node - it's the same)

I can't quite find were this is stated, but you can see here:

http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-exec_partition_slurm-eng.html

from:

Warning: Since October 11, 2019, any job requiring more than one node runs in exclusive mode: The nodes are not shared. Consequently, the use of a part of a node results in the entire node being counted. For example, the reservation of 41 cores (or 1 node + 1 core) results in the invoicing of 80 cores (or 2 nodes). On the other hand, the total memory of the reserved nodes is available (on the order of 160 usable GBs per node).

I guess it's covered better here:

http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-exec_alloc-mem-eng.html

On the cpu_p1 partition

Each node of the cpu_p1 partition offers 160 GB of usable memory. The memory allocation is automatically determined on the basis of:

4 GB per CPU core if hyperthreading is deactivated (Slurm option --hint=nomultithread).

For example, a job specifying --ntasks=1 --cpus-per-task=5 on the cpu_p1 partition has access to 1 x 5 x 4 GB = 20 GB of memory if hyperthreading is deactivated (if not, half of that memory).

So the hyperthreading deactivation seems to be a useless option, other than controlling the memory allocation as explained in the para above.

if hyperthreading is deactivated, then we shouldn't get 80 cores when --cpus-per-task=40, but 40.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we have 80 cores, then we should use them all and not 60, if the IO doesn't become saturated that is.

Copy link
Member Author

@thomasw21 thomasw21 Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find some info here, though it doesn't show which partition:

http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-hw-eng.html

Concerning the hyperthreading otions, I'd say the memory allocation is direct consequence oh the notion vs physical core vs logical core. You have 40 physical cores, and 160 GB of memory, consequently when you share proportionally you get 160 / 40 = 4 Gb per physical core, when you use activate hyperthreading you access logical cores, and they are twice as many so 160/80 = 2 Gb per logical core.

What I suggest is we first merge this, and then start benchmarking whether we can use 80 cores without I/O, or update some slurm script. Also another option is just not to ask for 40 cores.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still miss the point of the whole --hint=nomultithread - clearly it makes no difference to the actual usage core-wise. Since it's always 80 cores seen by the software and never 40.

If I understand correctly --hint=nomultithread is really only indicating to the SLURM scheduler whether to allocate x or 2x memory per core and nothing else. So specifying:

#SBATCH --cpus-per-task=40
#SBATCH --hint=nomultithread

and:

#SBATCH --cpus-per-task=80
#SBATCH --hint=multithread

should be identical then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I'm not clear about yet ... Maybe we can ask Remy?

@thomasw21
Copy link
Member Author

thomasw21 commented Aug 4, 2021

Also @adammoody here microsoft/DeepSpeedExamples#120 suggests to work directly with HF datasets, skipping the jsonl stage, which is probably a great idea, since converting it to jsonl isn't fast either - so why handle the data twice if we could do it in one go. It will also dramatically reduce data storage needs.

Perhaps in another PR?

It's probably a very good idea, though my intuition is that we're going to likely have some preprocessing on the raw dataset, for example selecting specific subsets, shuffling, sampling strategies on them, making it much harder to handle them via script only. I think a better solution, would be to create some sort of dataset.to_megatron_dataset() method where dataset would be an already processed dataset, with all script arguments as available parameters. That method would either write the dataset or write and load dataset, or write/load cache dataset. I think this will depend on the usage.

Either way, I think we should do this in another PR, and try to better understand the usage.

@stas00
Copy link
Member

stas00 commented Aug 4, 2021

shuffling

shuffling is very slow in datasets, that's why we use terashuf in step 3 in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar#how-pre-processing-was-done

the other mindset is having many intermediate stages as we just did with oscar, which makes it easier to debug and speed up each stage separately.


print("Opening", args.input)
simple_queue = multiprocessing.Queue(1_000) # we can also limit the number of elements to reduce the memory footprint.
chunk_size = 25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this variable?
chunk size is very important variable for speed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want it exposed via arguments? If so, feel free to open a PR as indeed we could have fun with optimising that value!

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 20, 2021
* cl update

* script update
janEbert pushed a commit to janEbert/Megatron-DeepSpeed that referenced this pull request Dec 13, 2022
…8s_action_runner

Feature/add k8s action runner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants