-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster preprocessing #18
Faster preprocessing #18
Conversation
f67b36e
to
0e8287c
Compare
b5c74a7
to
e8152b1
Compare
This is really odd, I get exactly the same speed processing https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz with 8 workers on my setup, as compared to the script in master. I checked out your branch and did:
or was I supposed to merge #17 into it? |
Also the new script leaves behind a lot of temp files - but no need to deal with those until we see the speed is better and that this PR goes in. |
Can you try launching with a lot more workers? like 48? I obtained substantial speed up on |
e8152b1
to
fac6e90
Compare
Just to update this thread as we were discussing this on slack, getting more workers wasn't helping. I only have 12 cores on my desktop and things get slower once context switch kicks in. |
0554e0b
to
25c9090
Compare
I had memory issues with GPT2Tokenizer (#35). I was able to run this on oscar (1.2TB) with a single node (40 physical cores) using 60 workers, and HF version of the tokenizer:
On average it handled 44Mb/s making it about three times faster compared existing implementation. It can also benefit from the chunking strategy to run faster. cc @ontocord |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-run your new version vs the original and I'm getting no speed difference.
I tested on my home dev box (nvme) and also on JZ (both slow disk and SSD), with 10 and 40 workers. 40 workers is about 5% faster over 10. SSD was ~2% faster with your version.
Not sure how our setups are different, that you see a huge speed improvement. I have a feeling IO saturation might be the main bottleneck.
So we overcame it by using multiple nodes (not workers) on JZ.
Perhaps if we identify how yours is different we document it?
Are you sure you're using the right script, ie |
I am. So there is no doubt, here is what I'm running (just changing the workers)
Update: you shared on slack that you were using a different tokenizer. I was using: After updating the script to use your tokenizer
As you can see the speed up only works when a machine can handle a lot of workers, i.e. lots of cores. Otherwise, the new script is either on par or slower. I'm processing 1GB oscar slice. So perhaps we keep both scripts and document which to use when. I which case perhaps the second needs to be called |
So it's great you were able to obtain similar performance! (I've been mostly testing my script on JZ). Concerning the usage of the tokenizer, I'm still unclear, I was able to obtain similar performance with the other tokenizer (GPT2BPETokenizer), there was just an issue with memory that I've documented here #35 I agree the naming wasn't great, I'll switch it and modify the readme in order to reflect that we have two version depending on the number of workers you can provide. |
Also @adammoody here microsoft/DeepSpeedExamples#120 suggests to work directly with HF Perhaps in another PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a few suggestions, looks good.
Are we happy with the script name? not too long? with/of don't contribute much other than length, so perhaps preprocess_data_many_cpus.py
would be good enough of a mnemonic? Not attached to the name though.
- workers >= 20 | ||
- cpus >= 20 (logical cores) | ||
- large inputs: size >= 1GB | ||
|
||
For example using a 40 physical cores (80 logical cores) setup, we can run 60 workers on oscar (1.2T) to increase the speed of preprocessing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we aren't using logical cores on JZ (SBATCH --hint=nomultithread
) and it'd be just confusing to the user, just saying 40 cpu cores is most simple IMHO, since the scripts say: #SBATCH --cpus-per-task=40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just confusing to say that there's 40 cpu cores, when you could run:
from multiprocessing import cpu_count
print(cpu_count())
and obtain 80. I think making the difference is important to understand why 60 workers makes sense, despite being higher than 40.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see, I thought SBATCH --hint=nomultithread
was guaranteeing not to use virtual cores.
In which case, why aren't we using workers=80
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python -c "from multiprocessing import cpu_count; print(cpu_count())"
80
on:
#SBATCH --cpus-per-task=40
#SBATCH --hint=nomultithread
I'm not quite sure then what -hint=nomultithread
is for. Do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you run scontrol show node {node in cpu_p1}
you see:
CPUTot=80
CoresPerSocket=20
Sockets=2
ThreadsPerCore=2
CoresPerSocket
refers to physical cores, and I think Threads
refers to threads, also known as virtual cores. We could maybe ask for 60 cpus with --hint=multithread
, though I have never tested that (don't know how the billing works in this case, ie does it force you to obtain threads on same cpus cores?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most JZ nodes have only 40 cores (cpu or gpu-enabled node - it's the same)
I can't quite find were this is stated, but you can see here:
http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-exec_partition_slurm-eng.html
from:
Warning: Since October 11, 2019, any job requiring more than one node runs in exclusive mode: The nodes are not shared. Consequently, the use of a part of a node results in the entire node being counted. For example, the reservation of 41 cores (or 1 node + 1 core) results in the invoicing of 80 cores (or 2 nodes). On the other hand, the total memory of the reserved nodes is available (on the order of 160 usable GBs per node).
I guess it's covered better here:
http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-exec_alloc-mem-eng.html
On the cpu_p1 partition
Each node of the cpu_p1 partition offers 160 GB of usable memory. The memory allocation is automatically determined on the basis of:
4 GB per CPU core if hyperthreading is deactivated (Slurm option --hint=nomultithread).
For example, a job specifying --ntasks=1 --cpus-per-task=5 on the cpu_p1 partition has access to 1 x 5 x 4 GB = 20 GB of memory if hyperthreading is deactivated (if not, half of that memory).
So the hyperthreading deactivation seems to be a useless option, other than controlling the memory allocation as explained in the para above.
if hyperthreading is deactivated, then we shouldn't get 80 cores when --cpus-per-task=40
, but 40.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we have 80 cores, then we should use them all and not 60, if the IO doesn't become saturated that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find some info here, though it doesn't show which partition:
http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-hw-eng.html
Concerning the hyperthreading otions, I'd say the memory allocation is direct consequence oh the notion vs physical core vs logical core. You have 40 physical cores, and 160 GB of memory, consequently when you share proportionally you get 160 / 40 = 4 Gb per physical core, when you use activate hyperthreading you access logical cores, and they are twice as many so 160/80 = 2 Gb per logical core.
What I suggest is we first merge this, and then start benchmarking whether we can use 80 cores without I/O, or update some slurm script. Also another option is just not to ask for 40 cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still miss the point of the whole --hint=nomultithread
- clearly it makes no difference to the actual usage core-wise. Since it's always 80 cores seen by the software and never 40.
If I understand correctly --hint=nomultithread
is really only indicating to the SLURM scheduler whether to allocate x or 2x memory per core and nothing else. So specifying:
#SBATCH --cpus-per-task=40
#SBATCH --hint=nomultithread
and:
#SBATCH --cpus-per-task=80
#SBATCH --hint=multithread
should be identical then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I'm not clear about yet ... Maybe we can ask Remy?
It's probably a very good idea, though my intuition is that we're going to likely have some preprocessing on the raw dataset, for example selecting specific subsets, shuffling, sampling strategies on them, making it much harder to handle them via script only. I think a better solution, would be to create some sort of Either way, I think we should do this in another PR, and try to better understand the usage. |
shuffling is very slow in the other mindset is having many intermediate stages as we just did with oscar, which makes it easier to debug and speed up each stage separately. |
|
||
print("Opening", args.input) | ||
simple_queue = multiprocessing.Queue(1_000) # we can also limit the number of elements to reduce the memory footprint. | ||
chunk_size = 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this variable?
chunk size is very important variable for speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want it exposed via arguments? If so, feel free to open a PR as indeed we could have fun with optimising that value!
* cl update * script update
…8s_action_runner Feature/add k8s action runner
The main idea is to use workers to write on disk and thus removing communication between processes. It should work quite nice with the ability to merge. Using a sample I have, I'm running at
Processed 987700 documents (21903.35820268358 docs/s, 50.4128344196837 MB/s).
with 48 workers. I haven't tested extensively but I'd say it's pretty nice compared to 15mb/s :DAlso, GPT2Tokenizer has a memory issue where it holds a cache that can become arbitrarily big, so I ran everything using HF tokenizer ("gpt2"), which was slightly slower (probably dues to non existent cache). Please check #35 for more details.
TODO