-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C4-mC4 pre processing #9
base: main
Are you sure you want to change the base?
Conversation
Here are some of the stat of the currently processing data,
Note: For some reason, large files writing on the disk failing (i.e., russia, es, fr) in my system. Hopefully in a few days when I'm free, I can look into this more details. |
UPDATE: All data has been processed except es, fr, ru. I cannot |
let's talk what goes where:
The following is a possible approach:
As mentioned above, I'm happy to discuss other layouts. Thank you! |
@stas00
|
Thank you for the feedback, @sbmaruf! OK, let's leave the And let's start a top-level
This repo/project is currently very experimental, so please don't be afraid to experiment. This is all under git, so we can easily revert or change things if an experiment doesn't stick. Since we don't quite have a multi-months spec on what will be done, the layout and structure will be evolving as the code emerges. Please don't hesitate to tag me or others if you're not sure about something you're inspired to try. |
Question @sbmaruf and @stas00 : I see you are using mt5 tokenizer. I thought we are doing gpt style modeling, either auto-regressive or prefix_lm. How would it work to feed in mt5 tokens into the gpt model? I've played around with mt5 a bit and I think it's wonderful, except there's like 250K tokens as I recall, and some of the tokens could be removed, like all the emoji stuff, and some of the formatting stuff, and there are tokens for langauges we aren't modeling. Maybe I'm missing something? I know we are waiting for the tokenizer WG so maybe they will have a decision on tokenizer. In the meantime, if we want to feed in mt5 tokens to a gpt style model for testing, I recommend trimming down the number of tokens to the language we need, like keeping the top N tokens for each of the lang from mc4 we are testing on, and keeping shorter tokens to fill in any OOV tokens? Also, this tokenizer uses HF sentence piece tokenizer, which while fast is not easy to modify to shrink the number of tokens, altough we could rewrite some of this to use a more vanila python tokenizer (which I've done in the past and can share). I suspect we can probably shrink down the number of tokens to the similar number as gpt. |
|
Here is a hack that helps with shrinking an spm vocab: It may or may not help, as I was only seeking to make it much smaller. |
@stas00 Thank you for the great resources re sentencepiece. @sbmaruf I didn't realize we were considering running the gpt model with an mt5 size token space, but as you said, any tokenization mechanism can be used. I just wonder what happens to the token/embeddings that are rarely if never seen. Good luck and I'd love to see the results! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a massive contribution, @sbmaruf! Thank you!
In this review I've only focused on the usage doc and hope others will be able to review the code.
And I left a few small clarification suggestions.
Given that this is quite a complex process and may take many days to complete, any mistake can be quite costly. Therefore, is it possible to come up with a very small version of everything, so one can run the whole process quickly and ensure they all connect together
For example for when we first worked with openwebtext
I created openwebtext-10k
specifically for testing the processing pipeline. I wonder if we could create say 2 small datasets with 2 languages and do the full run on that first, ensure all parts stack up and then unleash the bigger version?
bash scripts/c4_mc4_processing/cache_c4_mc4.sh | ||
``` | ||
Running this script may require bandwidth `~30MB/s` per language. You may run this script multiple times to make the caching faster. The script `tools/c4_mc4/c4_mc4_cache.py` will perform caching if a caching folder for a language doesn't exist. So running the script multiple times with the same cache folder will be ok. Make sure you add your desired language in the script. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what "You may run this script multiple times to make the caching faster" mean - should it be run first on several instances in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new version,
Running this script may require bandwidth
~30MB/sper language. You may run this script multiple times. Sometimes there is cap for a single download request. So in that case you can just run the script multiple time to do parallel downloading and processing. Also once the data is downloaded, it starts caching. Caching may take a lot of time. In that time other process can keep downloading their stuffs. The script
tools/c4_mc4/c4_mc4_cache.py will perform caching if a caching folder for a language doesn't exist. So running the script multiple times with the same cache folder will be ok. Make sure you add your desired language in the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite familiar with this HF datasets feature - how does it know to download a different shard if you start identical processes to download the same dataset? Do you have a pointer to the corresponding doc?
What I do see is that one can use num_proc
argument to parallelize the download, which sounds like a much more sensible approach, since the user can then specify how many concurrent download threads to use: I'm referring to:
https://huggingface.co/docs/datasets/_modules/datasets/utils/download_manager.html
# Default to using 16 parallel thread for downloading
# Note that if we have less than 16 files, multi-processing is not activated
if download_config.num_proc is None:
download_config.num_proc = 16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could it be error prone to check that the cache dir exists to think that the caching has been completed? Won't it create that cache dir right away and if the process is aborted before being completed won't the cached dir remain there?
Or does it use a temp dir and only renames it to the final destination when datasets
knows the cache building has been completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also once the data is downloaded, it starts caching. Caching may take a lot of time. In that time other process can keep downloading their stuffs.
Normally I'd propose to switch to lower level API, and have one queue for downloads, and as soon as it finishes downloading one dataset it hands the cache build to another? That is instead of using load_dataset
which performs both download and the caching. But this won't work on JZ.
Most likely the approach we can successfully deploy on JZ is to handle each stage completely separately. I'd just write a script that first downloads all the datasets. Then I'd processes them one-by-one. this is how we did it with Oscar.
Please remember that on JZ slurm partitions have no internet, and while we have a few with internet they are usually useless for data processing since they are heavily "abused" by other users since they are "free". So we can't do any serious processing there, and have to use normal cpu partitions that have no internet. We are also limited at 20h jobs for many things, but can do 100h at other times.
So the process on JZ won't work efficiently unless we download the data first on slurm partition A, and then process it in a different process on slurm partition B.
Please let me know if that helps to understand the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I didn't know that
HF-dataset
can parallelize the download. I knew that it can process data withmulti-processing
. - Actually we download each of language by providing
Subset
. It automatically creates a new folder and starts the download. - If the caching fails it will raise an exception, https://github.com/sbmaruf/Megatron-DeepSpeed/blob/c4-mc4-pre_processing/tools/c4_mc4/c4_mc4_cache.py#L30
So user has to look into the logs. If caching fails, user has to clean the folder and start again.
For me to maintain caching, logging and monitoring, it was really convenient to log each of them with separate process. If it creates much confusion, I can make it simple by removing these additional conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you didn't miss my comment above about how this won't work, unless we separate the downloads from processing: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/9/files/8383432b9a1bf76e32cafa1d857a807c9495c382..9a588e8bdf84cfb1ca19f4b9057c11ffc8980344#r686400563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Data has already been processed and uploaded in gc-bucket.
- Data download and Processing is done in two different script. 1, 2
- But even after downloading data, HF-datasets performs one small internal processing, which may take significant amount of time in case of
tera-byte
datasets. - Another lower level implementation is already available with this pull
@yongzx I find some numbers not consistent. Specially for 86 GB raw data becomes 28GB binary and for english 784GB raw data becomes 763GB binary. There are no pattern in the Raw size to Binary size which worries me most. May be that's how it should be. Not sure at this point. |
|
||
# Data Binarization Stat | ||
|
||
If you tokenize English with `t5` tokenizer, `784GB` raw data becomes `344GB` (`*.bin` size, including validation). But if you tokenize the same `784GB` raw data with `mt5` it becomes `756GB` This is not what we were expecting at first. Earlier we were expecting 50\% English and 50\% remaining language, but now after binarization (a.k.a. tokenization) that calculation doesn't hold. For most of the languages, after binarization, the size reduced drastically. The stats are below, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you tokenize English with
t5
tokenizer,784GB
raw data becomes344GB
(*.bin
size, including validation). But if you tokenize the same784GB
raw data withmt5
it becomes756GB
This is not what we were expecting at first
I'm pretty sure this is caused by this:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/data/indexed_dataset.py#L25-L29
Essentially t5 tokenizer has 32128 tokens and mt5 has 250112 tokens. So I'm pretty sure the x2 memory footprint is because one uses int16
and the other int32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing. Will take a look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Maruf for downloading all the necessary language files. I went through the code repo and, overall, it looks good.
There are some functions (calc_multinomial_sampling_prob_with_penalty
, print_stat
, get_size_stats
) that have identical names and similar/identical outputs in data_resize.py
and calc_iterator_prob.py
. I don't see that they are causing any issue right now, but if we are going to use these utility functions in the future, I suggest we move them out and put them into a file like util.py
.
Another potential issue is the lack of docstrings/comments, so it's harder to follow the codes in tools/c4_mc4/
. However, I don't think we need to add them now because this tool is for internal use (correct me if I am wrong).
Edit: @sbmaruf have addressed my concerns about the functions and added docstrings.
mC4 data is too large. For 13 selected language it's around 18TB of data. I excluded the english data since teven already processed it.
Arabic, Swahili (Bantu), Chinese, Catalan, English, French, Indic (Hindi,Urdu,Bangla), Indonesian, Portuguese, Spanish, Russian, Japanese, Amharic
This pull adds pre-processing codes for mC4 data.