Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C4-mC4 pre processing #9

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

sbmaruf
Copy link
Collaborator

@sbmaruf sbmaruf commented Jul 23, 2021

mC4 data is too large. For 13 selected language it's around 18TB of data. I excluded the english data since teven already processed it.

Arabic, Swahili (Bantu), Chinese, Catalan, English, French, Indic (Hindi,Urdu,Bangla), Indonesian, Portuguese, Spanish, Russian, Japanese, Amharic

This pull adds pre-processing codes for mC4 data.

@sbmaruf sbmaruf changed the title C4 mc4 pre processing C4-mC4 pre processing Jul 23, 2021
@sbmaruf
Copy link
Collaborator Author

sbmaruf commented Jul 27, 2021

Here are some of the stat of the currently processing data,

Language : Data Size in mC4 (GB)
--------------------
zh-Latn :   0.65
am :   1.25
ru-Latn :   2.65
sw :   3.16
ur :   10.4
ca :   42.3
bn :   43.39
hi :   127.86
zh :   148.15
id :   248.72
ar :   251.31
pt :   478.09
ja :   773.92
fr :   1050.29
es :   1492.72
ru :   3905.98
--------------------
Total size : 8580.84
Expected size afted resizing : 576
Per language allocated size : 36.0
Low resource languages (<36.0) : sw(3.16) zh-Latn(0.65) ur(10.4) ru-Latn(2.65) am(1.25)
Total size consumed by low resource languages 18.11
For high resource language, predefined (given by user) Minimum allocation size : 12 GB, Maximum allocation size 100GB
Sampling High resource language based on multinomial distribution with alpha 0.01. 
--------------------------------------------------------------------------------
Language : ar, Sampling prob : 0.09 , (251.31 -> 23 GB)
Language : zh, Sampling prob : 0.09 , (148.15 -> 13 GB)
Language : ca, Sampling prob : 0.09 -> 0.28, (42.3 -> 12 GB)
Language : fr, Sampling prob : 0.09 , (1050.29 -> 97 GB)
Language : hi, Sampling prob : 0.09 -> 0.09, (127.86 -> 12 GB)
Language : bn, Sampling prob : 0.09 -> 0.28, (43.39 -> 12 GB)
Language : id, Sampling prob : 0.09 , (248.72 -> 23 GB)
Language : pt, Sampling prob : 0.09 , (478.09 -> 44 GB)
Language : es, Sampling prob : 0.09 -> 0.07, (1492.72 -> 100 GB)
Language : ru, Sampling prob : 0.09 -> 0.03, (3905.98 -> 100 GB)
Language : ja, Sampling prob : 0.09 , (773.92 -> 71 GB)
Expected high resource size 557.89, Total Size : 505.848648173082
Performing adjustment ...

Final Breakdown
---------------
Language : ar, Sampling prob : 0.13, Data resized : (251.31 -> 31.78 GB)
Language : sw, Sampling prob : 1.0, Data resized : (3.16 -> 3.16 GB)
Language : zh, Sampling prob : 0.15, Data resized : (148.15 -> 22.36 GB)
Language : zh-Latn, Sampling prob : 1.0, Data resized : (0.65 -> 0.65 GB)
Language : ca, Sampling prob : 0.28, Data resized : (42.3 -> 12.0 GB)
Language : fr, Sampling prob : 0.1, Data resized : (1050.29 -> 105.59 GB)
Language : hi, Sampling prob : 0.09, Data resized : (127.86 -> 12.0 GB)
Language : ur, Sampling prob : 1.0, Data resized : (10.4 -> 10.4 GB)
Language : bn, Sampling prob : 0.28, Data resized : (43.39 -> 12.0 GB)
Language : id, Sampling prob : 0.13, Data resized : (248.72 -> 31.55 GB)
Language : pt, Sampling prob : 0.11, Data resized : (478.09 -> 51.66 GB)
Language : es, Sampling prob : 0.07, Data resized : (1492.72 -> 100.0 GB)
Language : ru, Sampling prob : 0.03, Data resized : (3905.98 -> 100.0 GB)
Language : ru-Latn, Sampling prob : 1.0, Data resized : (2.65 -> 2.65 GB)
Language : ja, Sampling prob : 0.1, Data resized : (773.92 -> 78.95 GB)
Language : am, Sampling prob : 1.0, Data resized : (1.25 -> 1.25 GB)
Expected resource size 576, Total Size : 576.0

Note: For some reason, large files writing on the disk failing (i.e., russia, es, fr) in my system. Hopefully in a few days when I'm free, I can look into this more details.

@ibeltagy ibeltagy mentioned this pull request Jul 27, 2021
@sbmaruf
Copy link
Collaborator Author

sbmaruf commented Aug 2, 2021

UPDATE: All data has been processed except es, fr, ru. I cannot load, random shuffle and sample these 3 language with my current system. I am using huggingface dataset. Spend a full day debugging error on these 3 language. An alternate possible method could be using Allen AI's github LFS data which is splitted into small parts. I am trying that now.
I am uploading the data in Hf-datasets. Unfortunately I have very low upload speed ~1.5MiB/s. It will take 1-2 days to upload full data.

@stas00
Copy link
Member

stas00 commented Aug 3, 2021

let's talk what goes where:

  • examples aren't the best place - we now fully own this repo - so we want logical placements - probably should remove this folder altogether.
  • tools is not the place to put libraries, this dir is for project maintenance scripts - think janitor closet.

The following is a possible approach:

  1. Put all data processing libraries under megatron/data/ perhaps megatron/data/c4 in this case?
  2. Create top-level scripts and probably arrange the .sh scripts for this PR in scripts/data/c4?

As mentioned above, I'm happy to discuss other layouts.

Thank you!

@sbmaruf
Copy link
Collaborator Author

sbmaruf commented Aug 3, 2021

@stas00
tools is not the place to put libraries, this dir is for project maintenance scripts - think janitor closet.

  • Actually I found that openwebtext processing scripts are in the tools folder. That's why I put the c4_mc4 processing codes in the tools folder. I also like your idea putting them in megatron/data/c4.

examples aren't the best place - we now fully own this repo - so we want logical placements - probably should remove this folder altogether.

Create top-level scripts and probably arrange the .sh scripts for this PR in scripts/data/c4?

  • I actually proposed similar stuff in our last archi meeting. All my projects has a top level scripts folder to track the runs. But I didn't create one here because I thought we are putting the scripts in the examples folder. Since I never maintain any large project, I didn't want to interfere on the development cycles (creating new folders, adding new stuff in .gitignore etc.). I am open to any proposal.

@stas00
Copy link
Member

stas00 commented Aug 3, 2021

Thank you for the feedback, @sbmaruf!

OK, let's leave the tools as it is for now and then we can move the whole thing at once if it makes more sense.

And let's start a top-level scripts and start migrating from examples - and eventually probably remove examples all-together. Examples are for a multitude of users of Megatron-LM who need examples. We are now the owners of this repo and need concrete solutions and not examples.

Since I never maintain any large project, I didn't want to interfere on the development cycles (creating new folders, adding new stuff in .gitignore etc.). I am open to any proposal.

This repo/project is currently very experimental, so please don't be afraid to experiment. This is all under git, so we can easily revert or change things if an experiment doesn't stick.

Since we don't quite have a multi-months spec on what will be done, the layout and structure will be evolving as the code emerges.

Please don't hesitate to tag me or others if you're not sure about something you're inspired to try.

@huu4ontocord
Copy link
Contributor

huu4ontocord commented Aug 6, 2021

Question @sbmaruf and @stas00 : I see you are using mt5 tokenizer. I thought we are doing gpt style modeling, either auto-regressive or prefix_lm. How would it work to feed in mt5 tokens into the gpt model? I've played around with mt5 a bit and I think it's wonderful, except there's like 250K tokens as I recall, and some of the tokens could be removed, like all the emoji stuff, and some of the formatting stuff, and there are tokens for langauges we aren't modeling. Maybe I'm missing something? I know we are waiting for the tokenizer WG so maybe they will have a decision on tokenizer. In the meantime, if we want to feed in mt5 tokens to a gpt style model for testing, I recommend trimming down the number of tokens to the language we need, like keeping the top N tokens for each of the lang from mc4 we are testing on, and keeping shorter tokens to fill in any OOV tokens? Also, this tokenizer uses HF sentence piece tokenizer, which while fast is not easy to modify to shrink the number of tokens, altough we could rewrite some of this to use a more vanila python tokenizer (which I've done in the past and can share). I suspect we can probably shrink down the number of tokens to the similar number as gpt.

@sbmaruf
Copy link
Collaborator Author

sbmaruf commented Aug 7, 2021

I see you are using mt5 tokenizer. I thought we are doing gpt style modeling, either auto-regressive or prefix_lm

  • Any tokenizer can be used for auto-regressive model. We actually discuss about this in the meeting.
  • We wanted to train our own tokenizer but it seems like tokenization WG group will finally propose all the hyperparameters of the tokenizer. So we decided to use regular mT5 tokenizer.

In the meantime, if we want to feed in mt5 tokens to a gpt style model for testing, I recommend trimming down the number of tokens to the language we need, like keeping the top N tokens for each of the lang from mc4 we are testing on, and keeping shorter tokens to fill in any OOV tokens?

  • sentencepiece tokenizer are not as straightforward as to resizing and remapping. In addition to that it is complicated to identify the source language of a subword since there are virtually no way we can identify language for typographically same tokens.

except there's like 250K tokens as I recall, and some of the tokens could be removed, like all the emoji stuff, and some of the formatting stuff

  • Actually scaling law doesn't depend on vocabulary size of the tokenizer.

altough we could rewrite some of this to use a more vanila python tokenizer (which I've done in the past and can share). I suspect we can probably shrink down the number of tokens to the similar number as gpt.

  • shrinking down tokenizer won't bring any theoretical improvement. The improvement we will be getting is speedup on lookup operation. For me, it's more easier to manually shrink a tokenizer than train a new one. While shrinking a tokenizer, so many things could go wrong.

@ontocord

@stas00
Copy link
Member

stas00 commented Aug 7, 2021

sentencepiece tokenizer are not as straightforward as to resizing and remapping.

Here is a hack that helps with shrinking an spm vocab:
https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564

It may or may not help, as I was only seeking to make it much smaller.

@huu4ontocord
Copy link
Contributor

@stas00 Thank you for the great resources re sentencepiece.

@sbmaruf I didn't realize we were considering running the gpt model with an mt5 size token space, but as you said, any tokenization mechanism can be used. I just wonder what happens to the token/embeddings that are rarely if never seen. Good luck and I'd love to see the results!

@sbmaruf
Copy link
Collaborator Author

sbmaruf commented Aug 10, 2021

  • Done with processing.
  • Here is the readme.
  • Process data is here
  • Raw data (except english) is here.
  • Please review this part carefully.
  • Final iterator selection probability is here, you need to choose an alpha from here.
  • All the json file for iterator selection probability is here.

Awaiting for review, @stas00 @ibeltagy @yongzx

@sbmaruf sbmaruf requested a review from yongzx August 10, 2021 11:11
Copy link
Member

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a massive contribution, @sbmaruf! Thank you!

In this review I've only focused on the usage doc and hope others will be able to review the code.

And I left a few small clarification suggestions.


Given that this is quite a complex process and may take many days to complete, any mistake can be quite costly. Therefore, is it possible to come up with a very small version of everything, so one can run the whole process quickly and ensure they all connect together

For example for when we first worked with openwebtext I created openwebtext-10k specifically for testing the processing pipeline. I wonder if we could create say 2 small datasets with 2 languages and do the full run on that first, ensure all parts stack up and then unleash the bigger version?

scripts/c4_mc4_processing/iterator_selection_prob.out.txt Outdated Show resolved Hide resolved
scripts/c4_mc4_processing/README.md Outdated Show resolved Hide resolved
bash scripts/c4_mc4_processing/cache_c4_mc4.sh
```
Running this script may require bandwidth `~30MB/s` per language. You may run this script multiple times to make the caching faster. The script `tools/c4_mc4/c4_mc4_cache.py` will perform caching if a caching folder for a language doesn't exist. So running the script multiple times with the same cache folder will be ok. Make sure you add your desired language in the script.

Copy link
Member

@stas00 stas00 Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what "You may run this script multiple times to make the caching faster" mean - should it be run first on several instances in parallel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version,

Running this script may require bandwidth ~30MB/sper language. You may run this script multiple times. Sometimes there is cap for a single download request. So in that case you can just run the script multiple time to do parallel downloading and processing. Also once the data is downloaded, it starts caching. Caching may take a lot of time. In that time other process can keep downloading their stuffs. The scripttools/c4_mc4/c4_mc4_cache.py will perform caching if a caching folder for a language doesn't exist. So running the script multiple times with the same cache folder will be ok. Make sure you add your desired language in the script.

Copy link
Member

@stas00 stas00 Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite familiar with this HF datasets feature - how does it know to download a different shard if you start identical processes to download the same dataset? Do you have a pointer to the corresponding doc?

What I do see is that one can use num_proc argument to parallelize the download, which sounds like a much more sensible approach, since the user can then specify how many concurrent download threads to use: I'm referring to:
https://huggingface.co/docs/datasets/_modules/datasets/utils/download_manager.html

        # Default to using 16 parallel thread for downloading
        # Note that if we have less than 16 files, multi-processing is not activated
        if download_config.num_proc is None:
            download_config.num_proc = 16

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could it be error prone to check that the cache dir exists to think that the caching has been completed? Won't it create that cache dir right away and if the process is aborted before being completed won't the cached dir remain there?

Or does it use a temp dir and only renames it to the final destination when datasets knows the cache building has been completed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also once the data is downloaded, it starts caching. Caching may take a lot of time. In that time other process can keep downloading their stuffs.

Normally I'd propose to switch to lower level API, and have one queue for downloads, and as soon as it finishes downloading one dataset it hands the cache build to another? That is instead of using load_dataset which performs both download and the caching. But this won't work on JZ.


Most likely the approach we can successfully deploy on JZ is to handle each stage completely separately. I'd just write a script that first downloads all the datasets. Then I'd processes them one-by-one. this is how we did it with Oscar.

Please remember that on JZ slurm partitions have no internet, and while we have a few with internet they are usually useless for data processing since they are heavily "abused" by other users since they are "free". So we can't do any serious processing there, and have to use normal cpu partitions that have no internet. We are also limited at 20h jobs for many things, but can do 100h at other times.

So the process on JZ won't work efficiently unless we download the data first on slurm partition A, and then process it in a different process on slurm partition B.

Please let me know if that helps to understand the proposal.

Copy link
Collaborator Author

@sbmaruf sbmaruf Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me to maintain caching, logging and monitoring, it was really convenient to log each of them with separate process. If it creates much confusion, I can make it simple by removing these additional conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you didn't miss my comment above about how this won't work, unless we separate the downloads from processing: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/9/files/8383432b9a1bf76e32cafa1d857a807c9495c382..9a588e8bdf84cfb1ca19f4b9057c11ffc8980344#r686400563

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Data has already been processed and uploaded in gc-bucket.
  • Data download and Processing is done in two different script. 1, 2
  • But even after downloading data, HF-datasets performs one small internal processing, which may take significant amount of time in case of tera-byte datasets.
  • Another lower level implementation is already available with this pull

scripts/c4_mc4_processing/README.md Outdated Show resolved Hide resolved
scripts/c4_mc4_processing/README.md Outdated Show resolved Hide resolved
scripts/c4_mc4_processing/README.md Outdated Show resolved Hide resolved
scripts/c4_mc4_processing/README.md Outdated Show resolved Hide resolved
scripts/c4_mc4_processing/README.md Outdated Show resolved Hide resolved
@yongzx
Copy link
Collaborator

yongzx commented Aug 11, 2021

@sbmaruf For the "Please review this part carefully", is there any particular component to pay extra attention to? For instance, the reported figures?

@sbmaruf
Copy link
Collaborator Author

sbmaruf commented Aug 11, 2021

@sbmaruf For the "Please review this part carefully", is there any particular component to pay extra attention to? For instance, the reported figures?

@yongzx I find some numbers not consistent. Specially for 86 GB raw data becomes 28GB binary and for english 784GB raw data becomes 763GB binary. There are no pattern in the Raw size to Binary size which worries me most. May be that's how it should be. Not sure at this point.


# Data Binarization Stat

If you tokenize English with `t5` tokenizer, `784GB` raw data becomes `344GB` (`*.bin` size, including validation). But if you tokenize the same `784GB` raw data with `mt5` it becomes `756GB` This is not what we were expecting at first. Earlier we were expecting 50\% English and 50\% remaining language, but now after binarization (a.k.a. tokenization) that calculation doesn't hold. For most of the languages, after binarization, the size reduced drastically. The stats are below,
Copy link
Member

@thomasw21 thomasw21 Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you tokenize English with t5 tokenizer, 784GB raw data becomes 344GB (*.bin size, including validation). But if you tokenize the same 784GB raw data with mt5 it becomes 756GB This is not what we were expecting at first

I'm pretty sure this is caused by this:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/data/indexed_dataset.py#L25-L29

Essentially t5 tokenizer has 32128 tokens and mt5 has 250112 tokens. So I'm pretty sure the x2 memory footprint is because one uses int16 and the other int32.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing. Will take a look into this.

Copy link
Collaborator

@yongzx yongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Maruf for downloading all the necessary language files. I went through the code repo and, overall, it looks good.

There are some functions (calc_multinomial_sampling_prob_with_penalty, print_stat, get_size_stats) that have identical names and similar/identical outputs in data_resize.py and calc_iterator_prob.py. I don't see that they are causing any issue right now, but if we are going to use these utility functions in the future, I suggest we move them out and put them into a file like util.py.

Another potential issue is the lack of docstrings/comments, so it's harder to follow the codes in tools/c4_mc4/. However, I don't think we need to add them now because this tool is for internal use (correct me if I am wrong).

Edit: @sbmaruf have addressed my concerns about the functions and added docstrings.

tools/c4_mc4/c4_mc4_cache.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants