C4-mC4 pre processing #9

sbmaruf · 2021-07-23T08:31:11Z

mC4 data is too large. For 13 selected language it's around 18TB of data. I excluded the english data since teven already processed it.

Arabic, Swahili (Bantu), Chinese, Catalan, English, French, Indic (Hindi,Urdu,Bangla), Indonesian, Portuguese, Spanish, Russian, Japanese, Amharic

This pull adds pre-processing codes for mC4 data.

sbmaruf · 2021-07-27T17:37:13Z

Here are some of the stat of the currently processing data,

Language : Data Size in mC4 (GB)
--------------------
zh-Latn :   0.65
am :   1.25
ru-Latn :   2.65
sw :   3.16
ur :   10.4
ca :   42.3
bn :   43.39
hi :   127.86
zh :   148.15
id :   248.72
ar :   251.31
pt :   478.09
ja :   773.92
fr :   1050.29
es :   1492.72
ru :   3905.98
--------------------
Total size : 8580.84
Expected size afted resizing : 576
Per language allocated size : 36.0
Low resource languages (<36.0) : sw(3.16) zh-Latn(0.65) ur(10.4) ru-Latn(2.65) am(1.25)
Total size consumed by low resource languages 18.11
For high resource language, predefined (given by user) Minimum allocation size : 12 GB, Maximum allocation size 100GB
Sampling High resource language based on multinomial distribution with alpha 0.01. 
--------------------------------------------------------------------------------
Language : ar, Sampling prob : 0.09 , (251.31 -> 23 GB)
Language : zh, Sampling prob : 0.09 , (148.15 -> 13 GB)
Language : ca, Sampling prob : 0.09 -> 0.28, (42.3 -> 12 GB)
Language : fr, Sampling prob : 0.09 , (1050.29 -> 97 GB)
Language : hi, Sampling prob : 0.09 -> 0.09, (127.86 -> 12 GB)
Language : bn, Sampling prob : 0.09 -> 0.28, (43.39 -> 12 GB)
Language : id, Sampling prob : 0.09 , (248.72 -> 23 GB)
Language : pt, Sampling prob : 0.09 , (478.09 -> 44 GB)
Language : es, Sampling prob : 0.09 -> 0.07, (1492.72 -> 100 GB)
Language : ru, Sampling prob : 0.09 -> 0.03, (3905.98 -> 100 GB)
Language : ja, Sampling prob : 0.09 , (773.92 -> 71 GB)
Expected high resource size 557.89, Total Size : 505.848648173082
Performing adjustment ...

Final Breakdown
---------------
Language : ar, Sampling prob : 0.13, Data resized : (251.31 -> 31.78 GB)
Language : sw, Sampling prob : 1.0, Data resized : (3.16 -> 3.16 GB)
Language : zh, Sampling prob : 0.15, Data resized : (148.15 -> 22.36 GB)
Language : zh-Latn, Sampling prob : 1.0, Data resized : (0.65 -> 0.65 GB)
Language : ca, Sampling prob : 0.28, Data resized : (42.3 -> 12.0 GB)
Language : fr, Sampling prob : 0.1, Data resized : (1050.29 -> 105.59 GB)
Language : hi, Sampling prob : 0.09, Data resized : (127.86 -> 12.0 GB)
Language : ur, Sampling prob : 1.0, Data resized : (10.4 -> 10.4 GB)
Language : bn, Sampling prob : 0.28, Data resized : (43.39 -> 12.0 GB)
Language : id, Sampling prob : 0.13, Data resized : (248.72 -> 31.55 GB)
Language : pt, Sampling prob : 0.11, Data resized : (478.09 -> 51.66 GB)
Language : es, Sampling prob : 0.07, Data resized : (1492.72 -> 100.0 GB)
Language : ru, Sampling prob : 0.03, Data resized : (3905.98 -> 100.0 GB)
Language : ru-Latn, Sampling prob : 1.0, Data resized : (2.65 -> 2.65 GB)
Language : ja, Sampling prob : 0.1, Data resized : (773.92 -> 78.95 GB)
Language : am, Sampling prob : 1.0, Data resized : (1.25 -> 1.25 GB)
Expected resource size 576, Total Size : 576.0

Note: For some reason, large files writing on the disk failing (i.e., russia, es, fr) in my system. Hopefully in a few days when I'm free, I can look into this more details.

sbmaruf · 2021-08-02T02:04:44Z

UPDATE: All data has been processed except es, fr, ru. I cannot load, random shuffle and sample these 3 language with my current system. I am using huggingface dataset. Spend a full day debugging error on these 3 language. An alternate possible method could be using Allen AI's github LFS data which is splitted into small parts. I am trying that now.
I am uploading the data in Hf-datasets. Unfortunately I have very low upload speed ~1.5MiB/s. It will take 1-2 days to upload full data.

stas00 · 2021-08-03T18:37:06Z

let's talk what goes where:

examples aren't the best place - we now fully own this repo - so we want logical placements - probably should remove this folder altogether.
tools is not the place to put libraries, this dir is for project maintenance scripts - think janitor closet.

The following is a possible approach:

Put all data processing libraries under megatron/data/ perhaps megatron/data/c4 in this case?
Create top-level scripts and probably arrange the .sh scripts for this PR in scripts/data/c4?

As mentioned above, I'm happy to discuss other layouts.

Thank you!

sbmaruf · 2021-08-03T19:04:18Z

@stas00
tools is not the place to put libraries, this dir is for project maintenance scripts - think janitor closet.

Actually I found that openwebtext processing scripts are in the tools folder. That's why I put the c4_mc4 processing codes in the tools folder. I also like your idea putting them in megatron/data/c4.

examples aren't the best place - we now fully own this repo - so we want logical placements - probably should remove this folder altogether.

Create top-level scripts and probably arrange the .sh scripts for this PR in scripts/data/c4?

I actually proposed similar stuff in our last archi meeting. All my projects has a top level scripts folder to track the runs. But I didn't create one here because I thought we are putting the scripts in the examples folder. Since I never maintain any large project, I didn't want to interfere on the development cycles (creating new folders, adding new stuff in .gitignore etc.). I am open to any proposal.

stas00 · 2021-08-03T19:15:15Z

Thank you for the feedback, @sbmaruf!

OK, let's leave the tools as it is for now and then we can move the whole thing at once if it makes more sense.

And let's start a top-level scripts and start migrating from examples - and eventually probably remove examples all-together. Examples are for a multitude of users of Megatron-LM who need examples. We are now the owners of this repo and need concrete solutions and not examples.

Since I never maintain any large project, I didn't want to interfere on the development cycles (creating new folders, adding new stuff in .gitignore etc.). I am open to any proposal.

This repo/project is currently very experimental, so please don't be afraid to experiment. This is all under git, so we can easily revert or change things if an experiment doesn't stick.

Since we don't quite have a multi-months spec on what will be done, the layout and structure will be evolving as the code emerges.

Please don't hesitate to tag me or others if you're not sure about something you're inspired to try.

huu4ontocord · 2021-08-06T00:59:28Z

Question @sbmaruf and @stas00 : I see you are using mt5 tokenizer. I thought we are doing gpt style modeling, either auto-regressive or prefix_lm. How would it work to feed in mt5 tokens into the gpt model? I've played around with mt5 a bit and I think it's wonderful, except there's like 250K tokens as I recall, and some of the tokens could be removed, like all the emoji stuff, and some of the formatting stuff, and there are tokens for langauges we aren't modeling. Maybe I'm missing something? I know we are waiting for the tokenizer WG so maybe they will have a decision on tokenizer. In the meantime, if we want to feed in mt5 tokens to a gpt style model for testing, I recommend trimming down the number of tokens to the language we need, like keeping the top N tokens for each of the lang from mc4 we are testing on, and keeping shorter tokens to fill in any OOV tokens? Also, this tokenizer uses HF sentence piece tokenizer, which while fast is not easy to modify to shrink the number of tokens, altough we could rewrite some of this to use a more vanila python tokenizer (which I've done in the past and can share). I suspect we can probably shrink down the number of tokens to the similar number as gpt.

sbmaruf · 2021-08-07T17:12:09Z

I see you are using mt5 tokenizer. I thought we are doing gpt style modeling, either auto-regressive or prefix_lm

Any tokenizer can be used for auto-regressive model. We actually discuss about this in the meeting.
We wanted to train our own tokenizer but it seems like tokenization WG group will finally propose all the hyperparameters of the tokenizer. So we decided to use regular mT5 tokenizer.

In the meantime, if we want to feed in mt5 tokens to a gpt style model for testing, I recommend trimming down the number of tokens to the language we need, like keeping the top N tokens for each of the lang from mc4 we are testing on, and keeping shorter tokens to fill in any OOV tokens?

sentencepiece tokenizer are not as straightforward as to resizing and remapping. In addition to that it is complicated to identify the source language of a subword since there are virtually no way we can identify language for typographically same tokens.

except there's like 250K tokens as I recall, and some of the tokens could be removed, like all the emoji stuff, and some of the formatting stuff

Actually scaling law doesn't depend on vocabulary size of the tokenizer.

altough we could rewrite some of this to use a more vanila python tokenizer (which I've done in the past and can share). I suspect we can probably shrink down the number of tokens to the similar number as gpt.

shrinking down tokenizer won't bring any theoretical improvement. The improvement we will be getting is speedup on lookup operation. For me, it's more easier to manually shrink a tokenizer than train a new one. While shrinking a tokenizer, so many things could go wrong.

@ontocord

stas00 · 2021-08-07T18:28:30Z

sentencepiece tokenizer are not as straightforward as to resizing and remapping.

Here is a hack that helps with shrinking an spm vocab:
https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564

It may or may not help, as I was only seeking to make it much smaller.

huu4ontocord · 2021-08-08T12:46:24Z

@stas00 Thank you for the great resources re sentencepiece.

@sbmaruf I didn't realize we were considering running the gpt model with an mt5 size token space, but as you said, any tokenization mechanism can be used. I just wonder what happens to the token/embeddings that are rarely if never seen. Good luck and I'd love to see the results!

sbmaruf · 2021-08-10T10:29:30Z

Done with processing.
Here is the readme.
Process data is here
Raw data (except english) is here.
Please review this part carefully.
Final iterator selection probability is here, you need to choose an alpha from here.
All the json file for iterator selection probability is here.

Awaiting for review, @stas00 @ibeltagy @yongzx

stas00

That's a massive contribution, @sbmaruf! Thank you!

In this review I've only focused on the usage doc and hope others will be able to review the code.

And I left a few small clarification suggestions.

Given that this is quite a complex process and may take many days to complete, any mistake can be quite costly. Therefore, is it possible to come up with a very small version of everything, so one can run the whole process quickly and ensure they all connect together

For example for when we first worked with openwebtext I created openwebtext-10k specifically for testing the processing pipeline. I wonder if we could create say 2 small datasets with 2 languages and do the full run on that first, ensure all parts stack up and then unleash the bigger version?

scripts/c4_mc4_processing/iterator_selection_prob.out.txt

scripts/c4_mc4_processing/README.md

stas00 · 2021-08-10T21:43:20Z

scripts/c4_mc4_processing/README.md

+bash scripts/c4_mc4_processing/cache_c4_mc4.sh
+```
+Running this script may require bandwidth `~30MB/s` per language. You may run this script multiple times to make the caching faster. The script `tools/c4_mc4/c4_mc4_cache.py` will perform caching if a caching folder for a language doesn't exist. So running the script multiple times with the same cache folder will be ok. Make sure you add your desired language in the script.
+


I don't quite understand what "You may run this script multiple times to make the caching faster" mean - should it be run first on several instances in parallel?

The new version,

Running this script may require bandwidth ~30MB/sper language. You may run this script multiple times. Sometimes there is cap for a single download request. So in that case you can just run the script multiple time to do parallel downloading and processing. Also once the data is downloaded, it starts caching. Caching may take a lot of time. In that time other process can keep downloading their stuffs. The scripttools/c4_mc4/c4_mc4_cache.py will perform caching if a caching folder for a language doesn't exist. So running the script multiple times with the same cache folder will be ok. Make sure you add your desired language in the script.

I'm not quite familiar with this HF datasets feature - how does it know to download a different shard if you start identical processes to download the same dataset? Do you have a pointer to the corresponding doc?

What I do see is that one can use num_proc argument to parallelize the download, which sounds like a much more sensible approach, since the user can then specify how many concurrent download threads to use: I'm referring to:
https://huggingface.co/docs/datasets/_modules/datasets/utils/download_manager.html

# Default to using 16 parallel thread for downloading # Note that if we have less than 16 files, multi-processing is not activated if download_config.num_proc is None: download_config.num_proc = 16

Also could it be error prone to check that the cache dir exists to think that the caching has been completed? Won't it create that cache dir right away and if the process is aborted before being completed won't the cached dir remain there?

Or does it use a temp dir and only renames it to the final destination when datasets knows the cache building has been completed?

Also once the data is downloaded, it starts caching. Caching may take a lot of time. In that time other process can keep downloading their stuffs.

Normally I'd propose to switch to lower level API, and have one queue for downloads, and as soon as it finishes downloading one dataset it hands the cache build to another? That is instead of using load_dataset which performs both download and the caching. But this won't work on JZ.

Most likely the approach we can successfully deploy on JZ is to handle each stage completely separately. I'd just write a script that first downloads all the datasets. Then I'd processes them one-by-one. this is how we did it with Oscar.

Please remember that on JZ slurm partitions have no internet, and while we have a few with internet they are usually useless for data processing since they are heavily "abused" by other users since they are "free". So we can't do any serious processing there, and have to use normal cpu partitions that have no internet. We are also limited at 20h jobs for many things, but can do 100h at other times.

So the process on JZ won't work efficiently unless we download the data first on slurm partition A, and then process it in a different process on slurm partition B.

Please let me know if that helps to understand the proposal.

I didn't know that HF-dataset can parallelize the download. I knew that it can process data with multi-processing.

Actually we download each of language by providing Subset. It automatically creates a new folder and starts the download.

If the caching fails it will raise an exception, https://github.com/sbmaruf/Megatron-DeepSpeed/blob/c4-mc4-pre_processing/tools/c4_mc4/c4_mc4_cache.py#L30
So user has to look into the logs. If caching fails, user has to clean the folder and start again.

For me to maintain caching, logging and monitoring, it was really convenient to log each of them with separate process. If it creates much confusion, I can make it simple by removing these additional conditions.

I hope you didn't miss my comment above about how this won't work, unless we separate the downloads from processing: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/9/files/8383432b9a1bf76e32cafa1d857a807c9495c382..9a588e8bdf84cfb1ca19f4b9057c11ffc8980344#r686400563

Data has already been processed and uploaded in gc-bucket.

Data download and Processing is done in two different script. 1, 2

But even after downloading data, HF-datasets performs one small internal processing, which may take significant amount of time in case of tera-byte datasets.

Another lower level implementation is already available with this pull

scripts/c4_mc4_processing/README.md

yongzx · 2021-08-11T03:58:50Z

@sbmaruf For the "Please review this part carefully", is there any particular component to pay extra attention to? For instance, the reported figures?

sbmaruf · 2021-08-11T14:31:31Z

@sbmaruf For the "Please review this part carefully", is there any particular component to pay extra attention to? For instance, the reported figures?

@yongzx I find some numbers not consistent. Specially for 86 GB raw data becomes 28GB binary and for english 784GB raw data becomes 763GB binary. There are no pattern in the Raw size to Binary size which worries me most. May be that's how it should be. Not sure at this point.

thomasw21 · 2021-08-11T15:23:29Z

scripts/c4_mc4_processing/README.md

+
+# Data Binarization Stat 
+
+If you tokenize English with `t5` tokenizer, `784GB` raw data becomes `344GB` (`*.bin` size, including validation). But if you tokenize the same `784GB` raw data with `mt5` it becomes `756GB` This is not what we were expecting at first. Earlier we were expecting 50\% English and 50\% remaining language, but now after binarization (a.k.a. tokenization) that calculation doesn't hold. For most of the languages, after binarization, the size reduced drastically. The stats are below, 


If you tokenize English with t5 tokenizer, 784GB raw data becomes 344GB (*.bin size, including validation). But if you tokenize the same 784GB raw data with mt5 it becomes 756GB This is not what we were expecting at first

I'm pretty sure this is caused by this:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/data/indexed_dataset.py#L25-L29

Essentially t5 tokenizer has 32128 tokens and mt5 has 250112 tokens. So I'm pretty sure the x2 memory footprint is because one uses int16 and the other int32.

Thanks for sharing. Will take a look into this.

scripts/c4_mc4_processing/extract_from_git_lfs.sh

yongzx

Thanks Maruf for downloading all the necessary language files. I went through the code repo and, overall, it looks good.

There are some functions (calc_multinomial_sampling_prob_with_penalty, print_stat, get_size_stats) that have identical names and similar/identical outputs in data_resize.py and calc_iterator_prob.py. I don't see that they are causing any issue right now, but if we are going to use these utility functions in the future, I suggest we move them out and put them into a file like util.py.

Another potential issue is the lack of docstrings/comments, so it's harder to follow the codes in tools/c4_mc4/. However, I don't think we need to add them now because this tool is for internal use (correct me if I am wrong).

Edit: @sbmaruf have addressed my concerns about the functions and added docstrings.

tools/c4_mc4/c4_mc4_cache.py

sbmaruf added 2 commits July 23, 2021 08:29

caching c4-mc4

06a51b1

calc lang sampling ratio

5023ee4

sbmaruf changed the title ~~C4 mc4 pre processing~~ C4-mC4 pre processing Jul 23, 2021

sbmaruf added 2 commits July 23, 2021 08:35

cleaning

fa2874f

typo in help

ed565fc

ibeltagy mentioned this pull request Jul 27, 2021

preprocess mC4 #25

Closed

sbmaruf added 3 commits August 3, 2021 17:41

improve logging

baf745b

export hfdataset to jsonl data

66c32a9

preprocess data

fc4ad61

sbmaruf and others added 10 commits August 7, 2021 18:00

Merge branch 'bigscience-workshop:main' into c4-mc4-pre_processing

be23a5d

moving files

2d8a5bd

run scripts for pre-processing

025c8e7

cache hf-dataset

df75451

data sampling prob calc

2c7b8ca

extract data from hf dataset

34e0873

extract data from allenai git lfs

3ad74e3

data process

c293580

fix, helper scripts

8f0d416

cleaning, moving scripts

47244d0

sbmaruf added 2 commits August 10, 2021 05:50

sync

a5fae35

update README

a626af8

sbmaruf added 2 commits August 10, 2021 10:26

sample iterator selection probability output

89c7dc0

print alpha pyfiglet

5815e5f

sbmaruf requested review from ibeltagy and stas00 August 10, 2021 10:31

sbmaruf added 3 commits August 10, 2021 18:43

typos

f48f5b2

output of iterator selection probs

bf03ae6

per shard prob added

6e3d285

sbmaruf requested a review from yongzx August 10, 2021 11:11

typo

5f8d7d6

stas00 requested changes Aug 10, 2021

View reviewed changes

sbmaruf added 5 commits August 10, 2021 23:35

remove small log

9b37746

merge

3d2f736

update caching description.

828bdcb

consistently naming of data size.

8383432

typo

9a588e8

stas00 reviewed Aug 11, 2021

View reviewed changes

scripts/c4_mc4_processing/README.md Outdated Show resolved Hide resolved

stas00 reviewed Aug 11, 2021

View reviewed changes

scripts/c4_mc4_processing/README.md Show resolved Hide resolved

sbmaruf added 2 commits August 11, 2021 00:50

Data size formatting: KB, MB, GB, TB

1d05287

Update Readme

ffb3924

thomasw21 reviewed Aug 11, 2021

View reviewed changes

yongzx reviewed Aug 12, 2021

View reviewed changes

scripts/c4_mc4_processing/extract_from_git_lfs.sh Show resolved Hide resolved

yongzx reviewed Aug 12, 2021

View reviewed changes

tools/c4_mc4/c4_mc4_cache.py Outdated Show resolved Hide resolved

sbmaruf added 4 commits August 12, 2021 18:41

update comment and run script

b676d12

typo

5c14674

add small doc-string for each of the modules.

197c83a

cleaning, recovering.

04b47ed

sbmaruf mentioned this pull request Aug 17, 2022

mC4 sampling & pre-processing bigscience-workshop/bigscience#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C4-mC4 pre processing #9

C4-mC4 pre processing #9

sbmaruf commented Jul 23, 2021 •

edited

Loading

sbmaruf commented Jul 27, 2021 •

edited

Loading

sbmaruf commented Aug 2, 2021

stas00 commented Aug 3, 2021

sbmaruf commented Aug 3, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

huu4ontocord commented Aug 6, 2021 •

edited

Loading

sbmaruf commented Aug 7, 2021 •

edited

Loading

stas00 commented Aug 7, 2021

huu4ontocord commented Aug 8, 2021

sbmaruf commented Aug 10, 2021 •

edited

Loading

stas00 left a comment •

edited

Loading

stas00 Aug 10, 2021 •

edited

Loading

sbmaruf Aug 10, 2021

stas00 Aug 10, 2021 •

edited

Loading

stas00 Aug 10, 2021

stas00 Aug 11, 2021

sbmaruf Aug 11, 2021 •

edited

Loading

stas00 Aug 11, 2021

sbmaruf Aug 11, 2021

yongzx commented Aug 11, 2021

sbmaruf commented Aug 11, 2021 •

edited

Loading

thomasw21 Aug 11, 2021 •

edited

Loading

sbmaruf Aug 11, 2021

yongzx left a comment •

edited

Loading


		# Data Binarization Stat

		If you tokenize English with `t5` tokenizer, `784GB` raw data becomes `344GB` (`*.bin` size, including validation). But if you tokenize the same `784GB` raw data with `mt5` it becomes `756GB` This is not what we were expecting at first. Earlier we were expecting 50\% English and 50\% remaining language, but now after binarization (a.k.a. tokenization) that calculation doesn't hold. For most of the languages, after binarization, the size reduced drastically. The stats are below,

C4-mC4 pre processing #9

Are you sure you want to change the base?

C4-mC4 pre processing #9

Conversation

sbmaruf commented Jul 23, 2021 • edited Loading

sbmaruf commented Jul 27, 2021 • edited Loading

sbmaruf commented Aug 2, 2021

stas00 commented Aug 3, 2021

sbmaruf commented Aug 3, 2021 • edited Loading

stas00 commented Aug 3, 2021 • edited Loading

huu4ontocord commented Aug 6, 2021 • edited Loading

sbmaruf commented Aug 7, 2021 • edited Loading

stas00 commented Aug 7, 2021

huu4ontocord commented Aug 8, 2021

sbmaruf commented Aug 10, 2021 • edited Loading

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

stas00 Aug 10, 2021 • edited Loading

Choose a reason for hiding this comment

sbmaruf Aug 10, 2021

Choose a reason for hiding this comment

stas00 Aug 10, 2021 • edited Loading

Choose a reason for hiding this comment

stas00 Aug 10, 2021

Choose a reason for hiding this comment

stas00 Aug 11, 2021

Choose a reason for hiding this comment

sbmaruf Aug 11, 2021 • edited Loading

Choose a reason for hiding this comment

stas00 Aug 11, 2021

Choose a reason for hiding this comment

sbmaruf Aug 11, 2021

Choose a reason for hiding this comment

yongzx commented Aug 11, 2021

sbmaruf commented Aug 11, 2021 • edited Loading

thomasw21 Aug 11, 2021 • edited Loading

Choose a reason for hiding this comment

sbmaruf Aug 11, 2021

Choose a reason for hiding this comment

yongzx left a comment • edited Loading

Choose a reason for hiding this comment

sbmaruf commented Jul 23, 2021 •

edited

Loading

sbmaruf commented Jul 27, 2021 •

edited

Loading

sbmaruf commented Aug 3, 2021 •

edited

Loading

stas00 commented Aug 3, 2021 •

edited

Loading

huu4ontocord commented Aug 6, 2021 •

edited

Loading

sbmaruf commented Aug 7, 2021 •

edited

Loading

sbmaruf commented Aug 10, 2021 •

edited

Loading

stas00 left a comment •

edited

Loading

stas00 Aug 10, 2021 •

edited

Loading

stas00 Aug 10, 2021 •

edited

Loading

sbmaruf Aug 11, 2021 •

edited

Loading

sbmaruf commented Aug 11, 2021 •

edited

Loading

thomasw21 Aug 11, 2021 •

edited

Loading

yongzx left a comment •

edited

Loading