Parallelizing the pre-processing of the dataset. #117

jayhpark530 · 2020-08-08T13:41:20Z

The current "process_one_file" function takes a long time using a single core.
This part of the pre-processing takes the longest time.
　
I modified it to enable parallelization in units of day files.
This parallelizing reduce the execution time of function dramatically.
In particular, using the Terabyte dataset can reduce the processing time from a few days to several hours.
　
ConvertDict, which interferes with parallelization, was handled as follows.
(1) Dictionary keys (convertDictDay) are created by date.
(2) To create convertDict, convertDictDay each created by parallelization is configured in order of the day.
　 (The third commit reflects its modification.)
The second-longest part of the data pre-processing is "processCriteoAdData" function.
This parallelizing can also reduce the processing time of this part from a few days to several hours when using the Terabyte dataset.
Finally, added the '--dataset_multiprocessing' argument for use on resources available machines.

facebook-github-bot · 2020-08-08T13:41:40Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2020-08-08T13:52:49Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

mnaumovfb · 2020-08-13T04:25:28Z

Thank you for taking a stab at making the pre-processing stage faster. This is an important part of the code.

Can you please help me understand how the ConvertDictDay are converted to ConvertDict, while not causing the issue explained in the following comment #58 (comment)? Quoting:
"I wanted to make an important point that in general the processing of multiple days is not embarrassingly parallel. We use dictionaries to associate an index with each unique hash value, and we process the hash values in order of appearance over multiple days. Therefore, if the days are processed completely independently the indices might be assigned out of order, but more importantly the same hash value might be assigned a different index on different days."

jayhpark530 · 2020-08-13T15:26:08Z

Thank you for reviewing my request.
The key idea is to create ConvertDictDay for each process (each day) that is parallelized, and converted ConvertDict according to the order of the day. ConvertDict here refers to the step of creating dictionary keys in the "process_one_file" function. (This is not the step of creating dictionary values and files. This step is relatively fast and not parallelized.) After parallelization execution, ConvertDictDay created in each process is merged into ConvertDict in order of the day. This result is equivalent to the serial execution of ConvertDict's dictionary keys. This does not cause issues in the later steps of assigning dictionary values.

jayhpark530 · 2020-08-18T08:46:36Z

Finally, I have confirmed that the output of pre-processing with parallelization is exactly the same as the output of traditional serial execution.
Exceptionally, when the --data-sub-sample-rate parameter is used, the result file of pre-processing may be different depending on the random value of rand_u in the process_one_file function. If you enter the same random seed in process_one_file, you can see the same result. For the same result, you can use the same random seed in the process_one_file.
For example:

np.random.seed(split)
rand_u = np.random.uniform(low=0.0, high=1.0, size=num_data_in_split))

amirstar · 2020-09-09T22:06:18Z

Thank you @jayhpark530 for your PR. I am trying to verify the correctness of the code on the machine with 114 GB RAM. It seems running with Kaggle dataset is not successful.
These is the commands:

dlrm_extra_option="--dataset-multiprocessing"
python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 $dlrm_extra_option

Output:
.
.
.
Load 6417687/6548660 (98%) Split: 0 Label True: 0 Stored: 0
Load 6483174/6548660 (99%) Split: 0 Label True: 0 Stored: 0

Saved ./input/train_day_0.npz!
Constructing convertDicts Split: 0
Constructing convertDicts Split: 1
Constructing convertDicts Split: 2
Constructing convertDicts Split: 3
Constructing convertDicts Split: 4
Constructing convertDicts Split: 5
Constructing convertDicts Split: 6
Total number of samples: 45840617
Divided into days/splits:
[6548660, 6548660, 6548660, 6548660, 6548659, 6548659, 6548659]
Not existing ./input/train_day_1_processed.npz
Processed ./input/train_day_1_processed.npz
Not existing ./input/train_day_5_processed.npz
Processed ./input/train_day_5_processed.npz
Not existing ./input/train_day_0_processed.npz
Processed ./input/train_day_0_processed.npz
Not existing ./input/train_day_3_processed.npz
Processed ./input/train_day_3_processed.npz
Not existing ./input/train_day_4_processed.npz
Processed ./input/train_day_4_processed.npz
Not existing ./input/train_day_2_processed.npz
Processed ./input/train_day_2_processed.npz
Not existing ./input/train_day_6_processed.npz
Processed ./input/train_day_6_processed.npz

After this line, the program crashed. The crash happens during the loop at the line 1110 of the file data_utils.py .

I wanted to mention running the above command without the argument "--dataset-multiprocessing" is fine.

jayhpark530 · 2020-09-10T03:59:07Z

Thank you @amirstar for verifying the correctness. It can run on 64 GB RAM, so it doesn't seem to be a memory issue.

When I ran it, there was no problem, so I need more information about the crash. The output of 'Constructing convertDicts Split: 0~6' is executed after line 1110. What crash was printed on line 1110?

I'm not sure, but it seems that the behavior in the Process part of Python is different. Which version of Python are you using? (For reference, I'm using version 3.7.7.)

amirstar · 2020-09-18T03:32:22Z

The code works fine with Python 3.8. Thank you for your great work! We will merge it.

jayhpark530 · 2020-09-18T07:16:40Z

The code works fine with Python 3.8. Thank you for your great work! We will merge it.

Thank you @amirstar for verifying and merging.

Parallelizing the pre-processing of the dataset. (facebookresearch#117)

* Parallelizing the "process_one_file" function. * Parallelizing the "processCriteoAdData" function. * Fix convertDicts construction by day order * Deleted print for debugging. * Changed memory requirements.

Parallelizing the "process_one_file" function.

3e3ce83

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2020

Parallelizing the "processCriteoAdData" function.

a13b6cb

jayhpark530 changed the title ~~Parallelizing the "process_one_file" function.~~ Parallelizing the pre-processing of the dataset. Aug 12, 2020

Fix convertDicts construction by day order

fe6cae2

Deleted print for debugging.

dc642fc

mnaumovfb mentioned this pull request Aug 14, 2020

Fix to read terabyte dataset in .gzip format #119

Open

Changed memory requirements.

8ec4aaf

jayhpark530 closed this Sep 18, 2020

jayhpark530 reopened this Sep 18, 2020

mnaumovfb merged commit 78d7045 into facebookresearch:master Sep 28, 2020

tginart added a commit to tginart/dlrm that referenced this pull request Sep 29, 2020

Merge pull request #3 from facebookresearch/master

5cae11a

Parallelizing the pre-processing of the dataset. (facebookresearch#117)

mkuchnik mentioned this pull request Apr 7, 2021

[Dataset Creation] Python Bug: 'i' format requires -2147483648 <= number <= 2147483647 #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizing the pre-processing of the dataset. #117

Parallelizing the pre-processing of the dataset. #117

jayhpark530 commented Aug 8, 2020 •

edited

Loading

facebook-github-bot commented Aug 8, 2020

facebook-github-bot commented Aug 8, 2020

mnaumovfb commented Aug 13, 2020 •

edited

Loading

jayhpark530 commented Aug 13, 2020

jayhpark530 commented Aug 18, 2020

amirstar commented Sep 9, 2020

jayhpark530 commented Sep 10, 2020

amirstar commented Sep 18, 2020

jayhpark530 commented Sep 18, 2020

Parallelizing the pre-processing of the dataset. #117

Parallelizing the pre-processing of the dataset. #117

Conversation

jayhpark530 commented Aug 8, 2020 • edited Loading

facebook-github-bot commented Aug 8, 2020

facebook-github-bot commented Aug 8, 2020

mnaumovfb commented Aug 13, 2020 • edited Loading

jayhpark530 commented Aug 13, 2020

jayhpark530 commented Aug 18, 2020

amirstar commented Sep 9, 2020

jayhpark530 commented Sep 10, 2020

amirstar commented Sep 18, 2020

jayhpark530 commented Sep 18, 2020

jayhpark530 commented Aug 8, 2020 •

edited

Loading

mnaumovfb commented Aug 13, 2020 •

edited

Loading