Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelizing the pre-processing of the dataset. #117

Merged
merged 5 commits into from
Sep 28, 2020

Conversation

jayhpark530
Copy link
Contributor

@jayhpark530 jayhpark530 commented Aug 8, 2020

  1. The current "process_one_file" function takes a long time using a single core.
    This part of the pre-processing takes the longest time.
     
    I modified it to enable parallelization in units of day files.
    This parallelizing reduce the execution time of function dramatically.
    In particular, using the Terabyte dataset can reduce the processing time from a few days to several hours.
     
    ConvertDict, which interferes with parallelization, was handled as follows.
    (1) Dictionary keys (convertDictDay) are created by date.
    (2) To create convertDict, convertDictDay each created by parallelization is configured in order of the day.
      (The third commit reflects its modification.)

  2. The second-longest part of the data pre-processing is "processCriteoAdData" function.
    This parallelizing can also reduce the processing time of this part from a few days to several hours when using the Terabyte dataset.

  3. Finally, added the '--dataset_multiprocessing' argument for use on resources available machines.

@facebook-github-bot
Copy link

Hi @jayhpark530!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2020
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@jayhpark530 jayhpark530 changed the title Parallelizing the "process_one_file" function. Parallelizing the pre-processing of the dataset. Aug 12, 2020
@mnaumovfb
Copy link
Contributor

mnaumovfb commented Aug 13, 2020

Thank you for taking a stab at making the pre-processing stage faster. This is an important part of the code.

Can you please help me understand how the ConvertDictDay are converted to ConvertDict, while not causing the issue explained in the following comment #58 (comment)? Quoting:
"I wanted to make an important point that in general the processing of multiple days is not embarrassingly parallel. We use dictionaries to associate an index with each unique hash value, and we process the hash values in order of appearance over multiple days. Therefore, if the days are processed completely independently the indices might be assigned out of order, but more importantly the same hash value might be assigned a different index on different days."

@jayhpark530
Copy link
Contributor Author

Thank you for reviewing my request.
The key idea is to create ConvertDictDay for each process (each day) that is parallelized, and converted ConvertDict according to the order of the day. ConvertDict here refers to the step of creating dictionary keys in the "process_one_file" function. (This is not the step of creating dictionary values and files. This step is relatively fast and not parallelized.) After parallelization execution, ConvertDictDay created in each process is merged into ConvertDict in order of the day. This result is equivalent to the serial execution of ConvertDict's dictionary keys. This does not cause issues in the later steps of assigning dictionary values.

@jayhpark530
Copy link
Contributor Author

Finally, I have confirmed that the output of pre-processing with parallelization is exactly the same as the output of traditional serial execution.
Exceptionally, when the --data-sub-sample-rate parameter is used, the result file of pre-processing may be different depending on the random value of rand_u in the process_one_file function. If you enter the same random seed in process_one_file, you can see the same result. For the same result, you can use the same random seed in the process_one_file.
For example:

np.random.seed(split)
rand_u = np.random.uniform(low=0.0, high=1.0, size=num_data_in_split))

@amirstar
Copy link
Contributor

amirstar commented Sep 9, 2020

Thank you @jayhpark530 for your PR. I am trying to verify the correctness of the code on the machine with 114 GB RAM. It seems running with Kaggle dataset is not successful.
These is the commands:

dlrm_extra_option="--dataset-multiprocessing"
python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 $dlrm_extra_option

Output:
.
.
.
Load 6417687/6548660 (98%) Split: 0 Label True: 0 Stored: 0
Load 6483174/6548660 (99%) Split: 0 Label True: 0 Stored: 0

Saved ./input/train_day_0.npz!
Constructing convertDicts Split: 0
Constructing convertDicts Split: 1
Constructing convertDicts Split: 2
Constructing convertDicts Split: 3
Constructing convertDicts Split: 4
Constructing convertDicts Split: 5
Constructing convertDicts Split: 6
Total number of samples: 45840617
Divided into days/splits:
[6548660, 6548660, 6548660, 6548660, 6548659, 6548659, 6548659]
Not existing ./input/train_day_1_processed.npz
Processed ./input/train_day_1_processed.npz
Not existing ./input/train_day_5_processed.npz
Processed ./input/train_day_5_processed.npz
Not existing ./input/train_day_0_processed.npz
Processed ./input/train_day_0_processed.npz
Not existing ./input/train_day_3_processed.npz
Processed ./input/train_day_3_processed.npz
Not existing ./input/train_day_4_processed.npz
Processed ./input/train_day_4_processed.npz
Not existing ./input/train_day_2_processed.npz
Processed ./input/train_day_2_processed.npz
Not existing ./input/train_day_6_processed.npz
Processed ./input/train_day_6_processed.npz

After this line, the program crashed. The crash happens during the loop at the line 1110 of the file data_utils.py .

I wanted to mention running the above command without the argument "--dataset-multiprocessing" is fine.

@jayhpark530
Copy link
Contributor Author

Thank you @amirstar for verifying the correctness. It can run on 64 GB RAM, so it doesn't seem to be a memory issue.

When I ran it, there was no problem, so I need more information about the crash. The output of 'Constructing convertDicts Split: 0~6' is executed after line 1110. What crash was printed on line 1110?

I'm not sure, but it seems that the behavior in the Process part of Python is different. Which version of Python are you using? (For reference, I'm using version 3.7.7.)

@amirstar
Copy link
Contributor

The code works fine with Python 3.8. Thank you for your great work! We will merge it.

@jayhpark530 jayhpark530 reopened this Sep 18, 2020
@jayhpark530
Copy link
Contributor Author

The code works fine with Python 3.8. Thank you for your great work! We will merge it.

Thank you @amirstar for verifying and merging.

@mnaumovfb mnaumovfb merged commit 78d7045 into facebookresearch:master Sep 28, 2020
tginart added a commit to tginart/dlrm that referenced this pull request Sep 29, 2020
Parallelizing the pre-processing of the dataset. (facebookresearch#117)
YoungsukKim12 pushed a commit to YoungsukKim12/dlrm that referenced this pull request Oct 29, 2023
* Parallelizing the "process_one_file" function.

* Parallelizing the "processCriteoAdData" function.

* Fix convertDicts construction by day order

* Deleted print for debugging.

* Changed memory requirements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants