-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizing the pre-processing of the dataset. #117
Conversation
Hi @jayhpark530! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
Thank you for taking a stab at making the pre-processing stage faster. This is an important part of the code. Can you please help me understand how the ConvertDictDay are converted to ConvertDict, while not causing the issue explained in the following comment #58 (comment)? Quoting: |
Thank you for reviewing my request. |
Finally, I have confirmed that the output of pre-processing with parallelization is exactly the same as the output of traditional serial execution.
|
Thank you @jayhpark530 for your PR. I am trying to verify the correctness of the code on the machine with 114 GB RAM. It seems running with Kaggle dataset is not successful. dlrm_extra_option="--dataset-multiprocessing" Output: Saved ./input/train_day_0.npz! After this line, the program crashed. The crash happens during the loop at the line 1110 of the file data_utils.py . I wanted to mention running the above command without the argument "--dataset-multiprocessing" is fine. |
Thank you @amirstar for verifying the correctness. It can run on 64 GB RAM, so it doesn't seem to be a memory issue. When I ran it, there was no problem, so I need more information about the crash. The output of 'Constructing convertDicts Split: 0~6' is executed after line 1110. What crash was printed on line 1110? I'm not sure, but it seems that the behavior in the Process part of Python is different. Which version of Python are you using? (For reference, I'm using version 3.7.7.) |
The code works fine with Python 3.8. Thank you for your great work! We will merge it. |
Thank you @amirstar for verifying and merging. |
Parallelizing the pre-processing of the dataset. (facebookresearch#117)
* Parallelizing the "process_one_file" function. * Parallelizing the "processCriteoAdData" function. * Fix convertDicts construction by day order * Deleted print for debugging. * Changed memory requirements.
The current "process_one_file" function takes a long time using a single core.
This part of the pre-processing takes the longest time.
I modified it to enable parallelization in units of day files.
This parallelizing reduce the execution time of function dramatically.
In particular, using the Terabyte dataset can reduce the processing time from a few days to several hours.
ConvertDict, which interferes with parallelization, was handled as follows.
(1) Dictionary keys (convertDictDay) are created by date.
(2) To create convertDict, convertDictDay each created by parallelization is configured in order of the day.
(The third commit reflects its modification.)
The second-longest part of the data pre-processing is "processCriteoAdData" function.
This parallelizing can also reduce the processing time of this part from a few days to several hours when using the Terabyte dataset.
Finally, added the '--dataset_multiprocessing' argument for use on resources available machines.