-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how long will the preprocessing step take #6
Comments
Hi Chujie
The preprocessing step can be run in parallel, as it processes one read at
a time. Can you split the input data?
I cc Pablo, Akanksha, and AJ who have run CHEUI with large samples and
might be able to provide additional suggestions
best
Eduardo
…On Tue, 8 Nov 2022 at 13:16, ssscj ***@***.***> wrote:
Hi,
thanks for developing CHEUI. I run CHEUI on my own data and the nanopolish
result file is about 700G. I have been running the preprocess_m6A step
using 20 cpus for 10 days and it hasn't finished yet. How long will this
step take normally, and is there any way to accelerate this step? thank you.
Chujie
—
Reply to this email directly, view it on GitHub
<#6>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKBZSCFY3CKYAMTDTICDWHGZXJANCNFSM6AAAAAARZZ7HZM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi, |
Hi, |
Hi, Sorry about the issue. We are working on making the preprocessing faster. In the meantime is it possible for you to try a really large number of n. Let's say -n 400. This is because, by defining -n flag you also define how many small files you can create from the input file. The number of parallel process will be limited by the number of CPU's you have. But since it will finish faster on small files maybe over all time can be reduced. Thanks, |
Hi |
The preprocessing is multi-threaded, so it should be possible to run it
faster.
We also have a C++ version that is much faster than the python version. Did
you try it?
I hope this helps
E.
…On Sun, 7 May 2023 at 20:16, baishengjun ***@***.***> wrote:
Hi
I generate the nanopolish result file is about 4.2T, and I use 20 cpus for
the preprocessing step; but it seems only use single core to process. it's
too long time to wait...; any suggestions ?
—
Reply to this email directly, view it on GitHub
<#6 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKB2VHGDHFBUBJ4IMWJLXE5Y7PANCNFSM6AAAAAARZZ7HZM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi, The preprocessing step will first create a new folder and generate some temp files. The number of temp files is the same as the number of CPUs you choose in your command line. This step will use only one CPU. After all the temp files are generated, the C++ program will run in parallel with multi-threads. And the temp files will be removed in the end. Upgrading your GCC compiler to a later version can speed up. For C++ version, I recommend setting the Hope this helps, |
Hi there, A faster preprocessing solution:
Hope this helps, |
Hi, Thanks a lot |
Hi Eileen, Thanks a lot, |
Hi Bai, To combine the split files can you please run the combine_binary_file.py in the scripts folder as below: here you need to provide the path of the folder with all the split files in it and the output file name. |
Hi Akanksha, Big thanks for your help. Thanks a lot, |
Hi Bai, Yes, we noted that as well in our test case for the script. But it could be because it is a binary file. Also, the number of processed signals in the combined file is equal to the sum of processed signals in the individual files. So it should be fine for the next step. But, I would recommend not deleting the individual split files until you have the final results. Thanks, |
Hi Akanksha, The combined file seem does not work in next step, it throws the following error: Thanks a lot |
Hi Bai, Sorry, there was a bug in the code. The code was only combining the keys and not the values. Thanks, |
Hi Akanksha, Yes, it works, but it consumed too much memory usage. Thanks, |
Hi Bai, Sorry, could you please try the latest updated version of the script? It should solve the memory issue. Thanks, |
Hi Akanksha, Big thanks to you, the memory issue solved. Thanks, |
Hi,
thanks for developing CHEUI. I run CHEUI on my own data and the nanopolish result file is about 700G. I have been running the preprocess_m6A step using 20 cpus for 10 days and it hasn't finished yet. How long will this step take normally, and is there any way to accelerate this step? thank you.
Chujie
The text was updated successfully, but these errors were encountered: