New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File locking blocks indefinitely in writePileUps
#6
Comments
Hi Randy, I don't have a clue so far. It looks like some kind of memory leak in the writer for the binary pileups format. Here are some questions to get some light into the dark:
|
Hi Arne,
CC: Tim Smith
I’ll try to send you the entire directory. I’m running tar on it now. The .las file is 7.2 G. We are trying this on an SGE cluster. I think, from memory and random peeks while it was running, that the memory use on the node running the collect phase increased to about 50G. After some time, I logged into the node. “ps” showed that the process was still running but “top” didn’t show that it was using any CPU. I should have saved the output of “ps” and “lsof” but I didn’t think of that before I deleted the job. I’d be happy to rerun it and do anything to diagnose the problem.
Regards,
Randy
[randy.bradley@arsnecla0fshome:~ 09:47:26]tail -15 dentist-test/joblog/dentistCore.out
Finished job 1308.
1237 of 1243 steps (100%) done
[Fri May 22 23:57:40 2020]
checkpoint collect:
input: workdir/dentist.json, workdir/scaffolds_FINAL.dam, workdir/.scaffolds_FINAL.bps, workdir/.scaffolds_FINAL.hdr, workdir/.scaffolds_FINAL.idx, workdir/haplotype-Simmental_dam.reformat.dam, workdir/.haplotype-Simmental_dam.reformat.bps, workdir/.haplotype-Simmental_dam.reformat.hdr, workdir/.haplotype-Simmental_dam.reformat.idx, workdir/scaffolds_FINAL.haplotype-Simmental_dam.reformat.las, workdir/.scaffolds_FINAL.dentist-self.anno, workdir/.scaffolds_FINAL.dentist-self.data, workdir/.scaffolds_FINAL.tan.anno, workdir/.scaffolds_FINAL.tan.data, workdir/.scaffolds_FINAL.dentist-reads.anno, workdir/.scaffolds_FINAL.dentist-reads.data
output: workdir/pile-ups.db
log: logs/collect.log
jobid: 1303
reason: Missing output files: workdir/pile-ups.db; Input files updated by another job: workdir/.scaffolds_FINAL.dentist-self.data, workdir/scaffolds_FINAL.haplotype-Simmental_dam.reformat.las, workdir/.scaffolds_FINAL.dentist-reads.data, workdir/.scaffolds_FINAL.dentist-self.anno, workdir/.scaffolds_FINAL.dentist-reads.anno
threads: 8
Downstream jobs will be updated after completion.
dentist collect --config=workdir/dentist.json --threads=4 --auxiliary-threads=2 - - - - 2> logs/collect.log
Submitted job 1303 with external jobid 'Your job 377542 ("dentist.collect") has been submitted'.
[randy.bradley@arsnecla0fshome:~ 09:41:49]cat dentist-test/workdir/dentist.json
{"__default__": {"read-coverage": 73.0, "max-coverage-self": 3, "verbose": 2, "reference": "workdir/scaffolds_FINAL.dam", "reads": "workdir/haplotype-Simmental_dam.reformat.dam", "result": "gap-closed.fasta", "ref-vs-reads-alignment": "workdir/scaffolds_FINAL.haplotype-Simmental_dam.reformat.las", "mask": ["dentist-self", "tan", "dentist-reads"], "pile-ups": "workdir/pile-ups.db", "insertions": "workdir/insertions.db"}, "output": {"fasta-line-width": 80}, "mask-repetitive-regions": {"reads": null}}[randy.bradley@arsnecla0fshome:~ 09:42:33]ll -h dentist-test/workdir/
total 13G
-rw-rw-r-- 1 randy.bradley randy.bradley 503 May 22 15:02 dentist.json
-rw-rw-r-- 1 randy.bradley randy.bradley 25K May 22 17:00 haplotype-Simmental_dam.reformat.dam
-rw-rw-r-- 1 randy.bradley randy.bradley 0 May 23 01:31 pile-ups.db
-rw-rw-r-- 1 randy.bradley randy.bradley 421 May 22 15:03 scaffolds_FINAL.dam
-r--r--r-- 1 randy.bradley randy.bradley 7.2G May 22 23:41 scaffolds_FINAL.haplotype-Simmental_dam.reformat.las
-r--r--r-- 1 randy.bradley randy.bradley 9.6G May 22 19:40 scaffolds_FINAL.scaffolds_FINAL.las
-rw-rw-r-- 1 randy.bradley randy.bradley 12 May 22 15:03 TAN.scaffolds_FINAL.las
|
Hi Arne,
We’ve been having problems trying to upload the entire workdir to your owncloud due to the filesize. Aaron had the idea of splitting the .tar file up into chunks that you could cat back together. Is that OK?
Thanks.
If he is willing to handle a split file, I could potentially upload it to his ownCloud in chunks rather than Box. That might be better for him.
Randy, you want to see if he would take that? I’d use split to break the file and provide md5sums for each part and the full file. Have to cat > the files back together before decompressing.
From: "Rogge, Aaron" <aaron.rogge@usda.gov>
Date: Wednesday, May 27, 2020 at 11:39 AM
To: "Bradley, Randy" <randy.bradley@usda.gov>
Cc: "Anderson, Phil" <phil.anderson@usda.gov>, "Smith, Tim - ARS" <tim.smith2@usda.gov>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
Second try crashed out at about the same point (6.5 MB difference). I think the file is too large for his ownCloud.
<?xml version="1.0" encoding="utf-8"?>
<d:error xmlns:d="DAV:" xmlns:s="http://sabredav.org/ns">
<s:exception>Sabre\DAV\Exception\BadRequest</s:exception>
<s:message>expected filesize 67073086930 got 18478247936</s:message>
</d:error>
Summary:
* Source file is 63G (gzipped)
* File is too large for Box – 15G max file size
* File crashes out at just of 17G transferring to ownCloud – max file size?
* Globus isn’t licensed for shared end points
* OneDrive is internal only also max file size issues
Any other ideas? I can split the file into chunks and upload to Box but they would have to be reassembled on the other end….
From: Rogge, Aaron
Sent: Wednesday, May 27, 2020 11:10 AM
To: Bradley, Randy <randy.bradley@usda.gov>
Cc: Anderson, Phil <phil.anderson@usda.gov>; Smith, Tim - ARS <tim.smith2@usda.gov>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
After uploading all night the upload “timed out”. I am beginning to suspect a file size limit. The file I’m trying to upload is 63G.
Below is the curl command. Sadly I am not getting a progress bar, so I can’t monitor progress. My first transfer died at 17.2G. I am trying again to see if it drops out at about the same point.
curl -k -u "Aa4Jf89vjEDVMKV:" -H 'X-Requested-With: XMLHttpRequest' "https://cloud.mpi-cbg.de/public.php/webdav/<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloud.mpi-cbg.de%2Fpublic.php%2Fwebdav%2F&data=02%7C01%7C%7Cf81ceada8bee4d696e1b08d8025c8a62%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637261943762448765&sdata=LibxbKpCS22eSrj8Tfe2iWUG3bn6Girw3KIrlIBk67k%3D&reserved=0>" -T dentist-workdir.tar.gz
Here is the error I received the first time
<?xml version="1.0" encoding="utf-8"?>
<d:error xmlns:d="DAV:" xmlns:s="http://sabredav.org/ns<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsabredav.org%2Fns&data=02%7C01%7C%7Cf81ceada8bee4d696e1b08d8025c8a62%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637261943762458761&sdata=Bzmi6HQ%2B%2Bw%2F%2B%2Bp53Oy%2FDtbJwiNeSCTElDNAP%2FNFGghs%3D&reserved=0>">
<s:exception>Sabre\DAV\Exception\BadRequest</s:exception>
<s:message>expected filesize 67073086930 got 18485112832</s:message>
</d:error>
From: Bradley, Randy <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Sent: Wednesday, May 27, 2020 8:45 AM
To: Rogge, Aaron <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Cc: Anderson, Phil <phil.anderson@usda.gov<mailto:phil.anderson@usda.gov>>; Smith, Tim - ARS <tim.smith2@usda.gov<mailto:tim.smith2@usda.gov>>
Subject: Re: [a-ludi/dentist] Hi Arne, (#6)
I’m beginning to see why the scientists are so frustrated trying to transfer data!
Maybe I should have asked if he had a CLI command to upload data to owncloud. Then we might be able to use our data transfer node to send it over I2. I did search a little but didn’t find anything simple. So I investigated using Globus a little but kept hitting roadblocks but I did finally get 2FA working from home. I knew they were going to drop the data sharing feature of Globus. I must be missing something because they keep recommending Globus to transfer data on SciNet. ☹
[A screenshot of a cell phone Description automatically generated]
From: "Rogge, Aaron" <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Date: Wednesday, May 27, 2020 at 8:22 AM
To: "Bradley, Randy" <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
Still slowly uploading.
From: Bradley, Randy <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Sent: Tuesday, May 26, 2020 5:48 PM
To: Rogge, Aaron <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Subject: Re: [a-ludi/dentist] Hi Arne, (#6)
OK. Thanks!
From: "Rogge, Aaron" <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Date: Tuesday, May 26, 2020 at 5:40 PM
To: "Bradley, Randy" <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
The file is still uploading. I’ll send you an update in the morning.
If it fails, I may be able to move it in chunks via Box (15G max file size limit) and try uploading from home.
Have a good evening,
aaron
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
Sure, perfect! |
Arne,
Would you check to see if all the files made it? You should have the following files (size given in bytes and GiB)
382b 382 dentist.md5
15032385536b 14G dentist-workdir.tar.gz.part-aa
15032385536b 14G dentist-workdir.tar.gz.part-ab
15032385536b 14G dentist-workdir.tar.gz.part-ac
15032385536b 14G dentist-workdir.tar.gz.part-ad
6943544786b 6.5G dentist-workdir.tar.gz.part-ae
The md5 file has the md5sums of the individual files and the original file before splitting:
fb81a55c846be3fdbe36c83b0a49c22d dentist-workdir.tar.gz (unsplit file)
c8fa2aa5fb4cf346bda64b683e56e917 dentist-workdir.tar.gz.part-aa
5bb012d5b2b7ff7a0ec18370b8cc53ae dentist-workdir.tar.gz.part-ab
f33179f01fc3478fe015882f7e775ebb dentist-workdir.tar.gz.part-ac
1db5be4178c3da3f24c195dfe99aeb49 dentist-workdir.tar.gz.part-ad
1f16bfff96d491b747a545cfc8276a5b dentist-workdir.tar.gz.part-ae
From: Bradley, Randy <randy.bradley@usda.gov>
Sent: Wednesday, May 27, 2020 11:49 AM
To: a-ludi/dentist <reply@reply.github.com>
Cc: Rogge, Aaron <aaron.rogge@usda.gov>
Subject: FW: [a-ludi/dentist] Hi Arne, (#6)
Hi Arne,
We’ve been having problems trying to upload the entire workdir to your owncloud due to the filesize. Aaron had the idea of splitting the .tar file up into chunks that you could cat back together. Is that OK?
Thanks.
If he is willing to handle a split file, I could potentially upload it to his ownCloud in chunks rather than Box. That might be better for him.
Randy, you want to see if he would take that? I’d use split to break the file and provide md5sums for each part and the full file. Have to cat > the files back together before decompressing.
From: "Rogge, Aaron" <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Date: Wednesday, May 27, 2020 at 11:39 AM
To: "Bradley, Randy" <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Cc: "Anderson, Phil" <phil.anderson@usda.gov<mailto:phil.anderson@usda.gov>>, "Smith, Tim - ARS" <tim.smith2@usda.gov<mailto:tim.smith2@usda.gov>>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
Second try crashed out at about the same point (6.5 MB difference). I think the file is too large for his ownCloud.
<?xml version="1.0" encoding="utf-8"?>
<d:error xmlns:d="DAV:" xmlns:s="http://sabredav.org/ns">
<s:exception>Sabre\DAV\Exception\BadRequest</s:exception>
<s:message>expected filesize 67073086930 got 18478247936</s:message>
</d:error>
Summary:
* Source file is 63G (gzipped)
* File is too large for Box – 15G max file size
* File crashes out at just of 17G transferring to ownCloud – max file size?
* Globus isn’t licensed for shared end points
* OneDrive is internal only also max file size issues
Any other ideas? I can split the file into chunks and upload to Box but they would have to be reassembled on the other end….
From: Rogge, Aaron
Sent: Wednesday, May 27, 2020 11:10 AM
To: Bradley, Randy <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Cc: Anderson, Phil <phil.anderson@usda.gov<mailto:phil.anderson@usda.gov>>; Smith, Tim - ARS <tim.smith2@usda.gov<mailto:tim.smith2@usda.gov>>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
After uploading all night the upload “timed out”. I am beginning to suspect a file size limit. The file I’m trying to upload is 63G.
Below is the curl command. Sadly I am not getting a progress bar, so I can’t monitor progress. My first transfer died at 17.2G. I am trying again to see if it drops out at about the same point.
curl -k -u "Aa4Jf89vjEDVMKV:" -H 'X-Requested-With: XMLHttpRequest' "https://cloud.mpi-cbg.de/public.php/webdav/<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloud.mpi-cbg.de%2Fpublic.php%2Fwebdav%2F&data=02%7C01%7C%7C1cac58dda5454f2a995c08d8025dd502%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637261949324808241&sdata=Kzk00tR5CyQ8E8XKK8n8e0HQ85gkMHfc0H%2BjohaXHCA%3D&reserved=0>" -T dentist-workdir.tar.gz
Here is the error I received the first time
<?xml version="1.0" encoding="utf-8"?>
<d:error xmlns:d="DAV:" xmlns:s="http://sabredav.org/ns<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsabredav.org%2Fns&data=02%7C01%7C%7C1cac58dda5454f2a995c08d8025dd502%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637261949324808241&sdata=wkl9x1YxgbFiY6jp2i8RNpnB3wKDX2KIDq8bm2uBp%2B8%3D&reserved=0>">
<s:exception>Sabre\DAV\Exception\BadRequest</s:exception>
<s:message>expected filesize 67073086930 got 18485112832</s:message>
</d:error>
From: Bradley, Randy <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Sent: Wednesday, May 27, 2020 8:45 AM
To: Rogge, Aaron <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Cc: Anderson, Phil <phil.anderson@usda.gov<mailto:phil.anderson@usda.gov>>; Smith, Tim - ARS <tim.smith2@usda.gov<mailto:tim.smith2@usda.gov>>
Subject: Re: [a-ludi/dentist] Hi Arne, (#6)
I’m beginning to see why the scientists are so frustrated trying to transfer data!
Maybe I should have asked if he had a CLI command to upload data to owncloud. Then we might be able to use our data transfer node to send it over I2. I did search a little but didn’t find anything simple. So I investigated using Globus a little but kept hitting roadblocks but I did finally get 2FA working from home. I knew they were going to drop the data sharing feature of Globus. I must be missing something because they keep recommending Globus to transfer data on SciNet. ☹
[A screenshot of a cell phone Description automatically generated]
From: "Rogge, Aaron" <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Date: Wednesday, May 27, 2020 at 8:22 AM
To: "Bradley, Randy" <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
Still slowly uploading.
From: Bradley, Randy <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Sent: Tuesday, May 26, 2020 5:48 PM
To: Rogge, Aaron <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Subject: Re: [a-ludi/dentist] Hi Arne, (#6)
OK. Thanks!
From: "Rogge, Aaron" <aaron.rogge@usda.gov<mailto:aaron.rogge@usda.gov>>
Date: Tuesday, May 26, 2020 at 5:40 PM
To: "Bradley, Randy" <randy.bradley@usda.gov<mailto:randy.bradley@usda.gov>>
Subject: RE: [a-ludi/dentist] Hi Arne, (#6)
The file is still uploading. I’ll send you an update in the morning.
If it fails, I may be able to move it in chunks via Box (15G max file size limit) and try uploading from home.
Have a good evening,
aaron
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
Hey Randy, got the files and they are OK. I will write to you once I have news. |
So, in my setup it worked. It took 08:55 hours with 2 CPUs and consumed max. RSS of 86.5 GB. Most of the time (6.7h) was spent reading the large alignment file. The routine is not very optimized, I have to admit. Now that you know how many resources are required, can you try again? I ran the job with 24h time limit and max allowed RSS of 128G. |
Hi Arne,
I’m pretty sure I’ve got it narrowed down to NFS options. Yesterday, I copied the workdir to a directory on another file system (mounted with different NFS OPTIONS) and ran “dentist collect” from there. It didn’t finish but it was able to write to pile-ups.db. Looks like it got hung up at 1:10 am this morning. I’m running it again today from a SSD array that is mounted with only the default options.
…-rw-rw-r-- 1 randy.bradley randy.bradley 5171172 Jun 7 01:10 pile-ups.db
I’ll let you know.
Thanks,
Randy
From: Arne <notifications@github.com>
Reply-To: a-ludi/dentist <reply@reply.github.com>
Date: Thursday, June 4, 2020 at 4:26 AM
To: a-ludi/dentist <dentist@noreply.github.com>
Cc: "Bradley, Randy" <randy.bradley@usda.gov>, Mention <mention@noreply.github.com>
Subject: Re: [a-ludi/dentist] High memory consumption in `writePileUps` (#6)
So, in my setup it worked. It took 08:55 hours with 2 CPUs and consumed max. RSS of 86.5 GB. Most of the time (6.7h) was spent reading the large alignment file. The routine is not very optimized, I have to admit.
Now that you know how many resources are required, can you try again? I ran the job with 24h time limit and max allowed RSS of 128G.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa-ludi%2Fdentist%2Fissues%2F6%23issuecomment-638732300&data=02%7C01%7C%7Ccf9225330058478862b408d808694e03%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637268595666556977&sdata=shDkMCaoWqWtD9iVLL%2BIVedOAFS4XYSUyUmkPovjFtQ%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKFTC4RO3DKBDKFKCTEZGVTRU5SCXANCNFSM4NJJQYWA&data=02%7C01%7C%7Ccf9225330058478862b408d808694e03%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637268595666566933&sdata=F43sIJE7afXpOXNzhHABRouEGSC%2B38Fv%2FS1P2OsSLw4%3D&reserved=0>.
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
Hi Arne,
Some good news. I was able to get the pipeline to run to completion by using a directory that resides directly on the master node. Maybe speculative, but I think the problem was an interface MTU packet size difference. Our home directory file server was left at the default MTU of 1500 while our cluster and compute nodes are set at 9000. It could also be that plus a combination of NFS options such as rsize, wsize, sync, etc. I have more testing to do but I’ll keep you posted.
Thanks,
Randy
Tail of job log:
[Mon Jun 8 15:17:11 2020]
rule output:
input: workdir/dentist.json, workdir/scaffolds_FINAL.dam, workdir/.scaffolds_FINAL.bps, workdir/.scaffolds_FINAL.hdr, workdir/.scaffolds_FINAL.idx, workdir/insertions.db
output: gap-closed.fasta
log: logs/output.log
jobid: 17
reason: Missing output files: gap-closed.fasta; Input files updated by another job: workdir/.scaffolds_FINAL.hdr, workdir/scaffolds_FINAL.dam, workdir/insertions.db, workdir/.scaffolds_FINAL.bps, workdir/.scaffolds_FINAL.idx
dentist output --config=workdir/dentist.json - - - 2> logs/output.log
Submitted job 17 with external jobid 'Your job 383493 ("dentist.output") has been submitted'.
[Mon Jun 8 15:18:11 2020]
Finished job 17.
1306 of 1307 steps (100%) done
[Mon Jun 8 15:18:11 2020]
localrule ALL:
input: gap-closed.fasta
jobid: 0
reason: Input files updated by another job: gap-closed.fasta
[Mon Jun 8 15:18:11 2020]
Finished job 0.
1307 of 1307 steps (100%) done
Complete log: /ext/randy.bradley/dentist-test/.snakemake/log/2020-06-08T093217.330730.snakemake.log
From: Arne <notifications@github.com>
Reply-To: a-ludi/dentist <reply@reply.github.com>
Date: Thursday, June 4, 2020 at 4:26 AM
To: a-ludi/dentist <dentist@noreply.github.com>
Cc: "Bradley, Randy" <randy.bradley@usda.gov>, Mention <mention@noreply.github.com>
Subject: Re: [a-ludi/dentist] High memory consumption in `writePileUps` (#6)
So, in my setup it worked. It took 08:55 hours with 2 CPUs and consumed max. RSS of 86.5 GB. Most of the time (6.7h) was spent reading the large alignment file. The routine is not very optimized, I have to admit.
Now that you know how many resources are required, can you try again? I ran the job with 24h time limit and max allowed RSS of 128G.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa-ludi%2Fdentist%2Fissues%2F6%23issuecomment-638732300&data=02%7C01%7C%7Ccf9225330058478862b408d808694e03%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637268595666556977&sdata=shDkMCaoWqWtD9iVLL%2BIVedOAFS4XYSUyUmkPovjFtQ%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKFTC4RO3DKBDKFKCTEZGVTRU5SCXANCNFSM4NJJQYWA&data=02%7C01%7C%7Ccf9225330058478862b408d808694e03%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637268595666566933&sdata=F43sIJE7afXpOXNzhHABRouEGSC%2B38Fv%2FS1P2OsSLw4%3D&reserved=0>.
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
Hi Arne,
The complete snakemake pipeline ran in less than 6 hours on one NFS filesystem. On two other NFS filesystems, the pipeline gets stuck in “dentist collect”. I’ve tried to make the failed NFS mount options identical to the successful filesystem but the symptoms remain the same. “dentist collect” runs for about 2 hours. During that time, I periodically log into the compute node that it is running on and I see that the top processes are “dentist collect” and “LAdump”. Then the “pile-ups.db” file gets created but remains empty. Comparing the logs, it may be that it gets stuck trying to write-protect the pile-ups.db file?
/ext: works
dentist collect --config=workdir/dentist.json --threads=4 --auxiliary-threads=2 - - - - 2> logs/collect.log
Submitted job 120 with external jobid 'Your job 383483 ("dentist.collect") has been submitted'.
Write-protecting output file workdir/pile-ups.db.
Updating job 5 (extend_dentist_config_for_merge).
Updating job 121 (process).
Updating job 119 (merge).
[Mon Jun 8 15:14:51 2020]
/mnt/shared_tmp: doesn’t work
dentist collect --config=workdir/dentist.json --threads=4 --auxiliary-threads=2 - - - - 2> logs/collect.log
Submitted job 1303 with external jobid 'Your job 387584 ("dentist.collect") has been submitted'.
Thanks,
Randy
From: Arne <notifications@github.com>
Reply-To: a-ludi/dentist <reply@reply.github.com>
Date: Thursday, June 4, 2020 at 4:26 AM
To: a-ludi/dentist <dentist@noreply.github.com>
Cc: "Bradley, Randy" <randy.bradley@usda.gov>, Mention <mention@noreply.github.com>
Subject: Re: [a-ludi/dentist] High memory consumption in `writePileUps` (#6)
So, in my setup it worked. It took 08:55 hours with 2 CPUs and consumed max. RSS of 86.5 GB. Most of the time (6.7h) was spent reading the large alignment file. The routine is not very optimized, I have to admit.
Now that you know how many resources are required, can you try again? I ran the job with 24h time limit and max allowed RSS of 128G.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa-ludi%2Fdentist%2Fissues%2F6%23issuecomment-638732300&data=02%7C01%7C%7Ccf9225330058478862b408d808694e03%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637268595666556977&sdata=shDkMCaoWqWtD9iVLL%2BIVedOAFS4XYSUyUmkPovjFtQ%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKFTC4RO3DKBDKFKCTEZGVTRU5SCXANCNFSM4NJJQYWA&data=02%7C01%7C%7Ccf9225330058478862b408d808694e03%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637268595666566933&sdata=F43sIJE7afXpOXNzhHABRouEGSC%2B38Fv%2FS1P2OsSLw4%3D&reserved=0>.
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
Hi Randy, the write-protection is likely not the cause of your issue because it is made by snakemake after dentist finished successfully. You may double-check this by verifying if dentist is still active. I guess it might be file-locking: dentist tries to lock files it reads or writes via I created another version of dentist that allows skipping the locking step entirely. I hope it just works because I did not test it at all. |
Thanks Arne! Do I need to pass dentist an option to skip the locking?
From: Arne <notifications@github.com>
Reply-To: a-ludi/dentist <reply@reply.github.com>
Date: Friday, June 12, 2020 at 8:01 AM
To: a-ludi/dentist <dentist@noreply.github.com>
Cc: "Bradley, Randy" <randy.bradley@usda.gov>, Mention <mention@noreply.github.com>
Subject: Re: [a-ludi/dentist] High memory consumption in `writePileUps` (#6)
Hi Randy,
the write-protection is likely not the cause of your issue because it is made by snakemake after dentist finished successfully. You may double-check this by verifying if dentist is still active.
I guess it might be file-locking: dentist tries to lock files it reads or writes via flockfile<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmanpage.me%2F%3Fq%3Dflockfile&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786381996&sdata=j9wwoiC42hy1qrAvhJGac%2FWxWa%2BvB2Hh78cBIYYFrmU%3D&reserved=0>. If an error occurs (like it does on our cluster because file locking is not implemented) it will just open the file without locking and continue. In contrast, it will just get stuck if the file lock cannot be acquired for some reason.
I created another version of dentist that allows skipping the locking step entirely. I hope it just works because I did not test it at all.
dentist.v1.0.0-beta.1-6-gce64df8.x86_64.tar.gz<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa-ludi%2Fdentist%2Ffiles%2F4770897%2Fdentist.v1.0.0-beta.1-6-gce64df8.x86_64.tar.gz&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786381996&sdata=%2BNe0TN%2BnCenel6XoGF9MKjBlbQv%2FCYIIW6lFnJWsYw4%3D&reserved=0>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa-ludi%2Fdentist%2Fissues%2F6%23issuecomment-643257676&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786391984&sdata=wdjXx7PNd9W0c00UMvDkbG%2FnzstRS%2BtuKb8GhCnrQ9Q%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKFTC4TXDU2EQGXD65GFHX3RWIRJRANCNFSM4NJJQYWA&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786391984&sdata=%2FOsZsriSWDvLAR3mgrhnAK5Urndpa6AraXTaFAZQnBo%3D&reserved=0>.
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
Hi Arne,
I applied an update to nfs-utils on our home directory server. I also made sure that rpcbind, nfs-lock, nfs-server, and nfs-idmap were started on it. Now, “dentist collect” works there too.
The third NFS file server that I tried it on is a CentOS 8 OS with SSD drives. It still doesn’t work there but the nfs-utils package for CentOS 8 doesn’t include the nfs-lock and nfs-idmap daemons. We just use that system for temp storage.
Thanks,
Randy
From: Arne <notifications@github.com>
Reply-To: a-ludi/dentist <reply@reply.github.com>
Date: Friday, June 12, 2020 at 8:01 AM
To: a-ludi/dentist <dentist@noreply.github.com>
Cc: "Bradley, Randy" <randy.bradley@usda.gov>, Mention <mention@noreply.github.com>
Subject: Re: [a-ludi/dentist] High memory consumption in `writePileUps` (#6)
Hi Randy,
the write-protection is likely not the cause of your issue because it is made by snakemake after dentist finished successfully. You may double-check this by verifying if dentist is still active.
I guess it might be file-locking: dentist tries to lock files it reads or writes via flockfile<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmanpage.me%2F%3Fq%3Dflockfile&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786381996&sdata=j9wwoiC42hy1qrAvhJGac%2FWxWa%2BvB2Hh78cBIYYFrmU%3D&reserved=0>. If an error occurs (like it does on our cluster because file locking is not implemented) it will just open the file without locking and continue. In contrast, it will just get stuck if the file lock cannot be acquired for some reason.
I created another version of dentist that allows skipping the locking step entirely. I hope it just works because I did not test it at all.
dentist.v1.0.0-beta.1-6-gce64df8.x86_64.tar.gz<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa-ludi%2Fdentist%2Ffiles%2F4770897%2Fdentist.v1.0.0-beta.1-6-gce64df8.x86_64.tar.gz&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786381996&sdata=%2BNe0TN%2BnCenel6XoGF9MKjBlbQv%2FCYIIW6lFnJWsYw4%3D&reserved=0>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa-ludi%2Fdentist%2Fissues%2F6%23issuecomment-643257676&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786391984&sdata=wdjXx7PNd9W0c00UMvDkbG%2FnzstRS%2BtuKb8GhCnrQ9Q%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKFTC4TXDU2EQGXD65GFHX3RWIRJRANCNFSM4NJJQYWA&data=02%7C01%7C%7C8877979ee27044c3790e08d80ed0af7c%7Ced5b36e701ee4ebc867ee03cfa0d4697%7C0%7C0%7C637275636786391984&sdata=%2FOsZsriSWDvLAR3mgrhnAK5Urndpa6AraXTaFAZQnBo%3D&reserved=0>.
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
Sorry, I forgot to mention: |
Yes, that worked perfectly.
Thanks!
From: Arne <notifications@github.com>
Reply-To: a-ludi/dentist <reply@reply.github.com>
Date: Saturday, June 13, 2020 at 7:59 AM
To: a-ludi/dentist <dentist@noreply.github.com>
Cc: "Bradley, Randy" <randy.bradley@usda.gov>, Mention <mention@noreply.github.com>
Subject: Re: [a-ludi/dentist] High memory consumption in `writePileUps` (#6)
Sorry, I forgot to mention: SKIP_FILE_LOCKING=1 should do the trick.
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
|
writePileUps
writePileUps
Originally posted by @BradleyRan in #3 (comment)
The text was updated successfully, but these errors were encountered: