Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuously increasing RAM demand #21

Open
thackl opened this issue Feb 18, 2021 · 14 comments
Open

Continuously increasing RAM demand #21

thackl opened this issue Feb 18, 2021 · 14 comments

Comments

@thackl
Copy link

thackl commented Feb 18, 2021

Hi Katelyn,

thanks for PHANOTATE, great tool!

I'm using it for larger sets of viral contigs, and noticed that it seems to not free up RAM between contigs. On my current set with 100+ sequence, towards the end, it needs >5GB RAM. Obviously, I can split the input file to get around that, but I'd also suspect, it wouldn't be too difficult to clean the RAM after a contig has been processed. Might make sense as an improvement for future versions.

Cheers
Thomas

@deprekate
Copy link
Owner

Hm, that is strange. In the current version I reset everything between contigs (which I probably shouldn't do since contigs in the same file should probably be treated as the same genome by default; and to treat them as separate genomes use a --meta flag).

I wonder if pythons automatic garbage collector isn't working correctly between contigs to clear old memory. Or perhaps I have a memory leak somewhere.

Thanks for info, I will try to track down the cause.

If you have the time, can you check whether the issue happens with my newer code?
I have been developing a 2+ version. The changes in 2.0 alpha are mostly only code optimizations (it should run in half the time), but I did have to make some changes to the way ORFs and gaps/overlaps are scored, and I haven't dialed them in yet, so output might be slightly different.
To run the develop branch you can manually install via:

git clone https://github.com/deprekate/PHANOTATE.git
cd PHANOTATE
git checkout develop
python3 setup.py install

phanotate.py tests/NC_001416.1.fasta

@thackl
Copy link
Author

thackl commented Feb 19, 2021

Interesting thought regarding --meta flag. Makes sense to treat contigs from the same genome together - especially shorter ones could profit from any "training" performed also on larger ones. Not sure though if people usually split their phage genomes (or even bacterial bins) into separate files after binning. But if properly documented, the definitely could.

I tried to run the new code. It worked on the test data, but then failed on may contigs with the following error:

phanotate.py  -f tabular pt-loci-r1.fna > pt-loci-r1-phanotate-v2.0.tsv
Traceback (most recent call last):
  File "/home/thackl/software//PHANOTATE/PHANOTATE/phanotate.py", line 47, in <module>
    contig_features.add_feature( trna )
  File "/home/thackl/.local/lib/python3.8/site-packages/phanotate-2.0-py3.8-linux-x86_64.egg/phanotate/features.py", line 100, in add_feature
    self.feature_at[ feature.as_edge()[:2] ] = feature
AttributeError: 'tRNA' object has no attribute 'as_edge'

@deprekate
Copy link
Owner

Yeah, the problem with making it default is that most people won't read the fulls docs, and so will run MAG files through as the same genome, which will skew the results.

The tRNA error should be fixed, I forgot to push a change I made to the code.

@thackl
Copy link
Author

thackl commented Feb 19, 2021

Yeah, users ;). You could enforce a choice with a required mode argument, something like phanotate meta [opts] infile / phanotate single [opts] infile, with the former going contig-wise and the letter expecting a single genome with 1+ contigs...

Caught another error...

phanotate.py  -f tabular pt-loci-r1.fna > pt-loci-r1-phanotate-v2.0.tsv
Traceback (most recent call last):
  File "/home/thackl/software//PHANOTATE/phanotate.py", line 84, in <module>
    shortest_path = fz.get_path(source= contig_features.source_node(), target= contig_features.target_node())[1:-1]
ValueError: Graph contains negative weight cycle

@deprekate
Copy link
Owner

Ah, that error is caused by the weights not being dialed in yet. I will see if I can get the weights sorted out, and/or track down the cause of the RAM usage. Thanks so for you help and feedback.

@thackl
Copy link
Author

thackl commented Feb 19, 2021

Sure. Happy to help. Debugging my own code as we speak. Let me know if you want me to give it another try

@deprekate
Copy link
Owner

The negative weight cycle is caused when the cost to go backwards via an overlap is not enough in relation to a really good gene (in this case most likely a tRNA), so during the path solving step, the Bellman-Ford will go through that gene, then do a backwards overlap, then back through the same gene, infinite times.
I reduced the tRNA from -20 to -1, and pushed the change to github, which should prevent the infinite loop for now

@thackl
Copy link
Author

thackl commented Feb 20, 2021

Hmm, I still get the negative weight cycle error. I'm currently trying to find out, which sequence exactly is causing the error. If I find it and it would help, I can send it to you.

@thackl
Copy link
Author

thackl commented Feb 20, 2021

OK, now it is getting bizarre. When I split the file into one file per seq, and run phanotate on each - everything runs through fine. But if I run phanotate on the file with all sequences, I get the error after a few sequences... Any ideas?

If it would help, I can share the data, assuming you would handle it confidentially

@deprekate
Copy link
Owner

Ah, one of the improvements I made with version 2 is moving the ORF to ORF connections creation step to Cpython. It looks like I forgot to empty the data structure between contigs. I will have to get that fixed.

@ilyavs
Copy link

ilyavs commented Dec 25, 2022

Hi, I am also running the dev branch because the master doesn't work on some of my phage sequences. I also get the error:
ValueError: Graph contains negative weight cycle
Is there anything that can be done?

@deprekate
Copy link
Owner

The "negative weight cycle" error is the reason why I have not pushed this version 2 to the main. I mentioned in this version I had to simplify the weighting of the ORFs but kept the tRNAs the same weight (-20). If I drop the weight to -1 it should run fine, but I have to tune it properly, which I havent done. One reason is that I dont have a contig that generates the error.
Are you able to share the contig that causes the error?

@ilyavs
Copy link

ilyavs commented Jan 2, 2023

Unfortunately I can't share the contigs. I am pretty sure there are public sequences that would have the error, I just don't have the capacity to analyze them until I hit the error.

@deprekate
Copy link
Owner

deprekate commented Apr 4, 2023

Hi, I am also running the dev branch because the master doesn't work on some of my phage sequences. I also get the error: ValueError: Graph contains negative weight cycle Is there anything that can be done?

I am revisiting this issue, can you clarify if you got the 'negative cycle' error with the develop or main branch? If it was only the dev branch, what was the error that caused the main branch not to work on some phages?

I also set the weight of tRNAs to 0, so neither beneficial or penalized, with the idea that I will add ALL of them back in post processing. Can you test out the develop branch on the contig that gave you the error?

git clone https://github.com/deprekate/PHANOTATE.git
git checkout develop
pip install ../PHANOTATE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants