Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with low_completion.fna and 4mer #46

Open
michoug opened this issue Feb 27, 2020 · 1 comment
Open

Issue with low_completion.fna and 4mer #46

michoug opened this issue Feb 27, 2020 · 1 comment

Comments

@michoug
Copy link

michoug commented Feb 27, 2020

Hi,
When running the "lc" part of your software, I got this error :

	       	    ____________________________________________________
	       	            Calculating 4mer frequencies for 
	       	            redundant bin low_completion.fna
       		    ____________________________________________________
          kmer frequency calculated in 15164.698575735092 seconds

           	    ____________________________________________________
                    	Creating Profile for 
                    	redundant bin low_completion.fna
                    ____________________________________________________
           Combined profile created in 167.75693249702454 seconds

        	    ____________________________________________________
	             	Reclustering redundant bin low_completion.fna
	            ____________________________________________________
          Preference: -25
          Maximum Iterations: 4000
          Convergence Iterations: 400
          Contig Cut-Off: 1000
          Damping Factor: 0.95
          Coverage File: HC_HiSeq_BinSaniy_cov.cov.x100.lognorm
          Fasta File: low_completion.fna
          Kmer: 4
          (300263, 266)
BinSanity failed when refining you genomes :/. The Bin that it failed at was the following bin: low_completion.fna

Any ideas, can I used the bins in the REFINED-BINS folder ?

@edgraham
Copy link
Owner

Hello,

This error is ultimately related to memory. At around 300,000 contigs when I have gotten to that number in the past is when I am hitting around 600GB of RAM. The first thing I usually advise for people running into this type of memory related issue is to consider your contig cut-offs. While I have tested Binsanity down to contigs of 1000bp I find that setting a cuttoff at ~ 2000bp typically speeds up the run significantly and does not remove any bin quality. Ultimately below 2000kbp while these contigs can often be useful they also often have more variable coverage profiles and composition metrics that may not align directly with the actual source genome, often when I include contigs this small most end up unbinned or I end up having to do quite a lot more manual genome refinement using anvio to confirm contig assignment. Increasing your cut-off to 2000bp would be the quickest way to speed up the run and reduce complexity.

Hopefully this means that it refined all of the genomes except that last "low_completion.fna" file. The Genomes in 'REFINED-BINS', 'high_completion', and 'strain_redundancy' would be where we would be pulling our final set of genomes from ultimately. Having said that loosing out on thos low_completion Genomes would be a bummer so if you don't want to up your contig cut-off you can use this work around with the caveat that in the past when I have done this I have sacrificed some amount of bin quality so you should consider some amount of close assessment.

From what you have told me it seems that 'Binsanity-lc' has finished refining all of the 'high_redundancy' genomes but failed when it hit that last group of contigs in the 'low_completion.fna' fasta file. So first take all the current genomes from 'REFINED-BINS', 'high_completion', and 'strain_redundancy' and move them to a directory called 'Final-Genomes' as this are finished processing. Then take the 'low_completion.fna' fasta file and try running it through 'Binsanity-lc' on its own using the same parameters and hopefully it will run with just that subset. If it doesn't some other things you could try is reducing the requirements for refinement in the source code. To do this find the function in the code:

def checkm_analysis(file_, fasta, path, prefix):
    df = pd.read_csv(file_, sep="\t")
    highCompletion = list(set(list(df.loc[(df['Completeness'] >= 95) & (df['Contamination'] <= 10), 'Bin Id'].values) + list(df.loc[(df['Completeness'] >= 80) & (
        df['Contamination'] <= 5), 'Bin Id'].values)+list(df.loc[(df['Completeness'] >= **40**) & (df['Contamination'] <= 2), 'Bin Id'].values)))
    lowCompletion = list(set(df.loc[(df['Completeness'] <= **40**) & (
        df['Contamination'] <= 2), 'Bin Id'].values))
    strainRedundancy = list(set(df.loc[(df['Completeness'] >= 50) & (
        df['Contamination'] >= 10) & (df['Strain heterogeneity'] >= 90), 'Bin Id'].values))

Change the two bolded numbers. We can play around with this more if necessary but this is a good starting place.

Let me know how it goes.

-Elaina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants