-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CONCOCT command is not following best practice #76
Comments
Hey Johannes! Thanks for messaging me about this. I actually wondered why CONCOCT wasn't doing so hot on larger data sets. That makes sense now. Its unfortunate this information was not immediately apparent to users. I definitely want to fix this in metaWRAP, as CONCOCT offers a very fresh binning approach compared to the other binners I tested, so it could offer many good bins when incorporated with the bin_refinement module of metaWRAP. It already does pretty well on smaller data, so I imagine it will do even better once we fix this. I see the scripts you mention do not come with the standard Bioconda distribution of CONCOCT. Looks like I will need to include them in the metawrap-scripts folder. Also, you probably noticed that the read alignment stage is done before the binning stages, so the contig depth files are already ready to go before concoct starts. Would it be ok to split the larger contigs and assume that they all have the same read depths? In other words, could you fix the pipeline by changing things only after line 403, having only the original fasta file and the I will not have much time to work on this in the coming months, so I would greatly appreciate it if you could help me, especially since you know exactly how this works. Do you mind filling in the Thanks, and I look forward to fixing this with you! |
Hi Gherman, yes I am aware that the origin of this mess is concoct. I'm also hoping the performance will be improved. We also have a new version of concoct coming (already available in the develop branch) which should be several times faster and using threads in a better way. I am afraid that simply assuming the same coverage would at least not be optimal. I'm not sure how bad it would be as I haven't tried it, but at least the number data points per bin would then be better distributed. The best alternative would be to split the contigs before mapping the reads, but I totally agree with you that this is a major hassle. The second best alternative would in my point be to set coverage for each sub-contig based on some probability distribution using the main contig mean coverage and coverage variance as given by the script you are using to get the coverage values. But that would still need splitting up the fasta file with the current version of concoct. Also a script to perform this generation remains to be written. But maybe this is the easiest way forward for concoct within metaWRAP? I will try to include those missing scripts into the bioconda package (which is by the way kindly contributed by other users and not from any of the developers). I'm not sure either how much time I can spend on this problem, but I am indeed keen on getting it fixed. 👍 Thanks! |
Great! Im really looking forward to the concoct that can take advantage of multi-threading effectively! Right now its just not scalable to most larger sequencing projects. I do not really like the idea of faking the coverage variance through modeling... The data is right there, so why not use it? Why do you think that assuming the coverage from the overall average coverage is bad? If we have a long contig from a genome of a given abundance that we then cut up into short fragments, wouldn't the abundance of each of its pieces be expected to be identical? In practice, the read coverage will deviate some due to random change and coverage biases, but the true value - the abundance of that genome in that sample - does not change from us cutting up the contig. Isn't coverage just a proxy to the abundance of each fragment in that sample? The longer the fragment the more accurate that estimation. Sorry in advance if I am misunderstanding something here. If we have to re-estimate the abundance of each fragment, I would prefer to be efficient and not have to repeat the alignment step. Instead, the .bam files from the alignments are right there. In theory, we could just split the fasta file, and then hack into the .bam file to make the read mappings consistent with the fragmented contig naming. Then run the coverage estimation tool on the new "split" .bam files. The downside is that this would be more work to make a program to parse out the bam files, so unless you want to help me integrate this, the easiest way would be to just brute-force and re-run the alignment stage on the cut-up contigs. This will be inefficient (doing the alignment twice), but one could argue that if the users cannot afford to re-align the reads then then should not be attempting to bin with COCCOCT, given how inefficient it is on large data. Let me know which way you want to try! |
Yes, it really needs a speed up, which I am confident in saying we will be able to achieve. Yes, I am afraid I wasn't clear enough. I am not aiming at faking the variance. Instead I want to use the variance and the coverage value from the long contig to build a probability distribution for each long contig. This distribution will then be used to produce slightly varying coverage values for the sub-contigs produced. Does that make sense? I think this would result in a more realistic values for the short contigs than just using the same coverage value for all sub-contigs. However, maybe it's worth trying using the same coverage values for all sub-contigs first, since I haven't tried that. The idea of creating the script you describe has passed my mind a few times, but I'm not sure how efficient I would be able to make such a script. Do you have a good idea of how it would be done? Creating a few 100k random distributions and subsample ~10-100 values from each sounds like a easier programming task to me. But yes, one would then not get the true coverage for the sub part. |
Can you provide the code that you would use to cut up the contigs, align the reads, bin, and then merge them? I am thinking what is the easiest way to modify the code for now. If you work with |
Hey @alneberg I saw you released
I also saw that the cut-up contig coverage does not have to be re-estimated now. Is there any clear-cut guide on how to do the cut up the contigs, use coverage from the main contigs, perform the binning, and then get final bins with the original contigs? I feel like given just two variables/files - a coverage file |
Hi @ursky, sorry for not getting back to you about this! The speed should be improved in the new version but the scalability for increasing data size is probably the same (except for improved parallelism). The commands given in the readme should be enough, but for you that would still mean you have to additionally run the cutup command and the concoct script to generate the coverage table. The big change is that the mapping doesn't have to be redone against the cutup contigs. However, the cutup contigs are still needed for concoct as input ( You should also add a value for The clustering results on the cutup contigs need to be merged to correspond to the original contigs. This command is also given in the readme. The scripts used in the readme are added to the path when concoct is installed. Please let me know if you need any other information from me, it would be very nice to have the concoct installation work as well as it can. Cheers, |
Excellent, thank you. I was able to integrate the contig splitting relatively easily, and it seems to work very well! However, I am unable to use the |
Ok, concoct should now be able to use the |
After some testing, all features of CONCOCT appear to be functional under the metaWRAP wrapper. I am happy to report that |
Fantastic! I browsed through the code and it looks great! A small side note (only for your convenience), the extra
will create the directory Feel free to close this issue as you like. Great work! |
Thanks, I incorporated the changes. Unfortunately there is another issue now:
Its something to do with the scikit-learn/scipy/numpy versions, but I am not sure which exactly. My environment:
|
Ah, that's bad. I think it's explained by the version of scikit-learn which should be > 0.18.0 to have the |
Ok, that makes sense. Let me know when you update the recipe so I can test it. |
The recipe was updated just now: bioconda/bioconda-recipes#13820 |
Thanks! I just tested and got a new error:
Looks like you need to enforce some more dependencies: https://stackoverflow.com/questions/36659453/intel-mkl-fatal-error-cannot-load-libmkl-avx2-so-or-libmkl-def-so |
|
I just installed The following NEW packages will be INSTALLED: nomkl pkgs/free/linux-64::nomkl-3.0-0
|
Hmm, I guess I could. But this time it works in a clean conda installation (at least for me). And therefore I'm already quite impressed how many different (and potentially conflicting) packages you've managed to bundle in one conda environment. I'm struggling to put together just one single package. But yes, great that its working again! |
Understandable. I updated the recipe on my end and it seems to work flawlessly. And yes, its a pain - hence the hundreds of little unique issues I am constantly bombarded with... By the way, feel free to advertise |
Hi, This might not be directly related to the issues of concoct that were discussed here, but I was not sure I should start another thread for it. In the current code, the inputs for the concoct command are: Could these be changed to "${out}/work_files/..." ? Thank you |
Thanks for pointing that out. Ill fix up the outdated variable handling before metawrap |
Hello! I have been playing around with the metaWRAP bin refinement module and I obtain some interesting results. Note that I implemented the binning tools myself and I only make use of the refinement module of metaWRAP. For a typical sample I get the following: I understand that I have used quite lax completeness and contamination parameters (40 and 25 respectively), and I plan to filter out bins at a later time. It looks like if use a cutoff of 80% completion for filtering, I am better off using the original CONCOCT bins, although it does appear that metaWRAP does reduce some contamination. Do you think that my bin refinement parameters too lax for the refinement to work properly? Interested in hearing both of your opinions @alneberg @ursky Best wishes, |
It is entirely possible that refinement may not improve results in every case scenario. In your example, if metabat2 has all the bins that concoct/maxbin2 produces, then there is no new information being added to metabat2, so the refinement module does not improve the result. It not common, but seeing how few bins you have makes me think that this is a simpler binning challenge so this would make sense. In general, metaWRAP will strongly prioritize low contamination, so you can see that it did so at some expense of completion. One thing I can say for sure is to run the refinement module with the completion and contamination parameters that you actually want to use later, since the parameters will define what bin qualities to prioritize. No such thing as "ill filter them out later" - thats not how the refinement module is meant to be used. If you intend to only trust bins with 80% comp and 10% cont, then run with |
Thanks for the quick response! I will re-run with more stringent parameters as you suggest. I look forward to also trying out the bin reassembly module next which looks very promising! Best wishes, |
Hello,
first I must thank you for what seems to be an amazing resource! I can see that the performance of CONCOCT is not that good in the figure showed at the README and likewise in the figure in the paper. I had a look at the command used to run CONCOCT:
metaWRAP/bin/metawrap-modules/binning.sh
Lines 394 to 417 in e7f740d
And from what I can see it is not using contigs cut-up into smaller pieces. Is this correct?
CONCOCT will not perform well on this and it is the general instruction to use the cut-up script to create chop up the longest contigs:
https://github.com/BinPro/CONCOCT/blob/develop/scripts/cut_up_fasta.py
There is then a script to create a clustering on the original clustering here:
https://github.com/EnvGen/toolbox/blob/master/scripts/concoct/merge_cutup_clustering.py
I know this is a big hassle and I am indeed hoping to fix this in a future release of CONCOCT (Yes, it's true, it is still alive!!!), but I am afraid this is the current procedure.
Do you think it would be possible to implement this in metaWRAP? It's a bit sad to see the poor performance displayed on the readme... :(
Let me know if you need any assistance!
The text was updated successfully, but these errors were encountered: