New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility: Avoid fileDate= and other variable results in VCF output #256
Comments
The same holds, arguably, for putting the used command line in the VCF headers as the input files can be the same, but have different dirnames. I'd rather just replicate the options given, but not the input files by name. Makes for easier testing too. |
Agreed. Feebayes produces deterministic output, but this should be made completely Would you please provide a patch to enable this behavior? If not, I will We can enable it by default unless there are objections. Surely the This will also help ensure that outputs really are completely deterministic If I remember there might've been a problem with slightly different
|
This is more complicated, and seems complex. To do this we will need to Maybe removing these lines altogether is the only way to provide the Have you considered just removing these lines from the header with grep or
|
@ekg Point is that more and more 'automatic' tools are appearing that make use of the deterministic output idea. Purging headers is an extra step - and there is really no need for that. As for the command line: it belongs in the metadata 'above' the output file. To have it in the headers is the poor-mans solution and only aimed at people running the command line by hand. It is now an option in bio-vcf, where it belongs. E.g., https://github.com/pjotrp/bioruby-vcf/blob/master/bin/bio-vcf#L123. Arguably you can have an option to switch it off, but at least having the option is a good (read informed) idea. |
Many users of freebayes are running it on the command line, often in an
|
As a regular user of freebayes who works with a variety of people doing exploratory analysis I am strongly in favour of retaining the full command line and list of targeted files in the header along with a run date as a default option. Having all the information available makes it much easier to compare results produced by different people working on the same dataset or subsets of a data set and work out exactly what they did and when they did it. Metadata on the filesystem attributes is simply not reliable enough when files are manipulated moved between storage systems and transferred between operating systems. The options for adding those metadata should be enabled by default because the occasions when they are needed are the most likely times that the user who used the software will not remember to enable a couple of extra command line options. While for the cases when you don't want them such as automated and standardised pipelines it is much easier to simply add an additional argument to the program to disable them and have it apply every time the software is run. Secondly if the maintenance burden of maintaining extra options is problematic then having the pipeline based systems include a 'grep -v -f headers.txt' command catching the STDOUT VCF stream from freebayes is relatively easy to implement with minimal compute cost. |
Sure, I appreciate what y'all are saying. But when you think of pipelines including some 50 odd tools and CLI invocations you may also appreciate that parsing out all the inconsequential stuff is superfluous AND therefore error prone. I am just signaling the way we are heading. Exploratory research and reproducibility are not fixed by adding evidence in the headers, even if it helps you now ;). Freebayes would do well to cater for determinism :) A simple switch will do the job. Please see it as a useful feature request. My 2cts. BTW we are adding freebayes to GNU Guix - a deterministic software deployment system which may end up in quite a few pipelines. See https://github.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/bioinformatics.scm#L102 |
This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
What I'll do is add a command line switch for deterministic output. |
Since freebayes produces textual VCF output it is trivial to filter on fileDate with |
Thanks for freebayes. I think it is a great variant caller. Meanwhile:
Modern pipelines (CWL, Arvados, Galaxy) are working towards reproducible analysis. Ideally, on the same set of inputs, the output should be identical. Having a time stamp in the output is therefore a bad idea (and really belongs in metadata on files). I suggest to remove it by default and have a switch to add it back in when people want it. The same I have done for the bio-vcf tool.
The text was updated successfully, but these errors were encountered: