Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility: Avoid fileDate= and other variable results in VCF output #256

Closed
pjotrp opened this issue Mar 8, 2016 · 10 comments
Closed
Assignees
Milestone

Comments

@pjotrp
Copy link
Collaborator

pjotrp commented Mar 8, 2016

Thanks for freebayes. I think it is a great variant caller. Meanwhile:

Modern pipelines (CWL, Arvados, Galaxy) are working towards reproducible analysis. Ideally, on the same set of inputs, the output should be identical. Having a time stamp in the output is therefore a bad idea (and really belongs in metadata on files). I suggest to remove it by default and have a switch to add it back in when people want it. The same I have done for the bio-vcf tool.

@pjotrp
Copy link
Collaborator Author

pjotrp commented Mar 8, 2016

The same holds, arguably, for putting the used command line in the VCF headers as the input files can be the same, but have different dirnames. I'd rather just replicate the options given, but not the input files by name.

Makes for easier testing too.

@pjotrp pjotrp changed the title Avoid fileDate= in VCF output Reproducibility: Avoid fileDate= and other variable results in VCF output Mar 8, 2016
@ekg
Copy link
Collaborator

ekg commented Mar 8, 2016

Agreed.

Feebayes produces deterministic output, but this should be made completely
true to ensure that checksums are identical.

Would you please provide a patch to enable this behavior? If not, I will
get to it soon.

We can enable it by default unless there are objections. Surely the
timestamp on the file will be enough.

This will also help ensure that outputs really are completely deterministic
(the algorithm is, but the output might not be).

If I remember there might've been a problem with slightly different
representations of extremely small numbers due to the math library used for
big number representation. If this is the case then there are a few
possible solutions.
On Mar 8, 2016 9:08 AM, "Pjotr Prins" notifications@github.com wrote:

Thanks for freebayes. I think it is a great variant caller. Meanwhile:

Modern pipelines (CWL, Arvados, Galaxy) are working towards reproducible
analysis. Ideally, on the same set of inputs, the output should be
identical. Having a time stamp in the output is therefore a bad idea (and
really belongs in metadata on files). I suggest to remove it by default and
have a switch to add it back in when people want it. The same I have done
for the bio-vcf tool.


Reply to this email directly or view it on GitHub
#256.

@ekg
Copy link
Collaborator

ekg commented Mar 8, 2016

This is more complicated, and seems complex. To do this we will need to
remove any file referring options from the header. There are many.

Maybe removing these lines altogether is the only way to provide the
desired behavior. However, removing the command line by default seems to
run counter to the objective of reproducible research.

Have you considered just removing these lines from the header with grep or
a script when checking that the output is deterministic?
On Mar 8, 2016 9:11 AM, "Pjotr Prins" notifications@github.com wrote:

The same holds, arguably, for putting the used command line in the VCF
headers as the input files can be the same, but have different dirnames.
I'd rather just replicate the options given, but not the input files by
name.

Makes for easier testing too.


Reply to this email directly or view it on GitHub
#256 (comment).

@pjotrp
Copy link
Collaborator Author

pjotrp commented Mar 8, 2016

@ekg Point is that more and more 'automatic' tools are appearing that make use of the deterministic output idea. Purging headers is an extra step - and there is really no need for that.

As for the command line: it belongs in the metadata 'above' the output file. To have it in the headers is the poor-mans solution and only aimed at people running the command line by hand. It is now an option in bio-vcf, where it belongs. E.g., https://github.com/pjotrp/bioruby-vcf/blob/master/bin/bio-vcf#L123.

Arguably you can have an option to switch it off, but at least having the option is a good (read informed) idea.

@ekg
Copy link
Collaborator

ekg commented Mar 8, 2016

Many users of freebayes are running it on the command line, often in an
exploratory way. Removing the header lines describing how the tool was run
seems only problematic for these users. Adding an option to remove them is
fine, but is more complicated for freebayes development and maintenance
than filtering them out in cases where they are not needed. If this is
important to your application would you please provide a patch to do so? I
will be happy to merge it.
On Mar 8, 2016 9:33 AM, "Pjotr Prins" notifications@github.com wrote:

@ekg https://github.com/ekg Point is that more and more 'automatic'
tools are appearing that make use of the deterministic output idea. Purging
headers is an extra step - and there is really no need for that.

As for the command line: it belongs in the metadata 'above' the output
file. To have it in the headers is the poor-mans solution and only aimed at
people running the command line by hand. It is now an option in bio-vcf,
where it belongs. E.g.,
https://github.com/pjotrp/bioruby-vcf/blob/master/bin/bio-vcf#L123.

Arguably you can have an option to switch it off, but at least having the
option is a good (read informed) idea.


Reply to this email directly or view it on GitHub
#256 (comment).

@aeonsim
Copy link

aeonsim commented Mar 8, 2016

As a regular user of freebayes who works with a variety of people doing exploratory analysis I am strongly in favour of retaining the full command line and list of targeted files in the header along with a run date as a default option.

Having all the information available makes it much easier to compare results produced by different people working on the same dataset or subsets of a data set and work out exactly what they did and when they did it. Metadata on the filesystem attributes is simply not reliable enough when files are manipulated moved between storage systems and transferred between operating systems.

The options for adding those metadata should be enabled by default because the occasions when they are needed are the most likely times that the user who used the software will not remember to enable a couple of extra command line options.

While for the cases when you don't want them such as automated and standardised pipelines it is much easier to simply add an additional argument to the program to disable them and have it apply every time the software is run. Secondly if the maintenance burden of maintaining extra options is problematic then having the pipeline based systems include a 'grep -v -f headers.txt' command catching the STDOUT VCF stream from freebayes is relatively easy to implement with minimal compute cost.

@pjotrp
Copy link
Collaborator Author

pjotrp commented Mar 8, 2016

Sure, I appreciate what y'all are saying. But when you think of pipelines including some 50 odd tools and CLI invocations you may also appreciate that parsing out all the inconsequential stuff is superfluous AND therefore error prone. I am just signaling the way we are heading. Exploratory research and reproducibility are not fixed by adding evidence in the headers, even if it helps you now ;). Freebayes would do well to cater for determinism :) A simple switch will do the job. Please see it as a useful feature request. My 2cts.

BTW we are adding freebayes to GNU Guix - a deterministic software deployment system which may end up in quite a few pipelines. See https://github.com/genenetwork/guix-bioinformatics/blob/master/gn/packages/bioinformatics.scm#L102

@github-actions
Copy link

This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Dec 11, 2020
@pjotrp
Copy link
Collaborator Author

pjotrp commented Dec 11, 2020

What I'll do is add a command line switch for deterministic output.

@pjotrp pjotrp self-assigned this Dec 11, 2020
@pjotrp pjotrp added this to the v1.3.4 milestone Dec 13, 2020
@pjotrp pjotrp modified the milestones: v1.3.4, v1.3.5 Jan 29, 2021
@pjotrp
Copy link
Collaborator Author

pjotrp commented Jan 31, 2021

Since freebayes produces textual VCF output it is trivial to filter on fileDate with grep -v. I am leaving it like that for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants