Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your vcf2tsv has a really annoying bug!!! #206

Closed
deniseduma opened this issue Jul 30, 2017 · 27 comments
Closed

Your vcf2tsv has a really annoying bug!!! #206

deniseduma opened this issue Jul 30, 2017 · 27 comments

Comments

@deniseduma
Copy link

deniseduma commented Jul 30, 2017

You claim that your vcf2tsv outputs one record per allele rather than one output per SNP but you are actually messing up the output by not separating the per-allele records by a new-line character!!!! This is a really elementary mistake and really annoying because the output records are messed up and, contrary to expectations, they equal the number of input records! Could you please fix this bug and test your vcf2tsv code?

Moreover, you are randomly changing the order of the input INFO fields in the output tsv file whereas it would be preferable to keep it the same.

Here is an example concatenated output:

1 17380465 rs138979875 G A 0 . . . . . . RCV00013 2258.2 2 Hereditary_cancer-predisposing_syndrome MedGen:SNOMED_CT C0027672:699346009 NC_000001.10:g.1 7380465G>T 1 single 0 . . . . . . SDHB:6390 . . . . . . . . . . . . . . . . . . . . 138979875 17380465 . . 1 . 0 . . . . SNV . 0x050060000a05040002100100 1 . 1341 17380465 rs138979875 G T 0 . . . . . . RCV000132258.2 2 Hereditary_cancer-predisposing_s yndrome MedGen:SNOMED_CT C0027672:699346009 NC_000001.10:g.17380465G>T 1 single 0 . . . . . . SDHB:6390 . . . . . . . . . . . . . . . . . . . . 138979875 17380465 . . 1 . 0 . . . . SNV . 0x050060000a050400021001 00 1 . 134

@zeeev
Copy link
Collaborator

zeeev commented Jul 30, 2017

@deniseduma I was unable to reproduce the "bug" you reported. Here's an example of the test I ran. There were two lines in the output, one for each alt alelle.

bin/vcf2tsv samples/1kg-phaseIII-v5a.20130502.genotypes.chr22-16-16.5mb.vcf.gz | grep "rs62224611"

@tseemann
Copy link
Contributor

tseemann commented Jul 31, 2017

@deniseduma i find the tone of your writing to be aggressive and disrespectful. I believe Github follows the "Open Code of Conduct" (see http://todogroup.org/opencodeofconduct/) which I would argue you are not following here.

@tseemann
Copy link
Contributor

@deniseduma now back to your issue. did you copy the file between Windows or Mac and a Unix system by any chance? This is a common problem with handling different line endings.

@deniseduma
Copy link
Author

I'm using a Mac so most likely you are not handling the newline character correctly on the Mac. I'm sure I'm not wrong because I've tested this many times and compared your output with that of another tool which does not split records by alleles and the two outputs have exactly the same number of lines.

I'm sorry if you were offended by my tone but I find your tool which apparently you published to be quite immature and bug-ridden. For instance, there was another bug when I tried to install vcflib on my Mac which took me forever to figure out! In my opinion, the installation of the software at least should work smoothly.

@gringer
Copy link

gringer commented Jul 31, 2017

Could you please provide an example input file that demonstrates this behaviour. If you're using a mac and 'hexdump' is also installed, the output of the last few lines of hexdump -C <output_file> (preferably mixing lines that have the problems and ones that don't) would also be informative.

@deniseduma
Copy link
Author

deniseduma commented Jul 31, 2017 via email

@tseemann
Copy link
Contributor

tseemann commented Jul 31, 2017

@deniseduma

  1. I am not the author of the software. I am just a bioinformatician who helps people in my community wherever I can.

  2. This software is NOT published in a journal as far as I know. It is only on Github.

  3. You need to provide test data for the authors to replicate the bug. This is a good guide: https://www.chiark.greenend.org.uk/~sgtatham/bugs.html

  4. I am not sure you are really "sorry" about your tone.

  5. You are clearly frustrated by this tool. There are alternatives out there, or a simple libraries for doing it yourself by parsing the VCF.

@tseemann
Copy link
Contributor

tseemann commented Jul 31, 2017

@deniseduma there does not seem to be a test.vcf file attached? it works best attaching it via the github web site rather then email. does it work for you @gringer ?

I would try running mac2unix on the original file and see if it fixes the problem.

@gringer
Copy link

gringer commented Jul 31, 2017

No, I'm not getting github emails about this, and also can't see any attachments on the issue page.

@deniseduma
Copy link
Author

deniseduma commented Jul 31, 2017 via email

@deniseduma
Copy link
Author

Here are the input and output files, I've changed their extension to .txt because Github won't let me upload as .vcf and .tsv

test_tsv.txt
test_vcf.txt

@gringer
Copy link

gringer commented Jul 31, 2017

It looks like you uploaded the same file. After I downloaded them, these files were identical.

Edit: I think what I did wrong with the download was to just replace the file name, rather than the file number. github doesn't seem to care what the file names are set to. For example, here they are "renamed" to the correct extension:

https://github.com/vcflib/vcflib/files/1186892/test.tsv
https://github.com/vcflib/vcflib/files/1186893/test.vcf

In any case, I do see that there is no line break or carriage return character in the output:

00000000  23 43 48 52 4f 4d 09 50  4f 53 09 49 44 09 52 45  |#CHROM.POS.ID.RE|
00000010  46 09 41 4c 54 09 51 55  41 4c 09 46 49 4c 54 45  |F.ALT.QUAL.FILTE|
00000020  52 09 52 53 09 52 53 50  4f 53 09 64 62 53 4e 50  |R.RS.RSPOS.dbSNP|
00000030  42 75 69 6c 64 49 44 0a  31 09 39 34 39 35 32 33  |BuildID.1.949523|
00000040  09 72 73 37 38 36 32 30  31 30 30 35 09 43 09 54  |.rs786201005.C.T|
00000050  09 30 09 2e 09 37 38 36  32 30 31 30 30 35 09 39  |.0...786201005.9|
00000060  34 39 35 32 33 09 31 34  34 0a 31 09 39 38 35 38  |49523.144.1.9858|
00000070  32 36 09 72 73 31 37 31  36 30 37 37 35 09 47 09  |26.rs17160775.G.|
00000080  41 09 30 09 2e 09 31 37  31 36 30 37 37 35 09 39  |A.0...17160775.9|
00000090  38 35 38 32 36 09 31 32  33 31 09 39 38 35 38 32  |85826.1231.98582| << 123//1
000000a0  36 09 72 73 31 37 31 36  30 37 37 35 09 47 09 54  |6.rs17160775.G.T|
000000b0  09 30 09 2e 09 31 37 31  36 30 37 37 35 09 39 38  |.0...17160775.98|
000000c0  35 38 32 36 09 31 32 33  0a                       |5826.123.|

@deniseduma
Copy link
Author

I've re-downloaded test_vcf.txt and test_tsv.txt that I've uploaded via the Github interface on my computer and they are different, the first is the .vcf file and the second is the resulting .tsv file, I'm not sure what files you find identical.

@tseemann
Copy link
Contributor

I confirm there are two different files.

@tseemann
Copy link
Contributor

Here's the C++ file for vcf2tsv, if you can see anything that would help the developers fix your issue that would be very helpful.

https://github.com/vcflib/vcflib/blob/master/src/vcf2tsv.cpp

@deniseduma
Copy link
Author

I thought that the way this works is the users point out the issues and the developers fix them, not that the users both point out the issues and fix them by themselves! Besides, I'm not familiar with C++ so I cannot help.

kblin added a commit to kblin/vcflib that referenced this issue Jul 31, 2017
As discussed in issue vcflib#206, it seems like for some inputs loadInfoSS() writes multiple entries
to the output stringstream, without appending a newline. Fixing this allows to remove the special
case handling of the newline in main() for all I can see.

Signed-off-by: Kai Blin <kblin@biosustain.dtu.dk>
@kblin
Copy link
Contributor

kblin commented Jul 31, 2017

This is my first contribution to vcflib and I don't write C++ code. I'm not even using vcflib directly, though I guess some tools I use might. It took me about 5 minutes to find and fix this (see the pull request referenced above), maybe this is useful for you.

I would like to point out that you are mistaken in your impression on how "this" works. As a maintainer of multiple open source tools myself, here is my take:

People spend time writing software, and share it with other people in the hopes it will be useful.
If the software is useful, other people will start using it, and are happy that they don't need to implement all the functionality from scratch. If a bug is identified, users can file a bug report and/or fix the issue themselves. If the developers have some time to spare, they also try and fix reported bugs.

At no point in this interaction is a user entitled to a bug free software (check out the license, it says THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.). At no point in this interaction is it ok for anybody to yell at or to be rude to anybody.

Free software empowers users because it places their copy under their own control, so they can go and fix problems themselves. If you want someone to yell at for software that doesn't work, buy a support contract.

@deniseduma
Copy link
Author

What exactly makes you think that I have the time required to fix bugs in free software??

Have you thought that my job descriptions might entail many other responsibilities and after having spent the weekend trying to figure out 1. why the software installation fails and 2. why vcf2tsv doesn't output what it's supposed to output, I cannot afford to spend a workday on the same issue???

I haven't yelled at anybody although I can tell I'm very frustrated and if a simple functionality like this turns out to have bugs, I don't want to think what other more complex features that vcflib is supposed to offer might look like! I was planning to use vcflib for my work but at this point, I don't think it's a safe decision anymore!

@wdecoster
Copy link

But if you don't like this code, why don't you just write your own?

@kblin
Copy link
Contributor

kblin commented Jul 31, 2017

I appreciate that text communication is difficult, but this is how I read this: Your initial comment on this bug contains 5 exclamation marks and one dot, not counting the pasted output example. The subject contains three exclamation marks. That looks like yelling to me. I understand that you might be frustrated because you wanted to do something that looks easy, and turned out to be much more time consuming than initially thought. But that doesn't entitle you to unload the frustration on other people.

What exactly makes you think that the people who initially wrote the software they provide to you for free have the time to fix bugs? Have you thought that their job description might entail many other responsibilities?

I have no idea of the constraints on your time, but I'm willing to go out on a limb to say that it'll be faster to fix a bug or two in an existing implementation than to write a new one from scratch. But you're in the best position to decide this for yourself.

@simonohanlon101
Copy link

simonohanlon101 commented Jul 31, 2017

Moreover, at least have the good grace to say "hey thanks!" to the developer who - despite being unconnected to the project or your work in any way - took time out of their working day to help you (and others) out. @kblin Thank you.

@ihh
Copy link

ihh commented Jul 31, 2017

@kblin thanks for the fix!
@deniseduma I have little to add to what the other commenters have said, other than:

  1. I do hope you learn from this experience, which looks set to become a textbook example of how not to file a github issue.
  2. In future, if you find your work blocked by a bug in free software that your busy schedule does not permit you to fix yourself, I would be happy to put you in touch with one of several software consultancy firms who will be happy to fix it for cash, and may even (as a result of that cash) be more tolerant of you venting your frustration when things don't immediately work the way you need them to.

@deniseduma
Copy link
Author

In the future, if I find my work blocked by an elementary mistake in free software, I'll make sure to use better (free) software out there, but thanks for your unnecessary advice!

@ttriche
Copy link

ttriche commented Jul 31, 2017 via email

@serine
Copy link

serine commented Aug 1, 2017

I can't believe that this is happening. I've heard about events like this, but haven't seen them, now I have.
While some developers and/or maintainers can be more or less proactive and the docs can be less up to date. Most often this comes down to FREE time that people have and willing to invest into the project.
I mean I can watch a move or play with my kids or lie flat on my couch and many more or.. OR I can help science community to be better. Open source isn't just about the code its about docs, tests, bug issues, bug fixes, just a good vibe from folks...

@deniseduma are you helping science community to be better by filling an issue ? Good on ya!

I felt that @tseemann and @kblin in particular doing a great job at facilitating this issue, but everyone else is also adding great deal to this project, great job everybody !

open source is extremely important on many fronts and github is a great place for it.

@zeeev
Copy link
Collaborator

zeeev commented Aug 1, 2017

@deniseduma @kblin's fix has been pulled into the master branch. I'm sorry for your frustrating experience. Just for some context, I introduced the bug while doing a code cleanup, where I fixed several other bugs. I tested the code before committing and releasing it, even for multi-allelic states.

Thank you for reporting the bug.

--Zev

@deniseduma
Copy link
Author

Thank you for fixing it and letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants