Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. #33

Closed
padpadpadpad opened this issue Jul 13, 2020 · 3 comments

Comments

@padpadpadpad
Copy link

Hi SNPGenie team

I am trying to get snpgenie.pl to work, but am getting stuck at The SNP Report table_1.csv has no header.

I think is likely because I am using a SNP report table that I made by hand, as the way I picked SNPs was a bit long winded and I ended up filtering them in R and want to now put them into SNPGenie.

I created a vcf file that I think follows the specifications of a VCF file and those documented in the README of SNPGenie, but maybe there is something SNPGenie does that means it does not work.

Any thoughts would be much appreciated.

Many thanks
Dan

snp_genie_prep.vcf.zip

@singing-scientist
Copy link
Contributor

Greetings, Dan! Thanks for your interest in SNPGenie. It is difficult to pinpoint the problem without a full example that includes a reference FASTA sequence and a GTF file. However, upon examining the file you sent, two things I notice are (1) the file seems to be a hybrid CSV/VCF format in that the "sample" columns are surrounded by quotation marks; and (2) the file uses Windows (CRLF) line endings, not Unix (LF) line endings. I am attaching a version where quotations have been removed and line endings have been changed to Unix (LF). If either of these two issues is the culprit, this should work. Otherwise please provide a full reproducible example (all input and exact command line call) and we can take it from there!

Chase

snp_genie_prep.vcf.zip

@padpadpadpad
Copy link
Author

Hi Chase

Sorry I feel I was a bit flustered yesterday and my initial issue comment was not very informative.

Thank you for your response. A few hours later I managed to parse a better looking VCF that did indeed work with SNPGenie. Did this in R from a data frame which I think may be a nice inclusion to help anyone else with the same problem. I think SNPGenie is a really nice tool. I ended up shortening the GTF file to just include the genes where my 12 variants were. I did then stumble into the problem of forward and reverse strands but that is for another day.

I do have some questions about the calculation of pi in poolseq data. I cannot for the life of me find a strict definition of how it is calculated at a per site basis.

So for a given sample I have the proportion of each variant. These were high confidence SNPs in pool-seq data that were confirmed via sequencing of individual clones. So to comply with the format, I then arbitrarily made everything have a depth of 1000 and then calculated the correct number of reads for each allele.

#CHROM POS REF ALT QUAL FORMAT SAMPLE-Compost_1
NC_012660 1816564 GT G 1000 AD:DP 1000,0:1000
NC_012660 1823344 T G 1000 AD:DP 900,100:1000
NC_012660 1824609 TA T 1000 AD:DP 1000,0:1000
NC_012660 1829303 T TC 1000 AD:DP 500,500:1000
NC_012660 2474703 C T 1000 AD:DP 1000,0:1000
NC_012660 4496109 C T 1000 AD:DP 1000,0:1000
NC_012660 4560835 G A 1000 AD:DP 1000,0:1000
NC_012660 4568375 T C 1000 AD:DP 1000,0:1000
NC_012660 4823073 C T 1000 AD:DP 900,100:1000
NC_012660 4915904 TCTG T 1000 AD:DP 1000,0:1000
NC_012660 6041420 T C 1000 AD:DP 1000,0:1000
NC_012660 6249053 G A 1000 AD:DP 1000,0:1000

For each site, how exactly is pi calculated? Would be interested in helping write up some documentation and a workflow to improve understanding of how these metrics are calculated for pool-seq data. I find the documentation in both Popoolation2 and PopGenome rather high level, whereas in reality I think everyone would benefit from working through a real world, simple example.

Sorry this has gone very off piece. I am happy to close this Issue and start a new one if needs be.

@singing-scientist
Copy link
Contributor

singing-scientist commented Jul 15, 2020

Thanks for these details! I am glad to hear it's working. Regarding pi, it is simply calculated as the sum of the mean number of differences per site divided by the number of sites. For example, suppose there are 4 sequences of length 10 bp. Site 1 is polymorphic, with seq1 having C and the other 3 seqs having T, and the remaining 9 sites are not polymorphic. Because there are 4 sequences, there are 4choose2, or (4^2-4)/2=6, possible comparisons between them — that is, seq1 vs. seq2, seq1 vs. seq3, seq1 vs. seq4, seq2 vs. seq3, seq2 vs. seq4, and seq3 vs. seq4. Because seq1 differs from the others, comparisons involving seq1 (bold) are mismatches, while the other comparisons are matches. Thus, of the 6 possible comparisons, 3 represent differences. As a result, the mean number of pairwise differences at this site is 3/6=0.5. If you were only interested in this site, the value of pi would be 0.5/1=0.5 for this site. If you were interested in the whole sequence of 10 bp, the value of pi would be (0.5+0+0+0+0+0+0+0+0+0+0)/10 = 0.5/10 = 0.05.

When it comes to coding regions, this is done on a codon-by-codon basis, where nonsynonymous and synonymous differences and sites are partitioned into piN and piS, respectively. We give a full explanation in Nelson et al. 2015. Please don't hesitate to write again if there are any questions not addressed therein.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants