Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. #33

padpadpadpad · 2020-07-13T11:21:59Z

Hi SNPGenie team

I am trying to get snpgenie.pl to work, but am getting stuck at The SNP Report table_1.csv has no header.

I think is likely because I am using a SNP report table that I made by hand, as the way I picked SNPs was a bit long winded and I ended up filtering them in R and want to now put them into SNPGenie.

I created a vcf file that I think follows the specifications of a VCF file and those documented in the README of SNPGenie, but maybe there is something SNPGenie does that means it does not work.

Any thoughts would be much appreciated.

Many thanks
Dan

snp_genie_prep.vcf.zip

The text was updated successfully, but these errors were encountered:

singing-scientist · 2020-07-14T06:59:13Z

Greetings, Dan! Thanks for your interest in SNPGenie. It is difficult to pinpoint the problem without a full example that includes a reference FASTA sequence and a GTF file. However, upon examining the file you sent, two things I notice are (1) the file seems to be a hybrid CSV/VCF format in that the "sample" columns are surrounded by quotation marks; and (2) the file uses Windows (CRLF) line endings, not Unix (LF) line endings. I am attaching a version where quotations have been removed and line endings have been changed to Unix (LF). If either of these two issues is the culprit, this should work. Otherwise please provide a full reproducible example (all input and exact command line call) and we can take it from there!

Chase

snp_genie_prep.vcf.zip

padpadpadpad · 2020-07-14T09:12:22Z

Hi Chase

Sorry I feel I was a bit flustered yesterday and my initial issue comment was not very informative.

Thank you for your response. A few hours later I managed to parse a better looking VCF that did indeed work with SNPGenie. Did this in R from a data frame which I think may be a nice inclusion to help anyone else with the same problem. I think SNPGenie is a really nice tool. I ended up shortening the GTF file to just include the genes where my 12 variants were. I did then stumble into the problem of forward and reverse strands but that is for another day.

I do have some questions about the calculation of pi in poolseq data. I cannot for the life of me find a strict definition of how it is calculated at a per site basis.

So for a given sample I have the proportion of each variant. These were high confidence SNPs in pool-seq data that were confirmed via sequencing of individual clones. So to comply with the format, I then arbitrarily made everything have a depth of 1000 and then calculated the correct number of reads for each allele.

#CHROM	POS	REF	ALT	QUAL	FORMAT	SAMPLE-Compost_1
NC_012660	1816564	GT	G	1000	AD:DP	1000,0:1000
NC_012660	1823344	T	G	1000	AD:DP	900,100:1000
NC_012660	1824609	TA	T	1000	AD:DP	1000,0:1000
NC_012660	1829303	T	TC	1000	AD:DP	500,500:1000
NC_012660	2474703	C	T	1000	AD:DP	1000,0:1000
NC_012660	4496109	C	T	1000	AD:DP	1000,0:1000
NC_012660	4560835	G	A	1000	AD:DP	1000,0:1000
NC_012660	4568375	T	C	1000	AD:DP	1000,0:1000
NC_012660	4823073	C	T	1000	AD:DP	900,100:1000
NC_012660	4915904	TCTG	T	1000	AD:DP	1000,0:1000
NC_012660	6041420	T	C	1000	AD:DP	1000,0:1000
NC_012660	6249053	G	A	1000	AD:DP	1000,0:1000

For each site, how exactly is pi calculated? Would be interested in helping write up some documentation and a workflow to improve understanding of how these metrics are calculated for pool-seq data. I find the documentation in both Popoolation2 and PopGenome rather high level, whereas in reality I think everyone would benefit from working through a real world, simple example.

Sorry this has gone very off piece. I am happy to close this Issue and start a new one if needs be.

singing-scientist · 2020-07-15T05:56:05Z

Thanks for these details! I am glad to hear it's working. Regarding pi, it is simply calculated as the sum of the mean number of differences per site divided by the number of sites. For example, suppose there are 4 sequences of length 10 bp. Site 1 is polymorphic, with seq1 having C and the other 3 seqs having T, and the remaining 9 sites are not polymorphic. Because there are 4 sequences, there are 4choose2, or (4^2-4)/2=6, possible comparisons between them — that is, seq1 vs. seq2, seq1 vs. seq3, seq1 vs. seq4, seq2 vs. seq3, seq2 vs. seq4, and seq3 vs. seq4. Because seq1 differs from the others, comparisons involving seq1 (bold) are mismatches, while the other comparisons are matches. Thus, of the 6 possible comparisons, 3 represent differences. As a result, the mean number of pairwise differences at this site is 3/6=0.5. If you were only interested in this site, the value of pi would be 0.5/1=0.5 for this site. If you were interested in the whole sequence of 10 bp, the value of pi would be (0.5+0+0+0+0+0+0+0+0+0+0)/10 = 0.5/10 = 0.05.

When it comes to coding regions, this is done on a codon-by-codon basis, where nonsynonymous and synonymous differences and sites are partitioned into piN and piS, respectively. We give a full explanation in Nelson et al. 2015. Please don't hesitate to write again if there are any questions not addressed therein.

singing-scientist closed this as completed Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. #33

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. #33

padpadpadpad commented Jul 13, 2020

singing-scientist commented Jul 14, 2020

padpadpadpad commented Jul 14, 2020

singing-scientist commented Jul 15, 2020 •

edited

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. #33

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. #33

Comments

padpadpadpad commented Jul 13, 2020

singing-scientist commented Jul 14, 2020

padpadpadpad commented Jul 14, 2020

singing-scientist commented Jul 15, 2020 • edited

singing-scientist commented Jul 15, 2020 •

edited