-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. #33
Comments
Greetings, Dan! Thanks for your interest in SNPGenie. It is difficult to pinpoint the problem without a full example that includes a reference FASTA sequence and a GTF file. However, upon examining the file you sent, two things I notice are (1) the file seems to be a hybrid CSV/VCF format in that the "sample" columns are surrounded by quotation marks; and (2) the file uses Windows (CRLF) line endings, not Unix (LF) line endings. I am attaching a version where quotations have been removed and line endings have been changed to Unix (LF). If either of these two issues is the culprit, this should work. Otherwise please provide a full reproducible example (all input and exact command line call) and we can take it from there! Chase |
Hi Chase Sorry I feel I was a bit flustered yesterday and my initial issue comment was not very informative. Thank you for your response. A few hours later I managed to parse a better looking VCF that did indeed work with SNPGenie. Did this in R from a data frame which I think may be a nice inclusion to help anyone else with the same problem. I think SNPGenie is a really nice tool. I ended up shortening the GTF file to just include the genes where my 12 variants were. I did then stumble into the problem of forward and reverse strands but that is for another day. I do have some questions about the calculation of pi in poolseq data. I cannot for the life of me find a strict definition of how it is calculated at a per site basis. So for a given sample I have the proportion of each variant. These were high confidence SNPs in pool-seq data that were confirmed via sequencing of individual clones. So to comply with the format, I then arbitrarily made everything have a depth of 1000 and then calculated the correct number of reads for each allele.
For each site, how exactly is pi calculated? Would be interested in helping write up some documentation and a workflow to improve understanding of how these metrics are calculated for pool-seq data. I find the documentation in both Popoolation2 and PopGenome rather high level, whereas in reality I think everyone would benefit from working through a real world, simple example. Sorry this has gone very off piece. I am happy to close this Issue and start a new one if needs be. |
Thanks for these details! I am glad to hear it's working. Regarding pi, it is simply calculated as the sum of the mean number of differences per site divided by the number of sites. For example, suppose there are 4 sequences of length 10 bp. Site 1 is polymorphic, with seq1 having C and the other 3 seqs having T, and the remaining 9 sites are not polymorphic. Because there are 4 sequences, there are 4choose2, or (4^2-4)/2=6, possible comparisons between them — that is, seq1 vs. seq2, seq1 vs. seq3, seq1 vs. seq4, seq2 vs. seq3, seq2 vs. seq4, and seq3 vs. seq4. Because seq1 differs from the others, comparisons involving seq1 (bold) are mismatches, while the other comparisons are matches. Thus, of the 6 possible comparisons, 3 represent differences. As a result, the mean number of pairwise differences at this site is 3/6=0.5. If you were only interested in this site, the value of pi would be 0.5/1=0.5 for this site. If you were interested in the whole sequence of 10 bp, the value of pi would be (0.5+0+0+0+0+0+0+0+0+0+0)/10 = 0.5/10 = 0.05. When it comes to coding regions, this is done on a codon-by-codon basis, where nonsynonymous and synonymous differences and sites are partitioned into piN and piS, respectively. We give a full explanation in Nelson et al. 2015. Please don't hesitate to write again if there are any questions not addressed therein. |
Hi SNPGenie team
I am trying to get
snpgenie.pl
to work, but am getting stuck atThe SNP Report table_1.csv has no header.
I think is likely because I am using a SNP report table that I made by hand, as the way I picked SNPs was a bit long winded and I ended up filtering them in R and want to now put them into SNPGenie.
I created a vcf file that I think follows the specifications of a VCF file and those documented in the README of SNPGenie, but maybe there is something SNPGenie does that means it does not work.
Any thoughts would be much appreciated.
Many thanks
Dan
snp_genie_prep.vcf.zip
The text was updated successfully, but these errors were encountered: