Skip to content

Latest commit

 

History

History
234 lines (190 loc) · 14.6 KB

output_formats.rst

File metadata and controls

234 lines (190 loc) · 14.6 KB

Output Formats

By default ipyrad will write out all output formats it is capable of generating. Converting between the various formats is very fast, but if you want to save yourself the cpu and disk space, you can enable only specific output formats with the output_formats

Variant Call Format *.vcf.gz

VCF is a standard format for storing and manipulating sequence data. The format is too complicated to go into here, but you can see a good explanation on the 1000 Genomes Project<http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40> site. The VCF format output by ipyrad includes full genotype information for all bases in all loci, including information about genotype quality. Many useful conversions and filtering options for this format are available in the software vcftools.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1A_0 1B_0 1C_0 1D_0 2E_0 2F_0 2G_0 2H_0 3I_0 3J_0 3K_0 3L_0 0 0 . G . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,0,19 0/0:0,0,0,22 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,22 0/0:0,0,0,20 0/0:0,0,0,21 0/0:0,0,0,15 0/0:0,0,0,14 0/0:0,0,0,24 0/0:0,0,0,21 0 1 . T . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,19,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,21,0 0/0:0,0,15,0 0/0:0,0,14,0 0/0:0,0,24,0 0/0:0,0,21,0 0 2 . T . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,19,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,22,0 0/0:0,0,19,1 0/0:0,0,21,0 0/0:0,0,15,0 0/0:0,0,14,0 0/0:0,0,24,0 0/0:0,0,21,0

ipyrad format *.loci

This is a custom format that is easy to read, showing each individual locus with variable sites indicated. Custom scripts can easily parse this file for loci containing certain amounts of taxon coverage or variable sites. Also it is the most easily readable file for assuring that your analyses are working properly. A (-) indicates a variable site, and a (*) indicates the site is phylogenetically informative. Integers enclosed by | indicate the locus number. Example:

1A_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 1B_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 1C_0 GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 1D_0 GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTC 2E_0 GTTATCCGTAGCGATTATTACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 2F_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGKACGCAGCTAGTC 2G_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 2H_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3I_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3J_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGSGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3K_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3L_0 GTTATCGGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACACAGCTAGTC // - * * - - - - - 1A_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 1B_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATMAGCTAGGCTTCGAGTCGTATC 1C_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 1D_0 ACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2E_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2F_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2G_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2H_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGYGATCAGCTAGGCTTCGAGTCGTATS 3I_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 3J_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 3K_0 ACAGCTCTGTTACATGCATCTGTCMATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 3L_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTAYC // * - - - * - - --

For paired-end data the two linked loci are shown separated by a 'nnnn' separator, any merged reads will of course not contain the 'nnnn'.

1A0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 1B0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 1C0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGAAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 1D0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 2E0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTSnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT 2F0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT 2G0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTTTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT 2H0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACTTCCCGGTATCCGACCT 3I0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT 3J0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT 3K0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT 3L0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGAGACYAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT // - * - - - * - * 1A0 GACAAATCTTACATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 1B0 GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 1C0 GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTAATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 1D0 GACAAATCTTAGTTTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGAACCAAACGCAGGTGGAGGACCCAAGAAC 2E0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 2F0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 2G0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 2H0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3I0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3J0 GACAAATCTCAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3K0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3L0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC // - -- * - -

PHYLIP *.phy

This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N's. This format is used in RAxML among other phylogenetic programs. The header here indicates there are 12 samples and 89023 bases in the sequence. Because of this the output is truncated here for clarity (indicated by the ellipses).

12 89023 1A_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT... 1B_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT... 1C_0 GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT... 1D_0 GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTCACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCAT...

*.snps.phy & *.u.snps.phy

Additionally we provide a two different PHYLIP formatted version that include only variable sites (SNPs). Paired loci are treated as a single locus, meaning SNPs from the two reads are not separated in this file (they're linked). The *.snps.phy file contains all SNPs from all loci concatenated together, with missing values filled by N's. The *.u.snps.phy contains one SNP sampled from each locus. If multiple SNPs in a locus, SNP sites that contain the least missing data across taxa are sampled, if equal amounts of missing data, they are randomly sampled. The header indicates this file contains 12 samples and 990 bases per sample. The output below is truncated for clarity.

12 990 1A_0 GAATGACATCCTCAAACACCCTGGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACWGAGGAACCCATGAGAGACCGCCTYCARYA... 1B_0 GAAASRCATACTCAAACACCCTKGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACAGAGGAACCCAAGAGAGACCGCCTTCAATA...

MAP/PARTITION (*.snps.map)

Because the concatenated SNPs file does not include information about which SNPs come from which locus we provide a _map file with this information. This is used by the program _tetrad to randomly sample single SNPs from among loci.

1 rad0_snp0 0 1 1 rad0_snp1 0 2 1 rad0_snp2 0 3 1 rad0_snp3 0 4 1 rad0_snp4 0 5 2 rad1_snp0 0 6 2 rad1_snp1 0 7 2 rad1_snp2 0 8 2 rad1_snp3 0 9 3 rad2_snp0 0 10 3 rad2_snp1 0 11 3 rad2_snp2 0 12 3 rad2_snp3 0 13 3 rad2_snp4 0 14 3 rad2_snp5 0 15 3 rad2_snp6 0 16

EIGENSTRAT *.geno & *.u.geno

This is a SNP based format. Each line corresponds to one snp with one column per sample. The value in the sample column indicates the number of copies of the reference allele each individual has. 9 indicates missing data. Below you will see standard .geno output from the simulated data, so there are 12 columns, one per sample. This format is used by EIGENSTRAT, SMARTPCA, and ADMIXTURE, among other programs.

There is an additional *.u.geno file output that includes only unlinked SNPS, with one SNP being randomly chosen per locus and the rest ignored.

222222222220 220202222222 000222222222 222122222222 222222222122 222022222222 222221222222 222222222220 222200022222 222122222222

G-PhoCS *.gphocs

This is a full sequence based format that is very similar to the native ipyrad .loci format. It is appropriate for use with the Bayesian MCMC demographic inference program G-PhoCS: http://compgen.cshl.edu/GPhoCS/

499

locus0 10 90 A_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGSTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT B_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT C_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT

STRUCTURE *.str & *.u.str

This is another SNP based format, that includes either all variable sites (*.str) or one randomly selected variable site per locus (*.u.str). These files are suitable input files for the population structure analysis program STRUCTURE, as well as a few others. The output below is truncated for clarity.

1A_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ... 1A_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ... 1B_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ... 1B_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 0 1 3 0 ...

NEXUS *.nex

This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N's, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.

<TODO: Unimplemented>