By default ipyrad will write out all output formats it is capable of
generating. Converting between the various formats is very fast, but
if you want to save yourself the cpu and disk space, you can enable
only specific output formats with the output_formats
VCF is a standard format for storing and manipulating sequence data. The format is too complicated to go into here, but you can see a good explanation on the :ref:`1000 Genomes Project<http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40>` site. The VCF format output by ipyrad includes full genotype information for all bases in all loci, including information about genotype quality. Many useful conversions and filtering options for this format are available in the software vcftools.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1A_0 1B_0 1C_0 1D_0 2E_0 2F_0 2G_0 2H_0 3I_0 3J_0 3K_0 3L_0 0 0 . G . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,0,19 0/0:0,0,0,22 0/0:0,0,0,20 0/0:0,0,0,19 0/0:0,0,0,18 0/0:0,0,0,22 0/0:0,0,0,20 0/0:0,0,0,21 0/0:0,0,0,15 0/0:0,0,0,14 0/0:0,0,0,24 0/0:0,0,0,21 0 1 . T . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,19,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,21,0 0/0:0,0,15,0 0/0:0,0,14,0 0/0:0,0,24,0 0/0:0,0,21,0 0 2 . T . 13 PASS NS=12;DP=235 GT:CATG 0/0:0,0,19,0 0/0:0,0,22,0 0/0:0,0,20,0 0/0:0,0,19,0 0/0:0,0,18,0 0/0:0,0,22,0 0/0:0,0,19,1 0/0:0,0,21,0 0/0:0,0,15,0 0/0:0,0,14,0 0/0:0,0,24,0 0/0:0,0,21,0
This is a custom format that is easy to read, showing each individual locus
with variable sites indicated. Custom scripts can easily parse this file for
loci containing certain amounts of taxon coverage or variable sites. Also it
is the most easily readable file for assuring that your analyses are working
properly. A (-) indicates a variable site, and a (*) indicates the site is
phylogenetically informative. Integers enclosed by |
indicate the locus
number. Example:
1A_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 1B_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 1C_0 GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 1D_0 GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTC 2E_0 GTTATCCGTAGCGATTATTACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 2F_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGKACGCAGCTAGTC 2G_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 2H_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3I_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3J_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGSGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3K_0 GTTATCCGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTC 3L_0 GTTATCGGTAGCGATTATCACCTCAGTTAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACACAGCTAGTC // - * * - - - - - |0| 1A_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 1B_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATMAGCTAGGCTTCGAGTCGTATC 1C_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 1D_0 ACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2E_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2F_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2G_0 ACAGCTCTATTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 2H_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGYGATCAGCTAGGCTTCGAGTCGTATS 3I_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 3J_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 3K_0 ACAGCTCTGTTACATGCATCTGTCMATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTATC 3L_0 ACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATCATAGGGCTCCATATCAAGTGATCAGCTAGGCTTCGAGTCGTAYC // * - - - * - - --|1|
For paired-end data the two linked loci are shown separated by a 'nnnn' separator, any merged reads will of course not contain the 'nnnn'.
1A0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 1B0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTAnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 1C0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGAAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 1D0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT 2E0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTSnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT 2F0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT 2G0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTTTAAGATACCAAACCCTGTCCCAGCATTACGTCCCGGTATCCGACCT 2H0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGTCCCAGCATTACTTCCCGGTATCCGACCT 3I0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT 3J0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT 3K0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGATACCAAACCCTGACCCAGCATTACGTCCCTGTATCCGACCT 3L0 GATAGCGGACGAAGCTTCCTGGATCAACATATCCGTTTGACAGTTTATATGTCAACAAGTAAGGAGCTGGACTGGGAGGTGCTATTGnnnnACTCTAAGAGACYAAACCCTGTCCCAGCATTACGTCCCTGTATCCGACCT // - * - - - * - * |0| 1A0 GACAAATCTTACATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 1B0 GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 1C0 GACAAATCTTAGATTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTAATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 1D0 GACAAATCTTAGTTTACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGAACCAAACGCAGGTGGAGGACCCAAGAAC 2E0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 2F0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 2G0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 2H0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3I0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3J0 GACAAATCTCAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3K0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC 3L0 GACAAATCTTAGATGACAGTAATTGGTACTTATCACATACTAAGTTGTCAGAGACTTATTTGACAATATTCGGGGTCTTTGGCCATGnnnnGTAGTTAGGCTATTTTGCGCGTACCAAACGCAGGTGGAGGACCCAAGAAC // - -- * - - |1|
This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, with missing data for any sample filled in with N's. This format is used in RAxML among other phylogenetic programs. The header here indicates there are 12 samples and 89023 bases in the sequence. Because of this the output is truncated here for clarity (indicated by the ellipses).
12 89023 1A_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT... 1B_0 GTTATCCGTAGCGATTATCACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT... 1C_0 GTTATCCGTAGCGATTATTACCTCAGTAAGATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGTCGGACGCAGCTAGTCACAGCTCTGTTACATGCATCTGTCCATACTCCCTGGTTCGCAATAAT... 1D_0 GTTATCCGTAGCGATTATCACCTCAGTTAKATAAACCCATGGATAACGGGGGGGACAGCGCTAGATTGTTGGGGCGGACGCAGCTAGTCACAGCTCTGTTACATRCATCTGTCCATACTCCCTGGTTCGTAATCAT...
Additionally we provide a two different PHYLIP formatted version that
include only variable sites (SNPs). Paired loci are treated as a single
locus, meaning SNPs from the two reads are not separated in this file
(they're linked). The *.snps.phy
file contains all SNPs from all
loci concatenated together, with missing values filled by N
's. The
*.u.snps.phy
contains one SNP sampled from each locus. If multiple
SNPs in a locus, SNP sites that contain the least missing data across
taxa are sampled, if equal amounts of missing data, they are randomly
sampled. The header indicates this file contains 12 samples and 990
bases per sample. The output below is truncated for clarity.
12 990 1A_0 GAATGACATCCTCAAACACCCTGGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACWGAGGAACCCATGAGAGACCGCCTYCARYA... 1B_0 GAAASRCATACTCAAACACCCTKGATACGGACAACGAAATTGCACTCATCAGACAAAGAAATTACAGAGGAACCCAAGAGAGACCGCCTTCAATA...
MAP/PARTITION (*.snps.map)
Because the concatenated SNPs file does not include information about which SNPs come from which locus we provide a _map_ file with this information. This is used by the program _tetrad_ to randomly sample single SNPs from among loci.
1 rad0_snp0 0 1 1 rad0_snp1 0 2 1 rad0_snp2 0 3 1 rad0_snp3 0 4 1 rad0_snp4 0 5 2 rad1_snp0 0 6 2 rad1_snp1 0 7 2 rad1_snp2 0 8 2 rad1_snp3 0 9 3 rad2_snp0 0 10 3 rad2_snp1 0 11 3 rad2_snp2 0 12 3 rad2_snp3 0 13 3 rad2_snp4 0 14 3 rad2_snp5 0 15 3 rad2_snp6 0 16
This is a SNP based format. Each line corresponds to one snp with one column per
sample. The value in the sample column indicates the number of copies of the
reference allele each individual has. 9 indicates missing data. Below you will
see standard .geno
output from the simulated data, so there are 12
columns, one per sample. This format is used by EIGENSTRAT, SMARTPCA, and
ADMIXTURE, among other programs.
There is an additional *.u.geno
file output that includes only unlinked
SNPS, with one SNP being randomly chosen per locus and the rest ignored.
222222222220 220202222222 000222222222 222122222222 222222222122 222022222222 222221222222 222222222220 222200022222 222122222222
This is a full sequence based format that is very similar to the native ipyrad .loci format. It is appropriate for use with the Bayesian MCMC demographic inference program G-PhoCS: http://compgen.cshl.edu/GPhoCS/
499 locus0 10 90 A_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGSTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT B_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT C_0 CTACGATAGAGAAATCACTCTTTTCTTCAGGGGTAGACTCACACGGCGGCGCAATTGTCACGAAAGTAAACCAATAGTCACGT
This is another SNP based format, that includes either all variable
sites (*.str
) or one randomly selected variable site per locus
(*.u.str
). These files are suitable input files for the population
structure analysis program STRUCTURE, as well as a few others. The output
below is truncated for clarity.
1A_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ... 1A_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ... 1B_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 3 1 3 0 ... 1B_0 3 3 0 2 2 1 2 2 2 2 3 3 0 1 0 1 3 0 ...
This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N's, and with data information appended to the beginning. This format is used in BEAST among other phylogenetic programs.
<TODO: Unimplemented>