Sub-sampling data sets

In this tutorial we show how to subsample both the number of taxa in an Assembly, and the amount of sequence data. Again, we use the 13 taxa Pedicularis data set from Eaton and Ree (2013) for our example.

Download the empirical example data set (Pedicularis)

These data are archived on the NCBI sequence read archive (SRA) under accession id SRP021469. For convenience, the data are also hosted at a public Dropbox link which is a bit easier to access. Run the code below to download and decompress the fastq data files, which will save them into a directory called example_empirical_data/. The compressed file size is approximately 1.1GB.

## download fastq data from the SRA database
>>> ipyrad --download SRP021469 example_empirical_data/

Setup a base params file

We start by using the -n argument to create a new named Assembly. I'll use the name base to indicate that this is the base assembly from which we will later create several branches.

>>> ipyrad -n "base"

New file 'params-base.txt' created in /home/deren/Downloads

The data come to us already demultiplexed so we are going to simply set the sorted_fastq_path to tell ipyrad the location of the data files, and also set a project_dir, which will group all of our analyses into a single directory. For the latter I use the name of our study organism, "pedicularis".

## Use your text editor to enter the following values: ## The wildcard () tells ipyrad to select all files ending in .gz pedicularis ## [1] [project_dir] ... example_empirical_rad/.gz ## [4] [sorted_fastq_path] ...

For now we'll leave the remaining parameters at their default values.

Load the fastq Sample data

When the data location is entered as a sorted_fastq_path step 1 simply counts the number of reads for each Sample and parses the file names to extract names for each Sample. For example, the file 29154_superba.fastq.gz will be assigned to Sample 29154_superba. Now, run step 1 (-s 1) and tell ipyrad to print the results when it is finished (-r).

>>> ipyrad -p params-base.txt -s 1 -r

--------------------------------------------------ipyrad [v.0.2.5] Interactive assembly and analysis of RADseq data --------------------------------------------------New Assembly: base ipyparallel setup: Local connection to 4 Engines

Step1: Linking sorted fastq data to Samples

Linking to demultiplexed fastq files in:: /home/deren/Downloads/example_empirical_rad/*.gz

13 new Samples created in 'base'. 13 fastq files linked to 13 new Samples.

Saving Assembly.

Summary stats of Assembly base

state reads_raw

29154_superba 1 696994 30556_thamno 1 1452316 30686_cyathophylla 1 1253109 32082_przewalskii 1 964244 33413_thamno 1 636625 33588_przewalskii 1 1002923 35236_rex 1 1803858 35855_rex 1 1409843 38362_rex 1 1391175 39618_rex 1 822263 40578_rex 1 1707942 41478_cyathophylloides 1 2199740 41954_cyathophylloides 1 2199613

Sub-sampling methods

Assembling this full data set takes around 3 hours on a 4-core laptop, which is actually pretty fast. However, for very large data sets you may be interested in running an even faster analysis by using just a subset of your data. This would allow you to more easily explore the affect of many different parameter settings on your results before running the full data set. Two forms of subsampling are possible: first, subsampling the number of Samples in your analysis, and second, subselecting the number of reads in your analysis.

Note

Importantly, no matter what you do in ipyrad, it will never delete or modify your original fastq data files. Assembly objects simply store information about Samples, and Samples simply contain statistics about data files. Samples can be discarded from an Assembly, in which case the Assembly loses some information, however, this does not delete any data files. Nevertheless, to retain Sample information ipyrad only allows Samples to be discarded during branching, so that Sample information is always retained in the parent branch. See the example below.

Subselecting samples: You can subselect Samples by creating a new branch. Here we call the new branch "sub4", and pass it a list of Sample names in addition to the new branch name. This does NOT delete any files (see above), but simply copies a subset of information from "base" to the new assembly "sub4". If you accidentally discarded the wrong Samples you could re-create "sub4" by simply branching "base" again with a different list of Samples.

## Create new branch of base Assembly named sub4 and pass it
## the names of four Samples (if no names it keeps all Samples)
## names should be separated by a comma and spaces are optional. 
## The '\' character simply continues our list across a line-break
## for easier viewing.

>>> ipyrad -p params-base.txt -b sub4 29154_superba 30556_thamno \
                                      30686_cyathophylla 32082_przewalskii

loading Assembly: base from saved path: ~/Documents/ipyrad/tests/cli/cli.json Sample name not found: 4 Creating a new branch called 'sub4' with 4 Samples Writing new params file to params-sub4.txt

## print stats for sub4 to confirm that Samples were discarded
>>> ipyrad -p params-sub4.txt -r

Summary stats of Assembly sub4 ------------------------------------------------state reads_raw 39618_rex 1 822263 40578_rex 1 1707942 41478_cyathophylloides 1 2199740 41954_cyathophylloides 1 2199613

Running step 2:

## run step2 
>>> ipyrad -p params-sub4.txt -s 2 -r

--------------------------------------------------ipyrad [v.0.2.5] Interactive assembly and analysis of RADseq data --------------------------------------------------loading Assembly: base from saved path: ~/Downloads/pedicularis/base.json ipyparallel setup: Local connection to 4 Engines

Step2: Filtering reads [####################] 100% processing reads | 0:02:48 Saving Assembly.

Summary stats of Assembly base

state reads_raw reads_filtered

29154_superba 2 696994 92448 30556_thamno 2 1452316 93666 30686_cyathophylla 2 1253109 89122 32082_przewalskii 2 964244 92016

Run step 3 (clustering and aligning)

This is generally one of the longest running steps, depending on how many unique clusters (loci) there are in each sample. Using more processors will allow it run much faster. On my laptop with 4 cores this step finishes in approximately 30 minutes. From the results you can see that there are many clusters found in each sample (clusters_total), but very few are recovered at high depth (clusters_hidepth). The coverage would of course be much better if we did not subsample the data set in step2. Also, this data set has fairly low coverage to begin with. We can either lower the mindepth setting to allow us to use more of this low depth data, or we can decide to go ahead with our mindepth setting (currently at the default of 6) and simply discard most of our data. I know, how about we create a branch so that we can do both!

## create a lowdepth branch
>>> ipyrad -p params-sub4.txt -b sub4-lowdepth.

loading Assembly: base from saved path: ~/Downloads/pedicularis/base.json Creating a branch of assembly base called sub4 Writing new params file to params-sub4.txt

## create a lowdepth branch
>>> ipyrad -p params-base.txt -s 3

--------------------------------------------------ipyrad [v.0.2.5] Interactive assembly and analysis of RADseq data --------------------------------------------------loading Assembly: base from saved path: ~/Downloads/pedicularis/base.json ipyparallel setup: Local connection to 4 Engines

Step3: Clustering/Mapping reads [####################] 100% dereplicating | 0:00:01 [####################] 100% clustering | 0:01:01 [####################] 100% chunking | 0:00:00 [####################] 100% aligning | 0:25:44 [####################] 100% concatenating | 0:00:05 Saving Assembly.

Use a text editor to enter the following new mindepth_majrule value in the file params-sub4-lowdepth.txt:

## 2 ## [mindepth_majrule] ...

Steps 4-5 (joint estimation of error rate & heterozygosity)

As you can see in the results the error rate is about 10X the heterozygosity estimate. The latter does not vary significantly across samples. With data of greater depth the estimates will be more accurate.

>>> ipyrad -p params-sub4.txt          -s 45 -r 
>>> ipyrad -p params-sub4-lowdepth.txt -s 45 -r

... add output here

Run step 5 (consensus base calls)

This is another step that can be computationally intensive. Here it takes about 15 minutes on 4 cores. Although many clusters are filtered out at this step (especially due to low depth) their information is retained for the VCF output later so that the coverage/depth of excluded reads can be examined.

ipyrad -p params-base.txt -s 5
ipyrad -p params-base.txt -r

-------------------------------------------------- ipyrad [v.0.1.70]

Interactive assembly and analysis of RADseq data

-------------------------------------------------- loading Assembly: base [~/Downloads/pedicularis/base.json]

ipyparallel setup: Local connection to 4 Engines

Step5: Consensus base calling: Diploid base calls and paralog filter (max haplos = 2) error rate (mean, std): 0.00703, 0.00331 heterozyg. (mean, std): 0.04071, 0.00396 Saving Assembly.

Summary stats of Assembly base

state reads_raw reads_filtered clusters_total

29154_superba 5 696994 92448 45531 30556_thamno 5 1452316 93666 45745 30686_cyathophylla 5 1253109 89122 50306 32082_przewalskii 5 964244 92016 44242 33413_thamno 5 636625 89428 52053 33588_przewalskii 5 1002923 92418 46674 35236_rex 5 1803858 92807 57801 35855_rex 5 1409843 92883 45139 38362_rex 5 1391175 93363 41580 39618_rex 5 822263 92096 47295 40578_rex 5 1707942 93386 45295 41478_cyathophylloides 5 2199740 93846 41965 41954_cyathophylloides 5 2199613 91756 47735

clusters_hidepth hetero_est error_est reads_consens

29154_superba 978 0.038530 0.006630 821 30556_thamno 987 0.038266 0.006009 810 30686_cyathophylla 757 0.044680 0.004627 606 32082_przewalskii 686 0.046796 0.007077 523 33413_thamno 728 0.041466 0.004528 597 33588_przewalskii 904 0.041445 0.011253 709 35236_rex 767 0.042423 0.005119 629 35855_rex 1106 0.035123 0.012086 844 38362_rex 1140 0.041206 0.004702 943 39618_rex 1258 0.040696 0.009077 1011 40578_rex 832 0.045177 0.002789 689 41478_cyathophylloides 992 0.041085 0.004468 872 41954_cyathophylloides 1307 0.032387 0.013090 983

Full stats files

step 1: None step 2: ./pedicularis/base_edits/s2_rawedit_stats.txt step 3: ./pedicularis/base_clust_0.85/s3_cluster_stats.txt step 4: ./pedicularis/base_clust_0.85/s4_joint_estimate.txt step 5: ./pedicularis/base_consens/s5_consens_stats.txt step 6: None step 7: None

Step 6 (clustering and aligning across samples)

This step clusters consensus loci across Samples using the same threshold for sequence similarity as used in step3.

ipyrad -p params-base.txt -s 6

-------------------------------------------------- ipyrad [v.0.1.70]

Interactive assembly and analysis of RADseq data

-------------------------------------------------- loading Assembly: base [~/Downloads/pedicularis/base.json]

ipyparallel setup: Local connection to 4 Engines

Step6: Clustering across 13 samples at 0.85 similarity: Saving Assembly.

Branch the assembly

Here we will branch the assembly to create different assemblies that we will use as our final outputs. The main parameter we will focus on is the min_samples_locus, which is the minimum number of samples that must have data at a locus for the locus to be retained in the data set. We create a min4, min8, and min12 data sets.

ipyrad -p params-base.txt -b min4
ipyrad -p params-base.txt -b min8
ipyrad -p params-base.txt -b min12

loading Assembly: base [~/Downloads/pedicularis/base.json] Creating a branch of assembly base called min4 Writing new params file to params-min4.txt

loading Assembly: base [~/Downloads/pedicularis/base.json] Creating a branch of assembly base called min8 Writing new params file to params-min8.txt

loading Assembly: base [~/Downloads/pedicularis/base.json] Creating a branch of assembly base called min12 Writing new params file to params-min12.txt

Change the parameter settings in params.txt for each assembly

## Enter the changes below into the params files using a text editor

## in the file params-min4.txt 4 ## [21] [min_samples_locus] ...

## in the file params-min8.txt 8 ## [21] [min_samples_locus] ...

## in the file params-min12.txt 12 ## [21] [min_samples_locus] ...

Step 7 (final filtering and create output files)

Filter and create output files for the three assemblies with different values for the parameter min_samples_locus.

ipyrad -p params-min4.txt -s 7 
ipyrad -p params-min8.txt -s 7 
ipyrad -p params-min12.txt -s 7

-------------------------------------------------- ipyrad [v.0.1.70]

Interactive assembly and analysis of RADseq data

-------------------------------------------------- loading Assembly: min4 [~/Downloads/pedicularis/min4.json]

ipyparallel setup: Local connection to 4 Engines

Step7: Filter and write output files for 13 Samples.: Outfiles written to: ~/Downloads/pedicularis/min4_outfiles Saving Assembly.

-------------------------------------------------- ipyrad [v.0.1.70]

Interactive assembly and analysis of RADseq data

-------------------------------------------------- loading Assembly: min8 [~/Downloads/pedicularis/min8.json]

ipyparallel setup: Local connection to 4 Engines

Step7: Filter and write output files for 13 Samples.: Outfiles written to: ~/Downloads/pedicularis/min8_outfiles Saving Assembly.

-------------------------------------------------- ipyrad [v.0.1.70]

Interactive assembly and analysis of RADseq data

-------------------------------------------------- loading Assembly: min12 [~/Downloads/pedicularis/min12.json]

ipyparallel setup: Local connection to 4 Engines

Step7: Filter and write output files for 13 Samples.: Outfiles written to: ~/Downloads/pedicularis/min12_outfiles Saving Assembly.

Take a look at the stats summary

Each assembly that finishes step 7 will create a stats.txt output summary in the 'assembly_name'_outfiles/ directory. This includes information about which filters removed data from the assembly, how many loci were recovered per sample, how many samples had data for each locus, and how many variable sites are in the assembled data.

cat ./pedicularis/min4_outfiles/min4_stats.txt

## The number of loci caught by each filter. ## ipyrad API location: [assembly].statsfiles.s7_filters

locus_filtering

total_prefiltered_loci 1206 filtered_by_rm_duplicates 2 filtered_by_max_indels 159 filtered_by_max_snps 0 filtered_by_max_hetero 15 filtered_by_min_sample 921 filtered_by_edge_trim 0 total_filtered_loci 221

## The number of loci recovered for each Sample. ## ipyrad API location: [assembly].stats_dfs.s7_samples

sample_coverage

29154_superba 151 30556_thamno 120 30686_cyathophylla 118 32082_przewalskii 132 33413_thamno 155 33588_przewalskii 89 35236_rex 154 35855_rex 145 38362_rex 154 39618_rex 156 40578_rex 90 41478_cyathophylloides 156 41954_cyathophylloides 97

## The number of loci for which N taxa have data. ## ipyrad API location: [assembly].stats_dfs.s7_loci

locus_coverage sum_coverage

1 NaN 0 2 NaN 0 3 NaN 0 4 65 65 5 36 101 6 16 117 7 10 127 8 10 137 9 6 143 10 2 145 11 12 157 12 7 164 13 57 221

## The distribution of SNPs (var and pis) across loci. ## pis = parsimony informative site (minor allele in >1 sample) ## var = all variable sites (pis + autapomorphies) ## ipyrad API location: [assembly].stats_dfs.s7_snps

var sum_var pis sum_pis

0 260 260 1140 1140 1 130 390 45 1185 2 130 520 12 1197 3 133 653 3 1200 4 90 743 4 1204 5 110 853 0 1204 6 77 930 0 1204 7 70 1000 2 1206 8 64 1064 0 1206 9 32 1096 0 1206 10 31 1127 0 1206 11 30 1157 0 1206 12 16 1173 0 1206 13 14 1187 0 1206 14 5 1192 0 1206 15 6 1198 0 1206 16 2 1200 0 1206 17 4 1204 0 1206 18 2 1206 0 1206

Take a peek at the .loci (easily human-readable) output

This is one fo the first places I usually look when an assembly finishes. It provides a clean view of the data with variable sites (-) and parsimony informative SNPs () highlighted. Use the unix commandsless* or head to look at this file briefly.

## head -n 50 prints just the first 50 lines of the file to stdout
head -n 50 pedicularis/min4_outfiles/min4.loci

29154_superba CCTTGGTSACCTTMGCWCCWGAYGGRTCCTTCTTCTCCACACTCTTKATRACACCAACAGCAACAGTC 32082_przewalskii CCTTGGTSACCTTRGCWCCWGAYGG-TCCTTCTTCTCCACACTCTTGATRACACCAACAGCAACAGTC 35236_rex CCTTGGTCACCTTAGCACCTGATGG-TCCTTCTTCTCCACACTCTTGATGACACCAACAGCAACAGTC 38362_rex CCTTGGTCACCTTAGCACCTGATGGRTCCTTCTTCTCCACACTCTTGATGACACCAACAGCAAC-GTC // - - - - - - - - | 33413_thamno TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 33588_przewalskii GAGACAACCAGTGCCTTCTTGTCTATCAGCCTCACACCTGTCTTCGGTACTTTCGGTACTTAGAAGCA 35855_rex TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 38362_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 39618_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 40578_rex TAGACAACCAGTGCCTTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 41478_cyathophylloides TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA 41954_cyathophylloides TAGACAACCAGTGCCGTCTTGTCTATCAGTCTCACACCTGTCTTCGGTACTTGCGGTACTTAGAAGCA // - * - - | 29154_superba AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 30556_thamno AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 30686_cyathophylla AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 32082_przewalskii AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 33413_thamno AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 33588_przewalskii AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 35236_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 35855_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 38362_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 39618_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 40578_rex AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 41478_cyathophylloides AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGC-CTAGCTAACGCGCCC 41954_cyathophylloides AGCAAGCGAAGAAAACGTAAGGGCGCGCGTTAGCACTCCTGCAAGAAAACGGCNCTAGCTAACGCGCCC // | 30686_cyathophylla TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT 32082_przewalskii TAGCAATAAATGCAAGAATATTGACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT 33588_przewalskii TAGCAATAAATGCAAGAATATTGACTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT 35236_rex TAGCAATAAATGCAAGAATATTKACTTCCATAATTTCGTCKGTTTTTTAATTCGCAATAACTCGGGAT 38362_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT 39618_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT 40578_rex TAGCAATAAATGCAAGAATATTTACTTCCATAATTTCGTCTGTTTTTTAATTCGCAATAACTCGGGAT 41478_cyathophylloides TAGCAATAAATGCAAGAATATTT-CTTCCATAATTTCGTCGGTTTTTTAATTCGCAATAACTCGGGAT // * * | 35236_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAKCGGATCATCATGTGCGAGTTGAC 35855_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTGGAC 38362_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTTGAC 39618_rex CTCTAGGTGGAGCTCCAGCTGGGTCTGAACCAGATCCTCCGTAAGCGGATCATCATGTGCGAGTTGAC // - - | 30556_thamno C-TTCTGATTAATCTG-AAATTGTAATCAAATGAAATYAAACAGCAAAAACAATGACTSGATAAACTA 33413_thamno CTTTCTGWTTAATCTGMAAATTGTAATCAAATGAAATCAAACARCAAAAACAATGACTYGAYAAWCYR 35236_rex C-TTCTGATTAATCTG-AAATTGTAATCAAATGAAATCAAACA-CAAAAACAATGACT-GATAAACTA 41478_cyathophylloides CTTTCTGATTAATCTGCAAATTGTAATCAAATGAAATCAAACAGCAAAAACAATAACTTGATAAAATA // - - - - - - - ----| 30556_thamno GAAAGATWT-AYTGTAGACGTAWTKGATCRSAGGWKGAGGTGATGWATCATAWTCAT-ATCAGAGGAG 38362_rex GAAAGATTTCACTGTAGACGTAATGGATCAGAGGTTGAGGTGATGRATCATAATCATGATCAGAGGWG 39618_rex GAAAGATTTCACTGTAGACGTAWWGGATCMSAGGWKGAGGTGATGRATCATAATCATKATCAGAGGAG

peek at the .phy files

This is the concatenated sequence file of all loci in the data set. It is typically used in phylogenetic analyses, like in the program raxml.

## cut -c 1-80 prints only the first 80 characters of the file
cut -c 1-80 pedicularis/min4_outfiles/min4.phy

13 15034 29154_superba CCTTGGTSACCTTMGCWCCWGAYGGRTCCTTCTTCTCCACACTCTTKATRACA 30556_thamno NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 30686_cyathophylla NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 32082_przewalskii CCTTGGTSACCTTRGCWCCWGAYGGNTCCTTCTTCTCCACACTCTTGATRACA 33413_thamno NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 33588_przewalskii NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 35236_rex CCTTGGTCACCTTAGCACCTGATGGNTCCTTCTTCTCCACACTCTTGATGACA 35855_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 38362_rex CCTTGGTCACCTTAGCACCTGATGGRTCCTTCTTCTCCACACTCTTGATGACA 39618_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 40578_rex NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 41478_cyathophylloides NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 41954_cyathophylloides NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

peek at the .snp file

This is similar to the phylip file format, but only variable site columns are included. All SNPs are the file, in contrast to the .usnps file, which selects only a single SNP per locus.

## cut -c 1-80 prints only the first 80 characters of the file
cut -c 1-80 pedicularis/min4_outfiles/min4.snp

13 711 29154_superba SMWWYRKRNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGYATNYGAMGT 30556_thamno NNNNNNNNNNNNNNNNA-YGGSTACTACWYWTKRSWKWW-TAGTAT-NNNNGT 30686_cyathophylla NNNNNNNNNNNNTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCRACTT 32082_przewalskii SRWWY-GRNNNNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTATNNNNNGG 33413_thamno NNNNNNNNTTTGNNNNWMCRGYYWCYRYNNNNNNNNNNNNNNNNNNNNNNNGT 33588_przewalskii NNNNNNNNGTCTGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNRTRKNNNNNGG 35236_rex CAATT-GGNNNNKKKTA-C-G-TACTACNNNNNNNNNNNNNNG-ATMNNNNGT 35855_rex NNNNNNNNTGTGNNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGRCGT 38362_rex CAATTRGGTTTGTTGTNNNNNNNNNNNNTCATGAGTTRAGTWNNNNMCGACGT 39618_rex NNNNNNNNTTTGTTGTNNNNNNNNNNNNTCWWGMSWKRAKTAGTATMCGRCGT 40578_rex NNNNNNNNTTTGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTATNNNNNGT 41478_cyathophylloides NNNNNNNNTGTGTGNNACCGATTAATACWKATKRSWKRAGYAGTATNCRACGT 41954_cyathophylloides NNNNNNNNTGTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGT

peek at the .snp file for the min12 assembly

Similar to above but you can see that there is much less missing data (Ns).

## cut -c 1-80 prints only the first 80 characters of the file
cut -c 1-80 pedicularis/min12_outfiles/min12.snp

13 44 29154_superba GTTGACGCGGTTCTCCCCCGCGCAGTGCAGTCCGCTAANNAGTT 30556_thamno GTTGACGCGGTTCTCACCCGCGCAGCGCGGTACACTAAGAAATT 30686_cyathophylla TTTGACGCGGTTCTCACCCGCGCAGCGCGGTCCACTAAGAAATT 32082_przewalskii GGTGATGCAACCCCTACCCGCGTGGCGTARYCAGTTARGAGACC 33413_thamno GTTGACGCGGTTTTCACCCGCGCAGCGCGGTACACTAAGAAATT 33588_przewalskii GGTGATGCAACCCCTACCCGCGTG-TC-ARYCAGTTARGAGACC 35236_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGRYACACTWAKMAATT 35855_rex GTKRMYKMGGTTCTCACCCGCGCAGCGCGGTACACTAAGANNNT 38362_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGGTCCACTAAGAAATT 39618_rex GTTGACGCGGTTCTCAYYYRYRCAGCGCGGTCCACTAAGAAATT 40578_rex GTTGACGCGGTTCTCACCCGCGCAGCGCGGTACACTAAGAAATT 41478_cyathophylloides GTTGACGCGGTTCTCACCCGCGCAATGCAGTCCGCCAAGAAATT 41954_cyathophylloides GTTGACGCGGTTCTCACCCGCGCAATGCAGTCCGCCAAGAAATT

downstream analyses

...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pedicularis_cli.rst

pedicularis_cli.rst

Sub-sampling data sets

Download the empirical example data set (Pedicularis)

Setup a base params file

Load the fastq Sample data

Summary stats of Assembly base

Sub-sampling methods

Summary stats of Assembly base

Run step 3 (clustering and aligning)

Steps 4-5 (joint estimation of error rate & heterozygosity)

Run step 5 (consensus base calls)

Summary stats of Assembly base

Full stats files

Step 6 (clustering and aligning across samples)

Branch the assembly

Change the parameter settings in params.txt for each assembly

Step 7 (final filtering and create output files)

Take a look at the stats summary

Take a peek at the .loci (easily human-readable) output

peek at the .phy files

peek at the .snp file

peek at the .snp file for the min12 assembly

downstream analyses

Files

pedicularis_cli.rst

Latest commit

History

pedicularis_cli.rst

File metadata and controls

Sub-sampling data sets

Download the empirical example data set (Pedicularis)

Setup a base params file

Load the fastq Sample data

Summary stats of Assembly base

Sub-sampling methods

Summary stats of Assembly base

Run step 3 (clustering and aligning)

Steps 4-5 (joint estimation of error rate & heterozygosity)

Run step 5 (consensus base calls)

Summary stats of Assembly base

Full stats files

Step 6 (clustering and aligning across samples)

Branch the assembly

Change the parameter settings in params.txt for each assembly

Step 7 (final filtering and create output files)

Take a look at the stats summary

Take a peek at the .loci (easily human-readable) output

peek at the .phy files

peek at the .snp file

peek at the .snp file for the min12 assembly

downstream analyses