Skip to content

Latest commit

 

History

History
316 lines (227 loc) · 11.4 KB

outline.rst

File metadata and controls

316 lines (227 loc) · 11.4 KB

Assembly Steps and Branching

The typical workflow to move from fastq formatted input data to assembled output files in ipyrad involves seven sequential steps <seven_steps>. The reason we separate assembly into distinct steps is to create a modular workflow that can be easily restarted if interrupted, and can be easily branched<branching_workflow> at different points to create assemblies under different combinations of parameter settings.

Basic workflow

The simplest use of ipyrad is to assemble a data set under a single set of parameters defined in a params file. Step 1 assigns data to each of the Samples, Steps 2-5 process data for each Sample, Step 6 clusters data across Samples, and Step 7 filters these data and formats them for downstream analyses.

image

The code to run a basic workflow is quite simple:

## create an initial Assembly params file
>>> ipyrad -n data1 

## fill in the new params file with your text editor
## ... editing params-data1.txt

## run steps 1-7 with the params file for this Assembly
>>> ipyrad -p params-data1.txt -s 1234567

-------------------A more effective way to use ipyrad, however, is to create branching assemblies in which multiple data sets are assembled under different parameter settings. The schematic below shows an example where an assembly is branched at step3. The new branch will inherit file paths and statistics from the first Assembly, but can then apply different parameters going forward. Branching does not create hard copies of existing data files, and so is not an "expensive" action in terms of disk space or time. We suggest it be used quite liberally whenever applying a new set of parameters.

image

The code to run a branching workflow is only a bit more complex than the basic workflow. You can find more branching examples in the advanced tutorial<tutorial_advanced_cli> and cookbook<cookbook> sections.

## create an initial Assembly and params file, here called 'data1'
>>> ipyrad -n data1 

## edit the params file for data1 with your text editor
## ... editing params-data1.txt

## run steps 1-2 with the params file
>>> ipyrad -p params-data1.txt -s 12

## create a new branch of 'data1' before step3, here called 'data2'.
>>> ipyrad -p params-data1.txt -b data2

## edit the params file for data2 using a text editor
## ... editing params-data2.txt

## run steps 3-7 for both assemblies
>>> ipyrad -p params-data1.txt -s 34567
>>> ipyrad -p params-data2.txt -s 34567

Add or drop samples by branching

Another point where branching is useful is for adding or dropping samples from an assembly, either to analyze a subset of samples separately from others, or to exclude samples with low coverage. The branching and merging fuctions in ipyrad make this easy. By requiring a branching process in order to drop samples from an assembly ipyrad inherently forces you to retain the parent assembly as a copy. This provides a nice fail safe so that you can mess around with your new branched assembly without affecting it's pre-branched parent assembly.

Examples using the ipyrad CLI

## branch and only keep 3 samples from assembly data1
>>> ipyrad -p params-data1.txt -b data2 1A0 1B0 1C0

## and/or, branch and only exclude 3 samples from assembly data1
>>> ipyrad -p params-data1.txt -b data3 - 1A0 1B0 1C0

Examples using the ipyrad Python API

## branch and only keep 3 samples from assembly data1
>>> data1.branch("data2", subsamples=["1A0", "1B0", "1C0"])

## and/or, branch and only exclude 3 samples from assembly data1
>>> keep_list = [i for i in data1.samples.keys() if i not in ["1A0", "1B0", "1C0"]]
>>> data1.branch("data3", subsamples=keep_list)

Seven Steps

1. Demultiplexing / Loading fastq files

Step 1 loads sequence data into a named Assembly<Assembly> and assigns reads to Samples<Samples> (individuals). If the data are not yet demultiplexed then step 1 uses information from a barcodes file<barcodes_file> to demultiplex the data, otherwise, it simply reads the data for each Sample.

The following parameters<parameters> are potentially used or required (*) for step1:

  • *assembly_name<assembly_name>
  • *project_dir<project_dir>
  • raw_fastq_path<raw_fastq_path>
  • barcodes_path<barcodes_path>
  • sorted_fastq_path<sorted_fastq_path>
  • *datatype<datatype>
  • restriction_overhang<restriction_overhang>
  • max_barcode_mismatch<max_barcode_mismatch>

2. Filtering / Editing reads

Step 2 uses the quality score recorded in the fastQ data files to filter low quality base calls. Sites with a score below a set value are changed into “N”s, and reads with more than the number of allowed “N”s are discarded. The threshold for inclusion is set with the phred_Qscore_offset<phred_Qscore_offset> parameter. An optional filter can be applied to remove adapters/primers (see filter_adapters<filter_adapters>), and there is an optional filter to clean up the edges of poor quality reads (see edit_cutsites<edit_cutsites>).

The following parameters<parameters> are potentially used or required (*) for step2:

  • *assembly_name<assembly_name>
  • *project_dir<project_dir>
  • barcodes_path<barcodes_path>
  • *datatype<datatype>
  • restriction_overhang<restriction_overhang>
  • max_low_qual_bases<max_low_qual_bases>
  • filter_adapters<filter_adapters>
  • filter_min_trim_len<filter_min_trim_len>
  • edit_cut_sites<edit_cut_sites>

3. Clustering / Mapping reads within Samples and alignment

Step 3 first dereplicates the sequences from step 2, recording the number of times each unique read is observed. If the data are paired-end, it then uses vsearch to merge paired reads which overlap. The resulting data are then either de novo clustered (using vsearch) or mapped to a reference genome (using smalt and bedtools), depending on the selected assembly method. In either case, reads are matched together on the basis of sequence similarity and the resulting clusters are aligned using muscle.

The following parameters<parameters> are potentially used or required (*) for step3:

  • *assembly_name<assembly_name>
  • *project_dir<project_dir>
  • *assembly_method<assembly_method>
  • *datatype<datatype>
  • *clust_threshold<clust_threshold>

4. Joint estimation of heterozygosity and error rate

Step4 jointly estimates sequencing error rate and heterozygosity based on counts of site patterns across clustered reads. These estimates are used in step5 for consensus base calling. If the max_alleles_consens is set to 1 (haploid) then heterozygosity is fixed to 0 and only error rate is estimated. For all other settings of max_alleles_consens a diploid model is used (i.e., two alleles are expected to occur equally).

The following parameters<parameters> are potentially used or required (*) for step4:

  • *assembly_name<assembly_name>
  • *project_dir<project_dir>
  • *datatype<datatype>
  • *restriction_overhang<restriction_overhang>
  • *max_alleles_consens<max_alleles_consens>

5. Consensus base calling and filtering

Step5 estimates consensus allele sequences from clustered reads given the estimated parameters from step 4 and a binomial model. During this step we filter for maximum number of undetermined sites (Ns) per locus (max_Ns_consens). The number of alleles at each locus is recorded, but a filter for max_alleles is not applied until step7. Read depth information is also stored at this step for the VCF output in step7.

The following parameters<parameters> are potentially used or required (*) for step5:

  • *assembly_name<assembly_name>
  • *project_dir<project_dir>
  • *datatype<datatype>
  • *max_Ns_consens<max_Ns_consens>

6. Clustering / Mapping reads among Samples and alignment

Step6 clusters consensus sequences across Samples using the same assembly method as in step 3. One allele is randomly sampled before clustering so that ambiguous characters have a lesser effect on clustering, but the resulting data retain information for heterozygotes. The clustered sequences are then aligned using muscle.

The following parameters<parameters> are potentially used or required (*) for step6:

  • *assembly_name<assembly_name>
  • *project_dir<project_dir>
  • *datatype<datatype>

7. Filtering and formatting output files

Step7 applies filters to the final alignments and saves the final data in a number of possible output formats<full_output_formats>. This step is most often repeated at several different settings for the parameter min_samples_locus to create different assemblies with different proportions of missing data (see branching_workflow).

The following parameters<parameters> are potentially used or required (*) for step7:

  • *assembly_name<assembly_name>
  • *project_dir<project_dir>
  • *datatype<datatype>
  • min_samples_locus<min_samples_locus>
  • max_Indels_locus<max_Indels_locus>
  • max_shared_Hs_locus<max_shared_Hs_locus>
  • max_alleles_consens<max_alleles_consens>
  • trim_overhang<trim_overhang>
  • output_formats<output_formats>
  • pop_assign_file<pop_assign_file>

Example CLI branching workflow

## create a params.txt file and rename it data1, and then use a text editor
## to edit the parameter settings in data1-params.txt
ipyrad -n data1

## run steps 1-2 using the default settings
ipyrad -p params-data1.txt -s 12

## branch to create a 'copy' of this assembly named data2
ipyrad -p params-data1.txt -b data2

## edit data2-params.txt to a different parameter settings in a text editor,
## for example, change the clustering threshold from 0.85 to 0.90

## now run the remaining steps (3-7) on each data set
ipyrad -p params-data1.txt -s 34567
ipyrad -p params-data2.txt -s 34567

Example Python API branching workflow

## import ipyrad 
import ipyrad as ip

## create an Assembly and modify some parameter settings
data1 = ip.Assembly("data1")
data1.set_params("project_dir", "example")
data1.set_params("raw_fastq_path", "data/*.fastq")
data1.set_params("barcodes_path", "barcodes.txt")   

## run steps 1-2
data1.run("12")

## create a new branch of this Assembly named data2
## and change some parameter settings 
data2 = data1.branch("data2")
data2.set_params("clust_threshold", 0.90)

## run steps 3-7 for the two Assemblies
data1.run("34567")
data2.run("34567")