Skip to content

Latest commit

 

History

History
410 lines (265 loc) · 12.6 KB

tutorial-combining-data.rst

File metadata and controls

410 lines (265 loc) · 12.6 KB

Tutorial: advanced data input with ipyrad command-line

There are a number of ways to enter your data into ipyrad and we've tried to make it as easy as possible to combine data from multiple libraries and multiple plates in a simple and straightforward way. Here we demonstrate a number of ways to demultiplex and load data under different scenarios:

  1. One Library One Lane of sequencing
  2. One Library Multiple lanes of sequencing
  3. Multiple libraries Multiple lanes of sequencing
  4. Separate multiple libraries from one lane of sequencing
  5. Alternative: Doing all of this with the API instead of the CLI

1. One library One Lane of sequencing

First create a new Assembly, here we'll call it demux1. Then use a text-editor to edit the params file to enter the raw_fastq_path and the barcodes_path information that is needed to demultiplex the data. To automate the process of editing the params file I use the command-line program sed here to substitute in the new values.

## create a new assembly
ipyrad -n demux1

New file 'params-demux1.txt' created in /home/deren/Documents/ipyrad/tests

## edit the params file to enter your raw_fastq_path and barcodes path
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-demux1.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-demux1.txt
## run step 1 to demultiplex the data
ipyrad -p params-demux1.txt -s 1 
------------------------------------------------------------- ipyrad [v.0.5.15]

Interactive assembly and analysis of RAD-seq data

------------------------------------------------------------- loading Assembly: demux1

from saved path: ~/Documents/ipyrad/tests/demux1.json New Assembly: demux1 host compute node: [40 cores] on tinus

Step 1: Demultiplexing fastq data to Samples

[####################] 100% sorting reads | 0:00:06 [####################] 100% writing/compressing | 0:00:00

The demultiplexed data is now located in the directory <project_dir>/<assembly_name>/, which in this case is in ./demux1_fastqs/. The Assembly demux1 knows the location of the data, and so from here you can proceed in either of two ways. (1) You simply continue on to step 2 using this Assembly object (demux1), or (2) You create a new 'branch' of this Assembly, which will start by reading in the sorted_fastq_data. The latter is sometimes more clear in that you keep separate the demultiplexing steps from the assembly steps. It does not make a difference in this example, where we have only one library and one lane of data, but as you will see in the examples below, that it is sometimes easier to create multiple separate demux libraries that are then merged into a single Object for assembling.

## option 1: continue to assemble this data set
ipyrad -p params-demux1 -s 234567
## OR, option 2: create a new Assembly and enter path to the demux data
ipyrad -n New

## enter path to the 'sorted_fastq_data' in params
sed -i '/\[4] /c\./demux1_fastq/*.gz  ## [2] ' params-New.txt

## assemble this data set 
ipyrad -p params-New.txt -s 1234567

2. One Library Multiple Lanes of Sequencing

There are two options for how to join multiple lanes of sequence data that are from the same library (i.e., there is only one barcodes file). (1) The simplest way is to simply put the multiple raw fastq data files into the same directory and select them all when entering the raw_fastq_path using a wildcard selector (e.g., "*.fastq.gz"). (2) The second way is to create two separate demux Assemblies and the merge them, which I demonstrate below. Because the two demultiplexed lanes each use the same barcodes file the Samples will have identical names. ipyrad will recognize this during merging and read both input files for each Sample in step 2.

## create demux Assembly object for lane 1 
ipyrad -n lane1raws 

New file 'params-lane1raws.txt' created in /home/deren/Documents/ipyrad/tests

## create demux Assembly object for lane 2 
ipyrad -n lane2raws

New file 'params-lane2raws.txt' created in /home/deren/Documents/ipyrad/tests

## edit the params file for lane1 to enter its raw_fastq_path and barcodes file
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-lane1raws.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-lane1raws.txt

## edit the params file for lane2 to enter its raw_fastq_path and barcodes file
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-lane2raws.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-lane2raws.txt
## demultiplex lane1
ipyrad -p params-lane1raws.txt -s 1 
------------------------------------------------------------- ipyrad [v.0.5.15]

Interactive assembly and analysis of RAD-seq data

------------------------------------------------------------- New Assembly: lane1raws

host compute node: [40 cores] on tinus

Step 1: Demultiplexing fastq data to Samples

[####################] 100% sorting reads | 0:00:06 [####################] 100% writing/compressing | 0:00:01

## demultiplex lane2
ipyrad -p params-lane2raws.txt -s 1 
------------------------------------------------------------- ipyrad [v.0.5.15]

Interactive assembly and analysis of RAD-seq data

------------------------------------------------------------- New Assembly: lane2raws

host compute node: [40 cores] on tinus

Step 1: Demultiplexing fastq data to Samples

[####################] 100% sorting reads | 0:00:06 [####################] 100% writing/compressing | 0:00:00

## merge the two lanes into one Assembly named both
ipyrad -m both params-lane1raws.txt params-lane2raws.txt
------------------------------------------------------------- ipyrad [v.0.5.15]

Interactive assembly and analysis of RAD-seq data


Merging assemblies: ['params-lane1raws.txt', 'params-lane2raws.txt'] loading Assembly: lane1raws from saved path: ~/Documents/ipyrad/tests/lane1raws.json loading Assembly: lane2raws from saved path: ~/Documents/ipyrad/tests/lane2raws.json

Merging succeeded. New params file for merged assembly:

params-both.txt

## print merged stats of new Assembly
ipyrad -p params-both.txt -r 

Summary stats of Assembly both

state reads_raw

1A_0 1 39724 1B_0 1 40086 1C_0 1 40272 1D_0 1 39932 2E_0 1 40034 2F_0 1 39866 2G_0 1 40060 2H_0 1 40398 3I_0 1 39770 3J_0 1 39644 3K_0 1 39930 3L_0 1 40016

Full stats files

step 1: ./lane1raws_fastqs/s1_demultiplex_stats.txt step 2: None step 3: None step 4: None step 5: None step 6: None step 7: None

## run remaining steps on the merged assembly
ipyrad -p params-both.txt -s 234567

3. Multiple Libraries Multiple Lanes of Sequencing

The recommended way to combine multiple lanes of data is the same as we just demonstrated above, however, in this case because the Samples in each Object come from a different library, they will have different names. Imagine that each lane of sequencing contains a library with 48 Samples in it. In the example above (One library multiple lanes) the Samples would be combined so that you have 48 Samples, and each Sample has data from two fastq files. Alternatively, the merging in this example would combine the two libraries that contain different Samples into a single data set with 96 Samples, where each Sample has one lane of data.

4. Separate Multiple Libraries from One Lane of Sequencing

## create new Assembly named lib1
ipyrad -n lib1 

## enter raw_fastq_path and barcodes_path into params
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-lib1.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-lib1.txt

## demultiplex the lane of data
ipyrad -p params-lib1.txt -s 1 

## create a new branch with only the Samples for project 1
ipyrad -p params-lib1.txt -b project1 1A_0 1B_0 1C_0 1D_0 

## create a another branch with only the Samples for project 2
ipyrad -p params-lib1.txt -b project2 2E_0 2F_0 2G_0 2H_0 
## assemble project 1 
ipyrad -p params-project1 -s 234567
## assemble project 2
ipyrad -p params-project2 -s 234567

5. Alternative: Using the ipyrad API to do these things

Using the ipyrad API is an alternative to using the command-line-interface (CLI) above. As you can see below, writing code with the Python API can be much simpler and more elegant. We recommend using the API inside a Jupyter-notebook.

## import ipyrad
import ipyrad as ip
## one lane one library
data1 = ip.Assembly("data1")
data1.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
data1.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")
data.run("123467")
## one library multiple lanes
lib1lane1 = ip.Assembly("lib1lane1")
lib1lane1.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
lib1lane1.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")
lib1lane1.run("1")

lib1lane2 = ip.Assembly("lib1lane2")
lib1lane2.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
lib1lane2.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")
lib1lane2.run("1")

merged = ip.merge("lib1-2lanes", [lib1lane1, lib1lane2])
merged.run("234567")
## multiple libraries multiple lanes
lib1lane1 = ip.Assembly("lib1lane1")
lib1lane1.set_params("raw_fastq_path", "ipsimdata/lib1_lane1_R1_.fastq.gz")
lib1lane1.set_params("barcodes_path", "ipsimdata/lib1_barcodes.txt")
lib1lane1.run("1")

lib1lane2 = ip.Assembly("lib1lane2")
lib1lane2.set_params("raw_fastq_path", "ipsimdata/lib1_lane2.fastq.gz")
lib1lane2.set_params("barcodes_path", "ipsimdata/lib1_barcodes.txt")
lib1lane2.run("1")

lib2lane1 = ip.Assembly("lib1lane1")
lib2lane1.set_params("raw_fastq_path", "ipsimdata/lib2_lane1.fastq.gz")
lib2lane1.set_params("barcodes_path", "ipsimdata/lib2_barcodes.txt")
lib2lane1.run("1")

lib2lane2 = ip.Assembly("lib1lane2")
lib2lane2.set_params("raw_fastq_path", "ipsimdata/lib2_lane2_.fastq.gz")
lib2lane2.set_params("barcodes_path", "ipsimdata/lib2_barcodes.txt")
lib2lane2.run("1")

fulldata = ip.merge("fulldata", [lib1lane1, lib1lane2, lib2lane1, lib2lane2])
fulldata.run("234567")
## splitting a library into different project
project1 = ["sample1", "sample2", "sample3"]
project2 = ["sample4", "sample5", "sample6"]

proj1 = fulldata.branch("proj1", subsamples=project1)
proj2 = fulldata.branch("proj2", subsamples=project2)

proj1.run("234567", force=True)
proj2.run("234567", force=True)
## print stats of project 1
print proj1.stats