### Assignment: assemble an ipyrad example data set

Follow the instructions here: http://ipyrad.readthedocs.io/API_user-guide.html to assemble a dataset using the ipyrad API. You will need to download the dataset as instructed below. This dataset is different from the one in the linked tutorial. Be sure to download the data into your scratch space, and to set the project directory for you ipyrad analysis to your scratch directory. You can use any of the datasets in the downloaded directory. Read the ipyrad docs if you have questions and/or hit up the gitter chatroom. 

** When finished copy this notebook to your assignments/ dir, push it, and make a pull request**. 

In [9]:
import ipyrad as ip
import ipyparallel as ipp

### Download the data
You will probably want to move the data to your scratch directory. You can run this code here to download it, or from a terminal. 

In [10]:
%%bash
## The curl command needs a capital O, not a zero
curl -LkO https://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz
tar -xvzf ipsimdata.tar.gz

./ipsimdata/
./ipsimdata/pairgbs_example_R2_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_barcodes.txt
./ipsimdata/rad_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa
./ipsimdata/pairgbs_example_R1_.fastq.gz
./ipsimdata/pairgbs_wmerge_example_R2_.fastq.gz
./ipsimdata/rad_example_genome.fa.fai
./ipsimdata/pairddrad_example_R2_.fastq.gz
./ipsimdata/pairddrad_example_genome.fa.sma
./ipsimdata/pairddrad_example_genome.fa.fai
./ipsimdata/pairgbs_wmerge_example_genome.fa
./ipsimdata/pairddrad_wmerge_example_genome.fa
./ipsimdata/pairddrad_example_genome.fa.smi
./ipsimdata/pairgbs_wmerge_example_R1_.fastq.gz
./ipsimdata/rad_example_genome.fa.smi
./ipsimdata/gbs_example_barcodes.txt
./ipsimdata/pairgbs_example_barcodes.txt
./ipsimdata/pairddrad_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_barcodes.txt
./ipsimdata/rad_example_barcodes.txt
./ipsimdata/pairddrad_wmerge_example_R1_.fastq.gz
./ipsimdata/pairddrad_wmerge_example_R2_.fastq.gz
./ipsimdata/gbs_example_R1_.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   147  100   147    0     0   1035      0 --:--:-- --:--:-- --:--:--  1027
100 11.8M  100 11.8M    0     0  19.4M      0 --:--:-- --:--:-- --:--:-- 19.4M


In [11]:
ls ipsimdata/

gbs_example_barcodes.txt               pairgbs_example_barcodes.txt
gbs_example_genome.fa                  [0m[01;31mpairgbs_example_R1_.fastq.gz[0m
[01;31mgbs_example_R1_.fastq.gz[0m               [01;31mpairgbs_example_R2_.fastq.gz[0m
pairddrad_example_barcodes.txt         pairgbs_wmerge_example_barcodes.txt
pairddrad_example_genome.fa            pairgbs_wmerge_example_genome.fa
pairddrad_example_genome.fa.fai        [01;31mpairgbs_wmerge_example_R1_.fastq.gz[0m
pairddrad_example_genome.fa.sma        [01;31mpairgbs_wmerge_example_R2_.fastq.gz[0m
pairddrad_example_genome.fa.smi        rad_example_barcodes.txt
[01;31mpairddrad_example_R1_.fastq.gz[0m         rad_example_genome.fa
[01;31mpairddrad_example_R2_.fastq.gz[0m         rad_example_genome.fa.fai
pairddrad_wmerge_example_barcodes.txt  rad_example_genome.fa.sma
pairddrad_wmerge_example_genome.fa     rad_example_genome.fa.smi
[01;31mpairddrad_wmerge_example_R1_.fastq.gz[0m  [01;31mrad_example_R1_.fast

### Connect to an ipcluster instance

In [12]:
# in terminal window do:
#ipcluster start --n=4

In [13]:
%px import time, os

# connect to a running ipcluster instance 
ipyclient = ipp.Client()

UsageError: Line magic function `%px` not found.


### Assembly the dataset from step 1 to step 7

In [14]:
#step 1: Demultiplexing / Loading fastq files, 
#step 2: Filtering / Editing reads, 
#step 3: Clustering / Mapping reads within Samples and alignment
#step 4: Joint estimation of heterozygosity and error rate,
#step 5: Consensus base calling and filtering,
#step 6: Clustering / Mapping reads among Samples and alignment
#step 7: Filtering and formatting output files

In [15]:
## create an Assembly and modify some parameter settings
data1 = ip.Assembly("ipsimdata")
data1.set_params("project_dir", "ipsimdata")
data1.set_params("raw_fastq_path", "ipsimdata/pairgbs_wmerge_example_R1_.fastq.gz")
data1.set_params("barcodes_path", "ipsimdata/pairgbs_wmerge_example_barcodes.txt")
data1.set_params("clust_threshold", "0.90")

data1.get_params()

New Assembly: ipsimdata
0   assembly_name               ipsimdata                                    
1   project_dir                 ./ipsimdata                                  
2   raw_fastq_path              ./ipsimdata/pairgbs_wmerge_example_R1_.fastq.gz
3   barcodes_path               ./ipsimdata/pairgbs_wmerge_example_barcodes.txt
4   sorted_fastq_path                                                        
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    rad                                          
8   restriction_overhang        ('TGCAG', '')                                
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6   

In [16]:
## other parameters to look into
#data.set_params("project_dir", "analysis-ipyrad")
#data.set_params("sorted_fastq_path", "fastqs-Ped/*.fastq.gz")
#data.set_params("clust_threshold", "0.90")
#data.set_params("filter_adapters", "2")
#data.set_params("max_Hs_consens", (5, 5))
#data.set_params("trim_loci", (0, 5, 0, 0))
#data.set_params("output_formats", "psvnkua")

In [None]:
## run steps 1-2
data1.run("12")

Assembly: ipsimdata


In [None]:
## access the stats of the assembly (so far) from the .stats attribute
data.stats

In [None]:
## create a new branch of this Assembly named data2
## and change some parameter settings
data2 = data1.branch("data2")
data2.set_params("clust_threshold", 0.95)

In [None]:
## run steps 3-7 for the two Assemblies
data1.run("34567")
data2.run("34567")

### Print the final assembly stats

In [None]:
## we can access the stats summary as a pandas dataframes. 
min4.stats

### Show the location of your assembled output files

In [None]:
min4.run("7")