# _ipyrad_ testing tutorial

### Getting started
Import _ipyrad_ and remove previous test files if they are already present

In [1]:
## import modules
import ipyrad as ip      ## for RADseq assembly
print ip.__version__     ## print version

## clear data from test directory if it already exists
import shutil
import os
import subprocess
#if os.path.exists("./test_refseq/"):
#    shutil.rmtree("./test_refseq/")

0.0.65


### Initialize smalt (index reference sequence)
This is preparation for indexing. It only ever needs to be done once so shoud be tested during initialization.

`smalt index zf-ref ../zf/zf.sm.fa`

There is an optional -s flag that could improve mapping accuracy. Consider the best default, probably not worth letting people pass it in, if they want to mess with it they can index their own reference.

In [3]:
# hack the binary paths cuz the current egg doesn't have them in it
#data1.muscle=
#data1.vsearch
#data1.smalt

# Reference sequence directory (gzipped fasta files)
# TODO: set this as a parameter
# e.g., data1.set_params('refseq', "./data/zf.fa.gz")

# TODO: push this example file to the data/ dir 
REFSEQ = "./data/zf.fa.gz"

# Set the step size to 4 (default is 13)
# This will slow down read mapping, but increase accuracy
SMALT_INDEX_FLAGS = " -s 4 "

# TODO: create and link a dir/ to the Assembly object for the reference data files
data1.dirs.reference = '...'

# TODO: create and link index files to Sample objects
data1.samples['1A_0'].files.index_smi = '...'
data1.samples['1A_0'].files.index_sma = '...'

# Test if reference sequence is already indexed
# Only index if the .smi and .sma files don't exist, saves lots of time
if not os.path.isfile( REFSEQ+".smi" ):
    # smalt indexing will create two files called REFSEQ.smi and .sma
    # in the same directory as the reference sequence. 
    cmd = data1.smalt + " index " + SMALT_INDEX_FLAGS + REFSEQ + " " + REFSEQ
    print cmd
    subprocess.check_call(cmd, shell=True,
                            stderr=subprocess.STDOUT,
                            stdout=subprocess.PIPE)
    #output = subprocess.check_output( " ".join(cmd), shell=True)
else:
    print "Reference sequence index exists"

KeyError: '1A_0'

### Assembly and Sample objects

Assembly and Sample objects are used by _ipyrad_ to access data stored on disk and to manipulate it. Each biological sample in a data set is represented in a Sample object, and these Samples are stored inside Assembly objects. The Assembly object contains functions to assemble the data, and stores a log of all steps performed and the resulting statistics of those steps. Assembly objects can be copied or merged to allow branching events where different parameters are applied to assemblies. 

To create an Assembly object call ip.Assembly and pass it a name for the data set. We could imagine that we planned to assemble and later combine data from multiple sequencing runs, but before combining them each group of samples has to be analyzed under a different set of parameters. As an example, we could call two data sets "2014_data" and "2015_data". These initially do not contain any Samples. Sample objects are created either by linking fastq files to the Assembly object or by running step 1 to demultiplex raw data files. 

In [4]:
## create an Assembly object called data1. 
## It takes an 'test'
data1 = ip.Assembly("2014_data")
data2 = ip.Assembly("2015_data")

print "Assembly object named", data1.name
print "Assembly object named", data2.name


[]
0 new Samples created in 2014_data.
0 fastq files linked to Samples.
[]
0 new Samples created in 2015_data.
0 fastq files linked to Samples.
Assembly object named 2014_data
Assembly object named 2015_data


### Modifying assembly parameters
All of the parameter settings are linked to an Assembly object, which has a set of default parameters when it is created. These can be viewed using the `get_params()` function. To get more detailed information about all paramteres use `ip.get_params_info()` or to select a single parameter use `ip.get_params_info(3)`. Assembly objects have a function `set_params()` that can be used to modify parameters. 

In [5]:
## modify parameters for this Assembly object
data1.set_params(1, "./test_refseq")
data1.set_params(2, "./data/sim_rad_test_R1_.fastq.gz")
data1.set_params(3, "./data/sim_rad_test_barcodes.txt")
data1.set_params(7, 3)
data1.set_params(10, 'rad')
#data1.set_params(27, '/Volumes/WorkDrive/ipyrad/refhacking/MusChr1.fa')

## print the new parameters to screen
data1.get_params()

  1   working_directory             ./test_refseq                                
  2   raw_fastq_path                ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path                 ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                          
  5   restriction_overhang          ('TGCAG', '')                                
  6   max_low_qual_bases            5                                            
  7   N_processors                  3                                            
  8   mindepth_statistical          6                                            
  9   mindepth_majrule              6                                            
  10  datatype                      rad                                          
  11  clust_threshold               0.85                                         
  12  minsamp                       4                                            
  13  max_shared

### Starting data assembly and Sample objects
If the data are already demultiplexed then fastq files can be linked directly to the Data object, which in turn will create Sample objects for each fastq file (or pair of fastq files for paired data). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.

In [6]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it does nothing b/c there are no files in the sorted_fastq_path
data1.link_fastqs()

[]
0 new Samples created in 2014_data.
0 fastq files linked to Samples.


### Step 1: Demultiplex the raw data files
This uses the barcodes information to demultiplex reads in data files found in the 'raw_fastq_path'. It will create a Sample object for each sample that will be stored in the Assembly object. The state of each sample will be set to 1, meaning that the sample has completed step 1 of the _ipyrad_ assembly.

In [7]:
## run step 1 to demultiplex the data
data1.step1()

## print the results for each Sample in data1
print data1.stats

      state  reads_raw  reads_filtered  clusters_total  clusters_kept  \
1A_0      1      20099             NaN             NaN            NaN   
1B_0      1      19977             NaN             NaN            NaN   
1C_0      1      20114             NaN             NaN            NaN   
1D_0      1      19895             NaN             NaN            NaN   
2E_0      1      19928             NaN             NaN            NaN   
2F_0      1      19934             NaN             NaN            NaN   
2G_0      1      20026             NaN             NaN            NaN   
2H_0      1      19936             NaN             NaN            NaN   
3I_0      1      20084             NaN             NaN            NaN   
3J_0      1      20011             NaN             NaN            NaN   
3K_0      1      20117             NaN             NaN            NaN   
3L_0      1      19901             NaN             NaN            NaN   

      hetero_est  error_est  reads_consens  
1A_0 

### Step 2: Filter reads 
If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the `step2` function on. Because `step2` is a function of `data`, it will always execute with the parameters that are linked to `data`. 

In [None]:
## example of ways to run step 2 to filter and trim reads
#data1.step2("1A_0")            ## run on a single sample
data1.step2(["1B_0", "1C_0"])  ## run on one or more samples
#data1.step2()                  ## run on all samples, skipping finished ones

## print the results
print data1.stats

### Do the read mapping (SE)
Here's an example cmdline run with args explained below:

smalt map -f sam -n 8 -l pp -o Arremon.sam zf-ref ../MarTum-fasta/ArremonR1.fa ../MarTum-fasta/ArremonR2.fa

* -f sams - you can also output as 'bam' but it requires installing bambamc which is explained in the smalt docs, but which seems annoying, esp cuz samtools will do it for us.
* -n sets the number of threads to 8, dramatically increases speed
* -l pp tells smalt about the orientation of the paired reads, in this case pp means both reads are on the same strand in the 5' to 3' direction, I think the second read was originally from the second strand and pyrad reverse complemented it.
* -o is the outfile
* Next is the indexed reference sequence and the files containing reads

Other options to look into:
* -y minid Filters output alignments by a threshold in the number of exactly
matching nucleotides.
* -r seed Determines how reads or mate pairs with multiple best mappings are
reported.

In [None]:
#data1.paramsdict["working_directory"]
data1.dirs

In [None]:
output = "/tmp/wat"

# Check the input files
SMALT_CMD = "check "
## the read1 demultiplexed reads file
fr1 = data1.get_params(1)+"/fastq/1A_0_R1_.gz"
#data1.smalt = "/usr/local/bin/smalt"
cmd = data1.smalt + " " + SMALT_CMD + " " + fr1
print cmd
subprocess.call(cmd, shell=True,
                     stderr=subprocess.STDOUT,
                     stdout=subprocess.PIPE)

SMALT_CMD = "map -f sam -n 8 -o " + output
## the read1 demultiplexed reads file

## TODO: I recommend using parameter descriptions rather than numbers
## in the code so it is more robust to potential reordering of parameters
fr1 = data1.get_params('working_directory')+"/fastq/1A_0_R1_.gz"

cmd = data1.smalt + " " + SMALT_CMD + " " + REFSEQ + " " + fr1
print cmd
subprocess.call(cmd, shell=True,
                     stderr=subprocess.STDOUT,
                     stdout=subprocess.PIPE)

## Get mapped and unmapped reads

First get some info about our mapping.

    samtools flagstat <yoursam>

Get only the mapped reads. 0x4 is a bitmask for 'unmapped' reads, -F means get all not this mask. In both cases -b outputs as bam

    samtools view -b -F 0x4 <your.sam> > mapped.bam

Same as above, but in this case -f means just give me the ones with this flag set.

    samtools view -b -f 0x4 <your.sam> > unmapped.bam

## 

samtools sort -T /tmp/wat -O bam test.mapped.bam > test.mapped.sorted.bam
samtools bam2fq test.mapped.sorted.bam

In [11]:
import pysam

#This is junk

print data1.muscle
print data1.vsearch
print data1.smalt
print data1.samples["1B_0"].files.edits
#bam2py("")
#pysam.view("-b", "-S", "-o") #, INDIVIDUALS_WORK_DIR+species+"/"+ind+"-"+refseq.split("/")[-1]+".bam", INDIVIDUALS_WORK_DIR+species+"/"+ind+"-"+refseq.split("/")[-1]+".sam", catch_stdout=False)
#pysam.sort( "-O", "bam", "-o", INDIVIDUALS_WORK_DIR+species+"/"+ind+"-"+refseq.split("/")[-1]+".bam", "-T", "tempfile", INDIVIDUALS_WORK_DIR+species+"/"+ind+"-"+refseq.split("/")[-1]+".bam", catch_stdout=False)
#pysam.index( INDIVIDUALS_WORK_DIR+species+"/"+ind+"-"+refseq.split("/")[-1]+".bam", catch_stdout=False)

/home/deren/Dropbox/ipyrad/bin/muscle3.8.31_i86linux64
/home/deren/Dropbox/ipyrad/bin/vsearch-1.1.3-linux-x86_64
/home/deren/Dropbox/ipyrad/bin/smalt-0.7.6-linux-x86_64
[]


### Step 3: clustering within-samples

In [None]:
## run step 3 to cluster reads within samples using vsearch
#data1.step3(preview=1) #["2H_0", "2G_0"], preview=1)
data1.step3(["1B_0", "1C_0"], preview=1)
## print the results
print data1.stats

### Example of plotting with _ipyrad_
There are a a few simple plotting functions in _ipyrad_ useful for visualizing results. These are in the module `ipyrad.plotting`. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set.  

In [None]:
import ipyrad as ip
import ipyrad.plotting as iplot

## reload autosaved data. In case you quit and came back 
#data1 = ip.load_dataobj("test_rad/2014_data.dataobj")

## plot for one or more selected samples
iplot.depthplot(data1, ["1A_0", "1B_0"])

## plot for all samples in data1
#iplot.depthplot(data1)

## save plot as pdf and html
iplot.depthplot(data1, outprefix="testfig")

### Step 4: Joint estimation of heterozygosity and error rate


In [None]:
## run step 4
data1.step4() #"2H_0", "2G_0")

## print the results
print data1.stats

### Step 5: Consensus base calls


In [None]:
## run step 5
data1.step5(["2H_0"])

## print the results
print data1.stats

### Quick parameter explanations are always on-hand

In [None]:
ip.get_params_info(10)

### Log history 
A common problem after struggling through an analysis is that you find you've completely forgotten what parameters you used at what point, and when you changed them. The log history time stamps all calls to `set_params()`, as well as calls to `step` methods. It also records copies/branching of data objects.  

In [None]:
for i in data1.log:
    print i

### Saving Assembly objects
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly `step` function is called, but you can also create your own checkpoints with the `save` command. 

In [None]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name

In [None]:
from ipyrad import assemble

assemble.cluster_within.derep_and_sort( data1, data1.samples["3L_0"], 0 )

In [None]:
#sample = data1.samples["3L_0"]
#handle = sample.files["edits"]
#print handle.replace(".fasta", ".derep")
data1.vsearch = "/home/isaac/ipyrad-refseq/bin/vsearch-1.1.3-linux-x86_64"
data1.muscle = "/home/isaac/ipyrad-refseq/bin/muscle3.8.31_i86linux64"
data1.smalt = "/home/isaac/ipyrad-refseq/bin/smalt-0.7.6-linux-x86_64"

In [None]:
assemble.cluster_within.muscle_align( data1, data1.samples["1D_0"])

In [None]:
#print data1.paramsdict["sorted_fastq_path"]
data1.stats
data1.get_params()

In [None]:
print data1.get_params(1)

In [None]:
dir()