# Reference sequence mapping horserace

ipyrad is capable of incorporating a reference sequence to aid in the assembly. There are actually 4 different assembly methods, 3 of which use reference sequence in some way. Here we test reference assisted assembly for ipyrad, stacks, and ddocent. Though aftrRAD performs nicely on empirical data and does allow for reference assisted assembly, we consider runtimes to be prohibitive and so exclude it from analysis here.

Ideas for datasets (all have data in SRA):

    Selection and sex-biased dispersal in a coastal shark: the influence of philopatry on adaptive variation
    - 134 individuals (paper assembled w/ ddocent)
    
    Genome-wide data reveal cryptic diversity and genetic introgression in an Oriental cynopterine fruit bat radiation
    - < 45 samples, 2 reference genomes in the same family
    
    Beyond the Coral Triangle: high genetic diversity and near panmixia in Singapore's populations of the broadcast spawning sea star Protoreaster nodosus
    - 77 samples, it's a passerine, so there must be something reasonably close
    
    PSMC (pairwise sequentially Markovian coalescent) analysis of RAD (restriction site associated DNA) sequencing data
    - 17 sticklebacks, they used stacks
    
For this analysis we chose the paired-end ddRAD dataset from Lah et al 2016 (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0162792#sec002):
    Spatially Explicit Analysis of Genome-Wide SNPs Detects Subtle Population Structure in a Mobile Marine Mammal, the Harbor Porpoise
    - 49 samples from 3 populations of European Harbor Porpoise


In [12]:
import subprocess
import ipyrad as ip
import shutil
import glob
import sys
import os

## Set the default directories for exec and data. 
WORK_DIR="/home/iovercast/manuscript-analysis/"
REFMAP_EMPIRICAL_DIR=os.path.join(WORK_DIR, "Phocoena_empirical/")
REFMAP_FASTQS=os.path.join(REFMAP_EMPIRICAL_DIR, "Final_Files_forDryad/Bbif_ddRADseq/fastq/")
IPYRAD_DIR=os.path.join(WORK_DIR, "ipyrad/")
STACKS_DIR=os.path.join(WORK_DIR, "stacks/")
DDOCENT_DIR=os.path.join(WORK_DIR, "dDocent/")

## (emprical data dir will be created for us when we untar it)
for dir in [WORK_DIR, REFMAP_EMPIRICAL_DIR, IPYRAD_DIR, STACKS_DIR, DDOCENT_DIR]:
    if not os.path.exists(dir):
        os.makedirs(dir)

## Fetch the Phocoena raw sequence data 
We will use the sra-toolkit command `fastq-dump` to pull the PE reads out of SRA. 
This maybe isn't the best way, or the quickest, but it'll get the job done. Takes 
~30 minutes and requires ~70GB of space. After I downloaded the fq I looked at a
couple random samples in FastQC to get an idea where to trim in step2.

In [None]:
os.chdir(REFMAP_EMPIRICAL_DIR)
!mkdir raws
!cd raws
## Grab the sra-toolkit pre-built binaries to download from SRA
## This works, but commented for now so it doesn't keep redownloading
!wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.0/sratoolkit.2.8.0-ubuntu64.tar.gz
!tar -xvzf sratoolkit*
FQ_DUMP = os.path.join(REFMAP_EMPIRICAL_DIR, "sratoolkit.2.8.0-ubuntu64/bin/fastq-dump")
res = subprocess.check_output(FQ_DUMP + " -version", shell=True)

## The SRR numbers for the samples from this bioproject range from SRR4291662 to SRR4291705
## so go fetch them one by one
for samp in range(662, 706):
    print("Doing {}\t".format(samp)),
    res = subprocess.check_output(FQ_DUMP + " --split-files SRR4291" + str(samp), shell=True)


Doing 662	Doing 663	Doing 664	Doing 665	Doing 666	Doing 667	Doing 668	Doing 669	Doing 670	Doing 671	Doing 672	Doing 673	Doing 674	Doing 675	Doing 676	Doing 677	Doing 678	Doing 679	Doing 680	Doing 681	Doing 682	Doing 683	Doing 684	Doing 685	Doing 686	Doing 687	Doing 688	Doing 689	Doing 690	Doing 691	Doing 692	Doing 693	Doing 694	Doing 695	Doing 696	Doing 697	Doing 698	Doing 699	Doing 700	Doing 701	Doing 702	

In [48]:
## The SRA download files have wonky names, like SRR1234_R1.fastq.gz, but ipyrad expects SRR1234_R1_.fastq.gz,
## so we have to fix the filenames. Filename hax...
import glob
for f in glob.glob(REFMAP_EMPIRICAL_DIR + "raws/*.fastq.gz"):
    splits = f.split("/")[-1].split("_")
    newf = REFMAP_EMPIRICAL_DIR + "raws/" + splits[0] + "_R" + splits[1].split(".")[0] + "_.fastq.gz"
    os.rename(f, newf)

## Fetch the bottlenose dolphin genome
Tursiops truncatus reference genome. Divergence time between dolphin and porpoise is approximately 15Mya, which is on the order of divergence between humans and orang. There is also a genome for the Minke whale, which is much more deeply diverged (~30Mya), could be interesting to try both to see how it works.

Minke whale - http://www.nature.com/ng/journal/v46/n1/full/ng.2835.html#accessions

SRA Data table for converting fq files to sample name as used in the paper: file:///home/chronos/u-b20882bda0f801c7265d4462f127d0cb4376d46d/Downloads/Tune/SraRunTable.txt

In [30]:
os.chdir(REFMAP_EMPIRICAL_DIR)
!mkdir TurtrunRef
!cd TurtrunRef
!wget ftp://ftp.ensembl.org/pub/release-87/fasta/tursiops_truncatus/dna/Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz

## Ensembl distributes gzip'd reference sequence files, but samtools really wants it to be bgzipped or uncompressed
!gunzip Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz

--2016-12-18 10:29:57--  ftp://ftp.ensembl.org/pub/release-87/fasta/tursiops_truncatus/dna/Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz
           => ‘Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.203.85
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.203.85|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-87/fasta/tursiops_truncatus/dna ... done.
==> SIZE Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz ... 453168562
==> PASV ... done.    ==> RETR Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz ... done.
Length: 453168562 (432M) (unauthoritative)


2016-12-18 10:33:32 (2.03 MB/s) - ‘Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz’ saved [453168562]



## Trim reads w/ cutadapt
To reduce any potential bias introduced by differences in trimming and filtering methods we will trim and filter the raw reads w/ cutadapt, and use this QC'd dataset as the starting point for assembly for all 3 programs. If you are wondering, you __can__ run fastqc from the command line, like this

    fastqc -o fastqc_out/ SRR4291662_1.fastq SRR4291662_2.fastq
    
We'll trim R1 and R2 to 85bp following Lah et al 2016. The `-l` flag for cutadapt specifies the length to which each read will be trimmed.

In [53]:
%%bash -s "$REFMAP_EMPIRICAL_DIR"
cd $1
mkdir trimmed
for i in `ls raws`; do echo $i; cutadapt -l 10 raws/$i > trimmed/$i; done

SRR4291662_R1_.fastq.gz
SRR4291662_R2_.fastq.gz
SRR4291663_R1_.fastq.gz
SRR4291663_R2_.fastq.gz
SRR4291664_R1_.fastq.gz
SRR4291664_R2_.fastq.gz
SRR4291665_R1_.fastq.gz
SRR4291665_R2_.fastq.gz
SRR4291666_R1_.fastq.gz
SRR4291666_R2_.fastq.gz
SRR4291667_R1_.fastq.gz
SRR4291667_R2_.fastq.gz
SRR4291668_R1_.fastq.gz
SRR4291668_R2_.fastq.gz
SRR4291669_R1_.fastq.gz
SRR4291669_R2_.fastq.gz
SRR4291670_R1_.fastq.gz
SRR4291670_R2_.fastq.gz
SRR4291671_R1_.fastq.gz
SRR4291671_R2_.fastq.gz
SRR4291672_R1_.fastq.gz
SRR4291672_R2_.fastq.gz
SRR4291673_R1_.fastq.gz
SRR4291673_R2_.fastq.gz
SRR4291674_R1_.fastq.gz
SRR4291674_R2_.fastq.gz
SRR4291675_R1_.fastq.gz
SRR4291675_R2_.fastq.gz
SRR4291676_R1_.fastq.gz
SRR4291676_R2_.fastq.gz
SRR4291677_R1_.fastq.gz
SRR4291677_R2_.fastq.gz
SRR4291678_R1_.fastq.gz
SRR4291678_R2_.fastq.gz
SRR4291679_R1_.fastq.gz
SRR4291679_R2_.fastq.gz
SRR4291680_R1_.fastq.gz
SRR4291680_R2_.fastq.gz
SRR4291681_R1_.fastq.gz
SRR4291681_R2_.fastq.gz
SRR4291682_R1_.fastq.gz
SRR4291682_R2_.f

mkdir: cannot create directory ‘trimmed’: File exists
cutadapt version 1.11
Copyright (C) 2010-2016 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Witho

In [39]:
IPYRAD_REFMAP_DIR = os.path.join(REFMAP_EMPIRICAL_DIR, "ipyrad")
if not os.path.exists(IPYRAD_REFMAP_DIR):
    os.makedirs(IPYRAD_REFMAP_DIR)
os.chdir(IPYRAD_REFMAP_DIR)

## Make a new assembly and set some assembly parameters
data = ip.Assembly("refmap-empirical")
data.set_params("sorted_fastq_path", REFMAP_EMPIRICAL_DIR + "raws/*.fastq.gz")
data.set_params("project_dir", "reference-assembly")
data.set_params("assembly_method", "reference")
data.set_params("reference_sequence", REFMAP_EMPIRICAL_DIR + "TurtrunRef/Tursiops_truncatus.turTru1.dna_rm.toplevel.fa.gz")
data.set_params("datatype", "pairddrad")
data.set_params("restriction_overhang", ("TGCAG", "CGG"))
data.set_params("clust_threshold", 0.9)
data.set_params('max_low_qual_bases', 30)
data.set_params('min_samples_locus', 3)
data.set_params('trim_overhang', (0,15,0,15))

data.write_params(force=True)

cmd = "ipyrad -p params-refmap-empirical.txt -s 1 --force".format(dir)
print(cmd)
!time $cmd

  New Assembly: refmap-empirical
ipyrad -p params-refmap-empirical.txt -s 1 --force

 -------------------------------------------------------------
  ipyrad [v.0.5.10]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: refmap-empirical
  local compute node: [40 cores] on node001

  Step 1: Loading sorted fastq data to Samples

  Encountered an unexpected error (see ./ipyrad_log.txt)
  Error message is below -------------------------------
float division by zero


real	0m12.615s
user	0m1.395s
sys	0m0.344s


## Do ipyrad refmap empirical

## Do Stacks refmap empirical

In [None]:
STACKS_REFMAP_DIR = os.path.join(REFMAP_EMPIRICAL_DIR, "stacks")
if not os.path.exists(STACKS_REFMAP_DIR):
    os.makedirs(STACKS_REFMAP_DIR)
os.chidir(STACKS_REFMAP_DIR)



## Do dDocent refmap empirical

In [None]:
DDOCENT_REFMAP_DIR = os.path.join(REFMAP_EMPIRICAL_DIR, "ddocent")
if not os.path.exists(DDOCENT_REFMAP_DIR):
    os.makedirs(DDOCENT_REFMAP_DIR)
os.chidir(DDOCENT_REFMAP_DIR)