# Analysing simulated and empirical datasets with stacks, pyrad, and ipyrad

This ipython notebook will provide a completely reproducible record of
all the analyses for the ipyrad manuscript.

All analyses were performed on a 40 core Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz 
Silicon Mechanics compute node with 120GB of main memory running CentOS Linux release 7.1.1503.

* Insert thanks to the CCNY scientific computing cluster folks here


### Notebook Tunnels
Mostly notes for myself, since my cluster config is complicated. To get a connection to a notebook on a compute node I have to bounce of 2 different machines, a gateway to get on campus, then another gateway (head hpc node) so I can talk to the cluster. The ssh commands will spawn processes in the background to shuttle communication between the ports. Kinda fancy.

On the cluster compute node (where you actually want to do the work) run:

    ipython notebook --no-browser --port=8888 &
  
On the cluster head node run this to forward connections from the campus to a node inside the hpc infrastructure:

    ssh -N -f -L localhost:8888:localhost:8888 <username@<compute_node>

On your gateway box run:

    ssh -N -f -L localhost:8887:localhost:8888 <username>@<head_node>


## Prerequisites (download and install data and executables)

In [1]:
import shutil
import glob
import sys
import os

## Set the default directories for exec and data. 
WORK_DIR="/home/iovercast/manuscript-analysis/"
EMPERICAL_DATA_DIR=os.path.join(WORK_DIR, "example_empirical_rad/")
SIMULATION_DATA_DIR=os.path.join(WORK_DIR, "simulated_data")
IPYRAD_DIR=os.path.join(WORK_DIR, "ipyrad/")
PYRAD_DIR=os.path.join(WORK_DIR, "pyrad/")
STACKS_DIR=os.path.join(WORK_DIR, "stacks/")
AFTRRAD_DIR=os.path.join(WORK_DIR, "aftrRAD/")
DDOCENT_DIR=os.path.join(WORK_DIR, "dDocent/")

## (emprical data dir will be created for us when we untar it)
for dir in [WORK_DIR, IPYRAD_DIR, PYRAD_DIR, STACKS_DIR, AFTRRAD_DIR, DDOCENT_DIR]:
    if not os.path.exists(dir):
        os.makedirs(dir)

## Empirical output directories
IPYRAD_OUTPUT=os.path.join(IPYRAD_DIR, "REALDATA/")
PYRAD_OUTPUT=os.path.join(PYRAD_DIR, "REALDATA/")
STACKS_OUTPUT=os.path.join(STACKS_DIR, "REALDATA/")
STACKS_GAP_OUT=os.path.join(STACKS_OUTPUT, "gapped/")
STACKS_UNGAP_OUT=os.path.join(STACKS_OUTPUT, "ungapped/")
STACKS_DEFAULT_OUT=os.path.join(STACKS_OUTPUT, "default/")
AFTRRAD_OUTPUT=os.path.join(AFTRRAD_DIR, "REALDATA")
DDOCENT_OUTPUT=os.path.join(DDOCENT_DIR, "REALDATA")

## Make the empirical output directories if they don't already exist
for dir in [IPYRAD_OUTPUT, PYRAD_OUTPUT, STACKS_OUTPUT,\
            STACKS_GAP_OUT, STACKS_UNGAP_OUT, STACKS_DEFAULT_OUT, AFTRRAD_OUTPUT, DDOCENT_OUTPUT]:
    if not os.path.exists(dir):
        os.makedirs(dir)

os.chdir(WORK_DIR)


In [4]:
### Fetch the pedicularis data

##curl grabs the data from a public dropbox url
## the curl command uses an upper-case o argument, not a zero.
!curl -LkO https://dl.dropboxusercontent.com/u/2538935/example_empirical_rad.tar.gz

## the tar command decompresses the data directory
!tar -xvzf example_empirical_rad.tar.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1042M  100 1042M    0     0  17.8M      0  0:00:58  0:00:58 --:--:-- 14.5M
example_empirical_rad/
example_empirical_rad/38362_rex.fastq.gz
example_empirical_rad/32082_przewalskii.fastq.gz
example_empirical_rad/40578_rex.fastq.gz
example_empirical_rad/30686_cyathophylla.fastq.gz
example_empirical_rad/39618_rex.fastq.gz
example_empirical_rad/41954_cyathophylloides.fastq.gz
example_empirical_rad/41478_cyathophylloides.fastq.gz
example_empirical_rad/33588_przewalskii.fastq.gz
example_empirical_rad/35855_rex.fastq.gz
example_empirical_rad/35236_rex.fastq.gz
example_empirical_rad/29154_superba.fastq.gz
example_empirical_rad/30556_thamno.fastq.gz
example_empirical_rad/33413_thamno.fastq.gz


In [195]:
## The original Helocnius analysis from pyrad v3
## Fetch the heliconius genome and rad data
##
## Davey, John W., et al. "Major improvements to the Heliconius melpomene 
## genome assembly used to confirm 10 chromosome fusion events in 6 
## million years of butterfly evolution." G3: Genes| Genomes| Genetics 6.3
## (2016): 695-708.

!curl -LkO http://butterflygenome.org/sites/default/files/Hmel2-0_Release_20160201.tgz
!tar -xvzf Hmel2-0_Release_20160201.tgz

## Several RAD datasets are available
##
## Heliconius Genome Consortium. (2012). Butterfly genome reveals 
## promiscuous exchange of mimicry adaptations among species. 
## Nature, 487(7405), 94-98.

## European Nucleotide Archive, Accession ERP000991
## Heliconius melpomene melpomene x Heliconius melpomene rosina - ERP000993

## And from the Davey et al 2016
## European Nucleotide Archive (ENA), accession PRJEB11288

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 93.6M  100 93.6M    0     0  12.1M      0  0:00:07  0:00:07 --:--:-- 14.8M
./._Hmel2
Hmel2/
Hmel2/._annotation
Hmel2/annotation/
Hmel2/._ChangeLog.txt
Hmel2/ChangeLog.txt
Hmel2/._Hmel2.fa
Hmel2/Hmel2.fa
Hmel2/._Hmel_mtDNA.fa
Hmel2/Hmel_mtDNA.fa
Hmel2/._maps
Hmel2/maps/
Hmel2/._README.txt
Hmel2/README.txt
Hmel2/._repeats
Hmel2/repeats/
Hmel2/._transfer
Hmel2/transfer/
Hmel2/transfer/._Hmel1-1_Hmel2.chain
Hmel2/transfer/Hmel1-1_Hmel2.chain
Hmel2/transfer/._Hmel2_broken.gff
Hmel2/transfer/Hmel2_broken.gff
Hmel2/transfer/._Hmel2_removed.gff
Hmel2/transfer/Hmel2_removed.gff
Hmel2/transfer/._Hmel2_transfer_new.tsv
Hmel2/transfer/Hmel2_transfer_new.tsv
Hmel2/transfer/._Hmel2_transfer_old.tsv
Hmel2/transfer/Hmel2_transfer_old.tsv
Hmel2/repeats/._Hmel.all.named.final.1-31.lib
Hmel2/repeats/Hmel.all.named.final.1-31.lib
Hmel2/maps/._Hme

## Prereqs - Install ipyrad

Full install details for all platforms are here: http://ipyrad.readthedocs.io/installation.html

In [69]:
%%bash -s "$WORK_DIR"
## Must always export the new miniconda path in each bash cell. We do this
## so the analysis pipeline is totally self contained, it doesn't read to
## or write from anywwhere but the WORK_DIR.
## There's probably a magic way to do this but i can't figure it out
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"

## Fetch the latest miniconda installer
wget --quiet https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh

## Install miniconda silently to the work directory.
## -b means "batch" mode, -f is force overwrite and -p is the install dir
bash Miniconda-latest-Linux-x86_64.sh -b -f -p $WORK_DIR/miniconda

conda update -y conda                 ## updates conda
conda install -y -c ipyrad ipyrad     ## installs the latest release (silently `-y`)

/home/iovercast/manuscript-analysis//miniconda/bin:/home/iovercast/opt/miniconda/bin:/home/iovercast/opt/bin:/cm/shared/apps/slurm/14.11.6/sbin:/cm/shared/apps/slurm/14.11.6/bin:/cm/local/apps/gcc/5.1.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/sbin:/home/iovercast/.local/bin:/home/iovercast/bin
PREFIX=/home/iovercast/manuscript-analysis/miniconda
installing: _cache-0.0-py27_x0 ...
installing: python-2.7.11-0 ...
installing: conda-env-2.4.5-py27_0 ...
installing: openssl-1.0.2g-0 ...
installing: pycosat-0.6.1-py27_0 ...
installing: pyyaml-3.11-py27_1 ...
installing: readline-6.2-2 ...
installing: requests-2.9.1-py27_0 ...
installing: sqlite-3.9.2-0 ...
installing: tk-8.5.18-0 ...
installing: yaml-0.1.6-0 ...
installing: zlib-1.2.8-0 ...
installing: conda-4.0.5-py27_0 ...
installing: pycrypto-2.6.1-py27_0 ...
installing: pip-8.1.1-py27_1 ...
installing: wheel-0.29.0-py27_0 ...
installing: setuptools-20.3-py27_0 ...
creating default environment...
installation fini

Python 2.7.11 :: Continuum Analytics, Inc.


In [70]:
%%bash -s "$WORK_DIR"
export PATH="$1/miniconda/bin:$PATH"
which ipyrad

/home/iovercast/manuscript-analysis/miniconda/bin/ipyrad


## Prereqs - Install pyrad


In [71]:
%%bash -s "$WORK_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"

## Fetch and install prerequisites. This takes a few minutes so be patient

## Should be unnecessary because numpy and scipy already installed by conda
conda install numpy scipy
wget http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz
tar -xvzf muscle*.tar.gz

## This unpacks the muscle binary in the current working directory
## You can test it to see if it runs:
./muscle3.8.31_i86linux64 -h

## Copy to miniconda/bin so it will be in your path
## Maybe not technically 'correct' but it'll work for our purposes
cp muscle3.8.31_i86linux64 $WORK_DIR/miniconda/bin/muscle

## Download and install vsearch
wget https://github.com/torognes/vsearch/releases/download/v2.0.3/vsearch-2.0.3-linux-x86_64.tar.gz
tar xzf vsearch-2.0.3-linux-x86_64.tar.gz
cp vsearch-2.0.3-linux-x86_64/bin/vsearch $WORK_DIR/miniconda/bin/vsearch

## Fetch pyrad source from the git repository
git clone https://github.com/dereneaton/pyrad.git
cd pyrad

## If you are using anaconda you can simply run setup.py to
## install pyrad into your conda environment
python setup.py install

which muscle
which vsearch
which pyrad

Fetching package metadata .......
Solving package specifications: ..........

# All requested packages already installed.
# packages in environment at /home/iovercast/manuscript-analysis/miniconda:
#
numpy                     1.11.1                   py27_0  
scipy                     0.18.0              np111py27_0  
muscle3.8.31_i86linux64
running install
running bdist_egg
running egg_info
creating pyrad.egg-info
writing requirements to pyrad.egg-info/requires.txt
writing pyrad.egg-info/PKG-INFO
writing top-level names to pyrad.egg-info/top_level.txt
writing dependency_links to pyrad.egg-info/dependency_links.txt
writing entry points to pyrad.egg-info/entry_points.txt
writing manifest file 'pyrad.egg-info/SOURCES.txt'
reading manifest file 'pyrad.egg-info/SOURCES.txt'
writing manifest file 'pyrad.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/pyrad
copying pyrad/Dt

Using Anaconda API: https://api.anaconda.org
--2016-09-01 17:05:22--  http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz
Resolving www.drive5.com (www.drive5.com)... 199.195.116.69
Connecting to www.drive5.com (www.drive5.com)|199.195.116.69|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 487906 (476K) [application/x-gzip]
Saving to: ‘muscle3.8.31_i86linux64.tar.gz.1’

     0K .......... .......... .......... .......... .......... 10%  961K 0s
    50K .......... .......... .......... .......... .......... 20% 1.93M 0s
   100K .......... .......... .......... .......... .......... 31% 1.93M 0s
   150K .......... .......... .......... .......... .......... 41% 2.72M 0s
   200K .......... .......... .......... .......... .......... 52% 3.05M 0s
   250K .......... .......... .......... .......... .......... 62% 2.52M 0s
   300K .......... .......... .......... .......... .......... 73% 2.54M 0s
   350K .......... .......... .......... .....

### Prereqs - Download and install stacks

**NB:** Stacks is trickier to build on OSX, I had to compile it on a different 
machine and then scp the binaries to the box I was working on.

Stacks is picky about where stuff installs to. If you don't have permission to install to /usr/local (most HPC systems) then you need to provide the `--prefix` argument to `./configure`

In [74]:
%%bash -s "$WORK_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"
cd $WORK_DIR

## 1.42 is the version that crashed during cstacks on the simdata. Julian fixed
## the bug and pushed 1.43 which seems to work great.
#wget http://catchenlab.life.illinois.edu/stacks/source/stacks-1.42.tar.gz
wget http://catchenlab.life.illinois.edu/stacks/source/stacks-1.43.tar.gz
tar -xvzf stacks-1.43.tar.gz
cd stacks-1.43
./configure --prefix=$WORK_DIR/miniconda
make
make install

cd $WORK_DIR
which process_radtags

stacks-1.42/
stacks-1.42/acinclude.m4
stacks-1.42/aclocal.m4
stacks-1.42/autogen.sh
stacks-1.42/ChangeLog
stacks-1.42/config/
stacks-1.42/config.h.in
stacks-1.42/configure
stacks-1.42/configure.ac
stacks-1.42/htslib/
stacks-1.42/INSTALL
stacks-1.42/LICENSE
stacks-1.42/Makefile.am
stacks-1.42/Makefile.in
stacks-1.42/php/
stacks-1.42/README
stacks-1.42/scripts/
stacks-1.42/sql/
stacks-1.42/src/
stacks-1.42/tests/
stacks-1.42/tests/kmer_filter.t
stacks-1.42/tests/process_radtags.t
stacks-1.42/tests/pstacks.t
stacks-1.42/tests/ustacks.t
stacks-1.42/src/aln_utils.cc
stacks-1.42/src/aln_utils.h
stacks-1.42/src/BamI.h
stacks-1.42/src/BamUnalignedI.h
stacks-1.42/src/bootstrap.h
stacks-1.42/src/BowtieI.h
stacks-1.42/src/BustardI.h
stacks-1.42/src/catalog_utils.cc
stacks-1.42/src/catalog_utils.h
stacks-1.42/src/clean.cc
stacks-1.42/src/clean.h
stacks-1.42/src/clone_filter.cc
stacks-1.42/src/clone_filter.h
stacks-1.42/src/cmb.cc
stacks-1.42/src/cmb.h
stacks-1.42/src/constants.h
stacks-1.42/src/cs

--2016-09-01 17:11:52--  http://catchenlab.life.illinois.edu/stacks/source/stacks-1.42.tar.gz
Resolving catchenlab.life.illinois.edu (catchenlab.life.illinois.edu)... 130.126.48.94
Connecting to catchenlab.life.illinois.edu (catchenlab.life.illinois.edu)|130.126.48.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 897951 (877K) [application/x-gzip]
Saving to: ‘stacks-1.42.tar.gz.1’

     0K .......... .......... .......... .......... ..........  5%  790K 1s
    50K .......... .......... .......... .......... .......... 11% 1.00M 1s
   100K .......... .......... .......... .......... .......... 17%  777K 1s
   150K .......... .......... .......... .......... .......... 22%  637K 1s
   200K .......... .......... .......... .......... .......... 28% 91.4M 1s
   250K .......... .......... .......... .......... .......... 34%  559K 1s
   300K .......... .......... .......... .......... .......... 39% 92.6M 1s
   350K .......... .......... .......... .......... ....

### Prereqs - Download and install aftrRAD

Lots of gnarly prereqs. Also, the `Genotype.pl` script prompts the user a couple times 
to verify whether or not you want to remove individuals with lots of missing data. I
modified the script for the simulated data so it just passes so I can actually
automate testing.

In [338]:
%%bash -s "$WORK_DIR" -s "$AFTRRAD_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"; export "AFTRRAD_DIR=$2"
cd $AFTRRAD_DIR

## Get a copy of acana
wget http://www.niehs.nih.gov/research/resources/assets/docs/acana_linux_x64tgz.tgz
cp ACANA_DIR/ACANA .
cp ACANA_DIR/dnaMatrix .

## ... and mafft
wget http://mafft.cbrc.jp/alignment/software/mafft-7.305-with-extensions-src.tgz
tar -xvzf mafft-7.305-with-extensions-src.tgz
cd mafft-7.305-with-extensions/core
## This won't work on mac (`-i` flag acts different)
sed -i 's/PREFIX = \/usr\/local/PREFIX = \/$WORK_DIR\/miniconda/' Makefile
make; make install

## ... and R
## R turns out to be a complete nightmare to install if you don't have root
## using brew makes it easier, but still not v straightforward
## Brew installing R takes a while so be patient (>10 minutes)
cd $AFTRRAD_DIR
git clone https://github.com/Linuxbrew/brew.git
./brew/bin/brew install homebrew/science/r
## Copy the R binary to the miniconda bin dir so it will be available
cp brew/bin/R $WORK_DIR/miniconda/bin/

## Some hackish bullshit to get cpan working in the event you are _not_ root
## which is most everybody
wget -O- http://cpanmin.us | perl - -l ~/perl5 App::cpanminus local::lib
eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`
echo 'eval `perl -I $WORK_DIR/perl5/lib/perl5 -Mlocal::lib`' >> $WORK_DIR/.profile
echo 'export MANPATH=$WORK_DIR/perl5/man:$MANPATH' >> $WORK_DIR/.profile
source $WORK_DIR/.profile
cpanm Parallel::ForkManager

## Actually get aftrRAD
git clone https://github.com/mikesovic/AftrRAD.git

Process is interrupted.


### Install dDocent

dDocent installs a bunch of business into its working directory so you have to be
sure to update the PATH in a similar way to how we do it with miniconda, i.e.:
        
        %%bash -s "$WORK_DIR" -s "$DDOCENT_DIR"
        export PATH="$2/:$1/miniconda/bin:$PATH";

From the dDocent docs: "Now if you are using a Mac computer, things get a little trickier." This part of the manu pipeline is only tested and known to work on linux.

There is an install shell script and it tried to install lots of the deps, but several failed, which i had to install by hand (freebayes, bwa, samtools, seqtk, cd-hit-est). Drag. Freebayes turned out to be a nightmare of deps as well, so it took hours to figure out how to install. Ultimately I had to compile it on another machine and copy the binaries over. Also had to update my LD_LIBRARY_PATH to get vcflibs tools to work: 

        export LD_LIBRARY_PATH=/home/iovercast/manuscript-analysis/dDocent/freebayes-src/vcflib/tabixpp/htslib/:$LD_LIBRARY_PATH

In [435]:
#%%bash -s "$WORK_DIR" -s "$DDOCENT_DIR"
#export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"; export "DDOCENT_DIR=$2"
os.chdir(WORK_DIR)
#force = True
force = ""
if force:
    shutil.rmtree(DDOCENT_DIR)
    cmd = "git clone https://github.com/jpuritz/dDocent.git"
    !$cmd

    os.chdir("dDocent")
    ## This will run for a long time and install a bunch of binaries to the dDocent dir
    cmd = "sh install_dDocent_requirements " + DDOCENT_DIR
    !$cmd
    cmd = "chmod 777 {}/dDocent".format(DDOCENT_DIR)

## test
cmd = "export PATH={}:$PATH; dDocent".format(DDOCENT_DIR)
!$cmd

/home/iovercast/manuscript-analysis/dDocent/:/home/iovercast/opt/miniconda/bin:/home/iovercast/opt/bin:/cm/shared/apps/slurm/14.11.6/sbin:/cm/shared/apps/slurm/14.11.6/bin:/cm/local/apps/gcc/5.1.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/sbin:/home/iovercast/.local/bin:/home/iovercast/bin
dDocent 2.24 

Contact jpuritz@gmail.com with any problems 

 
Checking for required software
The dependency freebayes is not installed or is not in your $PATH.
The dependency bwa is not installed or is not in your $PATH.
The dependency samtools is not installed or is not in your $PATH.
The dependency seqtk is not installed or is not in your $PATH.
The dependency cd-hit-est is not installed or is not in your $PATH.
/home/iovercast/manuscript-analysis/dDocent/dDocent: line 46: [: : integer expression expected
The version of Samtools installed in your $PATH is not optimized for dDocent.
Please install at least version 1.3.0


# Simulation Analysis

We will use `simrrls` to generate some simulated RAD-seq data for testing. This is a program that was written by Deren Eaton and is available on github: github.com/dereneaton/simrrls.git. `simrrls` requires the python egglib module, which is somewhat painful to install. Short version:
* Install egglib (painful)
 * Good directions: http://wjidea.github.io/2016/installEgglib.html
* Install simrrls
 * git clone https://github.com/dereneaton/simrrls.git
 * cd simrrls
 * python setup.py install

In [None]:
%%bash -s "$WORK_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"
cd $WORK_DIR

## This is the long and painful version of installing egglib.
## This is what i did, it's not super guaranteed to work on 
## different platforms, but it should be at least a guide to follow.

## no cmake on my hpc compute node, so we have to install it ourselves
## grab the silent installer and tell it to write into the 
## miniconda dir, since we're already adding that to path
wget https://cmake.org/files/v3.6/cmake-3.6.1-Linux-x86_64.sh
bash cmake-3.6.1-Linux-x86_64.sh --prefix=$WORK_DIR/miniconda --exclude-subdir
cmake --version

## Install bio++. Similar move here, we'll install to the miniconda
## directory because we don't have any assurance of root on the box
mkdir bpp
cd bpp
wget http://biopp.univ-montp2.fr/repos/sources/bpp-core-2.2.0.tar.gz
wget http://biopp.univ-montp2.fr/repos/sources/bpp-seq-2.2.0.tar.gz
wget http://biopp.univ-montp2.fr/repos/sources/bpp-popgen-2.2.0.tar.gz
tar -xvzf bpp-core-2.2.0.tar.gz
tar -xvzf bpp-seq-2.2.0.tar.gz
tar -xvzf bpp-popgen-2.2.0.tar.gz
cd bpp-core-2.2.0
cmake -DCMAKE_INSTALL_PREFIX=$WORK_DIR/miniconda
make
make install
cd ../bpp-seq-2.2.0
cmake -DCMAKE_INSTALL_PREFIX=$WORK_DIR/miniconda
make
make install
cd ../bpp-popgen-2.2.0
cmake -DCMAKE_INSTALL_PREFIX=$WORK_DIR/miniconda
make
make install

## install egglib-cpp and egglib python module
## NB: simrrls requires egglib 2 
wget http://mycor.nancy.inra.fr/egglib/releases/2.1.11/egglib-cpp-2.1.11.tar.gz
wget http://mycor.nancy.inra.fr/egglib/releases/2.1.11/egglib-py-2.1.11.tar.gz
tar -xvzf egglib-cpp-2.1.11.tar.gz
tar -xvzf egglib-py-2.1.11.tar.gz
cd egglib-cpp-2.1.11
./configure --prefix=$WORK_DIR/miniconda
make
make install
cd ../egglib-py-2.1.11/
python setup.py build --prefix=$WORK_DIR/miniconda
python setup.py install --prefix=$WORK_DIR/miniconda

## Install simrrls
git clone https://github.com/dereneaton/simrrls.git
cd simrrls
python setup.py install
simrrls --help

## Simulating different RAD-datasets
Both pyRAD and stacks have undergone a lot of work since the original pyrad analysis. Because improvements have been made we want to test performance of all the current pipelines and be able to compare current to past performance. We'll follow the original pyRAD manuscript analysis (Eaton 2013) by simulating modest sized datasets with variable amounts of indels. We'll also simulate one much larger dataset. Also, because stacks has since included an option for handling gapped analysis we'll test both gapped and ungapped assembly.

### Tuning simrrls indel parameter
The `-I` parameter for simrrls has changed since the initial pyrad manuscript, so the we had to explore new values for this parameter that will approximate the number of indels we are after. I figured out a way to run simrrls and pipe the output to muscle to get a quick idea of the indel variation for different params:

        simrrls -n 1 -L 1 -I 1 -r1 $RANDOM 2>&1 | grep 0 -A 1 | tr '@' '>' | muscle | grep T | head -n 60

If you run it like this then you can get an idea of how many indel bearing seqs are generated:

        for i in {1..50}; do simrrls -n 1 -L 1 -I .05 -r1 $RANDOM 2>&1 | grep 0 -A 1 | tr '@' '>' | muscle | grep T | head -n 40 >> rpt.txt; done
        grep "-" rpt.txt | wc -l
        
From experimentation:

-I value -- %loci w/ indels
* 0.02 -- ~10%
* 0.05 -- ~15%
* 0.10 -- ~25%

The simulated data will live in these directories:
* SIM_NO_DIR = WORK_DIR/simulated_data/simno
* SIM_LO_DIR = WORK_DIR/simulated_data/simlo
* SIM_HI_DIR = WORK_DIR/simulated_data/simhi
* SIM_LARGE_DIR = WORK_DIR/simulated_data/simlarge

Timing:
* 10K loci -- ~8MB -- ~ 2 minutes
* 100K loci -- ~80MB -- ~ 20 minutes

In [326]:
import subprocess
import shutil

force = True

## Directories for the simulation data
SIM_NO_DIR=os.path.join(SIMULATION_DATA_DIR, "simno")
SIM_LO_DIR=os.path.join(SIMULATION_DATA_DIR, "simlo")
SIM_HI_DIR=os.path.join(SIMULATION_DATA_DIR, "simhi")
SIM_LARGE_DIR=os.path.join(SIMULATION_DATA_DIR, "simlarge")

for dir in [SIMULATION_DATA_DIR, SIM_NO_DIR, SIM_LO_DIR,\
            SIM_HI_DIR, SIM_LARGE_DIR]:
    if force and os.path.exists(dir):
        shutil.rmtree(dir)
    if not os.path.exists(dir):
        os.makedirs(dir)


## Tree input doesn't work right. Egglib errors.
## These are the trees from pyrad v3 manu, not sure if they ever worked there either
# newick = "(((((A:2,B:2):2,C:4):4,D:8):4,(((E:2,F:2):2,G:4):4,H:8):4):4,(((I:2,J:2):2,K:4):4,L:8):8,X:16):16;"
newick = "(((((A,B),C),D),(((E,F),G),H)),(((I,J),K),L));"
treefile = os.path.join(SIMULATION_DATA_DIR, "simtree.txt")
with open(treefile, 'w') as outfile:
    outfile.write(newick)

cmd = """\
export PATH={workdir}/miniconda/bin:$PATH; time simrrls -o {odir}/{oname} -ds 2 -L 10000 -I {indels} -r1 1
"""

## simulate no indels
call = cmd.format(workdir=WORK_DIR, odir=SIMULATION_DATA_DIR, oname="simno/simno", indels=0, tree=treefile)
print(call)
output = subprocess.check_output(call, shell=True, stderr=subprocess.STDOUT)
print(output)

## simulate low indels
call = cmd.format(workdir=WORK_DIR, odir=SIMULATION_DATA_DIR, oname="simlo/simlo", indels=0.02)
print(call)
output = subprocess.check_output(call, shell=True, stderr=subprocess.STDOUT)
print(output)

## simulate high indels
call = cmd.format(workdir=WORK_DIR, odir=SIMULATION_DATA_DIR, oname="simhi/simhi", indels=0.05)
print(call)
output = subprocess.check_output(call, shell=True, stderr=subprocess.STDOUT)
print(output)

## simulate large dataset, no indels but w/ allele dropout from mutations to cutsites
call = "export PATH={workdir}/miniconda/bin:$PATH; time simrrls -o {odir}/{oname} -ds 2 -L 100000 -mc 1 -r1 1"\
    .format(workdir=WORK_DIR, odir=SIMULATION_DATA_DIR, oname="simlarge/simlarge")
print(call)
#output = subprocess.check_output(call, shell=True, stderr=subprocess.STDOUT)
print(output)

export PATH=/home/iovercast/manuscript-analysis//miniconda/bin:$PATH; time simrrls -o /home/iovercast/manuscript-analysis/simulated_data/simno/simno -ds 2 -L 10000 -I 0 -r1 1


real	1m56.606s
user	1m56.326s
sys	0m0.257s

export PATH=/home/iovercast/manuscript-analysis//miniconda/bin:$PATH; time simrrls -o /home/iovercast/manuscript-analysis/simulated_data/simlo/simlo -ds 2 -L 10000 -I 0.02 -r1 1


real	2m5.706s
user	2m5.544s
sys	0m0.183s

export PATH=/home/iovercast/manuscript-analysis//miniconda/bin:$PATH; time simrrls -o /home/iovercast/manuscript-analysis/simulated_data/simhi/simhi -ds 2 -L 10000 -I 0.05 -r1 1


real	2m8.086s
user	2m7.912s
sys	0m0.191s

export PATH=/home/iovercast/manuscript-analysis//miniconda/bin:$PATH; time simrrls -o /home/iovercast/manuscript-analysis/simulated_data/simlarge/simlarge -ds 2 -L 100000 -mc 1 -r1 1

real	2m8.086s
user	2m7.912s
sys	0m0.191s



## ipyrad simulated data assembly

For simulated datasets with 10000 l00bp loci ipyrad runs in 
* v.0.3.41 ~6 minutes.
* v.0.3.42 ~8 minutes.
* v.0.4.0  ~4 minutes.

Large dataset (100000 100bp loci)
* v.0.4.0 ~24 minutes.

In [6]:
import ipyrad as ip

## Set up directory structures. change the force flag if you want to
## blow everything away and restart
#force = True
force = ""
IPYRAD_SIMOUT=os.path.join(IPYRAD_DIR, "SIMDATA/")
if force and os.path.exists(IPYRAD_SIMOUT):
    shutil.rmtree(IPYRAD_SIMOUT)
if not os.path.exists(IPYRAD_SIMOUT):
    os.makedirs(IPYRAD_SIMOUT)
for outdir in ["no/", "lo/", "hi/", "large/"]:
    tmpdir = os.path.join(IPYRAD_SIMOUT, "sim"+outdir)
    if not os.path.exists(tmpdir):
        os.makedirs(tmpdir)

## go to the ipyrad simulated data output directory
for dir in ["no", "lo", "hi", "large"]:
    
    ## Set barcode and raw data files
    simout_dir = os.path.join(IPYRAD_SIMOUT, "sim"+dir)
    simdata_dir = os.path.join(SIMULATION_DATA_DIR, "sim"+dir)
    bcodes = os.path.join(simdata_dir, "sim{}_barcodes.txt".format(dir))
    fastqs = os.path.join(simdata_dir, "sim{}_R1_.fastq.gz".format(dir))
    os.chdir(simout_dir)
    ## Make a new assembly
    data = ip.Assembly("sim"+dir)
    
    ## Set padata1.set_params('project_dir', "./test_rad")
    data.set_params('raw_fastq_path', fastqs)
    data.set_params('barcodes_path', bcodes)
    data.set_params('max_low_qual_bases', 4)
    data.set_params('filter_min_trim_len', 69)
    data.set_params('max_Ns_consens', (99,99))
    data.set_params('max_Hs_consens', (99,99))
    data.set_params('max_SNPs_locus', (100, 100))
    data.set_params('min_samples_locus', 2)
    data.set_params('max_Indels_locus', (99,99))
    data.set_params('max_shared_Hs_locus', 99)
    data.set_params('trim_overhang', (2,2,2,2))

    data.write_params(force=True)

    cmd = "ipyrad -p params-sim{}.txt -s 1234567 --force".format(dir)
    print(cmd)
    !time $cmd

  New Assembly: simlarge
ipyrad -p params-simlarge.txt -s 1234567 --force

 -------------------------------------------------------------
  ipyrad [v.0.3.42]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: simlarge
  local compute node: [40 cores] on node001

  Step 1: Demultiplexing fastq data to Samples
  [####################] 100%  chunking large files  | 0:01:06 
  [####################] 100%  sorting reads         | 0:01:00 
  [####################] 100%  writing/compressing   | 0:00:11 


  Step 2: Filtering reads 
  [####################] 100%  processing reads      | 0:00:40 


  Step 3: Clustering/Mapping reads
  [####################] 100%  dereplicating         | 0:00:05 
  [####################] 100%  clustering            | 0:00:32 
  [####################] 100%  building clusters     | 0:00:10 
  [####################] 100%  chunking              | 0:00:02 
  [####################] 100%  a

## pyrad simulated data assembly
Simulated assembly approximate runtimes:

* 10000 l00bp loci pyrad runs in ~11 minutes.
* 100000 100bp loci pyrad runs in ~90 minutes.

### Set the default pyrad params file.

In [8]:
## Here we load the default params info into `pyrad_params`. Each simulation
## iteration will slightly modify this and write it out for their own use.

## I zipped and uuencoded the params file so it wouldn't take up so much visual room. Lol
pyrad_params = 'begin 666 <data>\nM>)R-5^]OVCH4_<Y?<:5]&&%NFA^%MFR95*W3.FFKMG;2^UBYL2%>$SNS30?3\nM^^/?=9P *Q >\'TH@^/CF^)QS;[-L-(*::EIQRS4(62^L@9G24*_NKJ[AF6LC\nME(0TC,))"C :90=>0&<SGEO.P%A>0Y8-PM.[CU=?KJ]^7,&!UZM7$(?PC])/\nM0LZ!"8T 2J\\._7S]&M*R# 8[6$D(7U0>@IJ!5/*$\\6I16E&7?(E5S43)#0R%\nMNV>A%))#?!$@EHGW0*4;J$>J<\\5X [!G_799#NK9<%Q0[\'_:LQ!R5554,A@Z\nMCJDM K *<GP@Z%:Z.PM_\'7C8E)A),*@6)L<2]B./^Y /K6R0SX/!CT\\?KC[M\nM91J1)R\'<<6.UR*U3@D))%!2/:\\C#>4C@P[]^]<E[:"Z"C@IBDF 0IWMA6^CS\nM$&ZAUBKGQBB-I^.46):\\# ZNZ@Y_TH=[$<)7(1FO;3&%2DBD!JNF<]XHFT)>\nM+HR3^PLRSH@9!X.S/N1+K/C[@I8(2Y?P"HRPJ*K?PA;P"[^&=Y!$S0$8SKU(\nMDB@8.B;"B_%AV#A"&S153;OBG"-LH;DI5,F &BR;\\5Q4N,FZ8*\\+35D/,AKL\nMFEIJ5S6? OZ4S!\\-J:G0W3MC[MNAL@5Z?>KJ9BHW@:<YZ2,C3AJ>/ZAGS[*A\nM5>U,AI<4[2*QU%+E"_."9A1<KRIB--]7NKR_\\23C0;8,FX)J=\'+AHDHUU+^ \nMS2_&;%(E=7J3WNYNX:#1@-\\TGXDE2(R\\1@Z^4+6P&\'TPE A<4Q1D\\#=T$W&J\nM=A; 7S>):>"1E^HWGI(P_JP/9>-V3.[F9 \\3XQ"W#*>HIA+7@%D\\&HY%ULTC\nMC$#)<M7>4_HOSZ#D3ES9/=B3%ILR=H*^\'B(#<ZT6-<8&75(8EL)8<%\'B-]OF\nM8T*.8)^WV\'R):L;L/(+XXAC#\\%34*U3EZ;J#K"\\>.!/6G(X>[N*\'T>$"+MH"\nMRC;$&3_9;@8,\';%_<>/5GB>[;(%1F5^GF%Z5,!6U>>%EW_8*3#+&9QG$P3;P\nM3I_9!DZB%K@NG,:_FUQICI7/F@-OT-)T;RP>J3B)6V#L7^B;*3BL*&M2#&+_\nM_H8R6N--$T*2^:C_\'\\!))R \\4*&T@(_DIBTU"J,H)O@W)M!V3&PA&%ZH^P"!\nMQ[W Z8;C6Y\\!MVVHY$H:%/POO\\WX)1_\'@,\\VP&VX^"SYLYJ\';90WS6)GDV/ \nMG4_K4@FV\\M!-(]N\'F+Q=A^RZZ?1@3S9%W]]^,U,/$D=1@%F&^8UJZ;XA[MLM\nM-GI-FIQO@#]CLRP1VL6LD"=-"R(TU\\H8_Z\'9(H247%X&Z[&A![OSG\\:!1%4@\nM%]4C;YHB:Y\'B.$EV!.T[VC\'LSH*HTVHSCI1\\9HD6\\P(C1FZZCR!.\\,.(1($G\nMY/DP<MIYL&T&V!U0L4A+322AQ)!GLB"65.2)S,D(X;:.L;?D-%Y3_3/$D##<\nMSV740C.DO%^^J]IYQ=.SS+K/1]V2=C9\\XMA.\'"<5DJPY96T*.9U\\E&[<<3VZ\nMY\')NB]"SW>OO=,N&V*MH_H06^>-&8&F#AM7,W1J.47<5I_)-,KJ_[@;67N"U\nM#86\\YIK7FS[!W,>VV[_+,%YS50MN_&[Q4>#.AC@]0R%PXL,16B#3VS-5PV\\6\nM$<R^U;K+#]T<U0/<>1"KXWAH%35/&ZPX>\\T0_C7@_U"%8@2B3*HUL)_1>K [\nM&W;COZ/;#7[N &L\\M9_JL=UH\\A8VDMM@_S6:8,MS7;;IY*=Y29%4:HR8RXI+\n/NSNQ;&EX=U89_ >8=H2-\n \nend\n'
pyrad_params = pyrad_params.decode("uu").decode("zlib")
print(pyrad_params)

## If you want to make changes to the params file you can generate a new uuencoded string
## by editing the pyrad/params-pyrad.txt file and running these lines. If you run this
## at a terminal inside python shell then it'll print out nicely as one big fat line. 
#with open(os.path.join(PYRAD_DIR, "params-pyrad.txt")) as infile:
#    infile.read().encode("zlib").encode("uu")

<WORKDIR>                        ## 1. Working directory                                 (all)
<RAWFILE>              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
<BARCODES>              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                     ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
13                         ## 7. N processors (parallel)                           (all)
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                         ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.85                       ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
2                 

In [9]:
import subprocess

## Set up directory structures. change the force flag if you want to
## blow everything away and restart
# force = True
force = ""
PYRAD_SIMOUT=os.path.join(PYRAD_DIR, "SIMDATA/")
if force and os.path.exists(PYRAD_SIMOUT):
    shutil.rmtree(PYRAD_SIMOUT)
if not os.path.exists(PYRAD_SIMOUT):
    os.makedirs(PYRAD_SIMOUT)
for outdir in ["no/", "lo/", "hi/", "large/"]:
    tmpdir = os.path.join(PYRAD_SIMOUT, "sim"+outdir)
    if not os.path.exists(tmpdir):
        os.makedirs(tmpdir)

## go to the pyrad simulated data output directory
for dir in ["no", "lo", "hi", "large"]:
    ## Set barcode and raw data files
    simdata_dir = os.path.join(SIMULATION_DATA_DIR, "sim"+dir)
    bcodes = os.path.join(simdata_dir, "sim{}_barcodes.txt".format(dir))
    fastqs = os.path.join(simdata_dir, "sim{}_R1_.fastq.gz".format(dir))
    
    ## Set output directory and params file
    simout_dir = os.path.join(PYRAD_SIMOUT, "sim"+dir)
    my_params_file = os.path.join(simout_dir, "params{}-pyrad.txt".format(dir))
    
    ## Grab the generic pyrad params file and modify it for this run
    my_params = pyrad_params.replace("<WORKDIR>", simout_dir)
    my_params = my_params.replace("<RAWFILE>", fastqs)
    my_params = my_params.replace("<BARCODES>", bcodes)

    ## Write the params out to a file
    with open(my_params_file, 'w') as outfile:
        outfile.write(my_params)

    cmd = "export PATH={workdir}/miniconda/bin:$PATH; ".format(workdir=WORK_DIR)
    cmd += "time pyrad -p {params_file} -s 1234567".format(params_file=my_params_file)

    ## Good sub for testing
    #cmd += "time pyrad --version"

    print(cmd)
    output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
    print(output)

export PATH=/home/iovercast/manuscript-analysis//miniconda/bin:$PATH; time pyrad -p /home/iovercast/manuscript-analysis/pyrad/SIMDATA/simno/paramsno-pyrad.txt -s 1234567


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 1: sorting reads by barcode
	 .	step 2: editing raw reads 
	............
	de-replicating files for clustering...

	step 3: within-sample clustering of 12 samples at 
	        '.85' similarity. Running 12 parallel jobs
	 	with up to 6 threads per job. If needed, 
		adjust to avoid CPU and MEM limits

	sample 1D_0 finished, 10000 loci
	sample 3I_0 finished, 10000 loci
	sample 3J_0 finished, 10000 loci
	sample 3K_0 finished, 10000 loci
	sample 1A_0 finished, 10000 loci
	sample 2F_0 finished, 10000 loci
	sample 1B_0 finished, 10000 loci
	sample 2E_0 finished, 10000 loci
	sample 2H_0 finished, 10000 loci
	sample 1C_0 fini

## stacks simulated data assembly
Output for stacks will be a bit trickier bcz we'll run gapped, ungapped and default on all sim data
```
WORK_DIR
    |____stacks
            |___SIMOUT
                    |___gapped
                           |____simno
                           |____simlo
                           |____simhi
                    |____ungapped
                    |
                    |____default
```

We'll make calls to stacks that will look kind of like this:
```
* process_radtags args
 * -f input file
 * -i gzfastq
 * -b barcodes file
 * -e pstI
 * -o output dir

process_radtags -f simno_R1_.fastq.gz -i gzfastq -b simno_barcodes.txt -e pstI -o simout

* denovo_map.pl args
 * -m 2: Minimum depth to make a stack
 * -M 10: Number of mismatches per locus within individuals
 * -N 10: Number of allowed mismatches in secondary reads
 * -n 10: Number of mismatches when clustering across individuals 
 
 Housekeeping flags:
 * -T 40: Number of threads
 * -b 1: Batch id (Can be anything)
 * -S: Disable sql

denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 40 -b 1 -S --gapped  -X 'populations:--vcf --genepop --structure --phase --fastphase --phylip ' -X 'populations:-m 6' -X 'ustacks:--max_locus_stacks 2'  -o /home/iovercast/manuscript-analysis/stacks/REALDATA/gapped/ -s <clipped>

In [335]:
## These are the python variables for the main stacks simout paths: 
## STACKS_GAP_SIMOUT, STACKS_UNGAP_SIMOUT, STACKS_DEFAULT_SIMOUT

STACKS_SIMOUT=os.path.join(STACKS_DIR, "SIMDATA/")
STACKS_GAP_SIMOUT=os.path.join(STACKS_SIMOUT, "gapped/")
STACKS_UNGAP_SIMOUT=os.path.join(STACKS_SIMOUT, "ungapped/")
STACKS_DEFAULT_SIMOUT=os.path.join(STACKS_SIMOUT, "default/")

force = True
#force = ""
if force and os.path.exists(STACKS_SIMOUT):
    shutil.rmtree(STACKS_SIMOUT)
if not os.path.exists(STACKS_SIMOUT):
    os.makedirs(STACKS_SIMOUT)

## Make the sumulation output directories if they don't already exist
for dir in [STACKS_GAP_SIMOUT, STACKS_UNGAP_SIMOUT, STACKS_DEFAULT_SIMOUT]:
    if not os.path.exists(dir):
        os.makedirs(dir)
    for outdir in ["no/", "lo/", "hi/", "large/"]:
        tmpdir = os.path.join(dir, "sim"+outdir)
        if not os.path.exists(tmpdir):
            os.makedirs(tmpdir)

## Do the stacks sim analysis
for dir in [STACKS_GAP_SIMOUT, STACKS_UNGAP_SIMOUT, STACKS_DEFAULT_SIMOUT]:

    for sim in ["no", "lo", "hi", "large"]:
        print("##################################")
        print("Doing - " + dir + " - sim" + sim)
        print("##################################")
        ## Where simulation results will be written
        simout_dir = os.path.join(dir, "sim"+sim)
        
        ## Where the rawdata and barcodes live
        simdata_dir = os.path.join(SIMULATION_DATA_DIR, "sim"+sim)
        ## We can just use the sim barcodes file as the stacks population map
        ## because all it really is doing is reading the first column, which
        ## is the names of samples to use in denovo_map.pl
        ipyrad_bcodes = os.path.join(simdata_dir, "sim{}_barcodes.txt".format(sim))
        fastqs = os.path.join(simdata_dir, "sim{}_R1_.fastq.gz".format(sim))

        ## Munge the ipyrad barcodes file into the format stacks wants for the population map
        ## file (ie sample name then population). Here we just assign all samples to be from 
        ## population 1
        popmap = os.path.join(simout_dir, "stacks_popmap.txt")
        with open(ipyrad_bcodes) as infile:
            with open(popmap, 'w') as outfile:
                lines = infile.readlines()
                for line in lines:
                    ## This is just setting all samples to population 1 
                    outfile.write(line.split()[0]+"\t1\n")
        
        ## Munge the sim barcodes file into the format that stacks wants
        ## basically the inverse of what ipyrad uses barcode first then sample name
        ## There is probably a fancier/smarter way to do this, but this works.
        bcodes = os.path.join(simout_dir, "stacks_barcodes.txt")
        with open(ipyrad_bcodes, 'r') as infile:
            with open(bcodes, 'w') as outfile:
                lines = infile.readlines()
                for line in lines:
                    if line:
                        outfile.write("\t".join(line.split()[::-1])+"\n")

        ## Export the path so we can actually find the binaries
        cmd = "export PATH={workdir}/miniconda/bin:$PATH; ".format(workdir=WORK_DIR)

        ## Add the command to process_radtags.
        ## Yes we don't have to do this over and over, but it's fast, and also ipyrad/pyrad
        ## are paying the same demultiplex penalty for every run.
        cmd += "time process_radtags -f {simdata} -i gzfastq -b {barcodes} -e pstI -o {simout}; "\
            .format(simdata=fastqs, barcodes=bcodes, simout=simout_dir)
   
        ## Toggle the dryrun flag for testing
        #DRYRUN="-d"
        DRYRUN=""

        ## Don't add all the fancy stuff if you just want to do the default assembly
        ADDITIONAL_PARAMS = ""
        GAPPED = ""
        if not dir == STACKS_DEFAULT_SIMOUT:
            ADDITIONAL_PARAMS = " -m 2 -M 10 -N 10 -n 10 "
        if dir == STACKS_GAP_SIMOUT:
            ADDITIONAL_PARAMS += " --gapped "
            
        OUTPUT_FORMATS = "--vcf --genepop --structure --phase --fastphase --phylip "
        cmd += "time denovo_map.pl -T 40 -b 1 -S " + DRYRUN + ADDITIONAL_PARAMS\
        + " -X \'populations:" + OUTPUT_FORMATS + "\'"\
        + " -o " + simout_dir + " --samples " + simout_dir + " -O " + popmap 

        print(cmd)
        output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
        print(output)


##################################
Doing - /home/iovercast/manuscript-analysis/stacks/SIMDATA/gapped/ - simno
##################################
export PATH=/home/iovercast/manuscript-analysis//miniconda/bin:$PATH; time process_radtags -f /home/iovercast/manuscript-analysis/simulated_data/simno/simno_R1_.fastq.gz -i gzfastq -b /home/iovercast/manuscript-analysis/stacks/SIMDATA/gapped/simno/stacks_barcodes.txt -e pstI -o /home/iovercast/manuscript-analysis/stacks/SIMDATA/gapped/simno; time denovo_map.pl -T 40 -b 1 -S  -m 2 -M 10 -N 10 -n 10  --gapped  -X 'populations:--vcf --genepop --structure --phase --fastphase --phylip ' -o /home/iovercast/manuscript-analysis/stacks/SIMDATA/gapped/simno --samples /home/iovercast/manuscript-analysis/stacks/SIMDATA/gapped/simno -O /home/iovercast/manuscript-analysis/stacks/SIMDATA/gapped/simno/stacks_popmap.txt
Processing single-end data.
Using Phred+33 encoding for quality scores.
Found 1 input file(s).
Searching for single-end, inlined barcodes.
Loa

## aftrRAD simulated data assembly
aftrRAD throws out any read with one or more bases less than 
the phred qscore `minQual` setting. You can't really tune how many 
low quality bases to retain. For the simulated data we are simulating
all qscores arbitrarily high, so this isn't a problem. For
empirical we might have to think about lowering minQual...

Actually, throwing out many reads seems to be the aftrRAD strategy. I think
its trying to reduce the number of potential reads to match (reduce singletons
from error) because it 

Performance on 1000 100bp simulated dataset:
* With params set as below: AftrRAD.py ~50minutes

In [3]:
import subprocess
import gzip

## Set up directory structures. change the force flag if you want to
## blow everything away and restart
# force = True
force = ""
AFTRRAD_SIMOUT=os.path.join(AFTRRAD_DIR, "SIMDATA/")
if force and os.path.exists(AFTRRAD_SIMOUT):
    shutil.rmtree(AFTRRAD_SIMOUT)
if not os.path.exists(AFTRRAD_SIMOUT):
    os.makedirs(AFTRRAD_SIMOUT)
for outdir in ["no/", "lo/", "hi/", "large/"]:
    tmpdir = os.path.join(AFTRRAD_SIMOUT, "sim"+outdir)
    if not os.path.exists(tmpdir):
        os.makedirs(tmpdir)

## go to the pyrad simulated data output directory
for dir in ["no", "lo", "hi", "large"]:
    print("Doing - {}".format(dir))
    ## Set barcode and raw data files
    simdata_dir = os.path.join(SIMULATION_DATA_DIR, "sim"+dir)
    ipyrad_bcodes = os.path.join(simdata_dir, "sim{}_barcodes.txt".format(dir))
    fastqs = os.path.join(simdata_dir, "sim{}_R1_.fastq.gz".format(dir))

    simout_dir = os.path.join(AFTRRAD_SIMOUT, "sim"+dir)
    if not os.path.exists(simout_dir):
        os.mkdir(simout_dir)
    bcodes = os.path.join(simout_dir, "aftrrad-{}-barcodes.txt".format(dir))
    ## Munge the ipyrad barcodes file to the format aftrrad wants
    ## (barcode\tsample name)
    with open(ipyrad_bcodes, 'r') as infile:
        with open(bcodes, 'w') as outfile:
            lines = infile.readlines()
            for line in lines:
                if line:
                    outfile.write("\t".join(line.split()[::-1])+"\n")

    ## Make data and barcodes directories
    data_dir = os.path.join(simout_dir, "Data")
    bcodes_dir = os.path.join(simout_dir, "Barcodes")
    for tmp in [data_dir, bcodes_dir]:
        if not os.path.exists(tmp):
            os.mkdir(tmp)
    ## Copy the fastq to the aftrrad/Data dir and the barcodes to the
    ## aftrrad/Barcodes dir. These files have to be in these directories
    ## and they have to have the exact same name, derp.
    aftrrad_fastqs = os.path.join(data_dir, "sim-{}.txt".format(dir))
    with gzip.open(fastqs, 'rb') as f_in, open(aftrrad_fastqs, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
    shutil.copy2(bcodes, os.path.join(bcodes_dir, "sim-{}.txt".format(dir)))
    
    ## AfterRAD is incredibly picky about where scripts and binaries are so you
    ## have to do a bunch of annoying housekeeping.
    cmd = "cp -rf {}/* {}".format(os.path.join(AFTRRAD_DIR, "AftrRAD/AftrRADv5.0"), simout_dir)
    os.system(cmd)
    shutil.copy2(os.path.join(AFTRRAD_DIR, "dnaMatrix"), simout_dir)
    shutil.copy2(os.path.join(AFTRRAD_DIR, "ACANA"), simout_dir)

    ## This first line of nonsense is to make it so perl can find the Parallel::ForkManager library
    base = "export PATH={}perl5/bin:$PATH; cpanm --local-lib={}/perl5 local::lib && eval $(perl -I {}/perl5/lib/perl5/ -Mlocal::lib); ".format(WORK_DIR, WORK_DIR, WORK_DIR)

    ## Good sub for testing
    cmd = base + "time perl AftrRAD.pl -h"

    ## numIndels-3 What do here?
    ## stringLength-99 We don't care about homoplymers here, so set it very high
    ## maxH-90 ??
    ## AftrRAD manu analysis settings - P2: noP2;  minDepth: 5; numIndel: 3; MinReads: 5
    cmd = base + "time perl AftrRAD.pl re-TGCAG minDepth-6 P2-noP2 minIden-85 " \
               + "stringLength-99" \
               + "DataPath-Data/ BarcodePath-Barcodes/ maxProcesses-40"
    os.chdir(simout_dir)
    print(cmd)
    output = ""
    try:
        #output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
        pass
    except Exception as inst:
        print(inst.output)
    print(output)

    ## MinReads is mindepth_statistical
    cmd = base + "time perl Genotype.pl MinReads-6 subset-0 maxProcesses-40"
    #cmd = base + "perl Genotype.pl -h"
    print(cmd)
    #output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
    print(output)

    ## pctScored is min_samples_locus as a percent of total samples, here since
    ##    we use 2 for min_samples_locus and there are 12 simulated samples we use 17%
    cmd = base + "time perl FilterSNPs.pl pctScored-17"
    print(cmd)
    #output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)

Doing - no
export PATH=/home/iovercast/manuscript-analysis/perl5/bin:$PATH; cpanm --local-lib=/home/iovercast/manuscript-analysis//perl5 local::lib && eval $(perl -I /home/iovercast/manuscript-analysis//perl5/lib/perl5/ -Mlocal::lib); time perl AftrRAD re-TGCAG minDepth-6 p2-noP2 minIden-85 stringLength-99DataPath-Data/ BarcodePath-Barcodes/ maxProcesses-40

export PATH=/home/iovercast/manuscript-analysis/perl5/bin:$PATH; cpanm --local-lib=/home/iovercast/manuscript-analysis//perl5 local::lib && eval $(perl -I /home/iovercast/manuscript-analysis//perl5/lib/perl5/ -Mlocal::lib); perl Genotype.pl MinReads-6 subset-0 maxProcesses-40

export PATH=/home/iovercast/manuscript-analysis/perl5/bin:$PATH; cpanm --local-lib=/home/iovercast/manuscript-analysis//perl5 local::lib && eval $(perl -I /home/iovercast/manuscript-analysis//perl5/lib/perl5/ -Mlocal::lib); perl FilterSNPs.pl pctScored-17
Doing - lo
export PATH=/home/iovercast/manuscript-analysis/perl5/bin:$PATH; cpanm --local-lib=/home/iover

### Post-processing aftrrad
aftrrad doesn't provide vcf as output so we have to make it ourselves

In [None]:
%%bash -s "$WORK_DIR" -s "$AFTRRAD_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"; export "AFTRRAD_DIR=$2"

cd $AFTRRAD_DIR
./brew/bin/brew tap homebrew/science
./brew/bin/brew install snp-sites

## Copy the R binary to the miniconda bin dir so it will be available
cp brew/Cellar/snp-sites/2.2.0/bin/snp-sites $WORK_DIR/miniconda/bin/

cd SIMDATA
for d in sim*;
do
    echo $d
    cd $d/Formatting
    perl OutputFasta.pl SNPsOnly-1
    snp-sites -v -o $d.vcf SNPMatrix_*.All.fasta
done


## dDocent simulated data assembly
Simulated assembly approximate runtimes:

* 10000 l00bp loci ddocent runs in ~2 minutes.
* 100000 100bp loci pyrad runs in ~15 minutes.

~~This would be what you would do if dDocent actually could run unsupervised, but it REQUIRES
you to sit there and peck at a couple keys, so this part of the script just sets up the
input files and stuff for you, you're responsible for running this by hand in each sim directory.~~

Actually nevermind, I found the two sections of the dDocent script that prompt for user input. These map more or less to mindepth_statistical and min_samples_locus. Hard coded two params on dDocent shell script lines 546 and 585 to set values of CUTOFF=6 and CUTOFF2=2, to align w/ these ipyrad settings.

Actually nevermind, it's still doing some kind of pipe to stdout that's broken running inside the notebook, so still run these by hand. I'm leaving the hardcoded values tho.

Also, on line 379 there's a hardcoded cutoff for final output where it only retains snps called in 90% of individuals.
    echo "Using VCFtools to parse SNPS.vcf for SNPs that are called in at least 90% of individuals"
    vcftools --vcf TotalRawSNPs.vcf --geno 0.9 --out Final --counts --recode --non-ref-af 0.001 --max-non-ref-af 0.9999 --mac 1 --minQ 30 --recode-INFO-all &>VCFtools.log

Also, the final vcf files that are output contain complex genotypes rather than just snps (this is the default for freebayes), so you have to run some extra code-jams to get just
the snps (vcfallelicprimitives and then vcftools --remove-indels

In [None]:
import ipyrad as ip
import subprocess
import glob

## Set up directory structures. change the force flag if you want to
## blow everything away and restart
# force = True
force = ""
DDOCENT_SIMOUT=os.path.join(DDOCENT_DIR, "SIMDATA/")
if force and os.path.exists(DDOCENT_SIMOUT):
    shutil.rmtree(DDOCENT_SIMOUT)
if not os.path.exists(DDOCENT_SIMOUT):
    os.makedirs(DDOCENT_SIMOUT)
for outdir in ["no/", "lo/", "hi/", "large/"]:
    tmpdir = os.path.join(DDOCENT_SIMOUT, "sim"+outdir)
    if not os.path.exists(tmpdir):
        os.makedirs(tmpdir)

        
## go to the pyrad simulated data output directory
for dir in ["no", "lo", "hi", "large"]:
    print("Doing - {}".format(dir))
    ## Set barcode and raw data files
    simdata_dir = os.path.join(SIMULATION_DATA_DIR, "sim"+dir)
    bcodes = os.path.join(simdata_dir, "sim{}_barcodes.txt".format(dir))
    fastqs = os.path.join(simdata_dir, "sim{}_R1_.fastq.gz".format(dir))

    simout_dir = os.path.join(DDOCENT_SIMOUT, "sim"+dir)
    if not os.path.exists(simout_dir):
        os.mkdir(simout_dir)
    os.chdir(simout_dir)

    ## dDocent doesn't do demultiplexing, so we'll use ipyrad for it real quick
    data = ip.Assembly("sim"+dir)
    
    data.set_params('raw_fastq_path', fastqs)
    data.set_params('barcodes_path', bcodes)
    data.write_params(force=True)
    
    cmd = "ipyrad -p params-sim{}.txt -s 1".format(dir)
    print(cmd)
    !time $cmd
    
    ## Now we have to rename all the files in the way dDocent expects them:
    ## 1A_0_R1_.fastq.gz -> Pop1_Sample1.F.fq.gz
    ##
    ## Have to cd to the directory cuz we have to be in the data dir when we run dDocent anyway
    os.chdir("sim{}_fastqs".format(dir))
    fq_files = glob.glob("./*.fastq.gz")
    for f in fq_files:
        newname = "Pop{}_Sample{}.F.fq.gz".format(f[2], f[3])
        print("Renaming {} -> {}".format(f, newname))
        os.rename(f, newname)

    ## Write out the config file for this run.
    ## Compacted the config file into one long line here to make it not take up so much room
    config_file = "sim{}-config.txt".format(dir)
    with open(config_file, 'w') as outfile:
        outfile.write('Number of Processors\n40\nMaximum Memory\n0\nTrimming\nyes\nAssembly?\nyes\nType_of_Assembly\nSE\nClustering_Similarity%\n0.85\nMapping_Reads?\nyes\nMapping_Match_Value\n1\nMapping_MisMatch_Value\n3\nMapping_GapOpen_Penalty\n5\nCalling_SNPs?\nyes\nEmail\nwatdo@mailinator.com\n')

    cmd = "export LD_LIBRARY_PATH={}/freebayes-src/vcflib/tabixpp/htslib/:$LD_LIBRARY_PATH; ".format(DDOCENT_DIR)
    cmd += "export PATH={}:$PATH; time dDocent {}".format(DDOCENT_DIR, config_file)
    print(cmd)
    #!$cmd
    
    ## You have to post-process the vcf files to decompose complex genotypes 
    ## and remove indels
    fullvcf = simdata_dir + "TotalRawSNPs.vcf"
    filtvcf = simdata_dir + "Final.recode.vcf"
    for f in [fullvcf, filtvcf]:
        print("Finalizing - {}".format(f))
        ## Naming the new outfiles as <curname>.snps.vcf
        outfile = simdata_dir + f.split("/")[-1].split(".vcf")[0] + ".snps.vcf"
        cmd = "export PATH={}:$PATH; vcfallelicprimitives {} > ddoc-tmp.vcf".format(DDOCENT_DIR, f)
        print(cmd)
        #!$cmd
        cmd = "export PATH={}:$PATH; vcftools --vcf ddoc-tmp.vcf --remove-indels --recode --recode-INFO-all --out {}".format(DDOCENT_DIR, outfile)
        #!$cmd
        !rm ddoc-tmp.vcf


Doing - no
  New Assembly: simno
ipyrad -p params-simno.txt -s 1

 -------------------------------------------------------------
  ipyrad [v.0.4.0]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: simno
  local compute node: [40 cores] on node001

  Step 1: Demultiplexing fastq data to Samples
  [####################] 100%  chunking large files  | 0:00:00 
  [####################] 100%  sorting reads         | 0:00:44 
  [####################] 100%  writing/compressing   | 0:00:02 



real	0m56.558s
user	0m1.724s
sys	0m0.437s
Renaming ./1A_0_R1_.fastq.gz -> Pop1_SampleA.F.fq.gz
Renaming ./3J_0_R1_.fastq.gz -> Pop3_SampleJ.F.fq.gz
Renaming ./1B_0_R1_.fastq.gz -> Pop1_SampleB.F.fq.gz
Renaming ./3K_0_R1_.fastq.gz -> Pop3_SampleK.F.fq.gz
Renaming ./2F_0_R1_.fastq.gz -> Pop2_SampleF.F.fq.gz
Renaming ./1D_0_R1_.fastq.gz -> Pop1_SampleD.F.fq.gz
Renaming ./1C_0_R1_.fastq.gz -> Pop1_SampleC.F.fq.gz
Renaming ./3L_0

# Empirical Analysis
An empirical analysis was conducted on the published data set from Eaton & Ree (2013), 
available de-multiplexed in fastQ format at the ncbi sequence read archive (SRA072507). 
The data are already demultiplexed ipyrad step 1 and stacks `process_radtags`
are skipped. Because Stacks and ipyrad/PyRAD use different quality filtering methods we 
employed filtering by ipyrad only. This way quality filtering does not affect the results 
of comparison between the programs. After trimming the barcode and adapter 
read lengths are 69bp. At .85 similarity this is equivalent to 10 base differences. 


## ipyrad
Some representative runs to give you an expectation of how long things should take:

Step 1 - Linking samples:
```
real    0m33.881s
user    0m24.455s
sys     0m1.877s
```
Step 2 - Filtering Raw Reads:
```
real    7m48.600s
user    0m15.081s
sys     0m5.252s
```
Step 3 - Clustering within samples (`-c 13`):
```
real    29m56.718s
user    0m24.510s
sys     0m1.098s
```
Step 3 - Clustering within samples (`-c 40`):
```
## There is some overhead associated with parallelization so speedup
## isn't quite 2x as fast, but is still considerable. With larger
## number of samples the speed gain of massive parallelization
## is pronounced.

real    18m59.515s
user    0m23.856s
sys     0m1.058s
```

All steps 1-7 with `-c 40`:
```
real    50m25.647s
user    9m34.733s
sys     0m20.489s
```

All steps 1-7 with `-c 13`:
```
real    69m26.015s
user    9m1.946s
sys     0m20.285s
```



### ipyrad params file (params-example.txt)
Next, we write out the ipyrad params file to the ipyrad working direcotory. 

In [198]:
## I zipped and uuencoded the params file so it wouldn't take up so much visual room
## Lol
params = 'begin 666 <data>\nM>)R55EUOVS@0?->O6" /9_=LV9:3-"W: XRVUQ1HTR:7>RH*@I9HFX4D*B25\nMQ/WU-R0E1W$3VR<$"+FD9[@?L^1PZ#^2U5KSC"JN>6%H(7-!O=MX\'$_CR4E_\nM>/@777V8?7X_NY[1,]_1$7T?_Z#OW!A1S/,U*WDA?KRF63,G-X_I7R,RLLK/\nM2-6VJBUE4HO4*BT%3J@TM1!DK*A,%(_V<6\\.,,$!*JU^ HX!%?3?PLQQ4*_@\nMF2!94EIK-Y<+*I6E2@LC2MN/=L(\'@@0$FM^Q!3?VAE7<KL#Q6:7<2E626A 6\nM 5H.,U\'4N955+N[AL-_NHV\\.89F"9<YUJC)AGB)IUSQB%,<C<<\\+4#%15%++\nME.<,21^]B)>_#F [!IM1VHKL>;<>N3,*NQ]YE8E2W:J=/"?=ZBB$7:FL6Q_!\nM0KV -" M%D*+,A4#"J8_?[,,-Y:#DG?JDM?^@AEQ4[O!=@+;#=1N"%%V*MJ#\nM_Q+X&;?<KBN\'^KX94L\\(09E*3?\\UZB,;T\')NX$+FQ\\*F<73]\\=WLXV G^)D_\nMO+%(KSLJ4[="KWBY!-\'5@YE:,_72VDX&?8*>PA!5;Y-^=+P_3J] 5?![EJL[\nM=E.CF.;<" .B+_R>8"1GE\'9-SDZHMMQ0[_)-,NX[<7&$D&?1=+J79^+Z1;72\nM*+Q+DRHMF%HLC+!@\\E:Z)&^F8*:>*E$G/+="E\\C7K7A[>MR/3O<[-\'%]H9!E\nM)BJ[8@9ID<8ZF3B?<&1O]ZVGL_;@G2R7![$D79:"_]1U+GYC@!V=SJZ\';O4Q\nMQV2,;P_\'-.3&XS492?,:C5(W\'\'?2KD!H?$,PT3@^.]EW;M< / BS"+M9J=P)\nM\\UV Q<%H8_8>H(=ZK;=:CG:<><-QTM14T[I8(4W!;=KZ4-;%\'"Y ?XB%NN-S\nMQ*;=@CX\'A]J>%TT.8\'-*AVIQ?,8S7N&_*]^_O25<,8UU5&E9X#_U)DXIR=N@\nM)*%15Z_V\\[Q\\X$\'B&7Y;L%R43=(Q6B(COJGPS!!?./J&FMSF*#G F;,F=(B,\nM0$I9JDK<5ZT>&RM5@#32AOLM[*C-IH>9:#J@78KT3*WP+[9)+OY A.K2%2ID\nMZ=M!_S%1[PH=Y@K]Y0">9-SPG&_SG -H)1 =]6N]5/9YE@/BEC2B9XT4T,[2\nMVC2I.6H%XN/F5WQ=A!>)4^* )CO$Z F2QHU_+KYUT.\'&$3E3!_I_1:>1./N$\nM3I)O Z.8I+<_B;X/^KB!-BONNN[Y-OI#[%WMH)J>I$%L=DG>,SF]BTQ:AEO\'\nM X\'C ^;N%AH&Y!:O<S]&R8":O^>AG;B]TCI7X#7FG;MO ^A(_AK0FT#E1W#@\nMQ8[3MRQ.VJ$:& H#C<AY\\#4\\6!M#]^"[$0.D$W*E*H;&*9<E<\\\\*]T+%:\\N]\n3B+%2Y^$9$C84>)2&M\\=_GRU$J0  \n \nend\n'
print(params.decode("uu").decode("zlib"))
with open(os.path.join(IPYRAD_DIR, "params-example.txt"), 'w') as outfile:
    outfile.write(params.decode("uu").decode("zlib"))

## If you want to make changes to the params file you can generate a new uuencoded string
## by editing the pyrad/params-pyrad.txt file and running these lines. If you run this
## at a terminal inside python shell then it'll print out nicely as one big fat line. 
#with open(os.path.join(PYRAD_DIR, "params-pyrad.txt")) as infile:
#    infile.read().encode("zlib").encode("uu")

------- ipyrad params file (v.0.3.15)-------------------------------------------
REALDATA                        ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./REALDATA                             ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
../example_empirical_rad/*.gz                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhan

### ipyrad - Import samples and filter raw data
This step should take 5-10 minutes depending on the available resources.

In [76]:
%%bash -s "$WORK_DIR" "$IPYRAD_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "IPYRAD_DIR=$2"
cd $IPYRAD_DIR

## Import demultiplexed samples into a new assembly and filter the raw data
time ipyrad -p params-example.txt -s 12


 -------------------------------------------------------------
  ipyrad [v.0.3.36]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: REALDATA
  local compute node: [40 cores] on node001

  Step 1: Linking sorted fastq data to Samples
    Linking to demultiplexed fastq files in: /home/iovercast/manuscript-analysis/example_empirical_rad/*.gz
    13 new Samples created in 'REALDATA'.
    13 fastq files linked to 13 new Samples.

  Step 2: Filtering reads 
  [                    ]   0%  processing reads      | 0:00:00   [                    ]   0%  processing reads      | 0:00:01   [                    ]   0%  processing reads      | 0:00:02   [                    ]   0%  processing reads      | 0:00:03   [                    ]   0%  processing reads      | 0:00:04   [                    ]   0%  processing reads      | 0:00:05   [                    ]   0%  processing reads      | 0:00:06   [         


real	6m1.327s
user	0m39.101s
sys	0m6.733s


In [199]:
%%bash -s "$WORK_DIR" "$IPYRAD_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "IPYRAD_DIR=$2"
cd $IPYRAD_DIR

## Full run with 40 cores
time ipyrad -p params-example.txt -s 34567 -f -c 40


 -------------------------------------------------------------
  ipyrad [v.0.3.36]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: REALDATA
  from saved path: ~/manuscript-analysis/ipyrad/REALDATA/REALDATA.json
  local compute node: [40 cores] on node001

  Step 3: Clustering/Mapping reads
  [                    ]   0%  dereplicating         | 0:00:00   [                    ]   0%  dereplicating         | 0:00:01   [###                 ]  15%  dereplicating         | 0:00:02   [#######             ]  38%  dereplicating         | 0:00:03   [############        ]  61%  dereplicating         | 0:00:04   [################    ]  84%  dereplicating         | 0:00:05   [####################] 100%  dereplicating         | 0:00:06   [####################] 100%  dereplicating         | 0:00:07 
  [                    ]   0%  clustering            | 0:00:00   [                    ]   0%  clustering  


real	44m47.378s
user	9m24.622s
sys	0m17.537s


## pyRAD v.3
In all cases with pyrad analysis we set `N processors` = 13.

Step 2:
```
real	1m37.142s
user	12m39.945s
sys	0m4.051s
```

Running steps 3-7:
```
real    250m58.693s
user    1180m7.893s
sys     12m1.399s
```

### pyRAD params file (params-pyrad.txt)


In [99]:
## I zipped and uuencoded the params file so it wouldn't take up so much visual room
## Lol
params = 'begin 666 <data>\nM>)R-5^]OVCH4_<Y?<:5]&&%NFA^%MFR95*W3.FFKMG;2^UBYL2%>$SNS30?3\nM^^/?=9P *Q >\'TH@^/CF^)QS;[-L-(*::EIQRS4(62^L@9G24*_NKJ[AF6LC\nME(0TC,))"C :90=>0&<SGEO.P%A>0Y8-PM.[CU=?KJ]^7,&!UZM7$(?PC])/\nM0LZ!"8T 2J\\._7S]&M*R# 8[6$D(7U0>@IJ!5/*$\\6I16E&7?(E5S43)#0R%\nMNV>A%))#?!$@EHGW0*4;J$>J<\\5X [!G_799#NK9<%Q0[\'_:LQ!R5554,A@Z\nMCJDM K *<GP@Z%:Z.PM_\'7C8E)A),*@6)L<2]B./^Y /K6R0SX/!CT\\?KC[M\nM91J1)R\'<<6.UR*U3@D))%!2/:\\C#>4C@P[]^]<E[:"Z"C@IBDF 0IWMA6^CS\nM$&ZAUBKGQBB-I^.46):\\# ZNZ@Y_TH=[$<)7(1FO;3&%2DBD!JNF<]XHFT)>\nM+HR3^PLRSH@9!X.S/N1+K/C[@I8(2Y?P"HRPJ*K?PA;P"[^&=Y!$S0$8SKU(\nMDB@8.B;"B_%AV#A"&S153;OBG"-LH;DI5,F &BR;\\5Q4N,FZ8*\\+35D/,AKL\nMFEIJ5S6? OZ4S!\\-J:G0W3MC[MNAL@5Z?>KJ9BHW@:<YZ2,C3AJ>/ZAGS[*A\nM5>U,AI<4[2*QU%+E"_."9A1<KRIB--]7NKR_\\23C0;8,FX)J=\'+AHDHUU+^ \nMS2_&;%(E=7J3WNYNX:#1@-\\TGXDE2(R\\1@Z^4+6P&\'TPE A<4Q1D\\#=T$W&J\nM=A; 7S>):>"1E^HWGI(P_JP/9>-V3.[F9 \\3XQ"W#*>HIA+7@%D\\&HY%ULTC\nMC$#)<M7>4_HOSZ#D3ES9/=B3%ILR=H*^\'B(#<ZT6-<8&75(8EL)8<%\'B-]OF\nM8T*.8)^WV\'R):L;L/(+XXAC#\\%34*U3EZ;J#K"\\>.!/6G(X>[N*\'T>$"+MH"\nMRC;$&3_9;@8,\';%_<>/5GB>[;(%1F5^GF%Z5,!6U>>%EW_8*3#+&9QG$P3;P\nM3I_9!DZB%K@NG,:_FUQICI7/F@-OT-)T;RP>J3B)6V#L7^B;*3BL*&M2#&+_\nM_H8R6N--$T*2^:C_\'\\!))R \\4*&T@(_DIBTU"J,H)O@W)M!V3&PA&%ZH^P"!\nMQ[W Z8;C6Y\\!MVVHY$H:%/POO\\WX)1_\'@,\\VP&VX^"SYLYJ\';90WS6)GDV/ \nMG4_K4@FV\\M!-(]N\'F+Q=A^RZZ?1@3S9%W]]^,U,/$D=1@%F&^8UJZ;XA[MLM\nM-GI-FIQO@#]CLRP1VL6LD"=-"R(TU\\H8_Z\'9(H247%X&Z[&A![OSG\\:!1%4@\nM%]4C;YHB:Y\'B.$EV!.T[VC\'LSH*HTVHSCI1\\9HD6\\P(C1FZZCR!.\\,.(1($G\nMY/DP<MIYL&T&V!U0L4A+322AQ)!GLB"65.2)S,D(X;:.L;?D-%Y3_3/$D##<\nMSV740C.DO%^^J]IYQ=.SS+K/1]V2=C9\\XMA.\'"<5DJPY96T*.9U\\E&[<<3VZ\nMY\')NB]"SW>OO=,N&V*MH_H06^>-&8&F#AM7,W1J.47<5I_)-,KJ_[@;67N"U\nM#86\\YIK7FS[!W,>VV[_+,%YS50MN_&[Q4>#.AC@]0R%PXL,16B#3VS-5PV\\6\nM$<R^U;K+#]T<U0/<>1"KXWAH%35/&ZPX>\\T0_C7@_U"%8@2B3*HUL)_1>K [\nM&W;COZ/;#7[N &L\\M9_JL=UH\\A8VDMM@_S6:8,MS7;;IY*=Y29%4:HR8RXI+\n/NSNQ;&EX=U89_ >8=H2-\n \nend\n'
print(params.decode("uu").decode("zlib"))
with open(os.path.join(PYRAD_DIR, "params-pyrad.txt"), 'w') as outfile:
    outfile.write(params.decode("uu").decode("zlib"))

## If you want to make changes to the params file you can generate a new uuencoded string
## by editing the pyrad/params-pyrad.txt file and running these lines. If you run this
## at a terminal inside python shell then it'll print out nicely as one big fat line. 
#with open(os.path.join(PYRAD_DIR, "params-pyrad.txt")) as infile:
#    infile.read().encode("zlib").encode("uu")

./REALDATA                        ## 1. Working directory                                 (all)
              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                     ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
13                         ## 7. N processors (parallel)                           (all)
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                         ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.85                       ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
2                         ## 12. Min

In [98]:
%%bash -s "$WORK_DIR" "$PYRAD_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "PYRAD_DIR=$2"
cd $PYRAD_DIR

## Initial filtering
##
## pyrad v.3 performance is not improved by increasing the number
## of cores available beyond the number of samples in the dataset
## so results here are representative of increased values for
## param 7 - N Processors (Parallel)
time pyrad -p params-pyrad.txt -s 2



     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	sorted .fastq from /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/*_R1_* being used
	step 2: editing raw reads 
	.............
real	1m37.142s
user	12m39.945s
sys	0m4.051s


In [197]:
%%bash -s "$WORK_DIR" "$PYRAD_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "PYRAD_DIR=$2"
cd $PYRAD_DIR

## All steps 3-7. We do not include step 2 because we are using
## the ipyrad filtered data
time pyrad -p params-pyrad.txt -s 34567


	finished clustering
	ingroup 29154_superba_R1_,30556_thamno_R1_,30686_cyathophylla_R1_,32082_przewalskii_R1_,33413_thamno_R1_,33588_przewalskii_R1_,35236_rex_R1_,35855_rex_R1_,38362_rex_R1_,39618_rex_R1_,40578_rex_R1_,41478_cyathophylloides_R1_,41954_cyathophylloides_R1_
	addon 
	exclude 
	
	final stats written to:
	 /home/iovercast/manuscript-analysis/pyrad/REALDATA/stats/c85d6m2p3H3N3.stats
	output files being written to:
	 /home/iovercast/manuscript-analysis/pyrad/REALDATA/outfiles/ directory





     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	de-replicating files for clustering...

	step 3: within-sample clustering of 13 samples at 
	        '.85' similarity. Running 13 parallel jobs
	 	with up to 6 threads per job. If needed, 
		adjust to avoid CPU and MEM limits

	sample 29154_superba_R1_ finished, 117743 loci
	sample 33413_thamno_R1_ finished, 149210 loci
	sample 39618_rex_R1_ finished, 131379 loci
	sample 32082_przewalskii_R1_ finished, 132189 loci
	sample 33588_przewalskii_R1_ finished, 139491 loci
	sample 38362_rex_R1_ finished, 118064 loci
	sample 30556_thamno_R1_ finished, 186957 loci
	sample 30686_cyathophylla_R1_ finished, 203841 loci
	sample 41478_cyathophylloides_R1_ finished, 152744 loci
	sample 40578_rex_R1_ finished, 198186 loci
	sample 35855_rex_R1_ finished, 155901 loci
	sample 41954_cyathophylloides_R1_ finis

## Stacks empirical analysis

Runtime expectations:

Ungapped with all flags set as below and 13 cores runtime exceeded 12 hours.

Gapped alignment with parameters set as below and 40 cores:
```
real	466m45.829s
user	17063m15.498s
sys	    4m24.564s
```

Ungapped all default assembly parameters:
```
real	14m49.323s
user	20m29.222s
sys	    0m49.756s
```

Explanantions of all the stacks parameters. These settings are (close to being) analagous to the .85 clust similarity used in PyRAD given that read length are 100 bp.
* -m 2: Minimum depth to make a stack
* -M 10: Number of mismatches per locus within individuals
* -N 10: Number of allowed mismatches in secondary reads
* -n 10: Number of mismatches when clustering across individuals

Housekeeping flags:
* -T 40: Number of threads 
* -b 1: Batch id (Can be anything)
* -S: Disable sql

Populations flags:
* -m 6: Minimum stack depth per locus when clustering across samples
ustacks flags:
* --max_locus_stacks 2: 


In [182]:
## We build the stacks command w/ python because it'd be a pain
## to do what the glob() call is doing in bash. Then we call
## the command inside a bash cell because denovo_map expects
## all the submodules to be in the PATH

## Read in the filtered fastq from ipyrad step 2 
IPYRAD_EDITS_DIR = os.path.join(IPYRAD_OUTPUT, "REALDATA_edits/")
infiles = ["-s "+ff+" " for ff in glob.glob(IPYRAD_EDITS_DIR+"*_R1_*")]

## Toggle the dryrun flag for testing
DRYRUN=""
DRYRUN="-d"

OUTPUT_FORMATS = "--vcf --genepop --structure --phase --fastphase --phylip "
cmd = "denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 40 -b 1 -S " + DRYRUN\
        + " -X \'populations:" + OUTPUT_FORMATS + "\'"\
        + " -X \'populations:-m 6\' -X \'ustacks:--max_locus_stacks 2\' "\
        + " -o " + STACKS_UNGAP_OUT + " " + "".join(infiles)
print(cmd)

denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 13 -b 1 -S -d -X 'populations:--vcf --genepop --structure --phase --fastphase --phylip ' -X 'populations:-m 6' -X 'ustacks:--max_locus_stacks 2'  -o /home/iovercast/manuscript-analysis/stacks/REALDATA/ungapped/ -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/30556_thamno_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/32082_przewalskii_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/29154_superba_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/40578_rex_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/33588_przewalskii_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/33413_thamno_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/41478_cyathophylloides_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/35855_rex_

In [179]:
%%bash -s "$WORK_DIR" "$STACKS_DIR" "$cmd"
export PATH="$1/miniconda/bin:$PATH"; export "STACKS_DIR=$2"; export "cmd=$3"

## We have to play a little cat and mouse game here because of quoting in some of the args
## and how weird bash is we have to write the cmd to a file and then exec it.
## If you try to just run $cmd it truncates the command at the first single tic. Hassle.
cd $STACKS_DIR
echo $cmd > stacks.sh; chmod 777 stacks.sh
time ./stacks.sh

/home/iovercast/opt/bin/denovo_map.pl
Identifying unique stacks; file   1 of  13 [30556_thamno_R1_]
Identifying unique stacks; file   2 of  13 [32082_przewalskii_R1_]
Identifying unique stacks; file   3 of  13 [29154_superba_R1_]
Identifying unique stacks; file   4 of  13 [40578_rex_R1_]
Identifying unique stacks; file   5 of  13 [33588_przewalskii_R1_]
Identifying unique stacks; file   6 of  13 [33413_thamno_R1_]
Identifying unique stacks; file   7 of  13 [41478_cyathophylloides_R1_]
Identifying unique stacks; file   8 of  13 [35855_rex_R1_]
Identifying unique stacks; file   9 of  13 [39618_rex_R1_]
Identifying unique stacks; file  10 of  13 [38362_rex_R1_]
Identifying unique stacks; file  11 of  13 [35236_rex_R1_]
Identifying unique stacks; file  12 of  13 [41954_cyathophylloides_R1_]
Identifying unique stacks; file  13 of  13 [30686_cyathophylla_R1_]


Found 13 sample file(s).
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/30556_thamno_R1_.fastq -o REALDATA -i 1 -r -m 2 -M 10 -N 10 -p 13 --max_locus_stacks 2 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/32082_przewalskii_R1_.fastq -o REALDATA -i 2 -r -m 2 -M 10 -N 10 -p 13 --max_locus_stacks 2 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/29154_superba_R1_.fastq -o REALDATA -i 3 -r -m 2 -M 10 -N 10 -p 13 --max_locus_stacks 2 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/40578_rex_R1_.fastq -o REALDATA -i 4 -r -m 2 -M 10 -N 10 -p 13 --max_locus_stacks 2 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/33588_przewalskii_R1_.fastq -o REALDATA

In [200]:
## Do Gapped alignment

## Read in the filtered fastq from ipyrad step 2 
IPYRAD_EDITS_DIR = os.path.join(IPYRAD_OUTPUT, "REALDATA_edits/")
infiles = ["-s "+ff+" " for ff in glob.glob(IPYRAD_EDITS_DIR+"*_R1_*")]

## Toggle the dryrun flag for testing
DRYRUN=""
#DRYRUN=" -d "

OUTPUT_FORMATS = "--vcf --genepop --structure --phase --fastphase --phylip "
cmd = "denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 40 -b 1 -S --gapped " + DRYRUN\
        + " -X \'populations:" + OUTPUT_FORMATS + "\'"\
        + " -X \'populations:-m 6\' -X \'ustacks:--max_locus_stacks 2\' "\
        + " -o " + STACKS_GAP_OUT + " " + "".join(infiles)
print(cmd)

denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 40 -b 1 -S --gapped  -X 'populations:--vcf --genepop --structure --phase --fastphase --phylip ' -X 'populations:-m 6' -X 'ustacks:--max_locus_stacks 2'  -o /home/iovercast/manuscript-analysis/stacks/REALDATA/gapped/ -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/30556_thamno_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/32082_przewalskii_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/29154_superba_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/40578_rex_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/33588_przewalskii_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/33413_thamno_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/41478_cyathophylloides_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/35855

In [201]:
%%bash -s "$WORK_DIR" "$STACKS_DIR" "$cmd"
export PATH="$1/miniconda/bin:$PATH"; export "STACKS_DIR=$2"; export "cmd=$3"

## Run the gapped alignment
cd $STACKS_DIR
echo $cmd > stacks.sh; chmod 777 stacks.sh
time ./stacks.sh

Identifying unique stacks; file   1 of  13 [30556_thamno_R1_]
Identifying unique stacks; file   2 of  13 [32082_przewalskii_R1_]
Identifying unique stacks; file   3 of  13 [29154_superba_R1_]
Identifying unique stacks; file   4 of  13 [40578_rex_R1_]
Identifying unique stacks; file   5 of  13 [33588_przewalskii_R1_]
Identifying unique stacks; file   6 of  13 [33413_thamno_R1_]
Identifying unique stacks; file   7 of  13 [41478_cyathophylloides_R1_]
Identifying unique stacks; file   8 of  13 [35855_rex_R1_]
Identifying unique stacks; file   9 of  13 [39618_rex_R1_]
Identifying unique stacks; file  10 of  13 [38362_rex_R1_]
Identifying unique stacks; file  11 of  13 [35236_rex_R1_]
Identifying unique stacks; file  12 of  13 [41954_cyathophylloides_R1_]
Identifying unique stacks; file  13 of  13 [30686_cyathophylla_R1_]


Found 13 sample file(s).
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/30556_thamno_R1_.fastq -o /home/iovercast/manuscript-analysis/stacks/REALDATA/gapped -i 1 -r -m 2 -M 10 -N 10 -p 40 --gapped  --max_locus_stacks 2 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/32082_przewalskii_R1_.fastq -o /home/iovercast/manuscript-analysis/stacks/REALDATA/gapped -i 2 -r -m 2 -M 10 -N 10 -p 40 --gapped  --max_locus_stacks 2 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/29154_superba_R1_.fastq -o /home/iovercast/manuscript-analysis/stacks/REALDATA/gapped -i 3 -r -m 2 -M 10 -N 10 -p 40 --gapped  --max_locus_stacks 2 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/40578_rex_R1_.fastq -o /home/iovercast/manuscript-analysis/st

In [189]:
## Do all default settings
IPYRAD_EDITS_DIR = os.path.join(IPYRAD_OUTPUT, "REALDATA_edits/")
infiles = ["-s "+ff+" " for ff in glob.glob(IPYRAD_EDITS_DIR+"*_R1_*")]

## Toggle the dryrun flag for testing
DRYRUN=""
#DRYRUN="-d"

OUTPUT_FORMATS = "--vcf --genepop --structure --phase --fastphase --phylip "
cmd = "denovo_map.pl -T 40 -b 1 -S " + DRYRUN\
        + " -X \'populations:" + OUTPUT_FORMATS + "\'"\
        + " -o " + STACKS_DEFAULT_OUT + " " + "".join(infiles)
print(cmd)

denovo_map.pl -T 40 -b 1 -S  -X 'populations:--vcf --genepop --structure --phase --fastphase --phylip ' -o /home/iovercast/manuscript-analysis/stacks/REALDATA/default/ -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/30556_thamno_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/32082_przewalskii_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/29154_superba_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/40578_rex_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/33588_przewalskii_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/33413_thamno_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/41478_cyathophylloides_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/35855_rex_R1_.fastq -s /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/396

In [190]:
%%bash -s "$WORK_DIR" "$STACKS_DIR" "$cmd"
export PATH="$1/miniconda/bin:$PATH"; export "STACKS_DIR=$2"; export "cmd=$3"

## Run with all defaults
cd $STACKS_DIR
echo $cmd > stacks.sh; chmod 777 stacks.sh
time ./stacks.sh

Identifying unique stacks; file   1 of  13 [30556_thamno_R1_]
Identifying unique stacks; file   2 of  13 [32082_przewalskii_R1_]
Identifying unique stacks; file   3 of  13 [29154_superba_R1_]
Identifying unique stacks; file   4 of  13 [40578_rex_R1_]
Identifying unique stacks; file   5 of  13 [33588_przewalskii_R1_]
Identifying unique stacks; file   6 of  13 [33413_thamno_R1_]
Identifying unique stacks; file   7 of  13 [41478_cyathophylloides_R1_]
Identifying unique stacks; file   8 of  13 [35855_rex_R1_]
Identifying unique stacks; file   9 of  13 [39618_rex_R1_]
Identifying unique stacks; file  10 of  13 [38362_rex_R1_]
Identifying unique stacks; file  11 of  13 [35236_rex_R1_]
Identifying unique stacks; file  12 of  13 [41954_cyathophylloides_R1_]
Identifying unique stacks; file  13 of  13 [30686_cyathophylla_R1_]


Found 13 sample file(s).
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/30556_thamno_R1_.fastq -o /home/iovercast/manuscript-analysis/stacks/REALDATA/default -i 1 -r  -p 40 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/32082_przewalskii_R1_.fastq -o /home/iovercast/manuscript-analysis/stacks/REALDATA/default -i 2 -r  -p 40 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/29154_superba_R1_.fastq -o /home/iovercast/manuscript-analysis/stacks/REALDATA/default -i 3 -r  -p 40 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDATA/REALDATA_edits/40578_rex_R1_.fastq -o /home/iovercast/manuscript-analysis/stacks/REALDATA/default -i 4 -r  -p 40 2>&1
  /home/iovercast/opt/bin/ustacks -t fastq -f /home/iovercast/manuscript-analysis/ipyrad/REALDAT

## aftrRAD empirical analysis

In [None]:
import subprocess
import gzip

## Set up directory structures. change the force flag if you want to
## blow everything away and restart
# force = True
force = ""

if force and os.path.exists(AFTRRAD_OUTPUT):
    shutil.rmtree(AFTRRAD_OUTPUT)
if not os.path.exists(AFTRRAD_OUTPUT):
    os.makedirs(AFTRRAD_OUTPUT)
os.chdir(AFTRRAD_OUTPUT)

## Prep empirical data by copying it to a local folder called DemultiplexedFiles
## in this way you don't need the Barcodes/Data dir
demux_dir = os.path.join(AFTRRAD_OUTPUT, "DemultiplexedFiles/")
if force and os.path.exists(demux_dir):
    shutil.rmtree(demux_dir)
if not os.path.exists(demux_dir):
    os.makedirs(demux_dir)

IPYRAD_EDITS_DIR = os.path.join(IPYRAD_OUTPUT, "REALDATA_edits/")
infiles = [ff for ff in glob.glob(IPYRAD_EDITS_DIR+"*_R1_*")]
## Don't recopy the data over and over again while testing
if not os.path.exists(demux_dir+infiles[0].split("/")[-1]):
    for f in infiles:
        shutil.copy2(f, demux_dir+f.split("/")[-1])

## AfterRAD is incredibly picky about where scripts and binaries are so you
## have to do a bunch of annoying housekeeping.
cmd = "cp -rf {}/* {}".format(os.path.join(AFTRRAD_DIR, "AftrRAD/AftrRADv5.0"), AFTRRAD_OUTPUT)
os.system(cmd)
shutil.copy2(os.path.join(AFTRRAD_DIR, "dnaMatrix"), AFTRRAD_OUTPUT)
shutil.copy2(os.path.join(AFTRRAD_DIR, "ACANA"), AFTRRAD_OUTPUT)

## This first line of nonsense is to make it so perl can find the Parallel::ForkManager library
base = "export PATH={}perl5/bin:$PATH; cpanm --local-lib={}/perl5 local::lib && eval $(perl -I {}/perl5/lib/perl5/ -Mlocal::lib); ".format(WORK_DIR, WORK_DIR, WORK_DIR)

## Good sub for testing
#cmd = base + "time perl AftrRAD.pl -h"
#output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
#print(output)

## numIndels-3 What do here?
## stringLength-99 We don't care about homoplymers here, so set it very high
## maxH-90 ??
## AftrRAD manu analysis settings - P2: noP2;  minDepth: 5; numIndel: 3; MinReads: 5
cmd = base + "time perl AftrRAD.pl minDepth-6 P2-noP2 minIden-85 " \
           + "re-TGCAG stringLength-99 dplexedData-1 maxProcesses-40"
print(cmd)
output = ""
output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
print(output)

## MinReads is mindepth_statistical
cmd = base + "time perl Genotype.pl MinReads-6 subset-0 maxProcesses-40"
print(cmd)
output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
print(output)

## pctScored is min_samples_locus as a percent of total samples, here since
##    we use 2 for min_samples_locus and there are 12 simulated samples we use 17%
cmd = base + "time perl FilterSNPs.pl pctScored-17"
print(cmd)
output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)

print(output)

export PATH=/home/iovercast/manuscript-analysis/perl5/bin:$PATH; cpanm --local-lib=/home/iovercast/manuscript-analysis//perl5 local::lib && eval $(perl -I /home/iovercast/manuscript-analysis//perl5/lib/perl5/ -Mlocal::lib); time perl AftrRAD.pl minDepth-6 P2-noP2 minIden-85 re-TGCAG stringLength-99 dplexedData-1 maxProcesses-40


### Post-process aftrRAD output so we actually can get a vcf file

In [None]:
%%bash -s "$WORK_DIR" -s "$AFTRRAD_DIR"
export PATH="$1/miniconda/bin:$PATH"; export "WORK_DIR=$1"; export "AFTRRAD_DIR=$2"
cd $AFTRRAD_DIR/REALDATA/Formatting

perl OutputFasta.pl SNPsOnly-1
snp-sites -v -o REALDATA.vcf SNPMatrix_*.All.fasta

## dDocent empirical analysis
Runtime: ~ 15 minutes

In [None]:
import ipyrad as ip
import subprocess
import glob

## Set up directory structures. change the force flag if you want to
## blow everything away and restart
# force = True
force = ""if force and os.path.exists(DDOCENT_OUTPUT):
    shutil.rmtree(DDOCENT_OUTPUT)
if not os.path.exists(DDOCENT_OUTPUT):
    os.makedirs(DDOCENT_OUTPUT)
os.chdir(DDOCENT_OUTPUT)

## Have to copy all the raw files here so ddocent can find them
IPYRAD_EDITS_DIR = os.path.join(IPYRAD_OUTPUT, "REALDATA_edits/")
infiles = [ff for ff in glob.glob(IPYRAD_EDITS_DIR+"*_R1_*")]
## Don't recopy the data over and over again while testing
if not os.path.exists(infiles[0].split("/")[-1]):
    for f in infiles:
        shutil.copy2(f, f.split("/")[-1])

## Now we have to rename all the files in the way dDocent expects them:
## 1A_0_R1_.fastq.gz -> Pop1_Sample1.F.fq.gz
name_mapping = {"29154_superba_R1_.fastq":"Pop1_Sample1.F.fq", \
                "32082_przewalskii_R1_.fastq":"Pop2_Sample1.F.fq", \
                    "33588_przewalskii_R1_.fastq":"Pop2_Sample2.F.fq",\
                "35236_rex_R1_.fastq":"Pop3_Sample1.F.fq", "39618_rex_R1_.fastq":"Pop3_Sample2.F.fq", \
                    "35855_rex_R1_.fastq":"Pop3_Sample3.F.fq", "40578_rex_R1_.fastq":"Pop3_Sample4.F.fq", \
                    "38362_rex_R1_.fastq":"Pop3_Sample5.F.fq", \
                "41954_cyathophylloides_R1_.fastq":"Pop4_Sample1.F.fq", "30686_cyathophylla_R1_.fastq":"Pop4_Sample2.F.fq", \
                    "41478_cyathophylloides_R1_.fastq":"Pop4_Sample3.F.fq",\
                "30556_thamno_R1_.fastq":"Pop5_Sample1.F.fq", "33413_thamno_R1_.fastq":"Pop5_Sample2.F.fq"}
for k,v in name_mapping.items():
    os.rename(k, v)

## Only runs on gzip files.
!gzip *.fq

## Write out the config file for this run.
## Compacted the config file into one long line here to make it not take up so much room
config_file = "empirical-config.txt".format(dir)
with open(config_file, 'w') as outfile:
    outfile.write('Number of Processors\n40\nMaximum Memory\n0\nTrimming\nyes\nAssembly?\nyes\nType_of_Assembly\nSE\nClustering_Similarity%\n0.85\nMapping_Reads?\nyes\nMapping_Match_Value\n1\nMapping_MisMatch_Value\n3\nMapping_GapOpen_Penalty\n5\nCalling_SNPs?\nyes\nEmail\nwatdo@mailinator.com\n')

cmd = "export LD_LIBRARY_PATH={}/freebayes-src/vcflib/tabixpp/htslib/:$LD_LIBRARY_PATH; ".format(DDOCENT_DIR)
cmd += "export PATH={}:$PATH; time dDocent {}".format(DDOCENT_DIR, config_file)
print(cmd)
## Have to run the printed command by hand from the ddocent REALDATA dir bcz it doesn't like running in the notebook
#!$cmd

## NB: Must rename all the samples in the output vcf and then use vcf-shuffle-cols
## perl script in the vcf/perl directory to reorder the vcf file to match
## the output of stacks and ipyrad for pca/heatmaps to work.


Doing - no
  New Assembly: simno
ipyrad -p params-simno.txt -s 1

 -------------------------------------------------------------
  ipyrad [v.0.4.0]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: simno
  local compute node: [40 cores] on node001

  Step 1: Demultiplexing fastq data to Samples
  [####################] 100%  chunking large files  | 0:00:00 
  [####################] 100%  sorting reads         | 0:00:44 
  [####################] 100%  writing/compressing   | 0:00:02 



real	0m56.558s
user	0m1.724s
sys	0m0.437s
Renaming ./1A_0_R1_.fastq.gz -> Pop1_SampleA.F.fq.gz
Renaming ./3J_0_R1_.fastq.gz -> Pop3_SampleJ.F.fq.gz
Renaming ./1B_0_R1_.fastq.gz -> Pop1_SampleB.F.fq.gz
Renaming ./3K_0_R1_.fastq.gz -> Pop3_SampleK.F.fq.gz
Renaming ./2F_0_R1_.fastq.gz -> Pop2_SampleF.F.fq.gz
Renaming ./1D_0_R1_.fastq.gz -> Pop1_SampleD.F.fq.gz
Renaming ./1C_0_R1_.fastq.gz -> Pop1_SampleC.F.fq.gz
Renaming ./3L_0

dDocent makes you rename your samples and then it orders the vcf according to 
these new sample names, so the order is different from ipyrad/pyrad/stacks
which fsck plotting downstream, so we have to relabel all the columns
to have the familiar names and then reorder them to be in the same order as the other progs.

In [None]:
## There's probably a better way to do this
for f in ["Final.recode.vcf", "TotalRawSNPs.vcf"]:
    vcffile = os.path.join(DDOCENT_OUTPUT, f)
    infile = open(vcffile,'r')
    filedata = infile.readlines()
    infile.close()
    
    outfile = open(vcffile,'w')
    for line in filedata:
        if "CHROM" in line:
            for ipname, ddname in name_mapping.items():
                ipname = ipname.split("_R1")[0]
                ddname = ddname.split(".")[0]
                line = line.replace(ddname, ipname)
        outfile.write(line)
    outfile.close()
    
    ## Now we have to reorder the genotype columns
    IPYRAD_VCF = os.path.join(IPYRAD_OUTPUT, "REALDATA_outfiles/REALDATA.vcf")
    os.chdir(os.path.join(DDOCENT_DIR, "vcftools_0.1.11/perl"))
    tmpvcf = os.path.join(DDOCENT_DIR, "ddocent-tmp.vcf")
    cmd = "perl vcf-shuffle-cols -t {} {} > {}".format(IPYRAD_VCF, vcffile, tmpvcf)
    print(cmd)
    os.system(cmd)
    os.rename(tmpvcf, vcffile)

# Results


## Addendum

This is me carping about organization of the stacks manual. The walkthrough doesn't mention
that you _don't_ need to use mysql until somewhere in the late-middle of the process, long
after you actually have to do the install in order to get it running.
```
## NB: If you pass the `-S` flag to denovo_map.pl then it won't require the sql backend. 
## This message is totally buried in the docs.

## Also `--overw_db` and `--create_db` flags would have been good to know about.

## denovo_map.pl refused to run w/o mysql (command not found error) so I had to brew install it
brew install mysql
mysql.server start
msyql -u root
> GRANT ALL ON *.* TO 'stacks_user'@'localhost' IDENTIFIED BY 'stackspassword';
> create database stacks

## Populate the mysql database with the right tables
mysql -u root stacks < /usr/local/share/stacks/sql/stacks.sql

## Edit the /usr/local/share/stacks/sql/mysql.cnf.dist file to reflect the username and password as above
## then copy this file to /usr/local/share/stacks/sql/mysql.cnf
```
