# Pipeline Overview

The following data was created by extracting soil samples using the MoBio HTP Soil kit, amplifying an arthropod COI fragment using new JOH primers, then 250 PE sequencing on an Illumina HiSeq2500. RunStats for these data can be found [here] [link2]. Note that these 8 samples were pooled with many other samples as part of a single run with several unique projects.

See [UFITS Github page] [link1] for overview of install and executing code. Because program continues to be under development, use instructions on README.md file only.

This notebook will describe a modification from the traditional UFITS pipeline which uses Robert Edgar's UPARSE algorithm to merge PE reads, cluster OTUs, etc. Instead, we'll use a newly employed algorithm called DADA2 (see [paper] [link3] and [github page] [link4]) which can do the same things as the original UFITS pipeline plus error correct reads. The advantage here is we can likely keep samples which have lower numbers of reads because this algorithm has a way to error correct the Illumina reads prior to clustering the merged PE reads - however Jon Palmer continues to advise against this but in any event this algorithm seems to do a better job of calling what's there and leaving out what's not.

[link1]:https://github.com/nextgenusfs/ufits
[link3]:http://www.nature.com/articles/nmeth.3869.epdf?author_access_token=hfTtC2mxuI5t44WUhsz05NRgN0jAjWel9jnR3ZoTv0N5gu3rLNk61gF4j2hXPcagLe964qdfd3GRw8OwyZxfEgsul8lwR1lEWykR3lWF30Dl_bZWMvTPwOdrwuiBUeYa
[link4]:https://github.com/benjjneb/dada2
[link2]:http://cobb.unh.edu/161103_DevonP1_Lane1_A_DemuxStats.html

## Dataset information

Using 'pandas' to import data table shown below. Program already installed via 'pip install pandas' with associated dependencies. See documentation for [pandas here] [link1].  See also [useful pandas code] [link2] (though some may be outdated) and [pandas cookbook] [link3].
[link1]:http://pandas.pydata.org/pandas-docs/version/0.18.1/options.html
[link2]:http://www.swegler.com/becky/blog/2014/08/06/useful-pandas-snippets/
[link3]:http://pandas.pydata.org/pandas-docs/stable/cookbook.html

In [2]:
import os
import pandas as pd
from pandas import DataFrame, read_csv

### Sample information

In [4]:
sampleinfo = r'/Users/devonorourke/Documents/Lab.Foster/guano/soil_mt/sampleinfo.csv'
df = pd.read_csv(sampleinfo)
df.columns = ['SampleID', 'WellNumber', 'CoreID', 'SampleNum', 'TopDepth(cm)', 'BottomDepth(cm)', 'Description']
df = df.sort_values(by="SampleID")
pd.set_option('expand_frame_repr', False)
print(df)

          SampleID WellNumber   CoreID SampleNum TopDepth(cm) BottomDepth(cm)                             Description
1         mud.9062        A02  CH10_sh         2          0.5               1      no sediment remaining in whirl pak
2         mud.9063        A03  CH10_sh         3            1             1.5         sediment remaining in whirl pak
3         mud.9064        A04  CH10_sh         4          1.5               2      no sediment remaining in whirl pak
4         mud.9065        A05  CH10_sh         5            2             2.5      no sediment remaining in whirl pak
5         mud.9066        A06  CH07_sh         1            0             0.5         sediment remaining in whirl pak
6         mud.9067        A07  CH07_sh         2          0.5             1.5         sediment remaining in whirl pak
7         mud.9068        A08  CH07_sh         3          1.5               2  little sediment remaining in whirl pak
8         mud.9069        A09  CH07_sh         4        

### Runstats information

In [27]:
runstats = r'/Users/devonorourke/Documents/Lab.Foster/guano/soil_mt/mudrunstats.csv'
df = pd.read_csv(runstats, 
                sep = ",", 
                names = ["SampleName", "Index", "Yield(MB)", "#Reads", "%PerfectIndexRead", "MeanPhred"])
pd.set_option('expand_frame_repr', False)
print(df)

                             SampleName   Index Yield(MB)     #Reads  %PerfectIndexRead  MeanPhred
SampleName Index              Yield(Mb)  #Reads    Q30>=%  MeanPhred                NaN        NaN
mud_9062   CGAAGTAT-ACGACGTG         15  58,146     90.04      35.71                NaN        NaN
mud_9063   TAGCAGCT-ACGACGTG          1   3,634     92.27      36.55                NaN        NaN
mud_9064   TCTCTATG-ACGACGTG          6  23,458     90.45      35.87                NaN        NaN
mud_9065   GATCTACG-ACGACGTG          1   2,980     89.99      35.74                NaN        NaN
mud_9066   GTAACGAG-ACGACGTG          0     198     85.96       34.4                NaN        NaN
mud_9067   ACGTGCGC-ACGACGTG          3  10,610     90.63         36                NaN        NaN
mud_9068   ATAGTACC-ACGACGTG          0   1,130     87.41      34.82                NaN        NaN
mud_9069   GCGTATAC-ACGACGTG          0     160      82.2       32.9                NaN        NaN


# Running UFITS

Running version 0.5.5.  
New COI database installed with this version.

#install COI db for taxonomy calls if needed
ufits install -i COI

    #can check to determine which db installed by entering
    ufits taxonomy

## part 1 - data cleaning

One of the first thing to do with the raw data is move exactly the files you want into a specific directory. For this sequencing run there was more than just MT_mud samples getting sequenced, so you have to parse out the files we want from the files we don't. In this case, it's just a matter of using a wildcard with the SampleName in the header:

In [None]:
mv Project_DevonP1_Lane1/Sample_mud_906*/*.gz ./mudMT/

And copy the relevant negative control samples too (we're copying here because these are relevant to another undrelated project too):

In [None]:
cp Project_DevonP1_Lane1/Sample_negunemud_um*/*.gz ./mudMT/

This will leave behind the .csv files you don't need, and keep all the sample .fastq.gz files in a single directory for UFITS to work with, as it's intended. Once that's complete, you then need to rename the files because that three-part naming schemed used to separate them out in the first place actually doesn't jive with how UFITS scripts want to parse things...It defaults to looking at that first underscore and chopping everything off after that, so samples all get merged together into one big morass, which you don't want at all.  

Instead, use the following example to rename the scripts to incorporate a hyphen, which UFITS is okay with:

Once everything is properly named, you can then proceed.  

This process can take a considerable amount of time; increase the number of cpus when possible is advised.  
For example, with 8 cpus on 14 samples (containing 17 Gb of data), it took ~1 hour to process. 

Data stored in various subdirectories within parent directory at: __'/leo/devon/projects/bri2016/ufits_test_data'__.  

Output from this script will contain several new files:  

1. In the directory in whch the script was executed:
    - (output_name).mergedPE.log containing information about PE merging. Note that information in this single file includes the summary of all individual merged pairs.
    - (output_name)-filenames.txt containing information about files used in this pipeline (such as index-pair combinations)
    - (output_name).demux.fq containing a concatenated file of all trimmed and PE merged sequences of all samples listed in the '-filenames.txt' file above __(this one is to be used in the subsequent OTU clustering step)__

2. In the (output_name) directory named in the 'ufits illumina ...' argument provided above:  
    - (output_name).ufits-process.log containing information about the demultiplexing and read trimming processes
    - a single sample_name.fq file which is the PE merged file from each sample_name...R1/R2 pair of raw fastq inputs
    - a single sample_name.demux.fq file for each PE merged sample_name.fq file having been trimmed as defined  

Output will also contain a summary of information regarding the total number of reads processed as well as the number of reads processed per sample. Note in this example how there's about a 1,000-fold difference in reads per sample! We won't be using every one of those reads - in fact, we'll likely just use the top 3.

If at this point it's valuable to point out that you may want to exclude certain samples which contain too few reads. These can be discarded at this point before moving forward in the OTU clustering part. Jon Palmer recommends shooting for samples with about 50,000 reads, though as low as 10,000 samples are fine. In the example above, you may consider removing all samples except **'mud-9062'**, **'mud-9064'**, and **'mud-9067'** because all other samples contain about as many reads as any of our contaminant samples.  

You can either provide a list of samples to keep or to remove - in this instance it's easier to just specify which ones to keep in a list directly in the command line rather than in a file.

In [None]:
ufits select \
-i MTmud.demux.fq \
-l mud-9062 mud-9064 mud-9067 \
-o MTmud_cleaned_merged.demux.fq

You'll generate a new {output.name}.fq file containing only the samples you wanted to keep (or, without all the samples you just dropped). To double check that new .fq file contains just what you want, run the following command to double check you didn't miss anything:

In [None]:
ufits show -i MTmud_cleaned_merged.demux.fq

This should produce a list of the barcoded samples that remain; confirm you have the samples you want and don't have samples you intended on discarding. Once that's done it's on to OTU clustering.

In [None]:
----------------------------------
Found 3 barcoded samples
              Sample:  Count
            mud-9062:  28960
            mud-9064:  11691
            mud-9067:  5295
----------------------------------

## part 2 - OTU clustering

See Jon Palmer's [notes] [link1] about DADA2 if you're curious about what's required to get to this point.  

You're going to first have to run a command to get rid of any possible N's in the cleaned, merged, and demux'd fastq file - DADA2 will crash if it detects any nucleotide character other than A/C/G/T:
[link1]:https://github.com/nextgenusfs/ufits

You'll see an output that is really short, and indicates that you dropped a few sequences with those pesky 'N' characters:'

With those N characters removed, you can run DADA2:

Here's a sample readout from this command:

This will produce two files you want to use in the subsequent 'ufits filter' command below:
   - **dada2_output.cluster.otus.fa** (use this fasta file in the next command)
   - **dada2_output.cluster.otu_table.txt** (use this fasta file in the next command)

It also produces two very similar looking files that represent the inferred sequences which were further clustered at 97% identity (so we can make some sort of sense of it when assigning taxonomy) - these aren't necessary for downstream analyses as of now:
   - **dada2_output.iSeqs.fa**
   - **dada2_output.otu_table.txt**

You can see exactly which of those iSeq files ended up clustering together into a single OTU with this file:
   - **iSeqs 2 OTUs: dada2_output.iSeqs2clusters.txt**

There are also a few log files generated:
   - **dada2_output.dada2.Rscript.log**  (to ensure the DADA2 program was run successfully)
   - **dada2_output.ufits-dada2.log**  (tracks the overall processing of this wrapper 'ufits dada2' script)


Next up is to filter the OTU table with the fasta file listed above:

## Part 3: filtering OTU table

See the comments in the 'ufits_standardpipeline_bridata' workflow. Note that all we're using here is the fixed index-bleed workflow, but future analyses should use the mock community data once it becomes available.  We're going to run using the default parameters including the **--min_reads_otu** flag being set to it's default (2).

It doesn't appear that this fixed index removed any potential OTUs:

You get a bunch of new files, but the two relevant ones to use in assigning taxonomy next are:  
   - **dada2_r97m2.filtered.otus.fa**
   - **dada2_r97m2.final.binary.csv**

## Part 4: Assigning Taxonomy to OTUs

Make sure to have already downloaded the necessary database (COI).

In [None]:
ufits taxonomy -i dada2_r97m2.final.binary.csv -f dada2_r97m2.filtered.otus.fa -d COI

Which produces the following summary (note the OTU classification is performed with UTAX, using a database acquired through BOLD; see J. Palmer for details about its creation):

## What do we find?

See the R script 'ufits_2016_OTUtable_analysis.R' for generating the following observations.  

Transfer out the **"dada2_r97m2.otu_table.taxonomy.txt"** file with rsync for import into that R script.

# BLASTing some of those unknowns 

We can use the following scripts to run NCBI BLAST. This will happen in three steps:

- A. Clean up the file to pull out just the UTAX-tagged OTUs and convert into a single-line fasta
- B. Build a blast database (* note we're not going to use this step at the moment*)
- C. BLAST the file

**A.** First, use fastx toolkit to convert 'failedreads.fasta' into oneliner fasta

Then grep only lines with "UTAX" in header and remove the dashes where the lines were removed

**B.** We're going to skip the step of setting up a BLAST database for the moment - there's already the complete 'nr' database installed on Pinky, so unless we want to set up a custom database with specific sequences we don't have available through NCBI, then I wouldn't worry about this step. If you did want to do it, you'd run the following command:

**C.**. Finally, we're going to run a BLAST search on the UTAX-specific reads using NCBI's nr database. We're going to specify a few other flags described in the code below. Specifically, we're going to take only values that are a certain alignment length, a certain percent identity, and then filter sults these to choose the best possible taxonomic description to those collapsed reads.  

See [here] [link1] for BLAST manual with command line options used below.
[link1]:http://www.ncbi.nlm.nih.gov/books/NBK279675/

In [None]:
blastn \
-query /leo/devon/projects/mudMT/utaxOutput_only.fa \
-db /opt/ncbi-blast-2.2.29+/db/nt \
-outfmt '6 qseqid sseqid pident length bitscore evalue staxids' \
-num_threads 8 \
-perc_identity 79.9 \
-max_target_seqs 10 \
-out /leo/devon/projects/mudMT/blastout/rd1out.txt