# Sample background 

The following sequence data was generated from DNA extracted outside of the Foster Lab. Arthropod COI fragments were produced using new JOH primers, then 250 PE sequencing on an Illumina HiSeq2500 generated our reads.  A single lane was used in generating these data as indicated in the run statistics shown [here] [link2]; note that sequence data specific to this project was generated among other projects on the same Illumina lane.

[link2]:http://cobb.unh.edu/170125_DevonP2L2_A_DemuxStats.html

# Pipeline Overview

We use the tools developed by Jon Palmer in the [**amptk**] [link1] program which will trim, filter, cluster, and assign taxonomy to reads. The core clustering process employs an algorithm called DADA2 (see [paper] [link3] and [github page] [link4]) which has an added benefit of error correcting reads (even singletons), though each identified cluster is then reclustered at a 97% similarity (as done conventionally in many OTU-clustering approaches) for downstream community analyses. Taxonomic assignment is completed in part using the curated Barcode of Life Database [(BOLD)] [link7], while also leveraging Robert Edgar's [SINTAX] [link5] and [UTAX] [link6] algorithms to provide additional taxonomic information for reads which are not matched in BOLD.

[link1]:https://github.com/nextgenusfs/amptk
[link3]:http://www.nature.com/articles/nmeth.3869.epdf?author_access_token=hfTtC2mxuI5t44WUhsz05NRgN0jAjWel9jnR3ZoTv0N5gu3rLNk61gF4j2hXPcagLe964qdfd3GRw8OwyZxfEgsul8lwR1lEWykR3lWF30Dl_bZWMvTPwOdrwuiBUeYa
[link4]:https://github.com/benjjneb/dada2
[link5]:http://www.biorxiv.org/content/early/2016/09/09/074161
[link6]:http://www.drive5.com/usearch/manual/utax_algo.html
[link7]:http://v4.boldsystems.org/

## Other information

Installation requirements are found on the [amptk] [link1] site. This project was completed with the following dependencies:

- amptk v0.8.5 
- vsearch v2.3.4
- usearch v9.2.64 (linked as "usearch9", not "usearch")  

In addition to the basic tools we're also going to need to have installed a few other items:  
- a fasta file of the mock community (used in the sequence index bleed filtering step)
    - we are using the IM3 mock community provided by Michelle Jusino
- the COI index (either default from 'amptk' or a curated one of your own)

To install the amptk default COI database:

[link1]:https://github.com/nextgenusfs/amptk/blob/master/docs/ubuntu_install.md

In [None]:
amptk install COI

# part 1 - data migration and cleaning

Initial steps involve moving, renaming, and filtering data as follows.

First, move all relevant directories from Cobb to Pinky servers with rsync. For each project, there will likely be three sets of samples to migrate: 
- the samples
- the negative controls
- one or more mock community members  

This will occur for every lane for every project. In instances where there are multiple lanes in which samples were split across, each lane's data will be treated with separate mock community samples so that index bleed is lane-specific. Data for an entire project is only merged after OTUs have been subsequently determined; this is done in R.  

## generic migration

In [None]:
rsync -avzr foster@cobb.unh.edu:/path2data/Sample_*/*.gz /copy2here/fqRaw/.

## generic renaming

Rename the files to remove the unnecessary underscore. As there are several different prefixes to deal with, we'll run this command as many times as needed for each unique prefix type (while implementing wildcards when possible).

In [None]:
#for true samples
rename 's/{name}_/{name}\-/' *

#for any contaminant sample
rename 's/{name}-contaminated_/{name}une-contaminated/' *

#for negative controls
rename 's/neg{name}_/negu{name}\-/' *

#for mock commmunity
rename 's/mock_IM3/mockIM3/' *

Once everything is properly named, you can then proceed. If you get errors in downstream processes circle back and see if you've kept any weird underscores or other characters in the file names. See the [amptk homepage] [link1] if you're unclear about how the sequence data is expected to be named.

[link1]:https://github.com/nextgenusfs/amptk

## One final pre-processing note

If you have any files that contain zero bytes of data (literally no indices were detected during the Illumina run), you need to delete those .fq files ahead of time before trimming with the first step in the 'amptk' pipeline (it crashes when it tries to parse data that doesn't exist!). To delete files with zero bytes of data, enter the following command (within the directory containing all the raw fastq files):

In [None]:
 find . -name "*.fastq" -size -1c -delete

# Read processing

This process can take a considerable amount of time; increase the number of cpus when possible is advised. For example, with 8 cpus on about 17 Gb of data, it took ~1 hour to process.  

Note the suffix "L1" in the **-o** flag. This is done in this particular circumstance because this project contains data across two lanes. If there were multiple Illumina runs, these would be further specified by project number (ex. P1L1, P3L2, etc.).

In [None]:
#run in parent directory of {path.to}/fqRaw
amptk illumina \
-i /leo/devon/projects/guano/sean/p2_data/fqRaw \
-o sean \
--rescue_forward on \
--require_primer off \
--min_len 160 \
--full_length \
--cpus 20 \
--read_length 250 \
-f GGTCAACAAATCATAAAGATATTGG \
-r GGWACTAATCAATTTCCAAATCC

Output from this script will contain several new files:  

1. In the directory in whch the script was executed:
    - (output_name).mergedPE.log containing information about PE merging. Note that information in this single file includes the summary of all individual merged pairs.
    - (output_name)-filenames.txt containing information about files used in this pipeline (such as index-pair combinations)
    - (output_name).demux.fq containing a concatenated file of all trimmed and PE merged sequences of all samples listed in the '-filenames.txt' file above __(this one is to be used in the subsequent OTU clustering step)__

2. In the (output_name) directory named in the 'ufits illumina ...' argument provided above:  
    - (output_name).ufits-process.log containing information about the demultiplexing and read trimming processes
    - a single sample_name.fq file which is the PE merged file from each sample_name...R1/R2 pair of raw fastq inputs
    - a single sample_name.demux.fq file for each PE merged sample_name.fq file having been trimmed as defined  

The file labeled **'{-o name}.ufits-demux.log'** will also contain a summary of information regarding the total number of reads processed as well as the number of reads processed per sample (at the very bottom of the file). This is critical in evaluating which reads to keep, and the names of those files are essential in the next step for keeping/dropping samples.

## Dropping/keeping certain samples

Very few samples provide enough sequencing data for analysis. There are a variety of reasons why, but given that these samples have been sequenced twice using two different primer sets, it appears that there are either an extreme amount of PCR inhibitors present in the DNA extract solution, or there is simply not enough arthropod DNA present in the DNA extract solution for our primers to successfully amplify. 

Jon Palmer (the author behind this pipeline), recommends shooting for samples with about 50,000 reads, though as low as 10,000 samples are fine. It also depends on the distribution of the reads across all samples; if you have 20 samples with >100,000 reads each, and 20 samples with ~10,000 you might not want to keep anything less than 10,000 reads. However, if you have 20 samples with ~20,000 reads each and another 10 samples with 5,000 reads, and another 10 samples with ~100 reads, then you might want to keep those with 5,000 (so going lower than the 10,000 reads threshold). However, keeping samples with fewer reads creates other problems downstream; especially with regards to community analysis metrics which can be sensitive to our final presence/absence OTU matrix. In addition, you can also risk including OTUs present in a sample which are the result of index-bleed rather than a true representation of the per-sample OTU composition. In short, it's better to be conservative with what samples you include.

There's an equivalent set of commands which you can specify to keep or remove samples; if it's faster to enter just a few samples to drop, us the '**amptk remove**' command, if it's faster to just specify the few samples you want to keep, use the '**amptk select**' command.  

Here's an example of the kind of range you might see in a typical run:

In [None]:
Sample:  Count
9056:  36628
9001:  27041
9023:  25831
9040:  15422
negsample1:  1201
9030:  15
9033:  13

In the case above, you're likely going to just keep the top 4 samples, while dropping the bottom three. To drop those three, you'd enter the following generic command:

In [None]:
amptk remove \
-i somepath/trimd.demux.fq \
-l negsample1 9030 9033 \
-o somepath/nau_L1_merged.demux.fq

In the case of our specific dataset, I kept all samples with > 5,000 reads; note the negative control with highest read count reached 877, with most other negatives controls containing fewer than 500 reads total; thus while we're going lower than I'd like to for most data sets, the average NTC is so low that we likely won't see much in the way of contaminant OTUs bleeding into the sample analysis.  

I'd argue that these data may be useful for providing qualitative observations, but nothing else. No community-level analyses should be performed when our per-sample read distribution is this low.

In [None]:
amptk select \
-i sean.demux.fq \
-f samples2keep.txt \
-o sean-dropd.demux.fq
    #kept 306337 of 417815 reads (73.2% of all reads
    #kept 14 of 171 true samples (not counting 22 NTCs and 1 mock community)

To double check that new .fq file contains just what you want, run the following command to double check you didn't miss anything.

We're good to start with the next part - clustering OTUs.

# part 2 - OTU clustering

See Jon Palmer's [notes] [link1] about DADA2 if you're curious about what's required to get to this point.  

We used to have to clean up reads before passing into DADA2 using the vsearch program, but now the code in amptk deals with this problem just fine.
[link1]:https://github.com/nextgenusfs/ufits

In [None]:
amptk dada2 \
--fastq sean-dropd.demux.fq \
--out sean \
--length 180 \
--platform illumina \
--uchime_ref COI

A brief summary of the results:

In [None]:
-------------------------------------------------------
[09:37:33 AM]: Loading FASTQ Records
[09:37:33 AM]: 306,337 reads (117.8 MB)
[09:37:42 AM]: 289,019 reads passed
[09:38:36 AM]: 296 total inferred sequences (iSeqs)
[09:38:36 AM]: 23 denovo chimeras removed
[09:38:36 AM]: 273 valid iSeqs
[09:38:36 AM]: Chimera Filtering (VSEARCH) using COI DB
[09:38:40 AM]: 255 iSeqs passed, 18 ref chimeras removed
[09:38:40 AM]: Mapping reads to DADA2 iSeqs
[09:38:56 AM]: 285,564 reads mapped to iSeqs (93%)
[09:38:56 AM]: Clustering iSeqs at 97% to generate biological OTUs
[09:38:56 AM]: 172 OTUs generated
[09:38:56 AM]: Mapping reads to OTUs
[09:39:06 AM]: 284,112 reads mapped to OTUs (93%)
-------------------------------------------------------

This will produce two files you want to use in the subsequent 'ufits filter' command below:
   - **dada2_output.cluster.otus.fa** (use this fasta file in the next command)
   - **dada2_output.cluster.otu_table.txt** (use this fasta file in the next command)

It also produces two very similar looking files that represent the inferred sequences which were further clustered at 97% identity (so we can make some sort of sense of it when assigning taxonomy) - these aren't necessary for downstream analyses as of now:
   - **dada2_output.iSeqs.fa**
   - **dada2_output.otu_table.txt**

You can see exactly which of those iSeq files ended up clustering together into a single OTU with this file:
   - **iSeqs 2 OTUs: dada2_output.iSeqs2clusters.txt**

There are also a few log files generated:
   - **dada2_output.dada2.Rscript.log**  (to ensure the DADA2 program was run successfully)
   - **dada2_output.ufits-dada2.log**  (tracks the overall processing of this wrapper 'ufits dada2' script)


Next up is to filter the OTU table with the fasta file listed above:

# Part 3: filtering OTU table

It's advocated by Jon Palmer to use an index-bleed filter command using the '-b mock' flag, specifying the use of a mock community to calculate the index-bleed percentage. The idea behind this filtering step is to identify the number of instances in which a read mapped to an OTU which is *not supposed to be in the mock community* is found in the mock community. The percentage represents the overall number of reads that map to mock OTUs (ie. that are supposed to be in there) relative to the number of reads from OTUs that aren't. The alternative approach is to just trim down reads by some defined (yet arbitrary) percent across all samples, given some other examples in other data sets. I don't like that because I've found that every data set is unique, so having a mock community to judge this by is better.  

We're going to use our mock community to calculate index bleed. We're also going to (potentially) incorporate a subtraction value in which *if* the scenario occurs such that an unwanted OTU remains in the mock community following the application of the index-bleed filter, we can detect how many reads that maximum value should be and subtract from there to ensure that **zero** normalized reads remain in the highest 'bleeding' OTU in the mock sample. This value is subtracted from *all* reads from *all* samples per OTU, not just the mock community, so it ultimately drops a lot of OTUs with low read thresholds.  

There are intermediate files which are very useful in determing exactly how many read of which OTU are creeping into your mock community (that are unwanted OTUs), so you'll notice we're passing a **"--debug"** flag which is used to generate these files; without that flag, you'll miss the normalized read counts used in calculating the index bleed filter as well as the subsequent "--subtract" value.

You might want to play with the **--subtract** threshold and examine how many overall OTUs are retained. In the two examples below, we're going to test exactly what filtering parameters should be applied.  

- Our first filtering step uses several flags: 
    - the index-bleed flag **-b** specifies which sample is the mock community and goes with the **--mc** flag to specify the path with which you can find the associated fasta file with that mock community
    - the **calculate in** flag specifies that we're only going to calculate index bleed into the mock (not back out of the mock); this is done because we are using a biological mock community and there is a chance that our true samples also contain the same species (OTUs really); if left to the default (**calculate all**) *and we had samples that contained these mock OTUs too* then our index bleed rate would be super high, which we don't want
    - the **--debug** flag provides the intermediate files needed to determine which OTUs and how many reads are finding their way into the mock community sample
    - the **--subtract auto** flag will calculate what the total number of normalized reads are in OTUs unwated in the mock community and then subtract that value from all reads. When applied this results a completely clean mock, but drops a lot of samples

First example allows for an automatic detection and application of index-bleed and OTU subtraction (if necessary):

In [None]:
amptk filter \
-i sean.cluster.otu_table.txt \
-f sean.cluster.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--debug \
--subtract auto \
-o testfilt

A brief summary of the output:

In [None]:
-------------------------------------------------------
[09:39:53 AM]: OTU table contains 172 OTUs
[09:39:53 AM]: Mapping OTUs to Mock Community (USEARCH)
[09:39:53 AM]: Index bleed, samples into mock: 1.366000%.
[09:39:53 AM]: Auto subtract filter set to 1277
[09:39:53 AM]: Filtering OTU table down to 49 OTUs
-------------------------------------------------------

The resulting output will show you that:  

- there was an index bleed of **1.4%** applied (that's pretty standard for most of my data sets, though still kind of high), and 
- the auto subtract filter was used at a level of **1277**. That's really high!

In other words, after filtering down all reads by 1.4% on a per-OTU basis, all reads were then subtracted form each sample's OTU by an additional 1277 reads. If you used those default parameters it would result in a dramatic reduction in the resulting OTUs to keep. For instance in this filtering regime we go from **172 OTUs** down to **49 OTUs** - that's a lot of OTUs to throw away. The question is whether or not that **--auto subtract** filter needs to be quite as high. We're going to figure that out in a minute.  

To investigate what's going on in both data sets we're going check out the **testfilt.normalized.num.txt** file and compare it with the **testfilt.final.txt** file (one for each lane). We need to figure out which OTUs in the mock community have the greater number of OTUs that are *unwanted* - that is, OTUs which should not be in the mock community. We're going to clean up the output of the needed files just a bit so that our search basic linux tools work properly (this is because the space in the first line of the files shifts the first row into one more field than all the rows below).

Run the following commands:

In [None]:
sed -i 's/#OTU ID/OTUid/' testfilt.final.txt
sed -i 's/#OTU ID/OTUid/' testfilt.normalized.num.txt

You don't know two things at this point: 
- how many fields are there?
- which OTUs contain the most reads among OTUs that **don't** belong?  

To answer the *number of fields question*, substitute **file.name.txt** with whatever file you want:

In [None]:
awk -F '\t' '{print NF; exit}' file.name.txt

For example:

In [None]:
#same command for L1 and L2 data:
awk -F '\t' '{print NF; exit}' testfilt.final.txt

We see there are **16** fields, indicating that there are 15 total samples to deal with. Our current data set is named such that the second field is the mock community, and the first field is the list of OTUs in each sample. We're going to sort through the text file and just print out the fields we want.  

The following command should give you a sense of how many reads may need to be subtracted from a given OTU that's unwanted. In the first command you can view the OTU in question, the mock sample, and another real sample in terms of how many reads are in each OTU.  

In [None]:
#for the .normalized.num.txt file:
cut testfilt.normalized.num.txt -f 1,2,7 | sort -k2,2n | awk '$2 != "0.0" {print $0}' 

Notice how there is a single OTU (**OTU2**) that shouldn't be in there that's in abundance while there are a handful of other OTUs that shoudn't be there but contain only a fraction of reads? I think after looking at multiple sets of samples all sequenced on this Project 2 run it appears that some fly DNA is present in one of the reagents at some very low level (probably in the PCR mix, but possibly anywhere). You can search for this OTU using the second command below. Because this project set includes only bat DNA we know we aren't going to find that fly in it so we can drop this OTU all together.

If, for example, you wanted to see how many reads are in every sample of the data set for some specific OTU, say  **OTU2**, type:

In [None]:
grep "\\bOTU2\\b" testfilt.normalized.num.txt

Which will result in the following output:  

**For L2 data**  
OTU22&nbsp;&nbsp;  1277.0&nbsp;&nbsp;  2180.0&nbsp;&nbsp;  12165.0&nbsp;&nbsp;  855.0&nbsp;&nbsp;  14369.0&nbsp;&nbsp; 0.0&nbsp;&nbsp;  23976.0&nbsp;&nbsp;  29880.0&nbsp;&nbsp;  4.0&nbsp;&nbsp;  42267.0&nbsp;&nbsp;  21530.0&nbsp;&nbsp;  45.0&nbsp;&nbsp;  19330.0&nbsp;&nbsp;  76081.0&nbsp;&nbsp;  50417.0

It's important to note the order of what you're seeing above: the first value (**1277.0**) represents the number of normalized reads for that OTU in the *mock community*. This is not expected to be in such high numbers (it's not expected at all, actually). The remaining values are the number of normalized reads for that OTU for all other samples. This is the only project where those values are extremely high. If I had not also sequenced many other separate projects to feel confident this is a contaminant in the PCR step I would likely guess that these were real values and not a source of contamination. However, I believe what is happening here is that each of the true samples are so low in target arthropod DNA that this little bit of contaminant DNA is getting amplified significantly because of the lack of targeted template DNA in each sample. In the cases where there is very few reads (or zero), the contaminant didn't get amplified.  

It's useful to identify (as best as possible) the taxonomic information of these potential contaminant OTUs in question. You can do this by looking into the fasta file for each of those OTUs and then manually BLASTing them. For example:

In [None]:
grep "\\bOTU2\\b" sean.cluster.otus.fa -A 3

This produces the sequence needed; manually BLASTing this specific OTU in NCBI's nr database show that it's a common housefly. I'm unclear whether this species is expected in the guano samples, but given the number of reads detected among these successfully amplified samples of guano, I doubt that between half to two-thirds of their DNA is from house fly. Rather, it's likely what I described above: a case of poor quality template and a failed reaction to the target sample providing opportunity for low-level contaminants to rise in concentration during PCR.  

Given that information, it'll be easiest to just drop that OTU in question and then refilter. We're going to want to keep the **--subtract auto** feature on for now for both data sets because there were additional low-level reads popping up in their respective mock community samples.

*AMPtk* has a feature to do just that for weird scenarios like this!

In [None]:
amptk drop \
-i sean.cluster.otus.fa \
-r sean-dropd.demux.fq \
-l OTU2 \
-o sean

First, a brief summary of the results:

In [None]:
-------------------------------------------------------
[09:57:29 AM]: Loading 172 OTUs
[09:57:29 AM]: Dropping 1 OTUs
[09:57:29 AM]: Mapping Reads to OTUs and Building OTU table
[09:57:44 AM]: 171 OTUs remaining
[09:57:44 AM]: 262,156 reads mapped to OTUs (86%)
-------------------------------------------------------

Thus that single contaminant OTU was accounting for about 14% of all reads among the samples we kept..  

The above command will result in a single OTU being dropped from each .fq file and generate two output files:  
- **sean.cleaned.otu_table.txt** 
- **sean.cleaned.otus.fa**  

Next up is to run another filtering step to ensure that the OTU in question has been removed, and see how the removal of that OTU influences what the **--subtract** value is set to.

In [None]:
amptk filter \
-i sean.cleaned.otu_table.txt \
-f sean.cleaned.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
--debug \
-o lasttest

We find that dropping that one OTU results in the lowering of the **--subtract auto** value from the very high range to a more modest number of reads per sample (from **1277** to **48** reads). The **index-bleed** value also drops significantly down to just **0.1%**.

While there are still a few additional samples bleeding into the mock community these all occur at a very low abundnace. We can therefore apply a final filtering step where we keep only true samples (we drop the mock community and it's uniquely associated OTUs), and filter down with a more modest subraction filter.

In [None]:
amptk filter \
-i sean.cleaned.otu_table.txt \
-f sean.cleaned.otus.fa \
-b mockIM3 \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
-o sean

So what do we see? By dropping just that one contaminant OTU from each data set we greatly reduce the value used in the **--subtract auto** option.  

- We go from a subract value of **1277** to just **48**; this results in retaining **130** OTUs instead of the **49** that were left when we kept that one contaminant OTU. Big difference!

Notably, 11 OTUs of those 171 are specifically associated with the mock community that is now removed. When you look at the abundance of reads per OTU it's pretty clear that many of these OTUs are very rare (are present only in one or a few of the samples and absent in others), yet often when we see OTUs containing thousdands of reads they tend to be present in many of the samples. To be conservative it's certainly justifiable to cut at the rate with which we elimnate index bleed from our mock community and that's what we'll do here.  

The arguments above will produce the following files:  
- **sean.filtered.otus.fa** is the final filtered fasta file (say that five times fast)
- **sean.final.binary.txt** is the presence/absence OTU table after filtering
- **sean.stats.txt** is the number of OTUs in each sample before and after filtering
- **sean.final.txt** is the OTU table with normalized read counts (noramlized to the number of reads in each sample)
- **sean.amptk-filter.log**  

The last step in this pipeline is to assign taxonomy by using the **sean.filtered.otus.fa** and **sean.final.binary.txt** files.

# Part 4: Assigning Taxonomy to OTUs

Make sure to have already downloaded the necessary database (COI). If you haven't, just type:

In [None]:
amptk install -i COI

Note that the classification of each OTU can be executed using three potential programs: USEARCH using our database acquired through BOLD; UTAX (trained through BOLD); and SINTAX. I am only ultimately going to only present the OTU calls from the BOLD database in this project, but I will create a taxonomy profile using all three. The BOLD-only files are filtered out using an R script later in this workflow.

In [None]:
amptk taxonomy \
-i sean.final.binary.txt \
-f sean.filtered.otus.fa \
-d COI \
-o sean

## What do we find?

Some files of immediate interest:
**sean.otu_table.taxonomy.txt**: the file containing the binary presence (1) or absence (0) matrix with OTUs as rows and samples and taxonomy information as columns. This file is used in the subsequent R script for generating plots and tables.  
**sean.otus.taxonomy.fa**: the fasta file of all OTUs classified in the OTU table

Not necessary on the front end but also of potential interest:
**sean.taxonomy.txt**: a five-field file containing the OTU number and the taxonomic information for each classifier used. Could be useful in comparing these relative to a subsequent BLAST search too.
**sean.otu_table.txt**: another matrix with samples as columns but no taxonomy info. rows are not OTUs, rather, the individual iSeq reads pre-97% clustering. Could be helpful if you want to conduct a blast search of all unique sequences prior to the minimum number of read filtering step. Could help identify bird reads, for example, which may be much lower than the **48** value used in the subtract filtering stage.

The R script **'OTU_analysis.R'** for subsequent analysis of taxa present in these samples.

# Part 5 - BLASTing our filtered info (post R script)

Decided to do a little further trimming and digging. Uploaded the **OTUlist.110minReads** output from the R script to our Pinky server, then ran the following code from our Pinky server to create a fasta file containing OTUs that met that 110 min read threshold:

In [None]:
python \
~/scripts/guano_scripts/Python_scripts/fasta_subset.py \
sean.filtered.otus.fa \
OTUlist.110minReads.txt > min110Reads.fa

Performed a local BLAST from the command line using the 'nr' database from NCBI to generate output:

In [None]:
blastn \
-query /leo/devon/projects/guano/sean/p2_data/min110Reads.fa \
-db /opt/ncbi-blast-2.2.29+/db/nt \
-outfmt '6 qseqid sseqid pident length bitscore evalue staxids' \
-num_threads 8 \
-perc_identity 96.9 \
-max_target_seqs 12 \
-out /leo/devon/projects/guano/sean/p2_data/blastout.txt

To get the taxonomy information from these taxids we're going to use an R package called [taxize] [link1] to convert the *staxids* value into taxonomic information (kingdom, phylum,...species).  The only information we need to provide to **taxize** is a list of the taxids.  

However, we're going to sort through our blast output a little bit first.

[link1]:https://github.com/ropensci/taxize

- Removed redundant rows that contained identical OTU id's, bit scores, and taxIDs. 
- I also then removed any rows in which the alignment length of a match was less than 150 base pairs
- I then filtered any redundant row with common OTUid and TAXid, then sorted by bit score (highest first)

In [None]:
sort -u -k1,1 -k5,5nr -k7,7 blastout.txt | \
awk '$4 > 149 {print $0}' | \
sort -u -k1,1 -k7,7 | \
sort -k1,1 -k5,5nr > cleanBlastout.txt

The next step is to import these TAXid values into R and use a package called **taxize**.  See **taxifying_blastout.R** for details. The **cleanBlastout.txt** file is uploaded to a local machine where R scripts are executed. The output of this script is titled **OTUidTaxaClassifications.csv**, and contains a list of our best alignments for a given OTU using the *nr* database from NCBI; a few of these matches may improve taxonomic resolution from just the BOLD database.