# Sample background 

The following sequence data was generated either DNA extracted following guano extraction standards developed in the Foster Lab. Arthropod COI fragments were produced using new JOH primers, then 250 PE sequencing on an Illumina HiSeq2500 generated our reads.  A single lane was used in generating these data, multiple projects were pooled together on this lane. Each project from this single land was independently analyzed to identify potential chimeric sequences, tag-jumping instances, and any consistent sources of cross contmination.  In addition, though these samples were subject to two independent sequencing runs (as the first run did not generate many reads per sample), only reads acquired through this single latter round of sequencing are reflected in these analyses. Run statistics are indicated [here] [link1]. 

[link1]:http://cobb.unh.edu/170125_DevonP2L1_A_DemuxStats.html

# Pipeline Overview

We use the tools developed by Jon Palmer in the [**amptk**] [link1] program which will trim, filter, cluster, and assign taxonomy to reads. The core clustering process employs an algorithm called DADA2 (see [paper] [link3] and [github page] [link4]) which has an added benefit of error correcting reads (even singletons), though each identified cluster is then reclustered at a 97% similarity (as done conventionally in many OTU-clustering approaches) for downstream community analyses. Taxonomic assignment is completed in part using the curated Barcode of Life Database [(BOLD)] [link7], while also leveraging Robert Edgar's [SINTAX] [link5] and [UTAX] [link6] algorithms to provide additional taxonomic information for reads which are not matched in BOLD.

[link1]:https://github.com/nextgenusfs/amptk
[link3]:http://www.nature.com/articles/nmeth.3869.epdf?author_access_token=hfTtC2mxuI5t44WUhsz05NRgN0jAjWel9jnR3ZoTv0N5gu3rLNk61gF4j2hXPcagLe964qdfd3GRw8OwyZxfEgsul8lwR1lEWykR3lWF30Dl_bZWMvTPwOdrwuiBUeYa
[link4]:https://github.com/benjjneb/dada2
[link5]:http://www.biorxiv.org/content/early/2016/09/09/074161
[link6]:http://www.drive5.com/usearch/manual/utax_algo.html
[link7]:http://v4.boldsystems.org/

## Other information

Installation requirements are found on the [amptk] [link1] site. This project was completed with the following dependencies:

- amptk v0.8.5 
- vsearch v2.3.4
- usearch v9.2.64 (linked as "usearch9", not "usearch")  

In addition to the basic tools we're also going to need to have installed a few other items:  
- a fasta file of the mock community (used in the sequence index bleed filtering step)
    - we are using the IM3 mock community provided by Michelle Jusino
- the COI index (either default from 'amptk' or a curated one of your own)

To install the amptk default COI database:

[link1]:https://github.com/nextgenusfs/amptk/blob/master/docs/ubuntu_install.md

In [None]:
amptk install COI

# part 1 - data migration and cleaning

Initial steps involve moving, renaming, and filtering data as follows.

First, move all relevant directories from Cobb to Pinky servers with rsync. For each project, there will likely be three sets of samples to migrate: 
- the samples
- the negative controls
- one or more mock community members  

This will occur for every lane for every project. In instances where there are multiple lanes in which samples were split across, each lane's data will be treated with separate mock community samples so that index bleed is lane-specific. Data for an entire project is only merged after OTUs have been subsequently determined; this is done in R.  

## generic migration

In [None]:
rsync -avzr foster@cobb.unh.edu:/path2data/Sample_*/*.gz /copy2here/fqRaw/.

## generic renaming

Rename the files to remove the unnecessary underscore. As there are several different prefixes to deal with, we'll run this command as many times as needed for each unique prefix type (while implementing wildcards when possible).

In [None]:
#for true samples
rename 's/{name}_/{name}\-/' *

#for any contaminant sample
rename 's/{name}-contaminated_/{name}une-contaminated/' *

#for negative controls
rename 's/neg{name}_/negu{name}\-/' *

#for mock commmunity
rename 's/mock_IM3_/mockIM3\-/' *

Once everything is properly named, you can then proceed. If you get errors in downstream processes circle back and see if you've kept any weird underscores or other characters in the file names. See the [amptk homepage] [link1] if you're unclear about how the sequence data is expected to be named.

[link1]:https://github.com/nextgenusfs/amptk

## One final pre-processing note

If you have any files that contain zero bytes of data (literally no indices were detected during the Illumina run), you need to delete those .fq files ahead of time before trimming with the first step in the 'amptk' pipeline (it crashes when it tries to parse data that doesn't exist!). To delete files with zero bytes of data, enter the following command (within the directory containing all the raw fastq files):

In [None]:
 find . -name "*.fastq" -size -1c -delete

# Read processing

This process can take a considerable amount of time; increase the number of cpus when possible is advised. For example, with 8 cpus on about 17 Gb of data, it took ~1 hour to process.  

Note the suffix "L1" in the **-o** flag. This is done in this particular circumstance because this project contains data across two lanes. If there were multiple Illumina runs, these would be further specified by project number (ex. P1L1, P3L2, etc.).

In [None]:
#run in parent directory of {path.to}/fqRaw
amptk illumina \
-i /leo/devon/projects/guano/mudMT/p2_data/fqRaw \
-o mud \
--rescue_forward on \
--require_primer off \
--min_len 160 \
--full_length \
--cpus 20 \
--read_length 250 \
-f GGTCAACAAATCATAAAGATATTGG \
-r GGWACTAATCAATTTCCAAATCC

Output from this script will contain several new files:  

1. In the directory in whch the script was executed:
    - (output_name).mergedPE.log containing information about PE merging. Note that information in this single file includes the summary of all individual merged pairs.
    - (output_name)-filenames.txt containing information about files used in this pipeline (such as index-pair combinations)
    - (output_name).demux.fq containing a concatenated file of all trimmed and PE merged sequences of all samples listed in the '-filenames.txt' file above __(this one is to be used in the subsequent OTU clustering step)__

2. In the (output_name) directory named in the 'ufits illumina ...' argument provided above:  
    - (output_name).ufits-process.log containing information about the demultiplexing and read trimming processes
    - a single sample_name.fq file which is the PE merged file from each sample_name...R1/R2 pair of raw fastq inputs
    - a single sample_name.demux.fq file for each PE merged sample_name.fq file having been trimmed as defined  

The file labeled **'{-o name}.ufits-demux.log'** will also contain a summary of information regarding the total number of reads processed as well as the number of reads processed per sample (at the very bottom of the file). This is critical in evaluating which reads to keep, and the names of those files are essential in the next step for keeping/dropping samples.

## Dropping/keeping certain samples

You may want to exclude certain samples which contain too few reads. These can be discarded at this point before moving forward in the OTU clustering part. Jon Palmer recommends shooting for samples with about 50,000 reads, though as low as 10,000 samples are fine. It also depends on the distribution of the reads across all samples; if you have 20 samples with >100,000 reads each, and 20 samples with ~10,000 you might not want to keep anything less than 10,000 reads. However, if you have 20 samples with ~20,000 reads each and another 10 samples with 5,000 reads, and another 10 samples with ~100 reads, then you might want to keep those with 5,000 (so going lower than the 10,000 reads threshold).

There's an equivalent set of commands which you can specify to keep or remove samples; if it's faster to enter just a few samples to drop, us the '**amptk remove**' command, if it's faster to just specify the few samples you want to keep, use the '**amptk select**' command.  

In this specific run with just 9 samples we see a distribution like this:

In [None]:
Sample:  Count
mockIM3:  554502
mud-9064:  292799
mud-9062:  73355
mud-9067:  60846
mud-9063:  26222
mud-9065:  22458
mud-9068:  4400
mud-9066:  1974
mud-9069:  54

So we clearly don't want to keep that last sample, **mud-9069**, and the question is what to do about the other two low samples. One of the things to check out are other negative controls from the run, but there weren't any specific to this round of sequencing for these samples; however there were other negative controls pooled in this same lane and they generally topped out around ~1500 reads. In other words, we likely want to drop the bottom two samples total, and keeping that third from last sample, **mud-9068** is probably not a great idea if you're doing community analyses and need to compare between samples, but in this project we're just trying to figure out what's possibly in there of interest. If we find something interesting that's unique to **mud-9068** that might be a bit suspicious, but in general we should be able to keep it in there to get a sense of consistency of OTUs among all samples to find trends.  

To remove those last two samples:

In [None]:
amptk remove \
-i mud.demux.fq \
-l mud-9066 mud-9069 \
-o mud_filt.demux.fq
    #kept 1034582 of 1036610 (99.8%) of all reads
    #kept 6 of 8 samples mud samples

To double check that new .fq file contains just what you want, run the following command to double check you didn't miss anything.

In [None]:
amptk show -i {newly.named}.demux.fq

We're good to start with the next part - clustering OTUs.

## part 2 - OTU clustering

See Jon Palmer's [notes] [link1] about DADA2 if you're curious about what's required to get to this point.  

We used to have to clean up reads before passing into DADA2 using the vsearch program, but now the code in amptk deals with this problem just fine.
[link1]:https://github.com/nextgenusfs/ufits

In [None]:
#with just IM2 remaining (dropped IM4 and mockIM3)
amptk dada2 \
--fastq mud_filt.demux.fq \
--out mud \
--length 180 \
--platform illumina \
--uchime_ref COI

Output highlights include:

-------------------------------------------------------
- 1,022,415 reads passed
- 282 total inferred sequences (iSeqs)
- 81 denovo chimeras removed
- 201 valid iSeqs
- Chimera Filtering (VSEARCH) using COI DB
- 198 iSeqs passed, 3 ref chimeras removed
- 1,027,131 reads mapped to iSeqs (99%)
- 151 OTUs generated
- 1,022,800 reads mapped to OTUs (99%)
-------------------------------------------------------

This will produce two files you want to use in the subsequent 'ufits filter' command below:
   - **{name}.cluster.otus.fa** (use this fasta file in the next command)
   - **{name}.cluster.otu_table.txt** (use this fasta file in the next command)

It also produces two very similar looking files that represent the inferred sequences which were further clustered at 97% identity (so we can make some sort of sense of it when assigning taxonomy) - these aren't necessary for downstream analyses as of now:
   - **{name}.iSeqs.fa**
   - **{name}.otu_table.txt**

You can see exactly which of those iSeq files ended up clustering together into a single OTU with this file:
   - **iSeqs 2 OTUs: {name}.iSeqs2clusters.txt**

There are also a few log files generated:
   - **{name}.dada2.Rscript.log**  (to ensure the DADA2 program was run successfully)
   - **{name}.ufits-dada2.log**  (tracks the overall processing of this wrapper 'ufits dada2' script)


Next up is to filter the OTU table with the fasta file listed above:

# Part 3: filtering OTU table

It's advocated by Jon Palmer to use an index-bleed filter command using the '-b mock' flag, specifying the use of a mock community to calculate the index-bleed percentage. The idea behind this filtering step is to identify the number of instances in which a read mapped to an OTU which is *not supposed to be in the mock community* is found in the mock community. The percentage represents the overall number of reads that map to mock OTUs (ie. that are supposed to be in there) relative to the number of reads from OTUs that aren't. The alternative approach is to just trim down reads by some defined (yet arbitrary) percent across all samples, given some other examples in other data sets. I don't like that because I've found that every data set is unique, so having a mock community to judge this by is better.  

We're going to use our mock community to calculate index bleed. We're also going to (potentially) incorporate a subtraction value in which *if* the scenario occurs such that an unwanted OTU remains in the mock community following the application of the index-bleed filter, we can detect how many reads that maximum value should be and subtract from there to ensure that **zero** normalized reads remain in the highest 'bleeding' OTU in the mock sample. This value is subtracted from *all* reads from *all* samples per OTU, not just the mock community, so it ultimately drops a lot of OTUs with low read thresholds.  

There are intermediate files which are very useful in determing exactly how many read of which OTU are creeping into your mock community (that are unwanted OTUs), so you'll notice we're passing a **"--debug"** flag which is used to generate these files; without that flag, you'll miss the normalized read counts used in calculating the index bleed filter as well as the subsequent "--subtract" value.

You might want to play with the **--subtract** threshold and examine how many overall OTUs are retained. In the two examples below, we're going to play with the most conservative approach (filt1) versus a slightly less harsh filter (filt2).  
- **firstfilt** uses two flags: the index-bleed flag, and the **--subtract auto** flag. This second flag will calculate what the total number of normalized reads are in OTUs unwated in the mock community and then subtract that value from all reads. This results a completely clean mock, but drops a lot of samples
- **mud** uses the same two flags, but produces a different output format than the default csv and removes the mock community information.

First example allows for an automatic detection and application of index-bleed and OTU subtraction (if necessary):

In [None]:
amptk filter \
-i mud.cluster.otu_table.txt \
-f mud.cluster.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--debug \
--subtract auto \
-o firstfilt

The resulting output will show you that:
- there was an index bleed of **1.4%** identified (that's pretty high!) representing the proportion of an unwanted OTU from another samples bleeding into the mock (it's the worst case scenario, so it's the highest possible instance). 
- the auto subtract filter was used at a level of **1250**; that's a huge value which needs to be addressed.  

In other words after filtering down all reads by 1.4% on a per-OTU basis, all reads were then subtracted form each sample's OTU by an additional 1250 reads. This results in dropping the total number of OTUs present in the table from **151 OTUs** down to just **36 OTUs**.  

The next consideration is whether or not the OTUs could be kept if we were imposing a less stringent standard. If you run the same command (above) without passing the **--subtract auto** flag, you'll keep nearly all original OTUs. This indicates that it's this **--subtract** flag that's dropping most of our information.  

To investigate what's going on we're going check out the **{name}.normalized.num.txt** file and compare it with the **.final.txt** file. We need to figure out which OTUs in the mock community have the greater number of OTUs that are *unwanted* - that is, OTUs which should not be in the mock community. We're going to clean up the output of the needed files just a bit so that our search basic linux tools work properly (this is because the space in the first line of the files shifts the first row into one more field than all the rows below).

Run the following commands:

In [None]:
sed -i 's/#OTU ID/OTUid/' firstfilt.final.txt
sed -i 's/#OTU ID/OTUid/' firstfilt.normalized.num.txt

You don't know two things at this point: 
- how many fields are there?
- which OTUs contain the most reads among OTUs that **don't** belong?  

To answer the *number of fields question*, substitute **file.name.txt** with whatever file you want:

In [None]:
awk -F '\t' '{print NF; exit}' file.name.txt

For example:

In [None]:
awk -F '\t' '{print NF; exit}' firstfilt.final.txt

The result indicates that there are **8** fields (or samples). That's important when you want to next figure out which columns to sort. With our current data set the second field is the mock community, and the first field is the list of OTUs in each sample. So we're going to just sort through this text file and  print out the fields we want.  

The following command should give you a sense of how many reads may need to be subtracted from a given OTU that's unwanted. In the first command you can view the OTU in question, the mock sample, and another real sample in terms of how many reads are in each OTU.  
- Notice how there is a complete separation between reads that are in the mock and *not in the real sample* and vice versa. This is a good thing.
- Notice in the second command how once we've applied those filters that there is zero background noise in our real sample, yet all of our mock community members are maintained (there are 32 unique 'MockIM' samples, and there are 34 listed in our fasta, except that twice we have two samples where they will for a single OTU cluster becuase they are just variants sequences of the same species).

In [None]:
#for the .normalized.num.txt file:
cut firstfilt.normalized.num.txt -f 1,2,3 | sort -k2,2n | awk '$2 != "0.0" {print $0}' 

#for the .final.txt file:
cut firstfilt.final.txt -f 1,2,3 | sort -k2,2n | awk '$2 != "0" {print $0}'

And what do you find? There's a single OTU contaminating these samples. It's likely a contaminant present in the PCR mix or the DNA extraction materials because it's present in both the mock community as well as some real samples.  

Fortunately, it's the only *unwanted* OTU present in the mock community in significant numbers of reads. Thus, by dropping that single OTU across all samples should help alleviate contamination concerns.  

It's valuable to think about what that OTU may represent. If, for example, you wanted to see how many reads are in every sample of the data set for some specific OTU, say  **OTU16**, just type:

In [None]:
grep "\\bOTU16\\b" firstfilt.normalized.num.txt

Which will result in the following output:

**OTU60&nbsp;&nbsp;  1250.0&nbsp;  271.0&nbsp;&nbsp;  1488.0&nbsp;&nbsp;  38.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  3.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;**

This can be useful in instances in which there are one or two OTUs that clearly are unwanted but are much higher in background noise than all the others. This can result in dropping a lot of real samples at the expense of a few OTUs. This is often an indication of index-bleed, as chimeric sequences would typically how low read abundance in many samples (the opposite of what we see above). But that's just not what's going on with this dataset... things look pretty normal. You can run a similar *grep* command for each OTU of concern and you'll notice that each case gives the same impression of likely index-bleed.  

Another useful thing to check is identifying (as best as possible) the taxonomic information of these potential contaminant OTUs in question. They may be chimeric or index-bleed and this is another way to identify what's going on. You can do this by looking into the fasta file for each of those OTUs and then manually BLASTing them. For example:

In [None]:
grep "OTU16" mud.cluster.otus.fa -A 3

This produces the sequence needed; manually BLASTing this specific OTU in NCBI's nr database shows that it's a common housefly. It's expected in other sequences in projects that were spiked in together with this project (that is, this run consisted of not only these 8 mud samples, but also hundreds of other bird and bat fecal samples... and birds eat flies). Thus it's not a chimerica OTU, though it's possible that the fly contaminant OTU is the result of index bleed from other non-mud sample reads present in a handful of samples. However, in this case it's clearly not somewithing we can say *should* be in the dataset, and it's the only OTU behaving so differently from all other reads, so we're going to just drop it.  

The final filter is then applied by using the "auto-subtract" function again, but only after dropping the specific contaminant OTU:

In [None]:
amptk drop \
-i mud.cluster.otus.fa \
-r mud_filt.demux.fq \
-l OTU16 \
-o mud

The above command will result in a single OTU being dropped from each .fq file and generate two output files:  
- **mud.cleaned.otu_table.txt** 
- **mud.cleaned.otus.fa**  

Next up is to run another filtering step to ensure that the OTU in question has been removed, and see how the removal of that OTU influences what the **--subtract** value is set to.

In [None]:
amptk filter \
-i mud.cleaned.otu_table.txt \
-f mud.cleaned.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
--debug \
-o lasttest

Big difference. We find that dropping that one OTU results in the lowering of the **--subtract auto** value from the very high range to a more modest number of reads per sample: from **1250** to **17** reads.

While there are still a few additional samples bleeding into the mock community these all occur at a very low abundnace. We can therefore apply a final filtering step where we keep only true samples (we drop the mock community and it's uniquely associated OTUs), and filter down with a more modest subraction filter.

In [None]:
amptk filter \
-i mud.cleaned.otu_table.txt \
-f mud.cleaned.otus.fa \
-b mockIM3 \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
-o mud

So what do we see? By dropping just that one contaminant OTU from each data set we greatly reduce the value used in the **--subtract auto** option. We go from a subract value of **1250** to just **17**; this results in retaining **117** OTUs instead of the **36** that were left when we kept that one contaminant OTU. Big difference!

Notably, 11 of those OTUs are specifically associated with the mock community that is now removed. When you look at the abundance of reads per OTU it's pretty clear that many of these dropped OTUs are very rare (are present only in one or a few of the samples and absent in others), yet often when we see OTUs containing thousdands of reads they tend to be present in many of the samples. 

The arguments above will produce the following files:  
- **mud.filtered.otus.fa** is the final filtered fasta file (say that five times fast)
- **mud.final.binary.txt** is the presence/absence OTU table after filtering
- **mud.stats.txt** is the number of OTUs in each sample before and after filtering
- **mud.final.txt** is the OTU table with normalized read counts (noramlized to the number of reads in each sample)
- **mux.amptk-filter.log**  

The next step is to assign taxonomy by using the **mud.filtered.otus.fa** and **mud.final.binary.txt** files. 

## Part 4: Assigning Taxonomy to OTUs

Make sure to have already downloaded the necessary database (COI).

In [None]:
amptk taxonomy -i mud.final.binary.txt -f mud.filtered.otus.fa -d COI

## What do we find?

See email and GitHub page for more info.