# Sample background 

The following sequence data was generated either from DNA extracted by NAU or through following guano extraction standards developed in the Foster Lab (described elsewhere). Arthropod COI fragments were produced using new JOH primers, then 250 PE sequencing on an Illumina HiSeq2500 generated our reads.  Multiple lanes were used in generating these data as indicated in the run statistics shown [here] [link1] and [here] [link2]. 

[link1]:http://cobb.unh.edu/170125_DevonP2L1_A_DemuxStats.html
[link2]:http://cobb.unh.edu/170125_DevonP2L2_A_DemuxStats.html

# Pipeline Overview

We use the tools developed by Jon Palmer in the [**amptk**] [link1] program which will trim, filter, cluster, and assign taxonomy to reads. The core clustering process employs an algorithm called DADA2 (see [paper] [link3] and [github page] [link4]) which has an added benefit of error correcting reads (even singletons), though each identified cluster is then reclustered at a 97% similarity (as done conventionally in many OTU-clustering approaches) for downstream community analyses. Taxonomic assignment is completed in part using the curated Barcode of Life Database [(BOLD)] [link7], while also leveraging Robert Edgar's [SINTAX] [link5] and [UTAX] [link6] algorithms to provide additional taxonomic information for reads which are not matched in BOLD.

[link1]:https://github.com/nextgenusfs/amptk
[link3]:http://www.nature.com/articles/nmeth.3869.epdf?author_access_token=hfTtC2mxuI5t44WUhsz05NRgN0jAjWel9jnR3ZoTv0N5gu3rLNk61gF4j2hXPcagLe964qdfd3GRw8OwyZxfEgsul8lwR1lEWykR3lWF30Dl_bZWMvTPwOdrwuiBUeYa
[link4]:https://github.com/benjjneb/dada2
[link5]:http://www.biorxiv.org/content/early/2016/09/09/074161
[link6]:http://www.drive5.com/usearch/manual/utax_algo.html
[link7]:http://v4.boldsystems.org/

## Other information

Installation requirements are found on the [amptk] [link1] site. This project was completed with the following dependencies:

- amptk v0.8.5 
- vsearch v2.3.4
- usearch v9.2.64 (linked as "usearch9", not "usearch")  

In addition to the basic tools we're also going to need to have installed a few other items:  
- a fasta file of the mock community (used in the sequence index bleed filtering step)
    - we are using the IM3 mock community provided by Michelle Jusino
- the COI index (either default from 'amptk' or a curated one of your own)

To install the amptk default COI database:

[link1]:https://github.com/nextgenusfs/amptk/blob/master/docs/ubuntu_install.md

In [None]:
amptk install COI

# part 1 - data migration and cleaning

Initial steps involve moving, renaming, and filtering data as follows.

First, move all relevant directories from Cobb to Pinky servers with rsync. For each project, there will likely be three sets of samples to migrate: 
- the samples
- the negative controls
- one or more mock community members  

This will occur for every lane for every project. In instances where there are multiple lanes in which samples were split across, each lane's data will be treated with separate mock community samples so that index bleed is lane-specific. Data for an entire project is only merged after OTUs have been subsequently determined; this is done in R.  

## generic migration

In [None]:
rsync -avzr foster@cobb.unh.edu:/path2data/Sample_*/*.gz /copy2here/fqRaw/.

## generic renaming

Rename the files to remove the unnecessary underscore. As there are several different prefixes to deal with, we'll run this command as many times as needed for each unique prefix type (while implementing wildcards when possible).

In [None]:
#for true samples
rename 's/{name}_/{name}\-/' *

#for any contaminant sample
rename 's/{name}-contaminated_/{name}une-contaminated/' *

#for negative controls
rename 's/neg{name}_/negu{name}\-/' *

#for mock commmunity
rename 's/mock_IM3_/mockIM3\-/' *

Once everything is properly named, you can then proceed. If you get errors in downstream processes circle back and see if you've kept any weird underscores or other characters in the file names. See the [amptk homepage] [link1] if you're unclear about how the sequence data is expected to be named.

[link1]:https://github.com/nextgenusfs/amptk

## One final pre-processing note

If you have any files that contain zero bytes of data (literally no indices were detected during the Illumina run), you need to delete those .fq files ahead of time before trimming with the first step in the 'amptk' pipeline (it crashes when it tries to parse data that doesn't exist!). To delete files with zero bytes of data, enter the following command (within the directory containing all the raw fastq files):

In [None]:
 find . -name "*.fastq" -size -1c -delete

# Read processing

This process can take a considerable amount of time; increase the number of cpus when possible is advised. For example, with 8 cpus on about 17 Gb of data, it took ~1 hour to process.  

Note the suffix "L1" in the **-o** flag. This is done in this particular circumstance because this project contains data across two lanes. If there were multiple Illumina runs, these would be further specified by project number (ex. P1L1, P3L2, etc.).

In [None]:
#same command for both L1 and L2 data, except alter output name for L1 or L2 "-i" and "-o" flags as needed
#run in parent directory of {path.to}/fqRaw
amptk illumina \
-i /leo/devon/projects/guano/nau/p2_data/L{1|2}_data/fqRaw \
-o nauallL{1|2} \
--rescue_forward on \
--require_primer off \
--min_len 160 \
--full_length \
--cpus 20 \
--read_length 250 \
-f GGTCAACAAATCATAAAGATATTGG \
-r GGWACTAATCAATTTCCAAATCC

Output from this script will contain several new files:  

1. In the directory in whch the script was executed:
    - (output_name).mergedPE.log containing information about PE merging. Note that information in this single file includes the summary of all individual merged pairs.
    - (output_name)-filenames.txt containing information about files used in this pipeline (such as index-pair combinations)
    - (output_name).demux.fq containing a concatenated file of all trimmed and PE merged sequences of all samples listed in the '-filenames.txt' file above __(this one is to be used in the subsequent OTU clustering step)__

2. In the (output_name) directory named in the 'ufits illumina ...' argument provided above:  
    - (output_name).ufits-process.log containing information about the demultiplexing and read trimming processes
    - a single sample_name.fq file which is the PE merged file from each sample_name...R1/R2 pair of raw fastq inputs
    - a single sample_name.demux.fq file for each PE merged sample_name.fq file having been trimmed as defined  

The file labeled **'{-o name}.ufits-demux.log'** will also contain a summary of information regarding the total number of reads processed as well as the number of reads processed per sample (at the very bottom of the file). This is critical in evaluating which reads to keep, and the names of those files are essential in the next step for keeping/dropping samples.

## Dropping/keeping certain samples

You may want to exclude certain samples which contain too few reads. These can be discarded at this point before moving forward in the OTU clustering part. Jon Palmer recommends shooting for samples with about 50,000 reads, though as low as 10,000 samples are fine. It also depends on the distribution of the reads across all samples; if you have 20 samples with >100,000 reads each, and 20 samples with ~10,000 you might not want to keep anything less than 10,000 reads. However, if you have 20 samples with ~20,000 reads each and another 10 samples with 5,000 reads, and another 10 samples with ~100 reads, then you might want to keep those with 5,000 (so going lower than the 10,000 reads threshold).

There's an equivalent set of commands which you can specify to keep or remove samples; if it's faster to enter just a few samples to drop, us the '**amptk remove**' command, if it's faster to just specify the few samples you want to keep, use the '**amptk select**' command.  

Here's an example of the kind of range you might see in a typical run:

In [None]:
Sample:  Count
9056:  36628
9001:  27041
9023:  25831
9040:  15422
negsample1:  1201
9030:  15
9033:  13

In the case above, you're likely going to just keep the top 4 samples, while dropping the bottom three. To drop those three, you'd enter the following generic command:

In [None]:
amptk remove \
-i somepath/trimd.demux.fq \
-l negsample1 9030 9033 \
-o somepath/nau_L1_merged.demux.fq

In the case of our specific dataset, NAU runs were split among two lanes. We need to consider how these reads fit within the entirety of the run (for each lane) before trimming. 
- For **L1 samples** I kept all samples with > 11,000 reads; note the negative control with highest read count reached 6920, though most negatives controls only contained reads between 500-1200 total.
- For **L2 samples** I kept all samples with >9,800 reads; note that the only negative control sample in this run contained about 500 reads total.

In [None]:
#for L1 data:
amptk select \
-i nauallL1.demux.fq \
-f files2keep.txt \
-o nau-dropdL1.demux.fq
    #kept about 99.4% of all reads
    #kept 54 of 101 true samples

#for L2 data:
amptk remove \
-i nauallL2.demux.fq \
-l nau-7132 nau-7131 negnau-41H03 nau-7126 \
-o nau-dropdL2.demux.fq
    #kept about 99.3% of all reads
    #kept 6 of 9 true samples

To double check that new .fq file contains just what you want, run the following command to double check you didn't miss anything.

We're good to start with the next part - clustering OTUs.

## part 2 - OTU clustering

See Jon Palmer's [notes] [link1] about DADA2 if you're curious about what's required to get to this point.  

We used to have to clean up reads before passing into DADA2 using the vsearch program, but now the code in amptk deals with this problem just fine.
[link1]:https://github.com/nextgenusfs/ufits

In [None]:
#similar command for L1 and L2; note the specification in the '--out' flag
amptk dada2 \
--fastq nau-dropdL{1|2}.demux.fq \
--out nauL{1|2} \
--length 180 \
--platform illumina \
--uchime_ref COI

This will produce two files you want to use in the subsequent 'ufits filter' command below:
   - **dada2_output.cluster.otus.fa** (use this fasta file in the next command)
   - **dada2_output.cluster.otu_table.txt** (use this fasta file in the next command)

It also produces two very similar looking files that represent the inferred sequences which were further clustered at 97% identity (so we can make some sort of sense of it when assigning taxonomy) - these aren't necessary for downstream analyses as of now:
   - **dada2_output.iSeqs.fa**
   - **dada2_output.otu_table.txt**

You can see exactly which of those iSeq files ended up clustering together into a single OTU with this file:
   - **iSeqs 2 OTUs: dada2_output.iSeqs2clusters.txt**

There are also a few log files generated:
   - **dada2_output.dada2.Rscript.log**  (to ensure the DADA2 program was run successfully)
   - **dada2_output.ufits-dada2.log**  (tracks the overall processing of this wrapper 'ufits dada2' script)


Next up is to filter the OTU table with the fasta file listed above:

# Part 3: filtering OTU table

It's advocated by Jon Palmer to use an index-bleed filter command using the '-b mock' flag, specifying the use of a mock community to calculate the index-bleed percentage. The idea behind this filtering step is to identify the number of instances in which a read mapped to an OTU which is *not supposed to be in the mock community* is found in the mock community. The percentage represents the overall number of reads that map to mock OTUs (ie. that are supposed to be in there) relative to the number of reads from OTUs that aren't. The alternative approach is to just trim down reads by some defined (yet arbitrary) percent across all samples, given some other examples in other data sets. I don't like that because I've found that every data set is unique, so having a mock community to judge this by is better.  

We're going to use our mock community to calculate index bleed. We're also going to (potentially) incorporate a subtraction value in which *if* the scenario occurs such that an unwanted OTU remains in the mock community following the application of the index-bleed filter, we can detect how many reads that maximum value should be and subtract from there to ensure that **zero** normalized reads remain in the highest 'bleeding' OTU in the mock sample. This value is subtracted from *all* reads from *all* samples per OTU, not just the mock community, so it ultimately drops a lot of OTUs with low read thresholds.  

There are intermediate files which are very useful in determing exactly how many read of which OTU are creeping into your mock community (that are unwanted OTUs), so you'll notice we're passing a **"--debug"** flag which is used to generate these files; without that flag, you'll miss the normalized read counts used in calculating the index bleed filter as well as the subsequent "--subtract" value.

You might want to play with the **--subtract** threshold and examine how many overall OTUs are retained. In the two examples below, we're going to test exactly what filtering parameters should be applied.  

- Our first filtering step uses several flags: 
    - the index-bleed flag **-b** specifies which sample is the mock community and goes with the **--mc** flag to specify the path with which you can find the associated fasta file with that mock community
    - the **calculate in** flag specifies that we're only going to calculate index bleed into the mock (not back out of the mock); this is done because we are using a biological mock community and there is a chance that our true samples also contain the same species (OTUs really); if left to the default (**calculate all**) *and we had samples that contained these mock OTUs too* then our index bleed rate would be super high, which we don't want
    - the **--debug** flag provides the intermediate files needed to determine which OTUs and how many reads are finding their way into the mock community sample
    - the **--subtract auto** flag will calculate what the total number of normalized reads are in OTUs unwated in the mock community and then subtract that value from all reads. When applied this results a completely clean mock, but drops a lot of samples

First example allows for an automatic detection and application of index-bleed and OTU subtraction (if necessary):

In [None]:
#same command for L1 and L2 data; altering output input/output names as needed
amptk filter \
-i nauL{1|2}.cluster.otu_table.txt \
-f nau{1|2}.cluster.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--debug \
--subtract auto \
-o testfiltL{1|2}

The resulting output will show you that:  

*For Lane 1 data*
- there was an index bleed of **1.4%** applied (that's pretty standard for most of my data sets, though still kind of high), and 
- the auto subtract filter was used at a level of **1250**. That's really high!

In other words, for *Lane 1 data*, after filtering down all reads by 1.4% on a per-OTU basis, all reads were then subtracted form each sample's OTU by an additional 1250 reads. If you used those default parameters it would result in a dramatic reduction in the resulting OTUs to keep. For instance in this filtering regime we go from **685 OTUs** down to **136 OTUs** - that's a lot of OTUs to throw away. The question is whether or not that **--auto subtract** filter needs to be quite as high. We're going to figure that out in a minute.  


*For Lane 2 data*
- there was an index bleed of **1.4%** applied, just like in Lane 1, despite being completely independent. At least we're consistent...
- the auto subtract filter was used at a level of **1277**.  That's really high too (again, at least we're consistent though, as the same problem might be popping up and require the same solution).

To investigate what's going on in both data sets we're going check out the **testfilt.normalized.num.txt** file and compare it with the **testfilt.final.txt** file (one for each lane). We need to figure out which OTUs in the mock community have the greater number of OTUs that are *unwanted* - that is, OTUs which should not be in the mock community. We're going to clean up the output of the needed files just a bit so that our search basic linux tools work properly (this is because the space in the first line of the files shifts the first row into one more field than all the rows below).

Run the following commands:

In [None]:
#Same command for L1 and L2 data
sed -i 's/#OTU ID/OTUid/' testfiltL{1|2}.final.txt
sed -i 's/#OTU ID/OTUid/' testfiltL{1|2}.normalized.num.txt

You don't know two things at this point: 
- how many fields are there?
- which OTUs contain the most reads among OTUs that **don't** belong?  

To answer the *number of fields question*, substitute **file.name.txt** with whatever file you want:

In [None]:
awk -F '\t' '{print NF; exit}' file.name.txt

For example:

In [None]:
#same command for L1 and L2 data:
awk -F '\t' '{print NF; exit}' testfiltL{1|2}.final.txt

For example, entering that command *for Lane 1*, there are **55** fields, indicating that there are 54 total samples to deal with. Our current data set is named such that the second field is the mock community, and the first field is the list of OTUs in each sample. So we're going to just sort through this big array of a text file and just print out the fields we want.  

The following command should give you a sense of how many reads may need to be subtracted from a given OTU that's unwanted. In the first command you can view the OTU in question, the mock sample, and another real sample in terms of how many reads are in each OTU.  

In [None]:
#for the .normalized.num.txt file:
cut testfiltL{1|2}.normalized.num.txt -f 1,2,7 | sort -k2,2n | awk '$2 != "0.0" {print $0}' 

#for the .final.txt file:
cut testfiltL{1|2}.final.txt -f 1,2 | sort -k2,2n | awk '$2 != "0" {print $0}'

The same effect seems to be occurring in both data sets:  

Notice how there is a single OTU that shouldn't be in there that's in abundance while there are a handful of other OTUs that shoudn't be there but contain only a fraction of reads (it's **OTU74** for L1 data and **OTU22** for L2 data that are much higher than all others)? I think after looking at multiple sets of samples all sequenced on this Project 2 run it appears that some fly DNA is present in one of the reagents at some very low level (probably in the PCR mix, but possibly anywhere). You can search for this OTU using the second command below. Because this project set includes only bat DNA we know we aren't going to find that fly in it so we can drop this OTU all together.

If, for example, you wanted to see how many reads are in every sample of the data set for some specific OTU, say  **OTU74** for the L1 dataset, just type:

In [None]:
grep "\\bOTU74\\b" testfiltL1.normalized.num.txt

Which will result in the following output:  

**For L1 data**  
OTU147&nbsp;&nbsp;&nbsp;&nbsp;  84.0&nbsp;&nbsp;  16.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  119.0&nbsp;&nbsp;  151.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  5.0&nbsp;&nbsp;  1.0&nbsp;&nbsp;  3.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  2.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  9.0&nbsp;&nbsp;  1.0&nbsp;&nbsp;  10.0&nbsp;&nbsp;  724.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  2.0&nbsp;&nbsp;  2.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  2.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  7.0&nbsp;&nbsp;  1.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  18.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  2.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  167.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  62.0&nbsp;&nbsp;  0.0&nbsp;&nbsp;  5.0&nbsp;&nbsp;  4.0&nbsp;&nbsp;  27.0&nbsp;&nbsp;  9.0&nbsp;&nbsp;  462.0&nbsp;&nbsp;  45.0&nbsp;&nbsp;  4.0&nbsp;&nbsp;  2.0  


**For L2 data**  
OTU22&nbsp;&nbsp;  1277.0&nbsp;&nbsp;  4698.0&nbsp;&nbsp;  1.0&nbsp;&nbsp;  2.0&nbsp;&nbsp;  5.0&nbsp;&nbsp;  214.0&nbsp;&nbsp;  4.0

Though there are different numbers of samples we see a similar trend in both data sets: just a few samples contain the majority of the reads for this OTU; notably, the mock community only one of the highest read counts (the highest in the *L1 data*) along with a handful of other samples containing hundreds of reads. This could be an indication of index-bleed, as chimeric sequences would typically how low read abundance in many samples (the opposite of what we see above).  

**For both lanes**, it'll be useful to identify (as best as possible) the taxonomic information of these potential contaminant OTUs in question. They may be chimeric or index-bleed and this is another way to identify what's going on. It could also be a species that you are sure shouldn't be in there at all, which you can then consider just outright contamination in a reagent somewhere in your prep work. You can do this by looking into the fasta file for each of those OTUs and then manually BLASTing them. For example:

In [None]:
#For L1 data:
grep "OTU74" nauL1.cluster.otus.fa -A 3

#for L2 data:
grep "OTU22" nauL2.cluster.otus.fa -A 3

This produces the sequence needed; manually BLASTing these specific OTUs in NCBI's nr database show the same result: it's a common housefly. It's unexpected in this dataset because we wouldn't imagine that bats in South America are dining in huge numbers on North American house flies. However, the L1 dataset was generated with other samples in addition to these guano samples and included some bird guano; in those samples the housefly was in abundance and therefore expected to find its way to bleeding into the mock community. This leads me to the following conclusions:  
- **For Lane 1 and Lane 2** it's probably not a chimerica OTU
- **For Lane 1** it's possibly the result of index bleed given the huge number of reads present in a handful of bird guano samples. 
- **For Lane 1 and Lane 2** it's probably that it's the result of a contaminant in a reagent (could be at any of the levels of DNA extraction, PCR amplification, or PicoGreen quantification), unless there was some way in which reads are being swapped between lanes of a flow cell.

Given that information, it'll be easiest to just drop that OTU in question and then refilter. We're going to want to keep the **--subtract auto** feature on for now for both data sets because there were additional low-level reads popping up in their respective mock community samples.

*AMPtk* has a feature to do just that for weird scenarios like this!

In [None]:
#For L1 data set:
amptk drop \
-i nauL1.cluster.otus.fa \
-r nau-dropdL1.demux.fq \
-l OTU74 \
-o nauL1

#For L2 data set:
amptk drop \
-i nauL2.cluster.otus.fa \
-r nau-dropdL2.demux.fq \
-l OTU22 \
-o nauL2

The above command will result in a single OTU being dropped from each .fq file and generate two output files:  
- **nau.cleaned.otu_table.txt** 
- **nau.cleaned.otus.fa**  

Next up is to run another filtering step to ensure that the OTU in question has been removed, and see how the removal of that OTU influences what the **--subtract** value is set to.

In [None]:
#Same command for L1 and L2 data:
amptk filter \
-i nauL{1|2}.cleaned.otu_table.txt \
-f nauL{1|2}.cleaned.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
--debug \
-o lasttest

In the case of both data sets, we find that dropping that one OTU results in the lowering of the **--subtract auto** value from the very high range to a more modest number of reads per sample:
- For **L1 data** we see a drop from **1250** to **17**
- For **L2 data** we see a drop from **1277** to **48**

While there are still a few additional samples bleeding into the mock community these all occur at a very low abundnace. We can therefore apply a final filtering step where we keep only true samples (we drop the mock community and it's uniquely associated OTUs), and filter down with a more modest subraction filter.

In [None]:
#Same command for L1 and L2 data:
amptk filter \
-i nauL{1|2}.cleaned.otu_table.txt \
-f nauL{1|2}.cleaned.otus.fa \
-b mockIM3 \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
--delimiter tsv \
-o nauL{1|2}

So what do we see? By dropping just that one contaminant OTU from each data set we greatly reduce the value used in the **--subtract auto** option.  

- **For the L1 data set** we go from a subract value of **1250** to just **17**; this results in retaining **488** OTUs instead of the **136** that were left when we kept that one contaminant OTU. Big difference!
- **For the L2 data set** we ultimately remove quite a few OTUs after filtering - about half, from **122** to **43**

Notably for both datasets, 11 OTUs of those 122 are specifically associated with the mock community that is now removed. When you look at the abundance of reads per OTU it's pretty clear that many of these OTUs are very rare (are present only in one or a few of the samples and absent in others), yet often when we see OTUs containing thousdands of reads they tend to be present in many of the samples. To be conservative it's certainly justifiable to cut at the rate with which we elimnate index bleed from our mock community and that's what we'll do here.  

The arguments above will produce the following files:  
- **nau.filtered.otus.fa** is the final filtered fasta file (say that five times fast)
- **nau.final.binary.txt** is the presence/absence OTU table after filtering
- **nau.stats.txt** is the number of OTUs in each sample before and after filtering
- **nau.final.txt** is the OTU table with normalized read counts (noramlized to the number of reads in each sample)
- **nau.amptk-filter.log**  

One option is to assign taxonomy by using the **nau.filtered.otus.fa** and **nau.final.binary.txt** files for each run, however, you will get different OTU numbers assigned to each sequence. Thus you can't just concatenate the two OTU tables. If you do assign taxonomy directly, then you can see what's in a specific sample, but you can't make group compairsons without combining all reads and clustering together. We're going to do that now, but see ** *"Part5: Assigning Taxonomy to OTUs"* ** below if you want the code to do that directly first.

# Part 4: Clustering and Filtering the entire project

Because these data were split among two lanes for sequencing, each lane needed to be analyzed independently to check for lane-specific chimeras, contamination, and index-bleed proportions. Fortunately the characteristics of both lanes appear very similar with respect to contamination and index-bleed, therefore we will proceed with the clustering and filtering of entire project's worth of reads using a single dataset.  

The first step is to combine all reads from Lane1 and Lane2:

In [None]:
cat /leo/devon/projects/guano/nau/p2_data/L1_data/nauallL1.demux.fq /leo/devon/projects/guano/nau/p2_data/L2_data/nauallL2.demux.fq > nauall.demux.fq

Next, drop all samples required (~10,000 reads per sample or more). In this case we're keeping sample '7024' despite it having slightly less than 10,000 reads because it was from Lane2, which contained negative control values with very low read depth indicating that background contamination was low in this batch. However, we are dropping '7020' because it has about as many reads as the next negative control sample, both from Lane1 runs.  

To drop the necessary samples:

In [None]:
amptk select \
-i nauall.demux.fq \
-f samples2keep.txt \
-o nauall-dropd.demux.fq

This retains 10532919 of 10599704 reads (99.4%) after removing 53 true samples and (all) 10 negative controls containing too few reads. 60 true samples remain for further analysis.  

Next up is to cluster the newly concatenated and sorted dataset using the same DADA2 script. This will ensure that all reads from both lanes have OTU identities that match the same sequences.

In [None]:
amptk dada2 \
--fastq nauall-dropd.demux.fq \
--out nauall \
--length 180 \
--platform illumina \
--uchime_ref COI

And we see the following output:

In [None]:
-------------------------------------------------------
[01:56:14 PM]: 10,431,958 reads passed
[02:16:53 PM]: 2,486 total inferred sequences (iSeqs)
[02:16:53 PM]: 1,426 denovo chimeras removed
[02:16:53 PM]: 1,060 valid iSeqs
[02:17:01 PM]: 1,024 iSeqs passed, 36 ref chimeras removed
[02:28:00 PM]: 759 OTUs generated
[02:35:26 PM]: 10,428,657 reads mapped to OTUs (99%)
-------------------------------------------------------

Good. The number of reads clustered from independent runs of L1 (9,266,769) and L2 (1,161,890) are almost exactly what we see here (just 2 reads different).  

We're going to apply the same filtering strategy on the entire pooled dataset as we did on the individual Lanes, however, we're going to apply the more conservative appraoch from the independent runs with regards to the **auto-subtract** value rather than the one indicated by the filtering parameter below. This is because our mock community reads were concatenated from *both* datasets, which effectively inflates the number of mock reads that are present in our analysis - it's double! - relative to the number of reads we see in our samples.  We don't expect the index bleed parameter to change as these are based on the true samples, which are only represented once in this concatenated pool of sequences.  

To ensure that there is just one spurious OTU (the housefly) in the entire dataset in vastly greater read numbers than all other low-level contamination:

In [None]:
amptk filter \
-i nauall.cluster.otu_table.txt \
-f nauall.cluster.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--debug \
--subtract auto \
-o testfilt

Right. As expected, there's a high **index-bleed** of about 1.4%, and a super high **subtract** value of 1255. This is likely because of that one contaminant fly OTU we say in the independent runs. To confirm that's the case:

In [None]:
sed -i 's/#OTU ID/OTUid/' testfilt.final.txt
sed -i 's/#OTU ID/OTUid/' testfilt.normalized.num.txt

awk -F '\t' '{print NF; exit}' testfilt.final.txt
    #there are 61 fields (the OTU identity field, then the 1 mock community, then our 59 true samples)

#for the .normalized.num.txt file:
cut testfilt.normalized.num.txt -f 1,2,61 | sort -k2,2n | awk '$2 != "0.0" {print $0}'     

So we see that there is just a single OTU rising above all other unwanted OTUs in our mock community (**OTU72** in this case). This was as expected given the independent results from Lanes 1 and 2.  

To confirm this is the same fly DNA, first ensure this OTU isn't particularly high among all other samples:

In [None]:
grep "\\bOTU72\\b" testfilt.normalized.num.txt

No, it's not. There is usually 0.0 reads per sample. However, there are eight total samples where there are at least 100 reads. Recall this is normalized relative to 100,000 reads, so it's still not that high in broad context, but it just indicates that this is likely a low-level contaminant we want to remove from the entire dataset.  

Next, find it's sequence and then perform a BLAST search to confirm it's taxonomy.

In [None]:
grep "\\bOTU72\\b" nauall.cluster.otus.fa -A 3

Yep, same fly. We can therefore proceed to drop that single OTU from the entire dataset.

In [None]:
amptk drop \
-i nauall.cluster.otus.fa \
-r nauall-dropd.demux.fq \
-l OTU72 \
-o nauall

And now retest to check the **auto-subtract** value and **index-bleed** value without that single OTU:

In [None]:
amptk filter \
-i nauall.cleaned.otu_table.txt \
-f nauall.cleaned.otus.fa \
-b mockIM3 \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
--debug \
-o lasttest

Great - the **index-bleed** value drops to a very low value of ~ **0.1%**, and the **auto-subtract** value has also dropped substantially from **1255** to just **17**. We're going to proceed with the more conservative **subtract** value calculated in the indepenent Lane1 samples. This will result in losing some OTUs, but it should also ensure that we're only keeping samples that we're highly confident are real reads.  

We can now apply a final filtering step, removing the mock community from our dataset.

In [None]:
amptk filter \
-i nauall.cleaned.otu_table.txt \
-f nauall.cleaned.otus.fa \
-b mockIM3 \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract 48 \
-o nauall

This produces an OTU table and fasta file that are used to assigne taxonomy next. Notably, we have a final dataset consisting of **389 OTUs**.

## Part 5: Assigning Taxonomy to OTUs

Make sure to have already downloaded the necessary database (COI). If you haven't, just type:

In [None]:
amptk install -i COI

Note that the classification of each OTU can be executed using three potential programs: USEARCH using our database acquired through BOLD; UTAX (trained through BOLD); and SINTAX. I am only ultimately going to only present the OTU calls from the BOLD database in this project, but I will create a taxonomy profile using all three. The BOLD-only files are filtered out using an R script later in this workflow.

In [None]:
#same commands for L1 and L2, except output name varies:
amptk taxonomy \
-i nauall.final.binary.txt \
-f nauall.filtered.otus.fa \
-d COI \
-o nauall

## What do we find?

Some files of immediate interest:
**nauall.otu_table.taxonomy.txt**: the file containing the binary presence (1) or absence (0) matrix with OTUs as rows and samples and taxonomy information as columns. This file is used in the subsequent R script for generating plots and tables.  
**nauall.otus.taxonomy.fa**: the fasta file of all OTUs classified in the OTU table

Not necessary on the front end but also of potential interest:
**nauall.taxonomy.txt**: a five-field file containing the OTU number and the taxonomic information for each classifier used. Could be useful in comparing these relative to a subsequent BLAST search too.
**nauall.otu_table.txt**: another matrix with samples as columns but no taxonomy info. rows are not OTUs, rather, the individual iSeq reads pre-97% clustering. Could be helpful if you want to conduct a blast search of all unique sequences prior to the minimum number of read filtering step. Could help identify bat reads, for example, which may be much lower than the **48** value used in the subtract filtering stage.

The R script **'OTU_analysis.R'** for subsequent analysis of taxa present in these samples.

# Additional items: appending metadata to otu_table.taxonomy.txt

Created a metadata sheet in Excel and transferred into working directory. File was named as **meta_p{x}l{y}** (ex. for Project 2, Lane 1 samples it's labeled *metap2l1*) containing the following fields:
- #OTU ID
- LocationName
- CollectionDate
- BatID
- BatSpecies
- Sex

Then ran the following code:

In [None]:
amptk meta \
-i nauall.otu_table.taxonomy.txt \
-m meta_p2l1.csv \
-o naupivot.csv