# Sample background 

Individual guano samples were processed in two separate sequencing runs. See Jupyter notebooks titled "bri-P1-notes" and "bri-P2-notes" for initial processing details. This notebook functions by beginning with concatenating individually completed sequencing fastq files, removing any required reads associated for samples within those files we want to discard, then continuing with the clustering of all reads without the mock community. Index-bleed filters and auto-subtract values are assigned to this common pool of clustered reads using the more conservative when applicable. See below for details.

## Other information

- Running core programs:
    - AMPtk 0.8.6
    - vsearch v2.3.4_linux_x86_64
    - usearch usearch v9.2.64_i86linux32
    - python 2.7.6
- Installed the necessary (COI) databases

## Final cleanups before OTU clustering

#### Concatenating
The samples from Project1 and Project2 were combined by concatenating the two files:

In [None]:
#to concatenate two {}.demux.fq files
cat \
/leo/devon/projects/guano/bri2016/p1_data/bri-chunk1/cleaned.demux.fq \
/leo/devon/projects/guano/bri2016/p2_data/bridropd.demux.fq > prebriall.demux.fq

#### Sample read removal
Initially when completing the analyses on the individual sequencing runs, samples in Project1 with more than 5,000 reads were kept while samples in Project2 were kept with 10,000 reads (for various reasons including NTC read depth, mock community contamination, etc.) I wanted to be unifrom as possible when combining the two datasets and went with a more conservative approach by requiring 10,000 reads for any sample in this project. Therefore, 28 additional samples from Project1 were dropped, as listed in the file **morefiles2drop.txt**. 

In [None]:
amptk remove -i prebriall.demux.fq -f morefiles2drop.txt -o briall.demux.fq

The total number of samples following concatenation between both projects is **210 samples**. We are highly confident that these samples contain a meaningful number of reads to ensure that our community analyses are robust, while also containing as little type-1 error as possible in the forthcoming OTU table.

## part 2 - OTU clustering

In [None]:
#to cluster P1L1 data:
amptk dada2 \
--fastq briall.demux.fq \
--out briall \
--length 180 \
--platform illumina \
--uchime_ref COI

Here's the output using the default settings with DADA2:

Next up is to filter the OTU table with the fasta file listed above:

# Part 3: filtering OTU table

Note that there is something a bit unique about this analysis: the two independent sequencing runs (P1 and P2) each had their own mock communities; normally I'd take the more conservative estimate from either run and then apply that here, but there were some subtleties that required keeping the mock community in this dataset. These are going to be analyzed now to determine a few factors:  

1. What's the **index-bleed** threshold we want to apply? 
    Earlier work from the independent runs suggested somewhere between 0.1 - 0.3%

2. What's the **subtract** threshold we want to apply?
    Independent sequencing runs reported values of 47 and 86; thus the more conservative measure would be 86. However, this higher value (from Project 1 sequencing run) allowed for three non-intended OTUs to remain in the dataset in low values; it's unclear whether these are the result of index-bleed from actual *BRI samples*, or from another sample placed on the same P1 lane (for example, from O'Rourke samples, or others), or if it's from low-level contamination that was incorporated at the PCR step earlier in the workflow.  

3. How many OTUs will be dropped if necessary to balance the **subtract** value with overall OTU retention?  

To address these questions we'll run preliminary filtering steps and assess their output, as described below.

In [None]:
amptk filter \
-i briall.cluster.otu_table.txt \
-f briall.cluster.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--debug \
--subtract auto \
-o testfilt

And the relevant output:

In other words, we have a pretty high index bleed (but not awful), but a HUGE **subtract** value. From earlier work on each sequencing runs (P1 and P2), this is likely due to a handful of samples. To address which samples contain the highest number of reads, see the basic notes outlined in "bri-P2-notes". Bioinformatic steps and subsequent results are briefly explained below.  

First, identifying the number of contaminant OTUs in our combined mock community:

In [None]:
sed -i 's/#OTU ID/OTUid/' testfilt.final.txt
sed -i 's/#OTU ID/OTUid/' testfilt.normalized.num.txt
awk -F '\t' '{print NF; exit}' testfilt.final.txt
    #212

cut testfilt.normalized.num.txt -f 1,211 | sort -k2,2n | awk '$2 != "0.0" {print $0}'
    #OTU249 is the culprit

The above script produces a long list (see below) of all OTUs present in our mock community, though nearly all of these unwanted OTUs are in very low abundance (while, importantly, all the OTUs that should be in our mock community are there in relatively high numbers). Note, however, one OTU (OTU249) contains a much higher number of reads than any other unwanted OTU. This is in one sense reassuring, as it implies that the high **subtract** value found in our first-pass filtering step is due to just a single OTU, and by dropping just that one OTU we sould be able to retain far more reads/OTUs when our final filter is applied.  

- ...*(more values in list containing OTUs with fewer than 36.0 reads)*...
- OTU497	36.0
- OTU151	41.0
- OTU21	48.0
- MockIM34_pident=100.0_OTU349	682.0
- **OTU249	977.0**
- MockIM49_pident=100.0_OTU207	3960.0
- MockIM6_pident=100.0_OTU188	4976.0
- MockIM15_pident=99.4_OTU161	7036.0  

The next question is whether or not **OTU249** persists among many samples in addition to our mock community:

In [None]:
grep "\\bOTU249\\b" testfilt.normalized.num.txt

The result indicates that OTU249 is highest in our mock community (recall it's value was **977** reads); only 14 other samples have more than 100 reads (most are 0.0). Average was ~31, but deviation was about 119. See **otu249counts.txt** for table of these counts. I further investigated whether or not there was any relationship for a particular factor (index, PCR or DNA plate/well, etc.) associated with contamination; there doesn't appear to be one among the most contaminated. See **otu249contamtable.txt** for these results.  

To ensure this OTU is likely the same contaminant observed in individual analyses performed earlier, a BLAST search was performed: 

In [None]:
grep "\\bOTU249\\b" briall.cluster.otus.fa -A 3

Alignment results indicate it's Fannia canicularis, the Lesser housefly, which was what was observed from earlier independent analyses for the P1 data set. It's unclear whether contamination was from index bleed from other samples not part of this dataset (ex. other bird sequencing projects), or from primer contamination. Regardless, dropping this one OTU is likely of little consequence to broader interpretation as it is not particularly common in our dataset.  

To remove OTU249:

In [None]:
amptk drop \
-i briall.cluster.otus.fa \
-r briall.demux.fq \
-l OTU249 \
-o bri

The above steps likely address the problem of OTU contamination in our mock community. However, there is a single negative control in this dataset that should be dropped - sample "**negbri-41C03**". This sample wasn't removed initially because it's useful to notice how many OTUs are present, and to what degree (in terms of relative/normalized reads). In certain cases where negative controls show some appreciable number of amidst true samples, we find that a single OTU or two should be dropped from all samples (like above). However, this sample doesn't appear to be a typical cross-contamination event; rather, I think this sample represents one of two scenarios:  
- (a) A guano sample placed in an adjacent well split and fell into this negative control well, and thus represents a pseudoreplicate.
- (b) A guano sample was misplaced directly into the negative control well, and thus represents a true sample that is simply mislabeled.  

A quick analysis of this sample suggests the following:  
1. There is a single OTU that contains the majority of the reads associated with this sample: OTU2 (see file **negconcounts.txt** for all OTU counts associated with this sample). This OTU exists in just a handful of all true samples, with only 7 samples containing >100 reads (though 5 samples contain >10,000 reads). Thus this OTU best aligns to one of two *crane flys*, which are fairly large and may take up a significant proportion of a single guano sample.
2. There are just 5 other OTUs that contain >100 reads:
    - OTU48, OTU154, OTU34 follow the similar pattern above: only a few samples contain many reads associated with this particular OTU, and there's a huge variance in total number of reads. 
    - OTU48 is another crane fly; OTU154 is a caddisfly; OTU34 is a northeastern moth; OTU150 is a leafroller moth. In other words, this sample contains insects we'd expect in a piece of bat guano (so you can rule out contamination from other guano, say bird guano, from other projects).
3. I created another file, **contamprobs.txt** that looks at these top 6 OTUs for all samples. It doesn't fit any single other sample perfectly, but it matches to loads of other samples qualitatively. In other words, plenty of samples contain some number of reads for these OTUs in question, but no single sample has at least 100 reads for every one of these OTUs.  

Observation #3 above strongly suggests this being a true sample, though it's unclear whether its a case of a guano sample splitting into the wrong well (as in '**a**') or a true sample that was mislabeled (as in '**b**'). Dropping the OTUs associated with this sample is likely unnecessary. Rather, dropping this sample from further analysis is what makes more sense, as we'd like to retain these OTUs that are present in the data.  

This will be done post-taxonomic classification in R, however, if this information was available a priori we could have dropped it earlier with the ** *amptk remove* ** step.

We'll re-run our filtering program this time without that contaminant OTU:

In [None]:
amptk filter \
-i bri.cleaned.otu_table.txt \
-f bri.cleaned.otus.fa \
-b mockIM3 \
--keep_mock \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--debug \
--subtract auto \
-o testfilt2

The results are as follows:

Results indicate that our **index-bleed** can be set at **0.05%** (a typical value) and the auto-subtract value can be set to **48**. Note that while this is less of a conservative of a subtract value for one of the two runs (P2 was **47** while P1 was **86**), the fact that we summed up *three* mock communities is potentially artificially inflating our overall counts for any OTU associated with a mock community (despite that fact that we're normalizing reads when making these calculations).

Notably, we've increased the number of OTUs retained from our initial filtering (default) parameters when including OTU249. By dropping that single OTU from the dataset we've reduced the **subtract** value by about 900, thus we've recovered the number of OTUs by more than double (we kept **984 OTUs** rather than the earlier **406**).  

A final filtering step is applied whereby the mock community reads are removed leaving us with just true samples (and that unknown sample masquerading as a negative control sample).

In [None]:
amptk filter \
-i bri.cleaned.otu_table.txt \
-f bri.cleaned.otus.fa \
-b mockIM3 \
--calculate in \
--mc /leo/devon/projects/guano/mock-fa/CFMR_insect_mock3.fa \
--subtract auto \
-o bri

The arguments above will produce the following files:  
- **filtd.filtered.otus.fa** is the final filtered fasta file (say that five times fast)
- **filtd.final.binary.txt** is the presence/absence OTU table after filtering
- **filtd.stats.txt** is the number of OTUs in each sample before and after filtering
- **filtd.final.txt** is the OTU table with normalized read counts (noramlized to the number of reads in each sample)
- **filtd.amptk-filter.log**  

The next step is to assign taxonomy by using the **bri.filtered.otus.fa** and **bri.final.binary.txt** files. 

## Part 4: Assigning Taxonomy to OTUs

Make sure to have already downloaded the necessary database (COI). Need to run this command for every instance in which you've generated a uniquely filtered dataset from Parts 1-3. Here's just a pair of examples you'd run in their respective directories:

The OTU classification is performed with UTAX and SINTAX in addition to UPARSE; all rely to some degree on the custom database acquired through BOLD; see J. Palmer for details about its creation).  

## What do we find?

See the R script 'ufits_2016_OTUtable_analysis.R' for generating the following observations.  

Move the files as follows:

# BLASTing some of those unknowns 

We can use the following scripts to run NCBI BLAST. This will happen in three steps:

- A. Clean up the file to pull out just the UTAX-tagged OTUs and convert into a single-line fasta
- B. Build a blast database (* note we're not going to use this step at the moment*)
- C. BLAST the file

**A.** First, use fastx toolkit to convert '{name}.otus.taxonomy.fa' into oneliner fasta

Then grep only lines with "UTAX" or "SINTAX" in header and remove the dashes where the lines were removed

**B.** We're going to skip the step of setting up a BLAST database for the moment - there's already the complete 'nr' database installed on Pinky, so unless we want to set up a custom database with specific sequences we don't have available through NCBI, then I wouldn't worry about this step. If you did want to do it, you'd run the following command:

**C.** Finally, we're going to run a BLAST search on the UTAX and SINTAX sequences using NCBI's nr database. We're going to specify a few other flags described in the code below. Specifically, we're going to take only values that are a certain alignment length, a certain percent identity, and then filter these to choose the best possible taxonomic description to those collapsed reads.  

See [here] [link1] for BLAST manual with command line options used below.
[link1]:http://www.ncbi.nlm.nih.gov/books/NBK279675/

To get the taxonomy information from these taxids we're going to use an R package called [taxize] [link1] to convert the *staxids* value into taxonomic information (kingdom, phylum,...species).  The only information we need to provide to **taxize** is a list of the taxids.  

However, we're going to sort through our blast output a little bit first.

[link1]:https://github.com/ropensci/taxize

#1. because I had included a blast result column with taxonmy names that didn't work and printed "N/A" I removed it and created tab-separated fields as follows:

#2. I then removed redundant rows that contained identical OTU id's, bit scores, and taxIDs. I also then removed any rows in which the alignment length of a match was less than 150 base pairs:

#3. I then filtered out any redundant row with common OTUid and TAXid, then sorted by bit score (highest first).

This results in filtering down from the initial **2295 rows** to **903 rows**. We'll use these results to investigate whether or not the taxonomy information assigned by UTAX/SINTAX could be updated or not.  

The next step is to import these TAXid values into R and use a package called **taxize**.  For ease of use, I'm going to keep all the remaining information in this final file as a single dataframe, then pull that specific row in R out to make the necessary taxonomic classifications.

In [None]:
## taxize R work



In [None]:
library(taxize)



You can compare what you're trying to clean up with what's already been assigned by the original SINTAX or UTAX classifcation in the .fa file from the UFITS taxonomy output. You're going to need to get it into a format that R won't hate you for though:

check out this link for details about why we're filtering how we are... https://edwards.sdsu.edu/research/percent-similarity-at-different-taxonomic-levels/'

# Round 2 - redoing the analyses with the updated DB

See the extensive notes above for explanations about how and why each of the following commands are executed. 

## Cluster again

Start at Part 2 - clustering with DADA2

Output differs slightly from before - more chimeras removed; otherwise pretty much identical output.

## Filter again

Determine if OTUs in mock look similar and if similar **index-bleed** and **substract** values are warranted.

Output looks identical to previous run...

Evaluate these files:

where are the problems?

shows the exact same number of normalized counts as the earlier data set. The third OTU listed here was further analyzed with an NCBI BLAST search from the **cluster.otus.fa** file. This confirms that OTU55 from our first clustering in the previous data set is the same sequence as OTU54 here.  

In other words, renaming these new sequences to our database didn't change anything about the mock community (as expected!).

We're going to apply the exact same filtering parameters as before, given our mock community looks identical:

And we get identical outputs as before:

Then assign taxonomy.