## What to consider

There is a tradeoff between *Quantity*, *Quality*, and the *number of retained reads* that will remain after quality control. This program
will attempt to balance these three variables and provide the user with potential combinations of DADA2 parameters.

In this case both **trimming** and **maxEE** will be DADA2 parameters that we focus on.

### How to do this:

1. Perform "obvious" trimming
    + iterate along the average quality score and eliminate ends below a threshold (default will be 30)

2. Examining the relationship between trim values/read length and quality 
    + Understanding the effect that different trim values has on the overall quality of the reads 
    + Plot generated will provide empirical proof for parameter choice  

3. Examining the relationnship between trim values and retained reads
    + maxEE is a DADA2 parameter wherein if the sum of the read's expected error (per position) is higher than this threshold (default = 2.0) the read will be discarded. This automatically penalized longer reads.
    + Here we show the effect that maxEE will have on reads that have not been trimmed and reads that have only been trimmed using the "obvious" trimming values. 


## Performing "Obvious" Trimming

Current industry standards involve looking at a barplot and picking trim values where a noticeable decrease in average
quality score at a position on either end of the read occurs. Here we start with the same concept in order to reduce
search space.

The trim values (or index at which trimming is done) is determined by the first instance an average quality score is
below a threshold starting from the center index. This is done because the highest quality scores will be in the middle
of the reads.

It is also important to note that scores, as output in FASTQ files, are on a scale from 0 to 42 as per Phred quality
score standards.

Here is a step-by-step process of how this form of trimming is performed.

In [1]:
# Step 1: set threshold value
threshold = 30

# Step 2: middle index is determined
list = [15, 18, 18, 30, 30, 40, 30, 30, 19, 17, 15]

mid_index = len(list) // 2

# Step 3: travel from center to left and find instance of average score below threshold
current_index = mid_index
while current_index >= 0:
    # if value at current index is below threshold
    if list[current_index] < threshold:
        # get the prior index
        trim_left_index = current_index + 1
        break
    else:
        current_index -= 1

# Step 4: similar to step 3 the right index value is found

# Step 5: tuple containing left and right trim sites is returned. For the list in this example, the returned value
# would be (3, 8)

## Read Length vs Quality

A tsv is generated using **GetTrimParameters.py** wherein different index combinations were chosen and the average expected error per position along the read was calculated. 

The average expected error per positon was first calculated using the FastqEntry function **get_expected_error()**. Using different combinations of trim parameters the avergae was taken of those averages to get an overall average expected error per position and a heat map was generated using R. In addition a plot wherein the x axis is read length and the y axis is average expected error per position was created to juxtapose these specific variable and further inform the user of the potential effect that different trim, truncation, and read lengths would have on the overall quality of the reads and samples. 

Note: it is important to understand that there will be multiple combinations of index pairs with the same read length. For examples, if you indexed a read from 0 to 10 it would result in a read of the same length if you had indexed it from 1 to 11.

## Read Length vs Number of Retained Reads
Overall it is impossible to calculate exactly how many reads will be discarded during quality control. However, you can see the effect that different trimming might have on the number of reads that fall above a max expected error threshold (set by DADA2). 

Here we show how many reads will be discarded should no trimming occur and how many reads will be discarded should only obvious trimming occur. This was done using a histogram in R. 