### Diatom Analysis Pipeline - OTU

#### A notebook  for the diatom analysis pipeline using OTU classification.

### 1. Create BLAST reference database

In [1]:
%%time
!makeblastdb -in diatoms.sequences.FINAL2017.fasta -out diatoms -dbtype nucl



Building a new DB, current time: 06/25/2019 13:29:11
New DB name:   /code/diatoms
New DB title:  diatoms.sequences.FINAL2017.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2701 sequences in 0.189666 seconds.
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 1.04 s


### 2. Quality Control
* *Cutadapt:* Trim primers from sequences
* *SicklePE:* Trim off bad quality 3' bases
* *Pear:* Merge R1 and R2 reads
* *SickleSE:* Remove post-merging bad quality sequences
* *Histogram Generation:* Plots sequence length against number of sequences
* *QIIME Prep:* Pepare passed QC files for QIIME processing

In [None]:
%%time
!python ./ampliconQC.py --data sequences --forward ATGCGTTGGAGAGARCGTTTC --reverse GATCACCTTCTAATTTACCWACAACTG --threads 8 --histograms --qiime

This is cutadapt 1.9.1 with Python 2.7.16
Command line parameters: -e 0.047619047619 -a ATGCGTTGGAGAGARCGTTTC -o /code/sequences/2426.R1.fastq.gz.trimmed.fastq.gz /code/sequences/2426.R1.fastq.gz
Trimming 1 adapter with at most 4.8% errors in single-end mode ...
Finished in 1.80 s (35 us/read; 1.70 M reads/minute).

=== Summary ===

Total reads processed:                  51,003
Reads with adapters:                    41,789 (81.9%)
Reads written (passing filters):        51,003 (100.0%)

Total basepairs processed:    13,773,439 bp
Total written (filtered):      2,067,622 bp (15.0%)

=== Adapter 1 ===

Sequence: ATGCGTTGGAGAGARCGTTTC; Type: regular 3'; Length: 21; Trimmed: 41789 times.

No. of allowed errors:
0-21 bp: 0

Bases preceding removed adapters:
  A: 0.0%
  C: 0.2%
  G: 0.0%
  T: 0.0%
  none/other: 99.7%

Overview of removed sequences
length	count	expect	max.err	error counts
3	102	796.9	0	102
40	1	0.0	0	1
41	3	0.0	0	3
44	22	0.0	0	22
45	4	0.0	0

185	1	0.0	0	1
187	2	0.0	0	2
188	2	0.0	0	2
189	1	0.0	0	1
193	4	0.0	0	4
196	3	0.0	0	3
197	1	0.0	0	1
198	27	0.0	0	27
200	5	0.0	0	5
205	3	0.0	0	3
206	6	0.0	0	6
213	1	0.0	0	1
216	2	0.0	0	2
217	7	0.0	0	7
218	3	0.0	0	3
220	1	0.0	0	1
222	5	0.0	0	5
224	1	0.0	0	1
226	1	0.0	0	1
228	18	0.0	0	18
234	3	0.0	0	3
238	3	0.0	0	3
242	2	0.0	0	2
248	2	0.0	0	2
254	1	0.0	0	1
264	3	0.0	0	3
265	14	0.0	0	14
266	4	0.0	0	4
269	1	0.0	0	1
274	4	0.0	0	4
276	2	0.0	0	2
282	3	0.0	0	3
289	2	0.0	0	2
290	4	0.0	0	4
292	1	0.0	0	1
293	1	0.0	0	1
294	1	0.0	0	1
296	23	0.0	0	23
297	80	0.0	0	80
298	313	0.0	0	313
299	219	0.0	0	219
300	1651	0.0	0	1651
301	65537	0.0	0	65537

This is cutadapt 1.9.1 with Python 2.7.16
Command line parameters: -e 0.047619047619 -a ATGCGTTGGAGAGARCGTTTC -o /code/sequences/2593.R1.fastq.gz.trimmed.fastq.gz /code/sequences/2593.R1.fastq.gz
Trimming 1 adapter with at most 4.8% errors in single-end mode ...
Finished in 4.42 s (35 us/read; 1.72 M reads/minute).


### 3. Generate sequence counts file used in final step to produce reports

In [None]:
%%time
!for file in sequences/*.passedQC.fastq; \
do \
  awk 'NR%4==2{sum+=1}END{print FILENAME,sum}' $file >> sequences/diatomSequenceCounts.txt; \
done

### 4. Assign similar sequences to OTUs using user-defined similarity threshold
#### Default clustering algorithm is UCLUST, O(m\*n), which is a greedy algorithm and is dependent on ordering of sequences in readyForQiime.allsamples.fasta file.

In [None]:
%%time
!pick_otus.py -i sequences/readyForQiime.allsamples.fasta -o sequences/picked_otus_97

###  5. Pick a representative set of sequences. For each OTU, one sequence will be used in subsequent analysis

In [None]:
%%time
!pick_rep_set.py -i sequences/picked_otus_97/readyForQiime.allsamples_otus.txt \
  -f sequences/readyForQiime.allsamples.fasta \
  -o sequences/repset.fasta

### 6. Query BLAST database with OTU representatives

In [None]:
%%time
!blastn -db diatoms -query sequences/repset.fasta \
  -out sequences/repset.diatoms.blastn \
  -task blastn -max_target_seqs 1 -num_threads 8 -outfmt 6 -evalue 0.01

In [None]:
!mkdir sequences/assigned_taxonomy

### 7. Create taxonomy assignments from BLAST outputs

In [None]:
%%time
!python ./create_taxonomy_assignments_from_blast.py --taxonomy diatoms.taxonomy.FINAL2017.txt \
  --percid 95.0 --blast sequences/repset.diatoms.blastn --output sequences/assigned_taxonomy/repset.taxonomy.txt 

### 8. Report how often an OTU is found in each sample and add the taxonomic predictions for each OTU

In [None]:
%%time
!make_otu_table.py -i sequences/picked_otus_97/readyForQiime.allsamples_otus.txt \
  -t sequences/assigned_taxonomy/repset.taxonomy.txt \
  -o sequences/otu_table.biom

### 9. Filter the OTU table based on taxonomic metadata excluding specific taxa

In [None]:
%%time
!filter_taxa_from_otu_table.py -i sequences/otu_table.biom \
  -o sequences/otu_table.diatomsonly.biom \
  -n MARINE,NOT_DIATOM,Yellow_green_Algae,None

### 10. Sort OTU table by sample id

In [None]:
%%time
!sort_otu_table.py -i sequences/otu_table.diatomsonly.biom \
  -o sequences/otu_table.diatomsonly.biom

### 11. Provide summary information of the representation of taxonomic groups within each sample

In [None]:
%%time
!summarize_taxa.py -L 1 \
  -i sequences/otu_table.diatomsonly.biom \
  -o sequences/visualised_taxonomy -a

### 12. Produce Diatom reports

In [None]:
%%time
!python ./produceDiatomReports.py --folder sequences --lookup lookuptable.txt

### 13. Inspect producted Diatom reports

In [None]:
import pandas as pd

In [None]:
pd.read_csv('sequences/Abundances.fail.csv')

In [None]:
pd.read_csv('sequences/Abundances.pass.csv')