### Diatom Analysis Pipeline - OTU

#### A notebook  for the diatom analysis pipeline using OTU classification.

### 1. Create BLAST reference database

In [1]:
%%time
!makeblastdb -in diatoms.sequences.FINAL2017.fasta -out diatoms -dbtype nucl



Building a new DB, current time: 06/04/2019 13:06:08
New DB name:   /code/diatoms
New DB title:  diatoms.sequences.FINAL2017.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2701 sequences in 0.198415 seconds.
CPU times: user 30 ms, sys: 20 ms, total: 50 ms
Wall time: 2.4 s


### 2. Quality Control
* *Cutadapt:* Trim primers from sequences
* *SicklePE:* Trim off bad quality 3' bases
* *Pear:* Merge R1 and R2 reads
* *SickleSE:* Remove post-merging bad quality sequences
* *Histogram Generation:* Plots sequence length against number of sequences
* *QIIME Prep:* Pepare passed QC files for QIIME processing

In [2]:
%%time
!python ./ampliconQC.py --data sequences --forward ATGCGTTGGAGAGARCGTTTC --reverse GATCACCTTCTAATTTACCWACAACTG --threads 8 --histograms --qiime

This is cutadapt 1.9.1 with Python 2.7.16
Command line parameters: -e 0.047619047619 -b ATGCGTTGGAGAGARCGTTTC -b GATCACCTTCTAATTTACCWACAACTG -o /code/sequences/2377.R2.fastq.gz.trimmed.fastq.gz /code/sequences/2377.R2.fastq.gz
Trimming 2 adapters with at most 4.8% errors in single-end mode ...
Finished in 1.73 s (25 us/read; 2.37 M reads/minute).

=== Summary ===

Total reads processed:                  68,270
Reads with adapters:                     1,448 (2.1%)
Reads written (passing filters):        68,270 (100.0%)

Total basepairs processed:    19,434,171 bp
Total written (filtered):     19,429,790 bp (100.0%)

=== Adapter 1 ===

Sequence: ATGCGTTGGAGAGARCGTTTC; Type: variable 5'/3'; Length: 21; Trimmed: 1372 times.
8 times, it overlapped the 5' end of a read
1364 times, it overlapped the 3' end or was within the read

No. of allowed errors:
0-21 bp: 0

Overview of removed sequences (5')
length	count	expect	max.err	error counts
3	7	1066.7	0	7
5	1	66.7	0	1


FastQ paired records kept: 193702 (96851 pairs)
FastQ single records kept: 2719 (from PE1: 2252, from PE2: 467)
FastQ paired records discarded: 430 (215 pairs)
FastQ single records discarded: 2719 (from PE1: 467, from PE2: 2252)



FastQ paired records kept: 134916 (67458 pairs)
FastQ single records kept: 8258 (from PE1: 7317, from PE2: 941)
FastQ paired records discarded: 2124 (1062 pairs)
FastQ single records discarded: 8258 (from PE1: 941, from PE2: 7317)



FastQ paired records kept: 51332 (25666 pairs)
FastQ single records kept: 9827 (from PE1: 8750, from PE2: 1077)
FastQ paired records discarded: 45336 (22668 pairs)
FastQ single records discarded: 9827 (from PE1: 1077, from PE2: 8750)



FastQ paired records kept: 123904 (61952 pairs)
FastQ single records kept: 16116 (from PE1: 15064, from PE2: 1052)
FastQ paired records discarded: 3856 (1928 pairs)
FastQ single records discarded: 16116 (from PE1: 1052, from PE2: 15064)



FastQ paired records kept: 121272 (60636 pairs)
FastQ si

 ____  _____    _    ____ 
|  _ \| ____|  / \  |  _ \
| |_) |  _|   / _ \ | |_) |
|  __/| |___ / ___ \|  _ <
|_|   |_____/_/   \_\_| \_\

PEAR v0.9.11 [Nov 5, 2017]

Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593

Forward reads file.................: /code/sequences/2933.sickle.trimmed.R1.fastq.gz
Reverse reads file.................: /code/sequences/2933.sickle.trimmed.R2.fastq.gz
PHRED..............................: 33
Using empirical frequencies........: NO
Statistical method.................: OES
Maximum assembly length............: 999999
Minimum assembly length............: 50
p-value............................: 0.010000
Quality score threshold (trimming).: 0
Minimum read size after trimming...: 1
Maximal ratio of uncalled bases....: 1.000000
Minimum overlap....................: 10
Scoring method.....................: Scaled score
Threads............................: 8

Allo

 ____  _____    _    ____ 
|  _ \| ____|  / \  |  _ \
| |_) |  _|   / _ \ | |_) |
|  __/| |___ / ___ \|  _ <
|_|   |_____/_/   \_\_| \_\

PEAR v0.9.11 [Nov 5, 2017]

Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593

Forward reads file.................: /code/sequences/2377.sickle.trimmed.R1.fastq.gz
Reverse reads file.................: /code/sequences/2377.sickle.trimmed.R2.fastq.gz
PHRED..............................: 33
Using empirical frequencies........: NO
Statistical method.................: OES
Maximum assembly length............: 999999
Minimum assembly length............: 50
p-value............................: 0.010000
Quality score threshold (trimming).: 0
Minimum read size after trimming...: 1
Maximal ratio of uncalled bases....: 1.000000
Minimum overlap....................: 10
Scoring method.....................: Scaled score
Threads............................: 8

Allo

### 3. Generate sequence counts file used in final step to produce reports

In [3]:
%%time
!for file in sequences/*.passedQC.fastq; \
do \
  awk 'NR%4==2{sum+=1}END{print FILENAME,sum}' $file >> sequences/diatomSequenceCounts.txt; \
done

CPU times: user 40 ms, sys: 40 ms, total: 80 ms
Wall time: 4.96 s


### 4. Assign similar sequences to OTUs using user-defined similarity threshold

In [4]:
%%time
!pick_otus.py -i sequences/readyForQiime.allsamples.fasta -o sequences/picked_otus_97

CPU times: user 2.78 s, sys: 420 ms, total: 3.2 s
Wall time: 3min 25s


###  5. Pick a representative set of sequences. For each OTU, one sequence will be used in subsequent analysis

In [5]:
%%time
!pick_rep_set.py -i sequences/picked_otus_97/readyForQiime.allsamples_otus.txt \
  -f sequences/readyForQiime.allsamples.fasta \
  -o sequences/repset.fasta

CPU times: user 160 ms, sys: 90 ms, total: 250 ms
Wall time: 15.5 s


### 6. Query BLAST database with OTU representatives

In [6]:
%%time
!blastn -db diatoms -query sequences/repset.fasta \
  -out sequences/repset.diatoms.blastn \
  -task blastn -max_target_seqs 1 -num_threads 8 -outfmt 6 -evalue 0.01

CPU times: user 30.2 s, sys: 5.74 s, total: 35.9 s
Wall time: 39min 20s


In [7]:
!mkdir sequences/assigned_taxonomy

### 7. Create taxonomy assignments from BLAST outputs

In [8]:
%%time
!python ./create_taxonomy_assignments_from_blast.py --taxonomy diatoms.taxonomy.FINAL2017.txt \
  --percid 95.0 --blast sequences/repset.diatoms.blastn --output sequences/assigned_taxonomy/repset.taxonomy.txt 

CPU times: user 0 ns, sys: 30 ms, total: 30 ms
Wall time: 2.3 s


### 8. Reports how often an OTU is found in each sample and adds the taxonomic predictions for each OTU

In [9]:
%%time
!make_otu_table.py -i sequences/picked_otus_97/readyForQiime.allsamples_otus.txt \
  -t sequences/assigned_taxonomy/repset.taxonomy.txt \
  -o sequences/otu_table.biom

CPU times: user 70 ms, sys: 40 ms, total: 110 ms
Wall time: 7.27 s


### 9. Filters an OTU table based on taxonomic metadata excluding specific taxa

In [11]:
%%time
!filter_taxa_from_otu_table.py -i sequences/otu_table.biom \
  -o sequences/otu_table.diatomsonly.biom \
  -n MARINE,NOT_DIATOM,Yellow_green_Algae,None

CPU times: user 140 ms, sys: 0 ns, total: 140 ms
Wall time: 8.55 s


### 10. Sort OTU table by sample id

In [12]:
%%time
!sort_otu_table.py -i sequences/otu_table.diatomsonly.biom \
  -o sequences/otu_table.diatomsonly.biom

CPU times: user 140 ms, sys: 0 ns, total: 140 ms
Wall time: 8.88 s


### 11. Summary information of the representation of taxonomic groups within each sample

In [13]:
%%time
!summarize_taxa.py -L 1 \
  -i sequences/otu_table.diatomsonly.biom \
  -o sequences/visualised_taxonomy -a

CPU times: user 130 ms, sys: 30 ms, total: 160 ms
Wall time: 11 s


### 12. Produce Diatom reports

In [49]:
data = pd.read_csv("sequences/visualised_taxonomy/otu_table.diatomsonly_L1.txt",header=1,sep='\t',index_col=0)
data[['2377','2379']]

Unnamed: 0_level_0,2377,2379
#OTU ID,Unnamed: 1_level_1,Unnamed: 2_level_1
168,50.0,24.0
199,21.0,0.0
208,8.0,3.0
212,41.0,34.0
252,44.0,54.0
257,323.0,9410.0
27,11.0,0.0
29,0.0,0.0
308,0.0,0.0
311,0.0,1.0


In [56]:
%%time
!python ./produceDiatomReports.py --folder sequences --lookup lookuptable.txt

Reports completed
CPU times: user 0 ns, sys: 10 ms, total: 10 ms
Wall time: 906 ms


### 13. Inspect producted Diatom reports

In [15]:
import pandas as pd

In [57]:
pd.read_csv('sequences/Abundances.fail.csv')

Unnamed: 0,#OTU ID,101210740,101527498,101403272,101563446,101611446,101799933,101353981,101522698,101496665,20560447,101353095,101318072


In [58]:
pd.read_csv('sequences/Abundances.pass.csv')

Unnamed: 0,#OTU ID,101403272,101353095,101611446,101563446,101496665,101318072,101799933,101527498,101522698,101353981,101210740,20560447
0,168,0.0,0.0,1.0,0.0,0.0,50.0,0.0,24.0,61.0,0.0,50.0,5.0
1,199,0.0,0.0,1.0,0.0,0.0,8.0,7.0,0.0,4.0,1.0,21.0,0.0
2,208,0.0,9.0,0.0,0.0,0.0,75.0,0.0,3.0,25.0,4.0,8.0,1.0
3,212,0.0,0.0,1.0,0.0,0.0,0.0,0.0,34.0,68.0,0.0,41.0,55.0
4,252,0.0,0.0,14.0,0.0,0.0,0.0,0.0,54.0,139.0,0.0,44.0,23.0
5,257,0.0,0.0,356.0,0.0,0.0,0.0,0.0,9410.0,952.0,1.0,323.0,920.0
6,27,0.0,0.0,1.0,0.0,8.0,14.0,8.0,0.0,5.0,6.0,11.0,0.0
7,29,0.0,0.0,0.0,0.0,7.0,0.0,57.0,0.0,0.0,1.0,0.0,0.0
8,308,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,311,0.0,18.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,37.0,0.0,0.0


### 14. List sequence files used

In [18]:
!ls sequences/raw_data

2377.R1.fastq.gz  2609.R1.fastq.gz  2790.R1.fastq.gz  3008.R1.fastq.gz
2377.R2.fastq.gz  2609.R2.fastq.gz  2790.R2.fastq.gz  3008.R2.fastq.gz
2379.R1.fastq.gz  2733.R1.fastq.gz  2828.R1.fastq.gz  3068.R1.fastq.gz
2379.R2.fastq.gz  2733.R2.fastq.gz  2828.R2.fastq.gz  3068.R2.fastq.gz
2587.R1.fastq.gz  2745.R1.fastq.gz  2933.R1.fastq.gz  3140.R1.fastq.gz
2587.R2.fastq.gz  2745.R2.fastq.gz  2933.R2.fastq.gz  3140.R2.fastq.gz


In [19]:
!cat produceDiatomReports.py

import argparse
import pandas
import re
from collections import defaultdict

def main():
	options = parseArguments()
	resultsFile = options.folder + "/visualised_taxonomy/otu_table.diatomsonly_L1.txt"
	countsFile = options.folder + "/diatomSequenceCounts.txt"

	# Import the sequence counts
	counts = {}
	for line in open(countsFile, "rU"):
		linelist = re.split(' ',line)
		filename = linelist[0]
		count = int(45367) #int(linelist[1].rstrip())
		#filename will be sample.passedQC.fastq
		filenamesplit = re.split('\.',filename)
		sample = filenamesplit[0].replace(options.folder,'').replace('/','')
		counts[sample] = count

	# Import the lookup if there
	if (options.lookup):
		lookup = {}
		# FERA sample name \t EA sample name
		for line in open(options.lookup, "rU"):
			linelist = re.split('\t',line)
			lookup[linelist[0]] = linelist[1].rstrip()

	# Import the results into a pandas dataframe
	data = pandas.read_csv(resultsFile,header=1,sep='\t',index_col=0)
	