## Pooling Demultiplexed files from multiple seq runs

###These files are from the demultiplex notebooks [1](./demultiplex_seq1.ipynb), [2](./demultiplex_seq2.ipynb), [3](./demultiplex_seq3.ipynb), [4](./demultiplex_seq4.ipynb), [5](./demultiplex_seq5.ipynb).

In [1]:
!cat /var/seq_data/priming_exp/data/*.demult.fastq > /var/seq_data/priming_exp/data/allseq.assembled.demult.fq

In [2]:
!head /var/seq_data/priming_exp/data/allseq.assembled.demult.fq

@13C.000.28.07.NA_0seq1 orig_name=M02465:63:000000000-A9G5P:1:1101:18364:1748
CCGCCTTGCGACCGTACTCCCCAGGCGGAGTCCTTCATGCGTTAGCTTCGGCACGGCCCCGATCGCTCCCCGCCCCACCAAGCACTCATCGTTTAGGGCTAGGACTACCGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTCAGCGTCAGTGTTGATCCAGATGGCCGCCTTCGCCACGGATGTTCCTCCCGATCTCTACGCATTTCACCGCTACACCGGGAATTCCGCCATCCTCTATCACACTCTAGCTGGCCAGTATCCATCGCCATTCCCAGGTTGAGCCCAGGGCTTTCACGACAGACTTAACCAACCGCCTACGCACGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTTCGTA
+
6,66ACE;B@+@F+=9B<EFGGD,6:77++8,CAE6,<5A>>C=6=FFF5:=6@,+@++++*1@8,:,@CF+>H+@4IID68G3CII1D@IIG?28D@HIIGICDIIII<ICII<II2?3CIIF6IG>0I6IIII6I3=A0EIGIFIIIIHFIFII@DIIHI97FIIIB7?:DIIBI@III>IIIIC?3IFI2IIIIIICDIIIIIII?IFIIIIGIIIIIGIGIIIDIIG6III8IIE8IIIIIB,IC?EIIIIGFIC<IIIGGF?III@2IFIIB?<IIIEIFIIGIFIIDIIAEIFIFGGGEFFFEGGACCB7CEGGGEGGEFFB@CGDGGGGGGGGGGFCFFF@FGFFF@@GGGGGFCGGFDFFC9CC<C
@13C.700.45.03.NA_1seq1 orig_name=M02465:63:000000000-A9G5P:1:1101:14576:1759
CACTCTTGCGACCCTCCTCCCCCTTCCCACTGCTTCCTTCGTTCTCTTCAGCCCTTCCCTGCGCACACCCTCTAACACTTAGCACTC

##Discard sequences that exceed the max expected error threshold.

In [3]:
%%bash

nprocs=20
maxee=1

tmpdir1=`mktemp -d`
trap "rm -r $tmpdir1" 1 2 3 15
split -d -l 2000000 /var/seq_data/priming_exp/data/allseq.assembled.demult.fq $tmpdir1/Block

tmpdir2=`mktemp -d`
trap "rm -r $tmpdir2" 1 2 3 15
ls $tmpdir1/Block?? | parallel --gnu -j $nprocs -k "usearch -fastq_filter {} \
-fastq_maxee $maxee \
-fastaout $tmpdir2/{#}.fasta >/dev/null 2>&1 && cat $tmpdir2/{#}.fasta" > /var/seq_data/priming_exp/data/tmp/maxee$maxee.fasta
rm -r $tmpdir2 $tmpdir1

grep -c ">" /var/seq_data/priming_exp/data/tmp/maxee$maxee.fasta
head -n 8 /var/seq_data/priming_exp/data/tmp/maxee$maxee.fasta

25094893
>12C.100.28.03.NA_53seq1 orig_name=M02465:63:000000000-A9G5P:1:1101:17937:1855
CACACTTGCGTGCGTACTCCCCAGGCGGGTCACTTAACGCGTTAGCTACGGCACCGAGGGGGTCGATACCCCCGACACCTAGTGACCATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTACCTCAGCGTCAGAAACGGCCCAGAAGGTCGCCTTCGCCACCGGTGTTCTTCCGAATATCTACGCATTTCACCGCTACACCAGGAATTCCATTCTCCTCTTCCGGACTCTAGCCAGAAAGTTTCCACCGACAGCTCGACGTTGAGCCTCGAGTTTTCACAGCGGACTTTTCTGGCCGCCTACACGCGCTTTACGCCCAATGATTCCGGACAACGCTTGCCCCCTACGTA
>12C.700.28.03.NA_54seq1 orig_name=M02465:63:000000000-A9G5P:1:1101:14269:1856
TAATCTTGCGACCGTACTCCCCAGGCGGGATGCTTAATGCGTTAGCTGCGCCACTGAAGGGTAAACCCACCAACGGCTAGCATCCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTCAGCGTCAGTACCGGGCCAGTGAGCCGCCTTCGCCACTGGTGTTCTTGCGAATATCTACGAATTTCACCTCTACACCCGCAGTTCCACTTACCTCTTCCGGTCTCAAGCTCTACAGTATCGAAGGCAATTCTGTGGTTGAGCCACAGGCTTTCACCACCGACTTACAAGGCCGCCTACGCGCCCTTTACGCCCAGTGATTCCGAACAACGCTAGCCCCCTTCGTA
>13C.100.28.07.NA_61seq1 orig_name=M02465:63:000000000-A9G5P:1:1101:9868:1862
TAGCCTT

##Remove seqs with N characters.

In [4]:
%%bash
bioawk -c fastx '{if ($seq !~ /N/){print ">" $name " " $4 "\n" $seq}}' /var/seq_data/priming_exp/data/tmp/maxee1.fasta > \
/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.fasta
grep -c ">" /var/seq_data/priming_exp/data/tmp/maxee1.fasta
grep -c ">" /var/seq_data/priming_exp/data/tmp/maxee1.0.noN.fasta

25094893
25094861


# Alignment-based QC with Mothur

In [5]:
!ionice mothur "#unique.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.fasta)" \
>mothur.out

In [6]:
!grep -c ">" /var/seq_data/priming_exp/data/tmp/maxee1.0.noN.fasta
!grep -c ">" /var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.fasta

25094861
14712014


##### Need to use all sequences (not just unique sequences) for next step to create the data file

In [8]:
!mkdir -p /var/seq_data/priming_exp/data/tmp/db

In [9]:
%%bash
if ! [ -e /var/seq_data/priming_exp/data/silva_ref_aln_mothur.fasta ]; then
    curl -o /var/seq_data/priming_exp/data/tmp/db/silva_B.zip http://www.mothur.org/w/images/9/98/Silva.bacteria.zip 
    curl -o /var/seq_data/priming_exp/data/tmp/db/silva_E.zip http://www.mothur.org/w/images/1/1a/Silva.eukarya.zip 
    curl -o /var/seq_data/priming_exp/data/tmp/db/silva_A.zip http://www.mothur.org/w/images/3/3c/Silva.archaea.zip 
fi

In [10]:
%%bash
if ! [ -e /var/seq_data/priming_exp/data/silva_ref_aln_mothur.fasta ]; then
    unzip /var/seq_data/priming_exp/data/tmp/db/silva_A.zip -d /var/seq_data/priming_exp/data/tmp/db/
    unzip /var/seq_data/priming_exp/data/tmp/db/silva_B.zip -d /var/seq_data/priming_exp/data/tmp/db/
    unzip /var/seq_data/priming_exp/data/tmp/db/silva_E.zip -d /var/seq_data/priming_exp/data/tmp/db/
fi

In [11]:
%%bash
if ! [ -e /var/seq_data/priming_exp/data/silva_ref_aln_mothur.fasta ]; then
    cat /var/seq_data/priming_exp/data/tmp/db/silva.bacteria/silva.bacteria.fasta \
    /var/seq_data/priming_exp/data/tmp/db/silva.eukarya/silva.eukarya.fasta \
    /var/seq_data/priming_exp/data/tmp/db/Silva.archaea/silva.archaea.fasta \
    > /var/seq_data/priming_exp/data/silva_ref_aln_mothur.fasta
fi

In [12]:
%%bash
mothur "#filter.seqs(vertical=t, \
fasta=/var/seq_data/priming_exp/data/silva_ref_aln_mothur.fasta, \
processors=10)" > /dev/null

In [None]:
%%bash
mothur "#align.seqs(candidate=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.fasta, \
template=/var/seq_data/priming_exp/data/silva_ref_aln_mothur.fasta, \
processors=20, \
flip=T)" /dev/null

#### We can filter out vertical gaps...

In [None]:
%%bash
mothur "#filter.seqs(vertical=t, \
fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.align, \
processors=25)" > /dev/null

### Here is what our seqs look like.

In [20]:
%%bash
mothur "#summary.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.filter.fasta, \
processors=25, \
name=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.names)"

[H[2J





mothur v.1.32.1
Last updated: 10/16/2013

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > summary.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.filter.fasta, processors=25, name=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.names)

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.

Using 25 processors.



### Removing homopolymers (larger than 8) and screen out sequences that don't align to amplicon region

In [24]:
%%bash

mothur "#screen.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.filter.fasta, \
processors=25, \
name=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.names, \
start=878, \
end=2780, \
maxhomop=8, minlength=333)" > /dev/null

In [25]:
%%bash

mothur "#filter.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.filter.good.fasta, \
processors=25, \
vertical=T)" > /dev/null

In [26]:
%%bash

mothur "#summary.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.filter.good.filter.fasta, \
name=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.good.names)"

[H[2J





mothur v.1.32.1
Last updated: 10/16/2013

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > summary.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.filter.good.filter.fasta, name=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.good.names)

Using 1 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	1711	334	0	3	1
2.5%-tile:	219	1711	370	0	4	623218
25%-tile:	219	1711	372	0	4	6232175
Median: 	219	1711	374	0	5	12464349
75%-tile:	219	1711	374	0	5	18696523
97.5%-tile:	219	1711	377	0	8	243

##Expand data, remove gaps, and copy into data directory

In [None]:
%%bash
mothur "#deunique.seqs(fasta=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.unique.filter.good.fasta, \
name=/var/seq_data/priming_exp/data/tmp/maxee1.0.noN.good.names)" > /dev/null

In [None]:
!sed '/>/! s/-//g;/>/! s/\.//g' /var/seq_data/priming_exp/data/tmp/maxee1.0.noN.redundant.fasta \
> /var/seq_data/priming_exp/data/finalQC.fasta

In [34]:
!head /var/seq_data/priming_exp/data/finalQC.fasta

>12C.100.28.03.NA_53seq1
TACGTAGGGGGCAAGCGTTGTCCGGAATCATTGGGCGTAAAGCGCGTGTAGGCGGCCAGAAAAGTCCGCTGTGAAAACTCGAGGCTCAACGTCGAGCTGTCGGTGGAAACTTTCTGGCTAGAGTCCGGAAGAGGAGAATGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATTCGGAAGAACACCGGTGGCGAAGGCGACCTTCTGGGCCGTTTCTGACGCTGAGGTACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGGTCACTAGGTGTCGGGGGTATCGACCCCCTCGGTGCCGTAGCTAACGCGTTAAGTGACCCGCCTGGGGAGTACGCACGCAAGTGTG
>12C.700.28.03.NA_54seq1
TACGAAGGGGGCTAGCGTTGTTCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGCCTTGTAAGTCGGTGGTGAAAGCCTGTGGCTCAACCACAGAATTGCCTTCGATACTGTAGAGCTTGAGACCGGAAGAGGTAAGTGGAACTGCGGGTGTAGAGGTGAAATTCGTAGATATTCGCAAGAACACCAGTGGCGAAGGCGGCTCACTGGCCCGGTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGATGCTAGCCGTTGGTGGGTTTACCCTTCAGTGGCGCAGCTAACGCATTAAGCATCCCGCCTGGGGAGTACGGTCGCAAGATTA
>13C.100.28.07.NA_61seq1
TACGTAGGGTCCGAGCGTTGTCCGGAGTTACTGGGCGTAAAGCGCGCGTAGGCGGCGGTGCTGGCCCGGCGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCGTCGGGGACCGCACCGCTTGAGGGCGGTAGGGGCTGGTGGAATGCCTGGTGTAGTGGTGAAATGCGTAGAG

#### Total Number of QC'd sequences

In [35]:
!grep -c ">" /var/seq_data/priming_exp/data/finalQC.fasta

24928697
