# One big biome table

Since I am hoping to make comparisons of wood and leaf endophyte environmental patterns, I need to combine these datasets early in the biomformatics pipeline, to make them as comparable as possible. We'll try to stick to the [usearch (uparse)](http://drive5.com/usearch/) pipeline for the process, as much as possible.

------------

That was a year ago. 

Since then my car has been stolen, with my laptop in it. And the manuscripts which are based on these scripts are submitted. Somehow, the versions of this notebook and the following analysis notebook that were on github lost most of its graphic outputs, charts and maps and stuff. While most of the important graphics, etc, were backed up, I lost some of the computationally expensive intermediate files necessary to repopulate. 

So to make this notebook useable to reviewers and readers, I'll be picking through this process again, and the downstream analysis, if I don't go insane in the meantime. Maybe even if I do.

## Table of contents

[Work environment](#environment)  
[Rearranging barcodes](#rearrange)  
[Trimming reads](#trim)  
[Merging paired-end reads](#mergepairs) 
- [Leaf reads](#mergeLeaf)
- [Wood reads](#mergeWood)

[Visualizing merged read qualities](#visMerge)  
- [Make quality score charts](#makeQcharts)  
- [Leaf read qualities](#leafReadScores)
- [Wood read qualities](#woodReadScores)

[Quality filtering reads](#qf)  
[Convert fastq files to fasta format](#fastq2fasta)  
[Demultiplex leaf reads](#demult)  
[Clip primers](#clipPrimers)  
[Checking for chimeras](#chimeras)  
[Finding ITS1 region](#ITSx)  

[OTU clustering](#OTUclust)  
- [Dereplication and Sorting of reads](#derep)  
- [Cluster reads](#Cluster)
- [Assign unique names to OTU clusters](#addOTUtag)
- [Assign taxonomy](#assTax)
- [Make biom table](#makeBiom)


[Formatting Biom table and adding metadata](#formBiom)  
- [Change biom taxonomy metadata format](#formTax)
- [Add sample metadata](#addMetadata)




<a id='environment'><h3>Work environment</h3></a>

Working directory, on my machine:

In [2]:
cd /home/daniel/Documents/Taiwan_data/combined/combo_biome



We'll be using the [usearch (uparse)](http://drive5.com/usearch/) pipeline, version v8.0.1623_i86linux32, on the University of Oregon's Talapas computing cluster.

<a id='rearrange'><h3>Rearranging barcodes</h3></a>

We need to merged paired end sequences of the leaves and wood. But before we can do this, there are several steps. First, the leaf study reads include a split 6+6 bp barcode scheme for identifying reads, so these need to be clipped from one read and combined on the other. I wrote a python script for this:

In [4]:
cat scripts/BCunsplit.py

#!/usr/bin/env python3

## lets try to take a two unpaired read files, cut out the bp from the reverse,
## and tack it onto the forward.
## have to preserve the fastq format so that pandaseq can do unsplit3.py forward_reads reverse_reads

#The first six bps and quality ratings of the reverse reads should be chopped off and placed after
#the first six bps of the forward reads and quality ratings. For use with fastq files. It will spit
#out two files, with the names: "rearranged_[your orignial forward and reverse read file names].fastq".

import itertools ##to let us jump around 
from sys import argv

script, forward_file, reverse_file = argv

forwardlabels=[]
reverselabels=[]
forwardreads=[]
forwardreadsq=[]
reversereads=[]
reversereadsq=[]
forwardBC=[]
forwardBCq=[]
reverseBC=[]
reverseBCq=[]

with open(forward_file) as foop:

	#labels:

	for h in itertools.islice(foop, 0, None, 4):
		forwardlabels.append(h)

##forward sequencies, barcodes: 

	foop.seek(0)
    
	for i in itertools.isli

I have details on how I used this [here](https://github.com/danchurch/taiwan_dada2/blob/master/dada2pipeline.ipynb). 

This outputs two files, "rearranged_Roo_R2.fastq" and "rearranged_Roo_R2.fastq". I did this in another directory, so we'll add some sym links here for convenience:

In [8]:
## leaves
ln -s /home/daniel/Documents/taiwan/taiwan_dada2/rearranged_leafR1.fastq reLeafR1.fastq
ln -s /home/daniel/Documents/taiwan/taiwan_dada2/rearranged_leafR2.fastq reLeafR2.fastq

<a id='trim'><h3>Trimming reads</h3></a>

Next we trim a little to make sure we doing our alignments with high quality base calls. The sites for trimming are decided by looking at the raw reads [(see below)](#quality), and finding where quality begins to drop off. 
To trim, we'll use the [FASTX-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/).

Our wood reads are already demultiplexed, so we don't have a single forward and reverse read file for all of our wood samples, like we do above with the leaves. So let's make a script for this:

In [None]:
## trims.sh
#####################################################


## wood reads live here:
wooddir=/home/daniel/Documents/taiwan/woodreads/

## working directory is here:
cd /home/daniel/Documents/taiwan/taiwan_combined_biom

###### R1 reads:

## home for trimmed R1 wood reads here:
R1trimdir='/home/daniel/Documents/taiwan/taiwan_combined_biom/trimmed_wood/R1/'

## trim just the R1s, output to their new home with new filename:
for i in $wooddir*_R1_*; do
    echo $i
    out=$R1trimdir$(basename ${i/_001\.fastq/_trimmed\.fastq})
    fastx_trimmer -l 255 -i $i -o $out && echo $out 
done

###### R2 reads:

## home for trimmed R2 wood reads here:
R2trimdir='/home/daniel/Documents/taiwan/taiwan_combined_biom/trimmed_wood/R2/'

## trim just the R2s, output to their new home with new filename:
for j in $wooddir*_R2_*; do
    echo $j
    out=$R2trimdir$(basename ${j/_001\.fastq/_trimmed\.fastq})
    fastx_trimmer -l 210 -i $j -o $out && echo $out 
done

In [2]:
## leaves. These lengths were decided by Roo. They are all still in one pile:
fastx_trimmer -l 263 -i reLeafR1.fastq -o Roo_R1_trimmed.fastq
fastx_trimmer -l 170 -i reLeafR2.fastq -o Roo_R2_trimmed.fastq



<a id='mergepairs'><h3>Merging paired-end reads</h3></a> 

Unfortunately, I no longer have access to the 64-bit version of usearch 8.1. So let's use the 32-bit version:

In [7]:
usearch

usearch v8.1.1861_i86linux32, 4.0Gb RAM (8.0Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com



<a id='mergeLeaf'><h4>Leaf reads</h4></a>

The leaf files are too large to be handled by the 32-version. We don't want to demultiplex yet, this will be messy with the split barcodes. So let's break up the leaf files into smaller ones and merge these.

In [None]:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves

for i in Roo_R*_*; do
    ls $i
    split -d -l 500000 $i ${i/fastq/split\.fastq}
done

Makes a lot of files:

In [10]:
cat leafSplitLs.txt

aln_3.txt
leafSplitLs.txt
Roo_R1_trimmed.split.fastq00
Roo_R1_trimmed.split.fastq01
Roo_R1_trimmed.split.fastq02
Roo_R1_trimmed.split.fastq03
Roo_R1_trimmed.split.fastq04
Roo_R1_trimmed.split.fastq05
Roo_R1_trimmed.split.fastq06
Roo_R1_trimmed.split.fastq07
Roo_R1_trimmed.split.fastq08
Roo_R1_trimmed.split.fastq09
Roo_R1_trimmed.split.fastq10
Roo_R1_trimmed.split.fastq11
Roo_R1_trimmed.split.fastq12
Roo_R1_trimmed.split.fastq13
Roo_R1_trimmed.split.fastq14
Roo_R1_trimmed.split.fastq15
Roo_R1_trimmed.split.fastq16
Roo_R1_trimmed.split.fastq17
Roo_R1_trimmed.split.fastq18
Roo_R1_trimmed.split.fastq19
Roo_R1_trimmed.split.fastq20
Roo_R1_trimmed.split.fastq21
Roo_R1_trimmed.split.fastq22
Roo_R1_trimmed.split.fastq23
Roo_R1_trimmed.split.fastq24
Roo_R1_trimmed.split.fastq25
Roo_R1_trimmed.split.fastq26
Roo_R1_trimmed.split.fastq27
Roo_R1_trimmed.split.fastq28
Roo_R1_trimmed.split.fastq29
Roo_R1_trimmed.split.fastq30
Roo_R1_trimmed.split.fastq31
Roo_R1_trimmed.split.fastq32
Roo_R1_trimmed.sp

Roo_R2_trimmed.split.fastq9040
Roo_R2_trimmed.split.fastq9041
Roo_R2_trimmed.split.fastq9042
Roo_R2_trimmed.split.fastq9043
Roo_R2_trimmed.split.fastq9044
Roo_R2_trimmed.split.fastq9045
Roo_R2_trimmed.split.fastq9046
Roo_R2_trimmed.split.fastq9047
Roo_R2_trimmed.split.fastq9048
Roo_R2_trimmed.split.fastq9049
Roo_R2_trimmed.split.fastq9050
Roo_R2_trimmed.split.fastq9051
Roo_R2_trimmed.split.fastq9052
Roo_R2_trimmed.split.fastq9053
Roo_R2_trimmed.split.fastq9054
Roo_R2_trimmed.split.fastq9055


Merge these:

In [None]:
for forward in *_R1_*; do
    #ls $forward
    reverse=${forward/_R1_/_R2_}
    usearch -fastq_mergepairs $forward \
        -reverse $reverse \
        -fastq_maxdiffpct 40 \
        -alnout aln_3.txt \ ## oops, fix this if reused
        -report ../reports/$forward.report.txt \
        -fastqout ../merged/$forward.merged.fastq
    #ls $reverse
done

And put them back together.

In [None]:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/merged

cat * > leaf_trimmed_merged.fastq

<a id='mergeWood'><h4>Wood reads</h4></a>

We don't have the same issue on the wood, because they are already demultiplexed, and the 32-bit version of usearch can handle these smaller files fine:

In [None]:
for forward in *fastq; do
    ls -l $forward
    reverse="../R2/${forward/R1/R2}"
    usearch -fastq_mergepairs $forward \
        -reverse $reverse \
        -fastq_maxdiffpct 40 \
        -alnout aln_3.txt \ ## oops, fix this if reused
        -report ../reports/$forward.report.txt \
        -fastqout ./$forward.merged.fastq
    echo $reverse
done

<a id='visMerge'><h3>Visualizing merged read qualities</h3></a>

<a id='makeQcharts'><h4>Make quality score charts</h4></a>  
Let's make some charts of our read quality, using fastx tools. First, compile the stats on each basepair:

In [None]:
#!/usr/bin/env bash

## leaf reads

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/qCharts/leaf

rawLeafReadsR1="/home/daniel/Documents/taiwan_supp/roo_reads/TaiwanFA_R1.fastq"
rawLeafReadsR2="/home/daniel/Documents/taiwan_supp/roo_reads/TaiwanFA_R2.fastq"
trimmedLeafReadsR1="/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/Roo_R1_trimmed.fastq"
trimmedLeafReadsR2="/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/Roo_R2_trimmed.fastq"
leafmerg="/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/merged/leaf_trimmed_merged.fastq"

## leaf quality stats:
fastx_quality_stats -i $rawLeafReadsR1 -o rawLeafReadsR1_fastxstats.txt
fastx_quality_stats -i $rawLeafReadsR2 -o rawLeafReadsR2_fastxstats.txt
fastx_quality_stats -i $trimmedLeafReadsR1 -o trimmedLeafReadsR1_fastxstats.txt
fastx_quality_stats -i $trimmedLeafReadsR2 -o trimmedLeafReadsR2_fastxstats.txt
fastx_quality_stats -i $leafmerg -o leafmerged_fastxstats.txt

For the wood, to visualize them as a whole, we'll combine reads of the steps we've done so far:

In [None]:
cat /home/daniel/Documents/taiwan_supp/wood_reads/*R1* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/rawWoodReadsR1.fastq

cat /home/daniel/Documents/taiwan_supp/wood_reads/*R2* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/rawWoodReadsR2.fastq

cat /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/R1/* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmedWoodR1.fastq

cat /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/R2/* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmedWoodR2.fastq

cat /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/merged/* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/woodMerged.fastq


Compile the stats for these combined wood files:

In [None]:
#!/usr/bin/env bash

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom

fastx_quality_stats -i rawWoodReadsR1.fastq -o qCharts/wood/rawWoodReadsR1_fastxstats.txt
fastx_quality_stats -i rawWoodReadsR2.fastq -o qCharts/wood/rawWoodReadsR2_fastxstats.txt
fastx_quality_stats -i trimmedWoodR1.fastq -o qCharts/wood/trimmedWoodR1_fastxstats.txt
fastx_quality_stats -i trimmedWoodR2.fastq -o qCharts/wood/trimmedWoodR2_fastxstats.txt
fastx_quality_stats -i woodMerged.fastq -o qCharts/wood/woodMerged_fastxstats.txt

Then make the actual graphics.

In [None]:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/qCharts

cd leaf

for i in *; do
    ../dan_fastx_plot.sh -i $i -o ${i/\.txt/\.png}
done

cd ../wood

for i in *; do
    ../dan_fastx_plot.sh -i $i -o ${i/\.txt/\.png}
done

<a id='leafReadScores'><h4>Leaf read qualities</h4></a>

Forward raw reads:

![](rawLeafReadsR2_fastxstats.png)

Reverse raw Leaf reads.
![](rawLeafReadsR1_fastxstats.png)

Forward trimmed Leaf reads.
![](trimmedLeafReadsR2_fastxstats.png)

Reverse trimmed Leaf reads.
![](trimmedLeafReadsR1_fastxstats.png)

And the merged leaf reads.
![](leafmerged_fastxstats.png)

<a id='woodReadScores'><h4>Wood read qualities</h4></a>

Wood raw forward reads:
![](rawWoodReadsR1_fastxstats.png)

Wood raw reverse reads:
![](rawWoodReadsR2_fastxstats.png)

Wood trimmed forward reads:
![](rawWoodReadsR1_fastxstats.png)

Wood trimmed reverse reads:
![](rawWoodReadsR2_fastxstats.png)

Wood merged reads:
![](woodMerged_fastxstats.png)

<a id="qf"><h3>Quality filtering reads</h3></a>

USEARCH does quite a bit of filtering in the merging process, I think, based on how many reads from our wood samples were removed. But let's also use the USEARCH filtering program on both wood and leaf reads.

<h4>Filter leaves</h4>

In [None]:
## leaves

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/merged

for i in *; do
    out=${i/.merged.fastq/\.merged\.filt\.fastq}
    #echo $i $out
    usearch -fastq_filter $i -fastq_maxee_rate .01 -fastqout "../filtered/"$out -notrunclabels &>> ../filtered/leaf_mergeStdout.txt
done


In [4]:
cat leaf_filterStdout.txt

usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 97.5% passed
    102021  FASTQ recs (102.0k)            
     99492  Converted (99.5k, 97.5%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
    104711  FASTQ recs (104.7k)            
    102793  Converted (102.8k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 98.1% passed
    104701  FASTQ recs (104.7k)            
    102708  Converted (102.7k, 98.1%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 core

http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 97.6% passed
    102999  FASTQ recs (103.0k)            
    100516  Converted (100.5k, 97.6%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 97.1% passed
    100736  FASTQ recs (100.7k)            
     97824  Converted (97.8k, 97.1%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
    105062  FASTQ recs (105.1k)            
    103173  Converted (103.2k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com



00:02  32Mb  100.0% Filtering, 96.8% passed
    104980  FASTQ recs (105.0k)            
    101625  Converted (101.6k, 96.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 96.4% passed
    103707  FASTQ recs (103.7k)            
     99960  Converted (100.0k, 96.4%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 94.5% passed 
     98488  FASTQ recs (98.5k)             
     93055  Converted (93.1k, 94.5%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 96.5% passed
    103949  FAST

     99443  Converted (99.4k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
    102558  FASTQ recs (102.6k)            
    100694  Converted (100.7k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 97.6% passed
     98963  FASTQ recs (99.0k)             
     96545  Converted (96.5k, 97.6%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 97.4% passed
     98644  FASTQ recs (98.6k)             
     96079  Converted (96.1k, 97.4%)
usearch v8.1.1861_i86linux

(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 98.7% passed 
    102947  FASTQ recs (102.9k)            
    101629  Converted (101.6k, 98.7%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 98.8% passed
    103653  FASTQ recs (103.7k)            
    102394  Converted (102.4k, 98.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
     98820  FASTQ recs (98.8k)             
     97054  Converted (97.1k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
h

http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 97.8% passed 
     99553  FASTQ recs (99.6k)             
     97330  Converted (97.3k, 97.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.0% passed
    100658  FASTQ recs (100.7k)            
     98680  Converted (98.7k, 98.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 97.7% passed
     99534  FASTQ recs (99.5k)             
     97261  Converted (97.3k, 97.7%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com



<h4>Filter wood</h4>

In [None]:
## wood
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/merged

for i in *; do
    out=${i/.fastq.merged.fastq/\.merge\.filt\.fastq}
    usearch -fastq_filter $i -fastq_maxee_rate .01 -fastqout $out -notrunclabels &>> wood_filterStdout.txt
done


In [2]:
cat wood_filterStdout.txt

usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     61717  FASTQ recs (61.7k)              
     61717  Converted (61.7k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     44977  FASTQ recs (45.0k)              
     44975  Converted (45.0k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 99.8% passed
     37537  FASTQ recs (37.5k)             
     37462  Converted (37.5k, 99.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 

(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 99.9% passed
     43527  FASTQ recs (43.5k)             
     43484  Converted (43.5k, 99.9%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     56663  FASTQ recs (56.7k)              
     56656  Converted (56.7k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     55294  FASTQ recs (55.3k)              
     55293  Converted (55.3k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved

http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     42057  FASTQ recs (42.1k)              
     42053  Converted (42.1k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     38095  FASTQ recs (38.1k)              
     38085  Converted (38.1k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     34455  FASTQ recs (34.5k)              
     34443  Converted (34.4k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gma


License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     35103  FASTQ recs (35.1k)              
     35091  Converted (35.1k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 99.9% passed
     26257  FASTQ recs (26.3k)             
     26224  Converted (26.2k, 99.9%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 99.9% passed 
     36638  FASTQ recs (36.6k)             
     36597  Converted (36.6k, 99.9%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Fil

<a id="fastq2fasta"><h3>Convert fastq files to fasta format</h3></a>

Let's use [BBMap tools](https://sourceforge.net/projects/bbmap/) to do the conversion. FASTX toolbox has something for this also, but FASTX is a little brittle when dealing with fastq files in modern illumina quality scores, etc., and sometimes generates funny errors. 

<h4>Leaf reads</h4>

In [None]:
## drop bbtools into a nearby directory. Java, so can't really put in bin folders. 
bb=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap

## just one large leaf file
$bb/reformat.sh in=leaf_merged_filt.fastq \
    out=leaf_merged_filt.fasta\
    fastawrap=0 \
    &> makeLeafFasta.txt

Outputs from bbmap for this:

In [10]:
cat makeLeafFasta.txt

java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=leaf_merged_filt.fastq out=leaf_merged_filt.fasta fastawrap=0
Executing jgi.ReformatReads [in=leaf_merged_filt.fastq, out=leaf_merged_filt.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	14372164 reads          	4600246850 bases
Output:                 	14372164 reads (100.00%) 	4600246850 bases (100.00%)

Time:                         	151.584 seconds.
Reads Processed:      14372k 	94.81k reads/sec
Bases Processed:       4600m 	30.35m bases/sec


<h4>Wood reads</h4>

In [5]:
## wood to fasta, lots of smaller files:
wfd=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/filtered

for i in *; do
$bb/reformat.sh in=$i \
    out=$wfd${i/_R1_trimmed.merge.filt.fastq/.fasta} \
    fastawrap=0 \
    &>> ../../woodFastaStdout.txt
done

In [11]:
cat woodFastaStdout.txt

java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s160-index-AAGCACTG-GTGATCCANNNN-Dc-X_S160_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s160-index-AAGCACTG-GTGATCCANNNN-Dc-X_S160_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s160-index-AAGCACTG-GTGATCCANNNN-Dc-X_S160_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s160-index-AAGCACTG-GTGATCCANNNN-Dc-X_S160_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	61717 reads          	15788419 bases
Output:                 	61717 reads (100.00%) 	15788419 bases (100.00%)

Time:                         	0.638 seconds.
Reads Processed:       61717 	96.76k reads/sec
Bases Processed:      15788k 	24.75m bases/sec
java -ea -Xmx200m -cp /home/da


Input is being processed as unpaired
Input:                  	29049 reads          	7036714 bases
Output:                 	29049 reads (100.00%) 	7036714 bases (100.00%)

Time:                         	0.284 seconds.
Reads Processed:       29049 	102.32k reads/sec
Bases Processed:       7036k 	24.79m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s169-index-GAGGACTT-GACACAGTNNNN-7w_S169_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s169-index-GAGGACTT-GACACAGTNNNN-7w_S169_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s169-index-GAGGACTT-GACACAGTNNNN-7w_S169_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s169-index-GAGGACTT-GACACAGTNNNN-7w_S169_L001.fasta, fastawrap=0]

Input is being processed as unpaired
I

Executing jgi.ReformatReads [in=lane1-s177-index-GAGGACTT-GCATAACGNNNN-17w_S177_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s177-index-GAGGACTT-GCATAACGNNNN-17w_S177_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	49393 reads          	11925772 bases
Output:                 	49393 reads (100.00%) 	11925772 bases (100.00%)

Time:                         	0.364 seconds.
Reads Processed:       49393 	135.73k reads/sec
Bases Processed:      11925k 	32.77m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s178-index-ACCATCCA-GCATAACGNNNN-18w_S178_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s178-index-ACCATCCA-GCATAACGNNNN-18w_S178_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=l

Reads Processed:       43556 	121.95k reads/sec
Bases Processed:      10767k 	30.15m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s186-index-ACCATCCA-ACAGAGGTNNNN-28w_S186_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s186-index-ACCATCCA-ACAGAGGTNNNN-28w_S186_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s186-index-ACCATCCA-ACAGAGGTNNNN-28w_S186_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s186-index-ACCATCCA-ACAGAGGTNNNN-28w_S186_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	43484 reads          	11231013 bases
Output:                 	43484 reads (100.00%) 	11231013 bases (100.00%)

Time:                         	0.372 seconds.
Reads Processed:       43484 	116

Executing jgi.ReformatReads [in=lane1-s194-index-ACCATCCA-CCACTAAGNNNN-38w_S194_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s194-index-ACCATCCA-CCACTAAGNNNN-38w_S194_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	36142 reads          	9145609 bases
Output:                 	36142 reads (100.00%) 	9145609 bases (100.00%)

Time:                         	0.317 seconds.
Reads Processed:       36142 	113.86k reads/sec
Bases Processed:       9145k 	28.81m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s195-index-CAACACCT-CCACTAAGNNNN-39w_S195_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s195-index-CAACACCT-CCACTAAGNNNN-39w_S195_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lan

Reads Processed:       34837 	108.75k reads/sec
Bases Processed:       8267k 	25.81m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s203-index-CAACACCT-TGTTCCGTNNNN-56w_S203_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s203-index-CAACACCT-TGTTCCGTNNNN-56w_S203_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s203-index-CAACACCT-TGTTCCGTNNNN-56w_S203_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s203-index-CAACACCT-TGTTCCGTNNNN-56w_S203_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	43974 reads          	10301118 bases
Output:                 	43974 reads (100.00%) 	10301118 bases (100.00%)

Time:                         	0.373 seconds.
Reads Processed:       43974 	117


Input is being processed as unpaired
Input:                  	8913 reads          	2016016 bases
Output:                 	8913 reads (100.00%) 	2016016 bases (100.00%)

Time:                         	0.234 seconds.
Reads Processed:        8913 	38.04k reads/sec
Bases Processed:       2016k 	8.60m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s212-index-CTAGGTGA-AGCCGTAANNNN-68w_S212_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s212-index-CTAGGTGA-AGCCGTAANNNN-68w_S212_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s212-index-CTAGGTGA-AGCCGTAANNNN-68w_S212_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s212-index-CTAGGTGA-AGCCGTAANNNN-68w_S212_L001.fasta, fastawrap=0]

Input is being processed as unpaired
I

Executing jgi.ReformatReads [in=lane1-s220-index-CTAGGTGA-CTCCTGAANNNN-76w_S220_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s220-index-CTAGGTGA-CTCCTGAANNNN-76w_S220_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	42984 reads          	11182364 bases
Output:                 	42984 reads (100.00%) 	11182364 bases (100.00%)

Time:                         	0.371 seconds.
Reads Processed:       42984 	115.80k reads/sec
Bases Processed:      11182k 	30.13m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s221-index-ACGACTTG-CTCCTGAANNNN-79w_S221_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s221-index-ACGACTTG-CTCCTGAANNNN-79w_S221_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=l

Reads Processed:       62721 	139.92k reads/sec
Bases Processed:      16050k 	35.81m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s229-index-ACGACTTG-ACGAATCCNNNN-89w_S229_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s229-index-ACGACTTG-ACGAATCCNNNN-89w_S229_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s229-index-ACGACTTG-ACGAATCCNNNN-89w_S229_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s229-index-ACGACTTG-ACGAATCCNNNN-89w_S229_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	49122 reads          	12916117 bases
Output:                 	49122 reads (100.00%) 	12916117 bases (100.00%)

Time:                         	0.389 seconds.
Reads Processed:       49122 	126


Input is being processed as unpaired
Input:                  	24703 reads          	6149228 bases
Output:                 	24703 reads (100.00%) 	6149228 bases (100.00%)

Time:                         	0.278 seconds.
Reads Processed:       24703 	88.84k reads/sec
Bases Processed:       6149k 	22.12m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s238-index-GCATACAG-AATGGTCGNNNN-101w_S238_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s238-index-GCATACAG-AATGGTCGNNNN-101w_S238_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s238-index-GCATACAG-AATGGTCGNNNN-101w_S238_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s238-index-GCATACAG-AATGGTCGNNNN-101w_S238_L001.fasta, fastawrap=0]

Input is being processed as unp

Executing jgi.ReformatReads [in=lane1-s246-index-GCATACAG-CGCTACATNNNN-115w_S246_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s246-index-GCATACAG-CGCTACATNNNN-115w_S246_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	30651 reads          	7604374 bases
Output:                 	30651 reads (100.00%) 	7604374 bases (100.00%)

Time:                         	0.331 seconds.
Reads Processed:       30651 	92.52k reads/sec
Bases Processed:       7604k 	22.96m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s247-index-TGCGAACT-CGCTACATNNNN-121w_S247_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s247-index-TGCGAACT-CGCTACATNNNN-121w_S247_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=


Time:                         	0.560 seconds.
Reads Processed:       37523 	67.01k reads/sec
Bases Processed:       9979k 	17.82m bases/sec
java -ea -Xmx200m -cp /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap/current/ jgi.ReformatReads in=lane1-s255-index-TGCGAACT-CCTAAGTCNNNN-Neg_S255_L001_R1_trimmed.merge.filt.fastq out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s255-index-TGCGAACT-CCTAAGTCNNNN-Neg_S255_L001.fasta fastawrap=0
Executing jgi.ReformatReads [in=lane1-s255-index-TGCGAACT-CCTAAGTCNNNN-Neg_S255_L001_R1_trimmed.merge.filt.fastq, out=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/lane1-s255-index-TGCGAACT-CCTAAGTCNNNN-Neg_S255_L001.fasta, fastawrap=0]

Input is being processed as unpaired
Input:                  	1514 reads          	371147 bases
Output:                 	1514 reads (100.00%) 	371147 bases (100.00%)

Time:                         	0.156 se

<a id="demult"><h3>Demultiplex leaf reads</h3></a>

The leaf reads are in one massive file that needs to be taken apart and organized by the sample barcodes, which are the first 12 bps of each read, after we did [some rearranging above](#rearrange). 

In [None]:
ld="/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/"

(cat leaf_merged_filt.fasta | fastx_barcode_splitter.pl \
    --bcfile leafread_fastx_map.txt \
    --prefix $ld"leaf_"  \
    --suffix ".fa"  \
    --bol --mismatches 1 --partial 1 \
    &>> "leaf_demult_log.txt" &) &

In [12]:
cat leaf_demult_log.txt

Barcode	Count	Location
1	237434	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_1.fa
100	138715	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_100.fa
101	117532	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_101.fa
102	122279	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_102.fa
103	58121	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_103.fa
104	85418	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_104.fa
105	35436	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_105.fa
106	8	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_106.fa
107	91405	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_107.fa
108	34447

5	7887	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_5.fa
50	91324	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_50.fa
51	179957	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_51.fa
52	100442	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_52.fa
53	29372	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_53.fa
54	10777	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_54.fa
55	138725	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_55.fa
56	16369	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_56.fa
57	13611	/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/demult/leaf_57.fa
58	32997	/home/daniel/Documents/submissions/ta

2,645,675 reads unmatched. That's a lot. Hmmm.

<a id='clipPrimers'><h3>Clip primers</h3></a>

The leaves still have primers and barcodes on them: 

In [14]:
cd demult

In [16]:
## ITS2 was our linker primer:
grep GCTGCGTTCTTCATCGATGC leaf_94.fa | wc -l
grep GCTGCGTTCTTCATCGATGC <(head -n 20 leaf_94.fa)

87084
CGTGATAAGACG[01;31m[KGCTGCGTTCTTCATCGATGC[m[KCAGAACCAAGAGATCCGTTGTTAAAAGTTTTAATTATTTGCTTGTGCCACTCAGAAGAGACGTCGTGTAAATAGAGTTTGGTTTCCTCCGGCGGGCGCCCCGTCCCCGTGGTGGGGGCCGGCGCCGGGAGGGGAGGCCCGCGAGAGGCTTCCCCTGCCCGCCGAAGCAACGGTTAGGTACGTTCACAAAGGGTTATAGAGCGGTAACTCAGTAATGATCCCTCCGCTGGTTCACCAACGGAGACCTTGTTACGACTTTTACTTCCTCTAAATNACCAAGCGTCTTATCACG
CGTGATCAGACG[01;31m[KGCTGCGTTCTTCATCGATGC[m[KCAGAACCAAGAGATCCGTTGTTAAAAGTTTTAATTATTTGCTTGTGCCACTCAGAAGAGACGTCGTGTAAATAGAGTTTGGTTTCCTCCGGCGGGCGCCCCGTCCCCCTGGTGGGGGCCGGCGCCGGGAGGGGAGGCCCGCGAGAGGCTTCCCCTGCCCGCCGAAGCAACGGTTAGGTATGTTCACAAAGGGTTATAGAGCGGTAACTCAGTAATGATCCCTCCGCTGGTTCACCAACGGAGACCTTGTTACGACTCGTACTTCCTCTAAATNACCAAGCGTCTGATCACG
CGTGATAAGACA[01;31m[KGCTGCGTTCTTCATCGATGC[m[KCAGAACCAAGAGATCCGTTGTTAAAAGTTTTGATTATTTGCTTGTACCACTCAGAAGAAACGTCGTTAAATCAGAGTTTGGTTATCCTCCGGCGGGCGCCGACCCGCCCGGGGGCGGGAGGCCGGGAGGGTCACGGAGACCCTACCCGCCGAAGCAACAGTTATAGGTATGTTCACAAAGGGTTGTAGAGCGTAAACTCAGTAATGATCCCTCCGCTGGTTCACCAACGGAGACCTTGTTACGACTTTTACTTCCTCTAAATN

This forward barcode plus primer is 32 bp long:

In [17]:
aa=CGTGATAAGACGGCTGCGTTCTTCATCGATGC
echo ${#aa}

32


The reverse primer (ITS1f) was much more degraded, not sure why. So unless we put a bunch of wildcards in our search, we don't turn it up as often. But it is still definitely present, and we can look for its reverse compliment in these merged files to confirm how much we need to clip.

In [18]:
grep TTACTTCCTCTAAATGACCAAG leaf_94.fa | wc -l

grep TTACTTCCTCTAAATGACCAAG <(head -n 1000 leaf_94.fa)

34561
CGTGATAAGACGGCTGCGTTCTTCATCGATGCCAGAACCAAGAGATCCGTTGTTGAAAGTTTTGATTCATTCTTCATCAAACCGACGCATCAAAACCGCGTTGGAAAGGTCCACCGGGGGCGCGGGTCTCGCGTCCCCCGAGGAAACAAGGGTATTCATACAAAAGGGTGGGAGGTCGGGCCTGGGGCCCTCACTCGGTAATGATCCCTCCGCAGGTTCACCTACGGAGACCTTGTTACGACTT[01;31m[KTTACTTCCTCTAAATGACCAAG[m[KCGTCTTATCACG
CGTGATAAGACGGCTGCGTTCTTCATCGATGCCAGAACCAAGAGATCCGTTGTTGAAAGTTTTAATCAATTAAATGATATATCAGGACTTCACAAAATGAATTCTTGAGTTTTGTATACTGGCGGGCACTTAGCCGGGCGTCCTGGCCAGTTAAGGCTGGGGGCGCCGGCCGCCTGGGTCGGAACCAGGTCGACCCGCCAAAGCAACATAGTGAGTAGACTT[01;31m[KTTACTTCCTCTAAATGACCAAG[m[KCGTCTTATCACG
CGTGATAAGTCGGCTGCGTTCTTCATCGATGCTGGAGCCAAGAGATCCGTTGTTAAAAGTTTTGACAGTTCGCTAAGAACACTCAGAAGTATCGTCGGGTTCGAAAACAGAGATTCTGATGAGACCGGCGGGCACCCTCGCGGGCGCCGCCGAAGCAACAGGTATAATAGTTCACAAAGGGTAGAGAGTATAGTACTCATTAATGATCCCTCCGCTGGTTCACCAACGGAGACCTTGTTACGACTT[01;31m[KTTACTTCCTCTAAATGACCAAG[m[KCGACTTATCACG
CGTGATAAGACGGCTGCGTTCTTCATCGATGCTAGAGCCAAGAGATCCGTTGTTGAAAGTTTTAACAGTTCGCTTTGGAACACTCAGAGGTAACTCATAGAGAAACAGGAGATTCTGAACACCGGC

In [19]:
bb=TTACTTCCTCTAAATGACCAAGCGACTTATCACG
echo ${#bb}

34


So we need to clip 32 bps off of the 5' end of our reads, and 34 bps off of our 3' end. Makes sense, Barcodes (12 bp) + ITS2 (20 bp) = 32 bp, and (12 bp) + ITS1f (22 bp) = 34 bp ITS1f.

We'll use fastx again:

In [22]:
## leaves:
cd /home/daniel/Documents/taiwan/taiwan_combined_biom/demult

for i in *; do
    fastx_trimmer -i $i -f 33 | fastx_trimmer -t 34 -o ../leafNoPrim/${i/leaf/leafNoPrim}
done

<a id='chimeras'><h3>Checking for chimeras</h3></a>

Let's look for and remove chimeric sequences. For the USEARCH pipeline, we'll use the [ITS1 reference files from UNITE](https://unite.ut.ee/repository.php). 

In [None]:
## leaf reads:
ITS1_ref='/home/daniel/Documents/submissions/taibioinfo/UNITE/uchime_reference_dataset_28.06.2017/ITS1_ITS2_datasets/uchime_reference_dataset_ITS1_28.06.2017.fasta'

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/leafNoPrim

( for i in *; do
echo $i
j="../leafNotChim/"${i/NoPrim/NotChim}
k=${j/\.fa/\.log}
echo $j
echo $k
usearch -uchime_ref $i \
-db $ITS1_ref \
-nonchimeras $j \
-uchimeout $k \
-strand plus \
-notrunclabels \
&>> ../leafNotChim/leafUchime_stdout.txt
done &) &

With leaf reads, 14372161 out of 14372164 reads were non-chimeric. So a loss of three reads.

In [None]:
## wood reads

ITS1_ref='/home/daniel/Documents/submissions/taibioinfo/UNITE/uchime_reference_dataset_28.06.2017/ITS1_ITS2_datasets/uchime_reference_dataset_ITS1_28.06.2017.fasta'


cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta

(for i in *; do
echo $i
j="../woodNotChim/"${i/\.fasta/\.notChim\.fasta}
k=${j/\.fasta/\.log}
echo $j
echo $k
usearch -uchime_ref $i \
-db $ITS1_ref \
-nonchimeras $j \
-uchimeout $k \
-strand plus \
-notrunclabels \
&>> woodUchime_stdout.txt
done &) &

With wood reads, 3,732,153 out of 3,743,135 reads were non-chimeric, so 10,982 (0.3%) reads were chimeric. 

<a id='ITSx'><h3>Finding ITS1 region</h3></a>

Even though we trimmed the primers, it seems like to maximize the accuracy of the OTU clustering process, we should get rid of regions that are highly conserved among all fungi, i.e. the small subunit and the 5.8s subunit. Bits of both are in our reads, since our forward primer is seated in the ssu and our reverse in the 5.8s. We can estimate their locations with the ITSx tool. This is computationally a very expensive process, so we'll just look at some reads and estimate 

I guess we could have done this earlier, after demultiplexing, and skipped our primer clipping step...

<h4> Leaf ITS1 region </h4>

In [None]:
## look at leaf reads:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/leafNotChim

## we gotta get rid of the linebreaks, made a script
#wget https://raw.githubusercontent.com/danchurch/taiwan_combined_biom/master/scripts/fasta_remove_linebreaks.py

for i in *; do
echo $i
j=checkITS/${i/\.fa/_noLB\.fa}
#echo checkITS/${i/\.fa/_noLB\.fa}
fasta_remove_linebreaks.py $i $j
head -n 2 $j >> checkITS/allFirstReads.fa
done

## what's next? check ITS for all of these. ITXs Binaries are in the working directory. 

../../ITSx_1.0.11/ITSx \
-i checkITS/allFirstReads.fa \
--preserve T \
--allow_single_domain \
-t F \
-o checkITS/allFirstLeafReads


In [3]:
cat allFirstLeafReads.positions.txt

HWI-M01380:62:000000000-A65GR:1:1101:18554:1494	257 bp.	SSU: 1-46	ITS1: 47-227	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:20607:1531	215 bp.	SSU: 1-46	ITS1: 47-185	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:17026:1493	238 bp.	SSU: 1-46	ITS1: 47-208	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:13354:1572	274 bp.	SSU: 1-46	ITS1: 47-244	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:11460:1564	218 bp.	SSU: 1-46	ITS1: 47-188	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:10872:1538	306 bp.	SSU: 1-46	ITS1: 47-276	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequ

HWI-M01380:62:000000000-A65GR:1:1101:10989:1632	256 bp.	SSU: 1-46	ITS1: 47-226	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:14272:1645	229 bp.	SSU: 1-46	ITS1: 47-199	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:17796:1515	267 bp.	SSU: 1-46	ITS1: 47-237	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:14431:1642	217 bp.	SSU: 1-46	ITS1: 47-187	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:18084:1490	268 bp.	SSU: 1-46	ITS1: 47-238	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:21934:1698	218 bp.	SSU: 1-46	ITS1: 47-188	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequ

HWI-M01380:62:000000000-A65GR:1:1101:11941:1573	215 bp.	SSU: 1-46	ITS1: 47-188	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:25693:14022	252 bp.	SSU: 1-46	ITS1: 47-222	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:11938:1520	213 bp.	SSU: 1-46	ITS1: 47-183	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:17524:1531	261 bp.	SSU: 1-46	ITS1: 47-231	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:10720:1599	315 bp.	SSU: 1-46	ITS1: 47-286	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
HWI-M01380:62:000000000-A65GR:1:1101:10075:1488	218 bp.	SSU: 1-46	ITS1: 47-188	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial seq

<h4>Wood ITS1 region </h4>

In [None]:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodNotChim

for i in *; do
echo $i
j=checkITS/${i/\.fasta/_noLB\.fa}
#echo $j
fasta_remove_linebreaks.py $i $j
head -n 2 $j >> checkITS/allFirstWoodReads.fa
done

../../ITSx_1.0.11/ITSx \
-i checkITS/allFirstWoodReads.fa \
--preserve T \
--allow_single_domain \
-t F \
-o checkITS/allFirstWoodReads

In [5]:
cat allFirstWoodReads.positions.txt

M01498:244:000000000-ANT97:1:1101:16239:1160	256 bp.	SSU: 1-46	ITS1: 47-226	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:13588:1168	220 bp.	SSU: 1-46	ITS1: 47-190	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:9388:1167	256 bp.	SSU: 1-46	ITS1: 47-226	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:11226:1106	275 bp.	SSU: 1-46	ITS1: 47-245	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:20154:1278	224 bp.	SSU: 1-46	ITS1: 47-194	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:19378:1196	257 bp.	SSU: 1-46	ITS1: 47-227	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 

M01498:244:000000000-ANT97:1:1101:17861:1169	277 bp.	SSU: 1-46	ITS1: 47-247	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:13783:1129	216 bp.	SSU: 1-46	ITS1: 47-186	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:14904:1772	253 bp.	SSU: 1-46	ITS1: 47-223	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:14432:1196	247 bp.	SSU: 1-46	ITS1: 47-217	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:10402:1209	222 bp.	SSU: 1-46	ITS1: 47-192	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
M01498:244:000000000-ANT97:1:1101:14952:1145	306 bp.	SSU: 1-97	ITS1: 98-276	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial

So in both the leaves and the wood, we see that the large subunit usually ends at bp 46 of the read, and the small subunit begins at 30 bp before the end of the read. There are some exceptions, but the ITSx algorithms are computationally expensive, if we run them on our entire data set it can take days to weeks. So we'll do our best here to reduce the role of more highly conserved regions of the read (the 18s and 28s) in our OTU clustering, which is intended to capture species-ish diversity. 

We do this by clipping 46 bp off of the 5' end of our reads, and 30 bp off of the 3':

In [None]:
## woods
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodNotChim

for i in *fasta; do
#echo $i
echo ${i/notChim/ITSonly}
fastx_trimmer -i <(fasta_formatter -i $i)  -f 47 | fastx_trimmer -t 30 -o ../woodITSonly/${i/notChim/ITSonly}
done

## and leaves:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/leafNotChim

for i in *fa; do
echo $i
#echo ../leafITSonly/${i/notChim/ITSonly}
fastx_trimmer -i <(fasta_formatter -i $i)  -f 47 | fastx_trimmer -t 30 -o ../leafITSonly/${i/notChim/ITSonly}
done

## combine these into a single, big file of both leaves and reads:

leafITSonly=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/leafITSonly
woodITSonly=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodITSonly

cat $leafITSonly/* $woodITSonly/* > allReads.fasta

<h3><a id='OTUclust'>OTU clustering</a></h3>

With labels simplified and reads reduced to ITS1 region. This is three steps, actually: dereplication, sorting, and clustering. 

<a id='derep'><h4>Dereplication and Sorting of reads</h4></a>

We've been doing this pipeline with USEARCH, 32-bit. Not sure how this works, but this free version of USEARCH restricts RAM usage to 4 gig. If loading databases of reads takes more than this, we're punted, with a message that basically says "buy the 64 bit version, you bum."

So I went to Oslo and got a copy of [VSEARCH](#https://github.com/torognes/vsearch). This is an open-source, freely available parallel to USEARCH. Since I am trying to rebuild a pipeline previously with USEARCH 64-bit (not available to me now), we'll use VSEARCH sparingly, when memory limits are a problem. 

In [None]:
## get rid of singletons for clustering while we're at it
vsearch --derep_fulllength allReads.fasta \
--output allReads_derep.fasta \
--sizeout \
--minseqlength 1 \
--minuniquesize 2 \
&> derep_stdout.log

In [7]:
cat derep_stdout.log

vsearch v2.8.0_linux_x86_64, 11.5GB RAM, 4 cores
https://github.com/torognes/vsearch

Reading file allReads.fasta 100%
3203788852 nt in 18101955 seqs, min 2, max 373, avg 177
Dereplicating 100%
Sorting 100%
2961231 unique sequences, avg cluster 6.1, median 1, max 472437
Writing output file 100%
748827 uniques written, 2212404 clusters discarded (74.7%)


To sort, we go back to USEARCH...

In [None]:
usearch -sortbysize allReads_derep.fasta -fastaout allReads_sorted.fasta &> usearch_sort_stdout.log

In [9]:
cat usearch_sort_stdout.log

usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02 224Mb  100.0% Reading allReads_derep.fasta
00:02 191Mb Getting sizes                       
00:03 197Mb Sorting 748827 sequences
00:06 200Mb  100.0% Writing output


<a id='Cluster'><h4>Cluster reads</h4></a>

In [None]:
usearch -cluster_smallmem allReads_sorted.fasta \
-id 0.95 \
-centroids otus_95_combo.fasta \
-sizein \
-sizeout \
-sortedby size \
|& tee clust_stdout.log

In [11]:
cat clust_stdout.log

usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:48  59Mb  100.0% 12601 clusters, max size 681167, avg 1261.0
00:48  59Mb  100.0% Writing centroids to otus_95_combo.fasta   
                                                            
      Seqs  748827 (748827 ()
  Clusters  12601 (12601 (12601 (126)
  Max size  681167 (681167 (681167 (681167 (6)
  Avg size  1261.0
  Min size  2
Singletons  0, 0.0% of seqs, 0.0% of clusters
   Max mem  59Mb
      Time  48.0s
Throughput  15.6k seqs/sec.



12,601 clusters (otus). ~2000 more than our last pipeline. Not sure why, though I did several things differently, including NOT seeding this pipeline with Roo's hand curated *Xylaria* stromata. We also had many more unique sequences going into this step this time around, ~750,000 seqs, compared to ~450,000 last time I did this, with the same data. Hmmm....

<a id='addOTUtag'><h4>Assign unique names to OTU clusters</h4></a>

In this older version of USEARCH, with the `-cluster_smallmem` command, there was no option to give unique identifiers to the OTU clusters. So I made a script, which gives a label of your choice (here "OTU") plus a number, from the order in which the clusters were created. 

In [None]:
./addOTUtag.py otus_95_combo.fasta OTU otus_95_combo_relab.fasta

<a id=assTax><h4>Assign taxonomy</h4></a>

We'll do a mass, low-confidence assignment of taxonomy to our OTU clusters using UTAX and UNITE. These aren't to be used for any real analysis, just a quick first glance. If we become interested in an OTU, we'll need to take some time to look for a higher confidence taxonomic assignment. 

In [None]:
## get the suggested version of UNITE. 
wget https://drive5.com/utax/data/utax_unite_v7.tar.gz
tar -xzvf utax_unite_v7.tar.gz

## some useful shortcuts:
ITS1db=/home/daniel/Documents/submissions/taibioinfo/UNITE/utaxref/unite_v7/fasta/refdb.fa
ITS1tf=/home/daniel/Documents/submissions/taibioinfo/UNITE/utaxref/unite_v7/taxconfs/its1.tc
OTUs=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/OTUclust/otus_95_combo_relab.fasta

## make a USEARCH database out of UNITE
usearch -makeudb_utax $ITS1db  -taxconfsin $ITS1tf -output ITS1tax.udb |& tee assTax.log

## use this and the 
search -utax $OTUs -db ITS1tax.udb -fastaout OTUs_95_assTaxed.fasta -strand plus |& OtuAssTaxed.log

<a id='makeBiom'><h3>Make biom table</h3></a>

We can now assemble the biom table. We'll make the first generation format of biom tables, which was a json format. I prefer this to the later compiled file, hdf5-format, because it's human readable. Taxonomy assignments get thrown in for free during this step of biom table construction via uparse. 

To parse correctly, we need to get rid of "-" dashes in our sample names, it doesn't parse with USEARCH. So... more sed...

In [None]:
allReads=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/OTUclust/allReads.fasta

sed -E '/>Dc-X/ s/(Dc)-(X)/\1_\2/g' $allReads |\
sed -E '/>Dc-PosG/ s/(Dc)-(PosG)/\1_\2/' |\
sed -E '/>Dc-PosI/ s/(Dc)-(PosI)/\1_\2/' |\
sed -E '/>Dc-Neg/ s/(Dc)-(Neg)/\1_\2/' > allReads_con.fasta

While we're at it, let's get rid of our unmatched reads:

In [None]:
sed -i '/>leafNotChim_unmatched/,+1d' $allReadsCon

Make the table:

In [33]:
allReadsCon=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/assTax/allReads_con.fasta

usearch -usearch_global $allReadsCon -db OTUs_95_assTaxed.fasta -strand plus -id 0.95 -biomout combo_otu.biom |& tee makebiom.log

In [34]:
cat makebiom.log

usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  44Mb  100.0% Reading ./assTax/OTUs_95_assTaxed.fasta
00:00  10Mb  100.0% Masking                                
00:00  11Mb  100.0% Word stats
00:00  11Mb  100.0% Alloc rows
00:00  19Mb  100.0% Build index
07:25  90Mb  100.0% Searching allReads_con.fasta, 100.4% matched
15442054 / 15456309 mapped to OTUs (99.9%)                      
07:25  90Mb Writing combo_otu.biom
07:25  90Mb Writing combo_otu.biom ...done.


Well, that's nice. But now we have to do a lot of cleaning up, and add metadata.

<h2><a id='formBiom'>Formatting Biom table and adding metadata</a></h2>

<h3><a id='formTax'>Change biom taxonomy metadata format</a></h3>

Let's change the standard USEARCH taxonomy formatting to Greengenes database formatting. Our usearch output looks like this:

In [35]:
grep rows combo_otu.biom -A 5

	"[01;31m[Krows[m[K":[
		{"id":"OTU19:leafNotChim_100", "metadata":{"taxonomy":"d:Fungi,p:Ascomycota(0.3128),c:Eurotiomycetes(0.2065),o:Onygenales(0.1099),f:Arthrodermataceae(0.0449),g:Arthroderma(0.0202),s:Arthroderma_melis_SH007610.07FU(0.0022)"}},
		{"id":"OTU108:leafNotChim_100", "metadata":{"taxonomy":"d:Fungi,p:Ascomycota(0.1084),c:Eurotiomycetes(0.0620),o:Pyrenulales(0.0299),f:Massariaceae(0.0138),g:Massaria(0.0060)"}},
		{"id":"OTU1:leafNotChim_100", "metadata":{"taxonomy":"d:Fungi,p:Ascomycota(0.9897),c:Sordariomycetes(0.7674),o:Hypocreales(0.5816),f:Hypocreales_fam_Incertae_sedis(0.3979),g:Myrothecium(0.2279)"}},
		{"id":"OTU202:leafNotChim_100", "metadata":{"taxonomy":"d:Fungi,p:Ascomycota(0.4325),c:Dothideomycetes(0.2853),o:Dothideomycetidae_ord_Incertae_sedis(0.1601),f:Strigulaceae(0.0654),g:Strigula(0.0309),s:Strigula_smaragdula_SH211054.07FU(0.0045)"}},
		{"id":"OTU426:leafNotChim_100", "metadata":{"taxonomy":"d:Fungi,p:Ascomycota(0.2072),c:Dothideomycetes(0.1317)"}}

Let's edit this with SED:

In [38]:
sed '/taxonomy/ s/([0-1]\.[0-9]*)//g' combo_otu.biom |\
sed -E 's/("taxonomy")(:")/\1:[/' |\
sed -E 's/"}}/,]}}/' |\
sed -E '/taxonomy/ s/(d:)([^,]*)/"k__\2"/' |\
sed -E '/taxonomy/ s/(p:)([^,]*)/"p__\2"/' |\
sed -E '/taxonomy/ s/(c:)([^,]*)/"c__\2"/' |\
sed -E '/taxonomy/ s/(o:)([^,]*)/"o__\2"/' |\
sed -E '/taxonomy/ s/(f:)([^,]*)/"f__\2"/' |\
sed -E '/taxonomy/ s/(g:)([^,]*)/"g__\2"/' |\
sed -E '/taxonomy/ s/(s:)([^,]*)/"s__\2"/' |\
sed -E '/taxonomy/ s/,]}}/]}}/' > combo_otu_relab.biom

In [39]:
grep rows combo_otu_relab.biom -A 5

	"[01;31m[Krows[m[K":[
		{"id":"OTU19:leafNotChim_100", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Eurotiomycetes","o__Onygenales","f__Arthrodermataceae","g__Arthroderma","s__Arthroderma_melis_SH007610.07FU"]}},
		{"id":"OTU108:leafNotChim_100", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Eurotiomycetes","o__Pyrenulales","f__Massariaceae","g__Massaria"]}},
		{"id":"OTU1:leafNotChim_100", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Sordariomycetes","o__Hypocreales","f__Hypocreales_fam_Incertae_sedis","g__Myrothecium"]}},
		{"id":"OTU202:leafNotChim_100", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Dothideomycetes","o__Dothideomycetidae_ord_Incertae_sedis","f__Strigulaceae","g__Strigula","s__Strigula_smaragdula_SH211054.07FU"]}},
		{"id":"OTU426:leafNotChim_100", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Dothideomycetes"]}},


Looks good. But the leafNotChim thing is annoying, especially with our sample ids. So change this:

In [40]:
sed -E -i 's/(leaf)(NotChim_)([0-9]*)/\3\1/g' combo_otu_relab.biom

In [41]:
grep rows -A 10 combo_otu_relab.biom

	"[01;31m[Krows[m[K":[
		{"id":"OTU19:100leaf", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Eurotiomycetes","o__Onygenales","f__Arthrodermataceae","g__Arthroderma","s__Arthroderma_melis_SH007610.07FU"]}},
		{"id":"OTU108:100leaf", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Eurotiomycetes","o__Pyrenulales","f__Massariaceae","g__Massaria"]}},
		{"id":"OTU1:100leaf", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Sordariomycetes","o__Hypocreales","f__Hypocreales_fam_Incertae_sedis","g__Myrothecium"]}},
		{"id":"OTU202:100leaf", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Dothideomycetes","o__Dothideomycetidae_ord_Incertae_sedis","f__Strigulaceae","g__Strigula","s__Strigula_smaragdula_SH211054.07FU"]}},
		{"id":"OTU426:100leaf", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota","c__Dothideomycetes"]}},
		{"id":"OTU1905:100leaf", "metadata":{"taxonomy":["k__Fungi","p__Ascomycota"]}},
		{"id":"OTU10:100leaf", "metadata":{"taxonomy":["k__Fungi","

In [42]:
grep columns -A 10 combo_otu_relab.biom

	"[01;31m[Kcolumns[m[K":[
		{"id":"100leaf", "metadata":null},
		{"id":"101leaf", "metadata":null},
		{"id":"102leaf", "metadata":null},
		{"id":"103leaf", "metadata":null},
		{"id":"104leaf", "metadata":null},
		{"id":"105leaf", "metadata":null},
		{"id":"106leaf", "metadata":null},
		{"id":"107leaf", "metadata":null},
		{"id":"108leaf", "metadata":null},
		{"id":"109leaf", "metadata":null},


Looks okay. Check to see if the biom package can find problems:

In [43]:
biom validate-table -i combo_otu_relab.biom

Invalid format 'Biological Observation Matrix 1.0', must be '1.0.0'
'id' in {'metadata': {'taxonomy': ['k__Fungi', 'p__Ascomycota', 'c__Sordariomycetes', 'o__Ophiostomatales', 'f__Ophiostomataceae', 'g__Pesotum']}, 'id': ''} appears empty
Bad value at idx 0: [0, 0, 81511]
Timestamp does not appear to be ISO 8601
The input file is not a valid BIOM-formatted file.


: 1

Don't know why this name was left off, but this happened before. Find the OTU and fill in the name manually:

In [44]:
grep "g:Pesotum" combo_otu.biom -A 1 -B 1

		{"id":"OTU10613:leafNotChim_111", "metadata":{"taxonomy":"d:Fungi"}},
		{"id":"", "metadata":{"taxonomy":"d:Fungi,p:Ascomycota(0.3128),c:Sordariomycetes(0.2065),o:Ophiostomatales(0.1099),f:Ophiostomataceae(0.0449),[01;31m[Kg:Pesotum[m[K(0.0202)"}},
		{"id":"OTU7760:leafNotChim_111", "metadata":{"taxonomy":"d:Fungi,p:Ascomycota(0.2072),c:Schizosaccharomycetes(0.1317),o:Schizosaccharomycetales(0.0620),f:Schizosaccharomycetaceae(0.0273),g:Schizosaccharomyces(0.0108)"}},


Let's find this in our OTU clusters that have had taxonomy assignments:

In [45]:
grep "g:Pesotum(0.0202)" OTUs_95_assTaxed.fasta

>OTU6797:leafNotChim_111;size=16;tax=d:Fungi,p:Ascomycota(0.3128),c:Sordariomycetes(0.2065),o:Ophiostomatales(0.1099),f:Ophiostomataceae(0.0449),[01;31m[Kg:Pesotum(0.0202)[m[K;


Fill this into our OTU table with sed:

In [46]:
sed '/"id":""/ s/"id":"",/"id":"OTU6797:111leaf",/' combo_otu_relab.biom -i

In [47]:
biom validate-table -i combo_otu_relab.biom

Invalid format 'Biological Observation Matrix 1.0', must be '1.0.0'
Bad value at idx 0: [0, 0, 81511]
Timestamp does not appear to be ISO 8601
The input file is not a valid BIOM-formatted file.


: 1

Beh. Minor stuff. Onward...

<h3><a id='addMetadata'>Add sample metadata</a></h3>

We have a spreadsheet with our sample metadata on it. We can use this to assign useful info about each of our samples, with the `biom add-metadata` command:

In [17]:
head meta_2018.06.14.tsv

#SampleID	Library	SorC	X	Y	Forest_Type	Host_family	Host_genus	Host_species	Host_genus_species	stream_distance	vegcom
Dc_X	W	Control	NA	NA	NA	NA	NA	NA	NA	NA	NA
Dc_PosG	W	Control	NA	NA	NA	NA	NA	NA	NA	NA	NA
Dc_PosI	W	Control	NA	NA	NA	NA	NA	NA	NA	NA	NA
Dc_Neg	W	Control	NA	NA	NA	NA	NA	NA	NA	NA	NA
1w	W	Sample	360	220	7	Juglandaceae	Engelhardtia	roxburghiana	Engelhardtia_roxburghiana	24.11897	2
2w	W	Sample	360	221	7	Theaceae	Pyrenaria	shinkoensis	Pyrenaria_shinkoensis	23.22664	2
3w	W	Sample	361	221	7	Proteaceae	Helicia	formosana	Helicia_formosana	22.77525	2
4w	W	Sample	361	220	7	Theaceae	Pyrenaria	shinkoensis	Pyrenaria_shinkoensis	23.66758	2
5w	W	Sample	363	220	7	Theaceae	Pyrenaria	shinkoensis	Pyrenaria_shinkoensis	22.7648	2


In [1]:
biom add-metadata -i combo_otu_relab.biom -o combo_otu_wMeta.biom --sample-metadata-fp  meta_2018.06.14.tsv --output-as-json

In [2]:
## just checking:
biom validate-table -i combo_otu_wMeta.biom


The input file is a valid BIOM-formatted file.


This command adds metadata to each of our sites, but it mashes our biom file into a single line, making it really hard to read.

In [3]:
head -c 1000 combo_otu_wMeta.biom; tail -c 1000 combo_otu_wMeta.biom

{"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org","matrix_type": "sparse","generated_by": "BIOM-Format 2.1.6","date": "2018-06-17T13:56:51.200393","type": "OTU table","matrix_element_type": "float","shape": [11588, 232],"data": [[0,0,81511.0],[0,1,178.0],[0,3,1145.0],[0,4,3.0],[0,27,1.0],[0,35,226.0],[0,37,939.0],[0,51,1.0],[0,54,2.0],[0,93,2.0],[0,114,57.0],[0,122,1.0],[0,125,43.0],[0,127,13282.0],[0,128,2.0],[0,129,1.0],[1,0,26184.0],[1,3,7.0],[1,4,3504.0],[1,33,1.0],[1,128,2.0],[2,0,2735.0],[2,1,6795.0],[2,2,586.0],[2,3,4290.0],[2,4,2898.0],[2,5,5424.0],[2,6,1.0],[2,7,3.0],[2,8,859.0],[2,9,504.0],[2,10,435.0],[2,11,1191.0],[2,12,2.0],[2,13,259.0],[2,14,9.0],[2,15,7.0],[2,16,530.0],[2,17,25.0],[2,18,6454.0],[2,19,8038.0],[2,20,2138.0],[2,21,6263.0],[2,22,6891.0],[2,23,3186.0],[2,24,1138.0],[2,25,6137.0],[2,26,7641.0],[2,27,1.0],[2,28,65417.0],[2,29,3333.0],[2,30,848.0],[2,31,886.0],[2,32,5517.0],[2,33,6203.0],[2,34,484.0],[2,35,742.0

Let's use [js-beautify](https://www.npmjs.com/package/js-beautify) to re-render into a multi-line json:

In [4]:
js-beautify combo_otu_wMeta.biom > combo_otu_wMeta_pretty.biom

In [5]:
biom validate-table -i combo_otu_wMeta_pretty.biom


The input file is a valid BIOM-formatted file.


In [6]:
head -n 20 combo_otu_wMeta_pretty.biom

{
    "id": "None",
    "format": "Biological Observation Matrix 1.0.0",
    "format_url": "http://biom-format.org",
    "matrix_type": "sparse",
    "generated_by": "BIOM-Format 2.1.6",
    "date": "2018-06-17T13:56:51.200393",
    "type": "OTU table",
    "matrix_element_type": "float",
    "shape": [11588, 232],
    "data": [
        [0, 0, 81511.0],
        [0, 1, 178.0],
        [0, 3, 1145.0],
        [0, 4, 3.0],
        [0, 27, 1.0],
        [0, 35, 226.0],
        [0, 37, 939.0],
        [0, 51, 1.0],
        [0, 54, 2.0],


Our sample metadata is here:

In [7]:
grep columns combo_otu_wMeta_pretty.biom -A 20

    "[01;31m[Kcolumns[m[K": [{
        "id": "100leaf",
        "metadata": {
            "vegcom": "2",
            "stream_distance": "25.97654",
            "Host_genus": "Helicia",
            "Host_genus_species": "Helicia_formosana",
            "Library": "L",
            "Forest_Type": "7",
            "Host_species": "formosana",
            "X": "183",
            "Host_family": "Proteaceae",
            "SorC": "Sample",
            "Y": "20"
        }
    }, {
        "id": "101leaf",
        "metadata": {
            "vegcom": "3",
            "stream_distance": "18.36984",
            "Host_genus": "Helicia",


So we should be good - biom table is constructed, clean of major errors, with taxonomic and sample metadata attached. 

We will do most of our downstream manipulations of this biom table with [phyloseq](https://joey711.github.io/phyloseq/), a package made for handling microbial community data in R. 

Let's check to see if phyloseq likes our biom table:

In [1]:
library("phyloseq")

biom95_meta <- import_biom("combo_otu_wMeta.biom", parseFunction=parse_taxonomy_greengenes)

In [3]:
sample_data(biom95_meta)

Unnamed: 0,vegcom,stream_distance,Host_genus,Host_genus_species,Library,Forest_Type,Host_species,X,Host_family,SorC,Y
100leaf,2,25.97654,Helicia,Helicia_formosana,L,7,formosana,183,Proteaceae,Sample,20
101leaf,3,18.36984,Helicia,Helicia_formosana,L,7,formosana,180,Proteaceae,Sample,27
102leaf,3,21.3725,Cleyera,Cleyera_japonica,L,7,japonica,187,Theaceae,Sample,27
103leaf,3,11.08831,Helicia,Helicia_formosana,L,7,formosana,180,Proteaceae,Sample,35
104leaf,3,1.409998,Helicia,Helicia_formosana,L,7,formosana,180,Proteaceae,Sample,51
105leaf,3,22.46722,Limlia,Limlia_uraiana,L,7,uraiana,180,Fagaceae,Sample,83
106leaf,2,82.49734,Helicia,Helicia_formosana,L,3,formosana,180,Proteaceae,Sample,147
107leaf,1,64.85876,Blastus,Blastus_cochinchinensis,L,3,cochinchinensis,307,Melastomataceae,Sample,147
108leaf,1,19.02113,Cleyera,Cleyera_japonica,L,3,japonica,243,Theaceae,Sample,83
109leaf,3,13.46815,Meliosma,Meliosma_squamulata,L,7,squamulata,209,Sabiaceae,Sample,49


In [4]:
otu_table(biom95_meta)

Unnamed: 0,100leaf,101leaf,102leaf,103leaf,104leaf,105leaf,106leaf,107leaf,108leaf,109leaf,⋯,124w,125w,128w,129w,130w,131w,133w,Neg,PosG,PosI
OTU19:100leaf,81511,178,0,1145,3,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
OTU108:100leaf,26184,0,0,7,3504,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
OTU1:100leaf,2735,6795,586,4290,2898,5424,1,3,859,504,⋯,0,0,0,0,0,0,0,0,0,0
OTU202:100leaf,3214,0,0,6330,1,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
OTU426:100leaf,5943,0,0,0,38,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
OTU1905:100leaf,98,0,71,35,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
OTU10:100leaf,6618,22297,36,6276,12852,6258,1,0,1,81,⋯,0,0,0,0,0,0,0,0,0,0
OTU10433:17leaf,1,0,0,1,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
OTU17:100leaf,2877,2600,6,3782,5298,491,0,0,51,2,⋯,0,0,0,0,0,0,0,0,0,0
OTU429:100leaf,559,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


Seems okay. We'll work on the revised stats pipeline in a separate notebook.