# One big biome table

Since I am hoping to make comparisons of wood and leaf endophyte environmental patterns, I need to combine these datasets early in the biomformatics pipeline, to make them as comparable as possible. We'll try to stick to the [usearch (uparse)](http://drive5.com/usearch/) pipeline for the process, as much as possible.

------------

That was a year ago. 

Since then my car has been stolen, with my laptop in it. And the manuscripts which are based on these manuscripts are submitted. Somehow, the versions of this notebook and the following analysis notebook that were on github lost most of its graphic outputs, charts and maps and stuff. While most of the important graphics, etc, were backed up, I lost some of the computationally expensive intermediate files necessary to repopulate. 

So to make this notebook useable to reviewers and readers, I'll be picking through this process again, and the downstream analysis, if I don't go insane in the meantime. Maybe even if I do.

## Table of contents

[Work environment](#environment)  
[Trimming reads](#trim)  
[Merging paired-end reads](#mergepairs)  
- [Leaf reads](#mergeLeaf)
- [Wood reads](#mergeWood)

[Visualizing merged read qualities](#visMerge)  
- [Make quality score charts](#makeQcharts)  
- [Leaf read qualities](#leafReadScores)
- [Wood read qualities](#woodReadScores)

[Quality filtering reads](#qf)  
[Convert fastq files to fasta format](#fastq2fasta)  

<a id='environment'><h3>Work environment</h3></a>

Working directory, on my machine:

In [2]:
cd /home/daniel/Documents/Taiwan_data/combined/combo_biome



We'll be using the [usearch (uparse)](http://drive5.com/usearch/) pipeline, version v8.0.1623_i86linux32, on the University of Oregon's Talapas computing cluster.

<a id='rearrange'></a>
### Rearranging barcodes

We need to merged paired end sequences of the leaves and wood. But before we can do this, there are several steps. First, the leaf study reads include a split 6+6 bp barcode scheme for identifying reads, so these need to be clipped from one read and combined on the other. I wrote a python script for this:

In [4]:
cat scripts/BCunsplit.py

#!/usr/bin/env python3

## lets try to take a two unpaired read files, cut out the bp from the reverse,
## and tack it onto the forward.
## have to preserve the fastq format so that pandaseq can do unsplit3.py forward_reads reverse_reads

#The first six bps and quality ratings of the reverse reads should be chopped off and placed after
#the first six bps of the forward reads and quality ratings. For use with fastq files. It will spit
#out two files, with the names: "rearranged_[your orignial forward and reverse read file names].fastq".

import itertools ##to let us jump around 
from sys import argv

script, forward_file, reverse_file = argv

forwardlabels=[]
reverselabels=[]
forwardreads=[]
forwardreadsq=[]
reversereads=[]
reversereadsq=[]
forwardBC=[]
forwardBCq=[]
reverseBC=[]
reverseBCq=[]

with open(forward_file) as foop:

	#labels:

	for h in itertools.islice(foop, 0, None, 4):
		forwardlabels.append(h)

##forward sequencies, barcodes: 

	foop.seek(0)
    
	for i in itertools.isli

I have details on how I used this [here](https://github.com/danchurch/taiwan_dada2/blob/master/dada2pipeline.ipynb). 

This outputs two files, "rearranged_Roo_R2.fastq" and "rearranged_Roo_R2.fastq". I did this in another directory, so we'll add some sym links here for convenience:

In [8]:
## leaves
ln -s /home/daniel/Documents/taiwan/taiwan_dada2/rearranged_leafR1.fastq reLeafR1.fastq
ln -s /home/daniel/Documents/taiwan/taiwan_dada2/rearranged_leafR2.fastq reLeafR2.fastq

<a id='trim'><h3>Trimming reads</h3></a>

Next we trim a little to make sure we doing our alignments with high quality base calls. The sites for trimming are decided by looking at the raw reads [(see below)](#quality), and finding where quality begins to drop off. 
To trim, we'll use the [FASTX-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/).

Our wood reads are already demultiplexed, so we don't have a single forward and reverse read file for all of our wood samples, like we do above with the leaves. So let's make a script for this:

In [None]:
## trims.sh
#####################################################


## wood reads live here:
wooddir=/home/daniel/Documents/taiwan/woodreads/

## working directory is here:
cd /home/daniel/Documents/taiwan/taiwan_combined_biom

###### R1 reads:

## home for trimmed R1 wood reads here:
R1trimdir='/home/daniel/Documents/taiwan/taiwan_combined_biom/trimmed_wood/R1/'

## trim just the R1s, output to their new home with new filename:
for i in $wooddir*_R1_*; do
    echo $i
    out=$R1trimdir$(basename ${i/_001\.fastq/_trimmed\.fastq})
    fastx_trimmer -l 255 -i $i -o $out && echo $out 
done

###### R2 reads:

## home for trimmed R2 wood reads here:
R2trimdir='/home/daniel/Documents/taiwan/taiwan_combined_biom/trimmed_wood/R2/'

## trim just the R2s, output to their new home with new filename:
for j in $wooddir*_R2_*; do
    echo $j
    out=$R2trimdir$(basename ${j/_001\.fastq/_trimmed\.fastq})
    fastx_trimmer -l 210 -i $j -o $out && echo $out 
done

In [2]:
## leaves. These lengths were decided by Roo. They are all still in one pile:
fastx_trimmer -l 263 -i reLeafR1.fastq -o Roo_R1_trimmed.fastq
fastx_trimmer -l 170 -i reLeafR2.fastq -o Roo_R2_trimmed.fastq



<a id='mergepairs'><h3>Merging paired-end reads</h3></a> 

Unfortunately, I no longer have access to the 64-bit version of usearch 8.1. So let's use the 32-bit version:

In [7]:
usearch

usearch v8.1.1861_i86linux32, 4.0Gb RAM (8.0Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com



<a id='mergeLeaf'><h4>Leaf reads</h4></a>

The leaf files are too large to be handled by the 32-version. We don't want to demultiplex yet, this will be messy with the split barcodes. So let's break up the leaf files into smaller ones and merge these.

In [None]:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves

for i in Roo_R*_*; do
    ls $i
    split -d -l 500000 $i ${i/fastq/split\.fastq}
done

Makes a lot of files:

In [10]:
cat leafSplitLs.txt

aln_3.txt
leafSplitLs.txt
Roo_R1_trimmed.split.fastq00
Roo_R1_trimmed.split.fastq01
Roo_R1_trimmed.split.fastq02
Roo_R1_trimmed.split.fastq03
Roo_R1_trimmed.split.fastq04
Roo_R1_trimmed.split.fastq05
Roo_R1_trimmed.split.fastq06
Roo_R1_trimmed.split.fastq07
Roo_R1_trimmed.split.fastq08
Roo_R1_trimmed.split.fastq09
Roo_R1_trimmed.split.fastq10
Roo_R1_trimmed.split.fastq11
Roo_R1_trimmed.split.fastq12
Roo_R1_trimmed.split.fastq13
Roo_R1_trimmed.split.fastq14
Roo_R1_trimmed.split.fastq15
Roo_R1_trimmed.split.fastq16
Roo_R1_trimmed.split.fastq17
Roo_R1_trimmed.split.fastq18
Roo_R1_trimmed.split.fastq19
Roo_R1_trimmed.split.fastq20
Roo_R1_trimmed.split.fastq21
Roo_R1_trimmed.split.fastq22
Roo_R1_trimmed.split.fastq23
Roo_R1_trimmed.split.fastq24
Roo_R1_trimmed.split.fastq25
Roo_R1_trimmed.split.fastq26
Roo_R1_trimmed.split.fastq27
Roo_R1_trimmed.split.fastq28
Roo_R1_trimmed.split.fastq29
Roo_R1_trimmed.split.fastq30
Roo_R1_trimmed.split.fastq31
Roo_R1_trimmed.split.fastq32
Roo_R1_trimmed.sp

Roo_R2_trimmed.split.fastq9040
Roo_R2_trimmed.split.fastq9041
Roo_R2_trimmed.split.fastq9042
Roo_R2_trimmed.split.fastq9043
Roo_R2_trimmed.split.fastq9044
Roo_R2_trimmed.split.fastq9045
Roo_R2_trimmed.split.fastq9046
Roo_R2_trimmed.split.fastq9047
Roo_R2_trimmed.split.fastq9048
Roo_R2_trimmed.split.fastq9049
Roo_R2_trimmed.split.fastq9050
Roo_R2_trimmed.split.fastq9051
Roo_R2_trimmed.split.fastq9052
Roo_R2_trimmed.split.fastq9053
Roo_R2_trimmed.split.fastq9054
Roo_R2_trimmed.split.fastq9055


Merge these:

In [None]:
for forward in *_R1_*; do
    #ls $forward
    reverse=${forward/_R1_/_R2_}
    usearch -fastq_mergepairs $forward \
        -reverse $reverse \
        -fastq_maxdiffpct 40 \
        -alnout aln_3.txt \ ## oops, fix this if reused
        -report ../reports/$forward.report.txt \
        -fastqout ../merged/$forward.merged.fastq
    #ls $reverse
done

And put them back together.

In [None]:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/merged

cat * > leaf_trimmed_merged.fastq

<a id='mergeWood'><h4>Wood reads</h4></a>

We don't have the same issue on the wood, because they are already demultiplexed, and the 32-bit version of usearch can handle these smaller files fine:

In [None]:
for forward in *fastq; do
    ls -l $forward
    reverse="../R2/${forward/R1/R2}"
    usearch -fastq_mergepairs $forward \
        -reverse $reverse \
        -fastq_maxdiffpct 40 \
        -alnout aln_3.txt \ ## oops, fix this if reused
        -report ../reports/$forward.report.txt \
        -fastqout ./$forward.merged.fastq
    echo $reverse
done

<a id='visMerge'><h3>Visualizing merged read qualities</h3></a>

<a id='makeQcharts'><h4>Make quality score charts</h4></a>  
Let's make some charts of our read quality, using fastx tools. First, compile the stats on each basepair:

In [None]:
#!/usr/bin/env bash

## leaf reads

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/qCharts/leaf

rawLeafReadsR1="/home/daniel/Documents/taiwan_supp/roo_reads/TaiwanFA_R1.fastq"
rawLeafReadsR2="/home/daniel/Documents/taiwan_supp/roo_reads/TaiwanFA_R2.fastq"
trimmedLeafReadsR1="/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/Roo_R1_trimmed.fastq"
trimmedLeafReadsR2="/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/Roo_R2_trimmed.fastq"
leafmerg="/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/merged/leaf_trimmed_merged.fastq"

## leaf quality stats:
fastx_quality_stats -i $rawLeafReadsR1 -o rawLeafReadsR1_fastxstats.txt
fastx_quality_stats -i $rawLeafReadsR2 -o rawLeafReadsR2_fastxstats.txt
fastx_quality_stats -i $trimmedLeafReadsR1 -o trimmedLeafReadsR1_fastxstats.txt
fastx_quality_stats -i $trimmedLeafReadsR2 -o trimmedLeafReadsR2_fastxstats.txt
fastx_quality_stats -i $leafmerg -o leafmerged_fastxstats.txt

For the wood, to visualize them as a whole, we'll combine reads of the steps we've done so far:

In [None]:
cat /home/daniel/Documents/taiwan_supp/wood_reads/*R1* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/rawWoodReadsR1.fastq

cat /home/daniel/Documents/taiwan_supp/wood_reads/*R2* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/rawWoodReadsR2.fastq

cat /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/R1/* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmedWoodR1.fastq

cat /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/R2/* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmedWoodR2.fastq

cat /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/merged/* > /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/woodMerged.fastq


Compile the stats for these combined wood files:

In [None]:
#!/usr/bin/env bash

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom

fastx_quality_stats -i rawWoodReadsR1.fastq -o qCharts/wood/rawWoodReadsR1_fastxstats.txt
fastx_quality_stats -i rawWoodReadsR2.fastq -o qCharts/wood/rawWoodReadsR2_fastxstats.txt
fastx_quality_stats -i trimmedWoodR1.fastq -o qCharts/wood/trimmedWoodR1_fastxstats.txt
fastx_quality_stats -i trimmedWoodR2.fastq -o qCharts/wood/trimmedWoodR2_fastxstats.txt
fastx_quality_stats -i woodMerged.fastq -o qCharts/wood/woodMerged_fastxstats.txt

Then make the actual graphics.

In [None]:
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/qCharts

cd leaf

for i in *; do
    ../dan_fastx_plot.sh -i $i -o ${i/\.txt/\.png}
done

cd ../wood

for i in *; do
    ../dan_fastx_plot.sh -i $i -o ${i/\.txt/\.png}
done

<a id='leafReadScores'><h4>Leaf read qualities</h4></a>

Forward raw reads:

![](rawLeafReadsR2_fastxstats.png)

Reverse raw Leaf reads.
![](rawLeafReadsR1_fastxstats.png)

Forward trimmed Leaf reads.
![](trimmedLeafReadsR2_fastxstats.png)

Reverse trimmed Leaf reads.
![](trimmedLeafReadsR1_fastxstats.png)

And the merged leaf reads.
![](leafmerged_fastxstats.png)

<a id='woodReadScores'><h4>Wood read qualities</h4></a>

Wood raw forward reads:
![](rawWoodReadsR1_fastxstats.png)

Wood raw reverse reads:
![](rawWoodReadsR2_fastxstats.png)

Wood trimmed forward reads:
![](rawWoodReadsR1_fastxstats.png)

Wood trimmed reverse reads:
![](rawWoodReadsR2_fastxstats.png)

Wood merged reads:
![](woodMerged_fastxstats.png)

<a id="qf"><h3>Quality filtering reads</h3></a>

USEARCH does quite a bit of filtering in the merging process, I think, based on how many reads from our wood samples were removed. But let's also use the USEARCH filtering program on both wood and leaf reads.

<h4>Filter leaves</h4>

In [None]:
## leaves

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trim_leaves/merged

for i in *; do
    out=${i/.merged.fastq/\.merged\.filt\.fastq}
    #echo $i $out
    usearch -fastq_filter $i -fastq_maxee_rate .01 -fastqout "../filtered/"$out -notrunclabels &>> ../filtered/leaf_mergeStdout.txt
done


In [4]:
cat leaf_filterStdout.txt

usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 97.5% passed
    102021  FASTQ recs (102.0k)            
     99492  Converted (99.5k, 97.5%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
    104711  FASTQ recs (104.7k)            
    102793  Converted (102.8k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 98.1% passed
    104701  FASTQ recs (104.7k)            
    102708  Converted (102.7k, 98.1%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 core

http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 97.6% passed
    102999  FASTQ recs (103.0k)            
    100516  Converted (100.5k, 97.6%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 97.1% passed
    100736  FASTQ recs (100.7k)            
     97824  Converted (97.8k, 97.1%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
    105062  FASTQ recs (105.1k)            
    103173  Converted (103.2k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com



00:02  32Mb  100.0% Filtering, 96.8% passed
    104980  FASTQ recs (105.0k)            
    101625  Converted (101.6k, 96.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 96.4% passed
    103707  FASTQ recs (103.7k)            
     99960  Converted (100.0k, 96.4%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 94.5% passed 
     98488  FASTQ recs (98.5k)             
     93055  Converted (93.1k, 94.5%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 96.5% passed
    103949  FAST

     99443  Converted (99.4k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
    102558  FASTQ recs (102.6k)            
    100694  Converted (100.7k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 97.6% passed
     98963  FASTQ recs (99.0k)             
     96545  Converted (96.5k, 97.6%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 97.4% passed
     98644  FASTQ recs (98.6k)             
     96079  Converted (96.1k, 97.4%)
usearch v8.1.1861_i86linux

(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 98.7% passed 
    102947  FASTQ recs (102.9k)            
    101629  Converted (101.6k, 98.7%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 98.8% passed
    103653  FASTQ recs (103.7k)            
    102394  Converted (102.4k, 98.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.2% passed
     98820  FASTQ recs (98.8k)             
     97054  Converted (97.1k, 98.2%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
h

http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 97.8% passed 
     99553  FASTQ recs (99.6k)             
     97330  Converted (97.3k, 97.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 98.0% passed
    100658  FASTQ recs (100.7k)            
     98680  Converted (98.7k, 98.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:02  32Mb  100.0% Filtering, 97.7% passed
     99534  FASTQ recs (99.5k)             
     97261  Converted (97.3k, 97.7%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com



<h4>Filter wood</h4>

In [None]:
## wood
cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/merged

for i in *; do
    out=${i/.fastq.merged.fastq/\.merge\.filt\.fastq}
    usearch -fastq_filter $i -fastq_maxee_rate .01 -fastqout $out -notrunclabels &>> wood_filterStdout.txt
done


In [2]:
cat wood_filterStdout.txt

usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     61717  FASTQ recs (61.7k)              
     61717  Converted (61.7k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     44977  FASTQ recs (45.0k)              
     44975  Converted (45.0k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 99.8% passed
     37537  FASTQ recs (37.5k)             
     37462  Converted (37.5k, 99.8%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 

(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 99.9% passed
     43527  FASTQ recs (43.5k)             
     43484  Converted (43.5k, 99.9%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     56663  FASTQ recs (56.7k)              
     56656  Converted (56.7k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     55294  FASTQ recs (55.3k)              
     55293  Converted (55.3k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved

http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 100.0% passed
     42057  FASTQ recs (42.1k)              
     42053  Converted (42.1k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     38095  FASTQ recs (38.1k)              
     38085  Converted (38.1k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     34455  FASTQ recs (34.5k)              
     34443  Converted (34.4k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gma


License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 100.0% passed
     35103  FASTQ recs (35.1k)              
     35091  Converted (35.1k, 100.0%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Filtering, 99.9% passed
     26257  FASTQ recs (26.3k)             
     26224  Converted (26.2k, 99.9%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:01  32Mb  100.0% Filtering, 99.9% passed 
     36638  FASTQ recs (36.6k)             
     36597  Converted (36.6k, 99.9%)
usearch v8.1.1861_i86linux32, 4.0Gb RAM (12.1Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: danchurchthomas@gmail.com

00:00  32Mb  100.0% Fil

<a id="fastq2fasta"><h3>Convert fastq files to fasta format</h3></a>

Let's use [BBMap tools](https://sourceforge.net/projects/bbmap/) to do the conversion. FASTX toolbox has something for this also, but FASTX is a little brittle when dealing with fastq files in modern illumina quality scores, etc., and sometimes generates funny errors. 

In [None]:
## drop bbtools into a nearby directory. Java, so can't really put in bin folders. 
bb=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/bbmap

## just one large leaf file
$bb/reformat.sh in=leaf_merged_filt.fastq \
    out=leaf_merged_filt.fasta\
    fastawrap=0 \
    &> makeLeafFasta.txt

Outputs from leaf sample conversion to fasta:

In [8]:
pwd
#cat makeLeafFasta.txt

/home/daniel/Documents/taiwan/taiwan_combined_biom


In [5]:
## wood to fasta, lots of smaller files:
wfd=/home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/woodFasta/

cd /home/daniel/Documents/submissions/taibioinfo/taiwan_combined_biom/trimmed_wood/filtered

for i in *; do
$bb/reformat.sh in=$i \
    out=$wfd${i/_R1_trimmed.merge.filt.fastq/.fasta} \
    fastawrap=0 \
    &>> ../../woodFastaStdout.txt
done