# One big biome table

Since I am hoping to make comparisons of wood and leaf endophyte environmental patterns, I need to combine these datasets early in the biomformatics pipeline, to make them as comparable as possible. We'll try to stick to the [usearch (uparse)](http://drive5.com/usearch/) pipeline for the process, as much as possible.

## Table of contents

[Work environment](#environment)

[Merge paired ends](#mergepairs)

[Quality filtering](#qf)

[Convert form fastq to fasta format](#fastq2fasta)

[Demultiplex leaf reads](#demult)

[Remove primers](#removeprimers)

[Floating primers](#defloat)

[Chimera checking](#chimera)

[Combining fasta files from all 3 studies](#combine)

[OTU clustering](#otuclust)

[Customizing UNITE database](#UNITE)

[Assign taxonomy](#asstax)

[Make and tidy up biom table](#makebiom)

[Add metadata](#addmetadata)

<a id='environment'></a>

### Work environment

Working directory, on my machine:

In [2]:
cd /home/daniel/Documents/Taiwan_data/combined/combo_biome



We'll be using the [usearch (uparse)](http://drive5.com/usearch/) pipeline, version usearch v8.1.1861_i86linux32 on my personal machine, and the equivalent 64-bit version on our computing cluster [ACISS](http://aciss-computing.uoregon.edu/). These programs are abbreviated (soft linked) to "usearch81" and "usearch", respectively. 

<a id='mergepairs'></a> 

### Merging paired-end reads

Let's re-pair all readsets in the same manner, except for Roo's stromatal readset, which he aligned by hand. 

First, the leaf study reads include a split 6+6 bp barcode scheme for identifying reads, so these need to be clipped from one read and combined on the other. I wrote a python script for this, let's see if it works:

In [1]:
./BCunsplit4.py Roo_R2.fastq Roo_R1.fastq



This outputs two files, "rearranged_Roo_R2.fastq" and "rearranged_Roo_R2.fastq"

Next we trim a little to make sure we doing our alignments with high quality base calls. The sites for trimming are decided by looking at the raw reads [(see below)](#quality), and finding where quality begins to drop off. 
To trim, we'll use the [FASTX-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/).

In [2]:
## wood
fastx_trimmer -l 255 -i woodR1.fastq -o woodR1_trimmed.fastq
fastx_trimmer -l 210 -i woodR2.fastq -o woodR2_trimmed.fastq

## leaves. These lengths were decided by Roo
fastx_trimmer -l 263 -i rearranged_Roo_R1.fastq -o Roo_R1_trimmed.fastq
fastx_trimmer -l 170 -i rearranged_Roo_R2.fastq -o Roo_R2_trimmed.fastq



Do the actual pairing:

In [4]:
## wood, let's pair both trimmed and untrimmed just to compare:
usearch -fastq_mergepairs woodR1.fastq -reverse woodR2.fastq  -fastqout woodtrimmedmerged.fastq -notrunclabels
usearch -fastq_mergepairs woodR1.fastq -reverse woodR2.fastq -fastqout woodmerged.fastq -notrunclabels



The "-notrunclabels" tag above asks usearch to keep the entire label of the forward reads, which is necessary because the wood reads, which are more recently sequenced than the leaf reads, contain sample info in their identifier lines. The leaves do not require this, their sample info is still in the sequence itself, to be use to [demultiplex](#demult) them, later.

** typical report from these:**

usearch v8.1.1803_i86linux64, 74.2Gb RAM, 12 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: ronh@molbio.uoregon.edu

<br>03:02 925Mb  100.0% 95.6% merged<br>5567799  Pairs (5.6M)<br>5323274  Merged (5.3M, 95.61%)<br>2067236  Alignments with zero diffs (37.13%)<br>0  Fwd tails Q <= 2 trimmed (0.00%)<br>15  Rev tails Q <= 2 trimmed (0.00%)<br>244525  No alignment found (4.39%)<br>0  Alignment too short (< 16) (0.00%)<br>4512849  Staggered pairs (81.05%) merged & trimmed<br>239.63  Mean alignment length
259.16  Mean merged read length<br>3.81  Mean fwd expected errors<br>6.27  Mean rev expected errors<br>1.57  Mean merged expected errors

These numbers look pretty good. Many of the erroneous reads will be taken out below, in a [quality filtering](#qf) step

In [5]:
## leaves
usearch -fastq_mergepairs Roo_R2.fastq -reverse Roo_R1.fastq -fastqout leafmerged.fastq
usearch -fastq_mergepairs Roo_R2_trimmed.fastq -reverse Roo_R1_trimmed.fastq -fastqout leaftrimmedmerged.fastq



What did trimming do to our reads? Let's take a look. To plot these we'll use the [fastx wrapper](http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_quality_boxplot_usage) for [gnuplot](http://www.gnuplot.info/), which I've tinkered with just a little to change up the crowded axes of the original setup. Fastx requires that we first compile the quality data from the fastq files:

In [18]:
## wood quality stats:
fastx_quality_stats -i woodR1.fastq -o woodR1_fastxstats.txt
fastx_quality_stats -i woodR2.fastq -o woodR2_fastxstats.txt
fastx_quality_stats -i woodR1_trimmed.fastq -o woodR1_fastxstats.txt
fastx_quality_stats -i woodR2_trimmed.fastq -o woodR2_fastxstats.txt
fastx_quality_stats -i woodmerged.fastq -o woodmerged_fastxstats.txt
fastx_quality_stats -i woodtrimmedmerged.faestq -o woodtrimmedmerged_fastxstats.txt



In [None]:
## leaf quality stats:
fastx_quality_stats -i Roo_R1.fastq -o Roo_R1_fastxstats.txt
fastx_quality_stats -i Roo_R2.fastq -o Roo_R2_fastxstats.txt
fastx_quality_stats -i Roo_R1_trimmed.fastq -o Roo_R1_trimmed_fastxstats.txt
fastx_quality_stats -i Roo_R2_trimmed.fastq -o Roo_R2_trimmed_fastxstats.txt
fastx_quality_stats -i leafmerged.fastq -o leafmerged_fastxstats.txt


Make the graphics:

In [20]:
## wood graphics
./dan_plot.sh -i woodR1_fastxstats.txt -o woodR1_quality.png
./dan_plot.sh -i woodR2_fastxstats.txt -o woodR2_quality.png
./dan_plot.sh -i  woodR1_trimmed_fastxstats.txt -o woodR1_trimmed_quality.png
./dan_plot.sh -i  woodR2_trimmed_fastxstats.txt -o woodR2_trimmed_quality.png
./dan_plot.sh -i woodmerged_fastxstats.txt -o woodmerged_quality.png
./dan_plot.sh -i  woodtrimmedmerged_fastxstats.txt -o woodtrimmedmerged_quality.png

## leaf graphics
./dan_plot.sh -i  Roo_R1_trimmed_fastxstats.txt -o Roo_R1_trimmed_quality.png
./dan_plot.sh -i  Roo_R2_trimmed_fastxstats.txt -o Roo_R2_trimmed_quality.png
./dan_plot.sh -i leaftrimmedmerged_fastxstats.txt -o leaftrimmedmerged_quality.png



<a id='quality'></a>

The **untrimmed wood R1** reads look like this:

<img src='woodR1_quality.png'>

Compare to **trimmed wood R1** reads:

<img src='woodR1_trimmed_quality.png'>

The **untrimmed wood R2** reads look like this:

<img src='woodR2_quality.png'>

Compare to **trimmed wood R2** reads:

<img src='woodR2_trimmed_quality.png'>

And the **untrimmed, merged wood** file looks like this:

<img src='woodmerged_quality.png'>

Obviously some problems here. So compare to the **merged, trimmed wood** reads:

<img src='woodtrimmedmerged_quality.png'>

Looks much better, but still a large dip in quality around 15 bp. 

Roo has already decided the trimming sites for his leaf data [(see above)](#mergepairs). Skip to the trimmed leaf R1 reads:

<img src='Roo_R1_trimmed_quality.png'>

The trimmed leaf R2 reads:

<img src='Roo_R2_trimmed_quality.png'>

Trimmed, merged leaf reads:

<img src='leaftrimmedmerged_quality.png'>

In both leaves and wood, I think we've vastly improved the situation by merging. The nice thing about using the usearch merging algorithms is that they use a bayesian approach to calculating the q-scores of paired reads, so that agreements on base calls between R1 and R2 improve Q scores, and disagreements reduce them. The reduction by disagreement is proportional to the confidence of the two base calls at a site, so if the disagreement occurs at the end of a read, where quality is lower, the higher (more reliable) Q has a much greater influence on the final q-score of a base call. Check out the explanation [here](http://drive5.com/usearch/manual/exp_errs.html).

<a id='qf'></a>

## Quality filtering

Continuing with the usearch pipeline, let's do some quality filtering. We'll use the [expected error approach](http://drive5.com/usearch/manual/exp_errs.html). We can set error cutoff of 1% of all bases in a read, meaning that a read of length 400 bp is thrown out if it likely contains 4 or more erroneous bases. I think this is permissable, given our OTU clustering will ultimately be done at 95% similarity.

In [25]:
## wood sequences
usearch -fastq_filter woodtrimmedmerged.fastq -fastq_maxee_rate .01 -fastqout wood_filtered.fastq -notrunclabels



usearch v8.1.1803_i86linux64, 74.2Gb RAM, 12 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: ronh@molbio.uoregon.edu

01:04 857Mb  100.0% Filtering, 85.4% passed
   5323274  FASTQ recs (5.3M)              
   4548698  Converted (4.5M, 85.4%)

In [26]:
## leaves
usearch -fastq_filter leaftrimmedmerged.fastq -fastq_maxee_rate 0.01 -fastqout leaf_filtered.fastq



usearch v8.1.1803_i86linux64, 529Gb RAM, 32 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: ronh@molbio.uoregon.edu

07:01 2.4Gb  100.0% Filtering, 90.7% passed
  16701565  FASTQ recs (16.7M)             
  15145323  Converted (15.1M, 90.7%)

We can inspect these graphically as above, to see if avg read quality has improved. Not placing this here because the graphs look basically the same before and after filtering. But notice that we drop 15% of wood reads, and 9% of the leaf reads. 

<a id='fastq2fasta'></a>

## Convert from fastq to fasta format

Now that we've merged paired ends and done some quality filtering to ensure that we've hopefully mostly eliminated sequencer error (hah! see [index bleed](#bleed) below), let's convert to fasta as required by most downstream steps. Using FASTX toolkit again:

In [5]:
fastq_to_fasta -n -i leaf_filtered.fastq -o  leaf.fasta
fastq_to_fasta -n -i wood_filtered.fastq -o  wood.fasta



The "-n" flag tells fastx to retain sequences with "N" basecalls. Otherwise, these are removed, by default. Since ~1/2 of our leaf reads contain an "N", we need these. A single N basecall is an acceptable loss of information, OTU clustering and taxonomic assignments should be to deal with this. 

<a id="demult"></a>

## Demultiplex leaf reads

Roo's leaf reads were prepped at an earlier date than the wood reads. At the time of their sequencing, the standard method for denoting sample identities was to look for the presence of 12 bp golay barcodes, the ones that we cut and pasted when we [merged paired end reads](#mergepairs). For probably the only time in this pipeline, we will use a [qiime](http://qiime.org/) script, ["demultiplex_fasta.py"](http://qiime.org/scripts/demultiplex_fasta.html) that was made to parse samples by these golay barcodes.

This script requires a mapping file that lists the barcodes and their accompanying sample info, plus a "linkerprimersequence". I use a map file supplied by Roo. I do not know what a "linkerprimersequence" is, I believe this is supplied by the illumina software. The script seems to prefer .tsv format to .csv, and looks like this:

In [9]:
head leaf_sample_map.tsv

#SampleID	BarcodeSequence	LinkerPrimerSequence
1Leaf	ACCCATATATCC	GCTGCGTTCTTCATCGATGC
2Leaf	ACCCATAAGACG	GCTGCGTTCTTCATCGATGC
3Leaf	TCGCCAGAACCA	GCTGCGTTCTTCATCGATGC
4Leaf	ACCCATATCAAA	GCTGCGTTCTTCATCGATGC
5Leaf	ACCCATATAGTA	GCTGCGTTCTTCATCGATGC
6Leaf	ACCCATCTACAG	GCTGCGTTCTTCATCGATGC
7Leaf	ACCCATCATACC	GCTGCGTTCTTCATCGATGC
8Leaf	ACCCATCATTAT	GCTGCGTTCTTCATCGATGC
9Leaf	ACCCATCTATCT	GCTGCGTTCTTCATCGATGC


In [8]:
demultiplex_fasta.py -m leaf_sample_map.tsv -f leaf.fasta -o ./leaf_demult



This produces a folder with a log file and a file called "demultiplexed_seqs.fna". I will rename this to "leaf.fna" and bring it into our working directory. We can remove primers from the sequences now. 

<a id='removeprimers'></a>

## Remove primers

PCR Primers in our wood reads were not included in these sequences, because they are used as part of the sequencing primers. The leaf reads, however, still contain our forward and reverse PCR primers. To remove these, we just clip the appropriate number of BPs from each end. We'll use the FASTX toolkit again. We will trim these reads further [below](#trim3) when combining the three studies (stromata, leaf endophyte, and wood endophyte) into a single fasta file. 

How long are our primers?

In [15]:
##ITS1F
expr length "CTTGGTCATTTAGAGGAAGTAA"
## ITS2
expr length "GCTGCGTTCTTCATCGATGC"

22
20


In [16]:
fastx_trimmer -f 23 -i leaf.fna | fastx_trimmer -t 20 -o leaf_noprim.fna



<a id='defloat'></a>

## Floating primers

In both of our read sets, "floating" primer sequences appear. This happens in other studies, as indicated by [Bálint et al. (2014)](http://onlinelibrary.wiley.com/doi/10.1002/ece3.1107/abstract;jsessionid=FBCBBBE428CAA870889926051DBC9927.f04t02). As advised by these authors, I remove the sequences that contain these floating primers with a script.

In [10]:
## ITS1F
grep CTTGGTCATTTAGAGGAAGTAA wood.fasta | wc -l



129

In [11]:
## ITS2
grep GCTGCGTTCTTCATCGATGC wood.fasta | wc -l



18

Not many. This doesn't include the reverse complements that also occur. To remove these floating primers:

In [13]:
## wood
./floatingprimers.py wood.fasta wood_defloat.fasta CTTGGTCATTTAGAGGAAGTAA GCTGCGTTCTTCATCGATGC
## leaves
./floatingprimers.py leaf.fna leaf_defloat.fna CTTGGTCATTTAGAGGAAGTAA GCTGCGTTCTTCATCGATGC



Arguments for this script: floatingprimers.py input_fasta_file output_fasta_file forward_primer reverse_primer. 

Reverse compliments are checked automatically from the forward and reverse primer sequences that are given as arguments.

When doing this, we lose 230 reads from the wood sequences (0.005%), and 312 from the leaves (0.002%). There were many more in the raw reads, but I think they were removed by the quality filtering.

<a id='chimera'></a>

## Chimera checking

Time to look for chimeras. We'll use the [uchime](http://www.drive5.com/usearch/manual/cmd_uchime_ref.html) algorithm, another step in the uparse/usearch pipeline. This is actually just the first of two checks for chimeras, the other being part of the [otu clustering](#otus).

Get the latest [UNITE](https://unite.ut.ee/repository.php) database for usearch:

In [14]:
wget https://unite.ut.ee/sh_files/uchime_reference_dataset_01.01.2016.zip

--2016-08-21 18:27:34--  https://unite.ut.ee/sh_files/uchime_reference_dataset_01.01.2016.zip
Resolving unite.ut.ee (unite.ut.ee)... 2001:bb8:2002:500:ec4:7aff:fe0a:37b2, 193.40.5.164
Connecting to unite.ut.ee (unite.ut.ee)|2001:bb8:2002:500:ec4:7aff:fe0a:37b2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8233914 (7.9M) [application/zip]
Saving to: ‘uchime_reference_dataset_01.01.2016.zip’


2016-08-21 18:27:37 (3.95 MB/s) - ‘uchime_reference_dataset_01.01.2016.zip’ saved [8233914/8233914]



Unzip this, then use the ITS1 only database to check for chimeras:

In [17]:
## wood
usearch -uchime_ref wood_defloat.fasta -db /home6/dthomas/combobiom/uchime_reference_dataset_01.01.2016/ITS1_ITS2_datasets/uchime_sh_refs_dynamic_develop_985_01.01.2016.ITS1.fasta -nonchimeras wood_notchim.fasta -strand plus -uchimeout woodchim_log.txt -notrunclabels

## leaves
usearch -uchime_ref leaf_defloat.fna -db /home6/dthomas/combobiom/uchime_reference_dataset_01.01.2016/ITS1_ITS2_datasets/uchime_sh_refs_dynamic_develop_985_01.01.2016.ITS1.fasta -nonchimeras leaf_notchim.fna -strand plus -uchimeout leafchim_log.txt



How many reads were chimeric? In the wood:

In [18]:
grep '>' wood_defloat.fasta | wc -l
grep '>' wood_notchim.fasta | wc -l



4548468<br>
4503552<br>
Looks like we lost 44,916 reads (~1%) of the wood reads from the previous step.

Leaves:

In [19]:
grep '>' leaf_defloat.fna | wc -l
grep '>' leaf_notchim.fna | wc -l 



10820734<br>
10684953<br>
We lost 135,781 reads (~1%) of leaf reads from the previous step.

<a id='combine'></a>

## Combining fasta files from all 3 studies

So, the whole point of this exercise is get our three types of sequence files (Leaf endophytes, wood endophytes, and stromata) into a single fasta file that can be used to create OTU clusters. To help this, we'll simplify our leaf and wood identifier labels, extract ITS1 from our stromata, trim our illumina reads to ITS1 region manually, and concatenate the three fasta files. 

### Simplifying leaf endophytes identifiers:

Note: I omitted a step in this section. The utax command, which we use to generate our [biom table](#makebiom), only allows alpha-numeric characters in names of samples. This is an issue for us, the leaf reads have periods in some of the names. Stromata sequences also require a single sample name to differentiate them from the other types of sequences Corrected [here](#underscores) -dan.

Leaf read identifiers look like this:

In [23]:
head -n 1 leaf_notchim.fna



\>78Leaf_5

Use gnu [SED](https://www.gnu.org/software/sed/) to strip these down and reformat them:

In [24]:
sed s/_.*//g leaf_notchim.fna > leaf_relab.fna



Leaf identifiers just contain sample number and study (host) info. They look like this:

In [25]:
head -n 1 leaf_relab.fna



\>78Leaf

### Simplifying wood endophytes identifiers:

Wood identifiers look this:

In [20]:
head -n 1 wood_notchim.fasta



\>M01498:244:000000000-ANT97:1:1101:17999:1109 1:N:0:160

Here's a series of regexes that seems to work for simplifying them:

In [26]:
sed s/^\>.*[0-9]\ //g wood_notchim.fasta | sed s/1:[YN]:0:/\>/g | sed '/^>/ s/$/wood/g' > wood_relab.fasta



Looks like this:

In [27]:
head -n 1 wood_relab.fasta



\>160wood

### Extract ITS1 from stromata full ITS sequences

These stromatal reads are hand-aligned and curated by Roo, but their origins vary, as do their primer sites. To make them comparable during otu clustering, we'll extract the ITS1 region from them. The ITS1 region from stromata reads will be extracted using [Bengtsson-Palme et al.'s (Nilsson's) ITS extractor](http://microbiology.se/software/itsx/). 

The installation of the ITS extractor was theoretically simple but a little bit of a pain in the ass. It was simple in that you simply need to download a compressed, archived package from the [ITS extractor website](http://microbiology.se/software/itsx/), and it's main dependency, the [HMMER package](http://hmmer.org/). 

Each of these packages contains binaries that worked with my linux setups, but initial attempts at using the command line programs returned a lot of errors, so had do several things: delete old databases [(details here)](http://microbiology.se/2013/07/08/metaxa-and-hmmer-3-1b/) and use "--reset T" flag to create new dbs. I also had use the "hmmpress" command from the HMMER package with every one of these .hmm files, as a superuser, to create databases with correct formats. 


In [2]:
./ITSx -i ../Final_Stromata_Ref_Sequences.fasta -o strom --preserve T --allow_single_domain -t F



Some notes about the settings used here:

--preserve T  == keep our identifier lines, don't let ITSx create new ones 

--allow_single_domain == if ITSx can only find one conserved region (18s, 5.8, or 28s) to anchor into, this is enough (usually require two)

--t F == look at fungal reads only

--multi_thread 12 == use 12 cores, cuz we got 'em.

ITS1 from all 51 stromatal sequences were extracted. Problems reported by ITSx:

In [5]:
cat strom.problematic.txt

A_purpureonitens_consesus	End of SSU sequence not found; Start of LSU sequence not found
H_perforatum_China	End of SSU sequence not found
X_enterogena_Ecuador	Start of LSU sequence not found
X_flabelliforme_australia_thai	Start of LSU sequence not found
X_flabelliforme_china	Start of LSU sequence not found


<a id='trim3'></a>

### Trimming leaf and wood illumina reads to ITS1 region

Now we need to get rid of SSU and 5.8s region sequences from our Illumina reads. Manual inspection of our stromata sequences, and the ITSx mapping of a random subset of our leaf and wood illumina datasets both show a constant distance of 46 base pairs from the end of our forward (ITS1F) primer and the beginning of the ITS region, and 30 bp after the end of the ITS1 region we hit our ITS2 primer site. 

Each run of ITSx generates a positions file which tells us where SSU, 5.8s, and LSU regions (not present in our illumina reads) are located:

In [7]:
## get 100 randomly selected reads from our illumina fastas, after removing linebreaks
./subset_fasta.py leaf_relab_nolb.fna 100 leaf_sub.fna
./subset_fasta.py wood_relab_nolb.fasta 100 wood_sub.fasta
cat leaf_sub.fna wood_sub.fasta > combo_sub.fasta
## have ITSx program take a look at them...
./ITSx -i ../combo_sub.fasta -o combo_sub --preserve T --allow_single_domain -t F



In [2]:
head combo_sub.positions.txt; tail combo_sub.positions.txt

130Leaf	330 bp.	SSU: 1-46	ITS1: 47-300	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
38Leaf	315 bp.	SSU: 1-46	ITS1: 47-285	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
127Leaf	315 bp.	SSU: 1-46	ITS1: 47-285	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
102Leaf	262 bp.	SSU: 1-46	ITS1: 47-226	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
95Leaf	257 bp.	SSU: 1-46	ITS1: 47-227	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
96Leaf	315 bp.	SSU: 1-46	ITS1: 47-227	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
24Leaf	256 bp.	SSU: 1-46	ITS1: 47-226	5.8S: No end	ITS2: Not found	LSU: Not found	Broken or partial sequence, only partial 5.8S! 
117Leaf	254 bp.	SSU: 1-46	ITS1: 47-224	5.8S: No end	ITS2: Not found	LSU: Not fou

And so on. All reads examined contain 46 bp of SSU DNA, then ITS1 region starts. Also notice that the ITS1 region ends 30 bp before the end of the read, so our ITS2 primers are located 30 bp into the 5.8s region. Knowing this, we can trim these ends to just the ITS1 region:

In [5]:
## The wood:
fastx_trimmer -f 47 -i wood_relab_nolb.fasta | fastx_trimmer -t 30 -o wood.ITS1.fasta

## the leaves:
fastx_trimmer -f 47 -i leaf_relab_nolb.fna | fastx_trimmer -t 30 -o leaf.ITS1.fasta



### Combining wood and leaf fasta files

<a id='underscores'></a>

Now we combine our leaf and wood endophyte fastas into a single fasta. We will do this in two ways: 

(1) all reads from all three libraries will be combined, this is our data, which will await taxonomic assignments and sample data. For this let's add a uniform "Strom" label to the strom' to all the stromatal sequences, this is important for making the biom table [below](#makebiom). We'll also change out the periods in the leaf reads for underscores, this messes with the utax command. This probably should have been done when we [simplified illumina identifiers](#combine). Anyway, sed to rescue again:

In [38]:
sed '/^>/  s/>/>Strom;/' Final_Stromata_Ref_Sequences.fasta > strom_add_to_combo.fasta



In [None]:
sed '/>/ s/\./_/' leaf.ITS1.fasta -i

Now combine them:

In [None]:
cat wood.ITS1.fasta leaf.ITS1.fasta strom_add_to_combo.fasta > combo.fasta

(2) However we also need to generate an OTU repset as part of our tool box for making taxonomic assignments. As part of this, leaf and wood reads need to be combined, dereplicated, then stromatal sequences added after removing singletons from our illumina sets. For our OTU representative set, we wait to add in our stromata, because we don't want to lose these in the singleton removal [below](#removesings).

In [13]:
cat leaf.ITS1.fasta wood.ITS1.fasta > comboLW.fasta



<a id='otuclust'></a>

## OTU clustering

We're following the [UPARSE pipeline](http://drive5.com/usearch/manual/uparse_pipeline.html) recommendations as much as possible. They recommend, for otu clustering, to remove singletons from our reads. They can be added back in later, during the final steps of [making our biome table](), [according to drive5](http://drive5.com/usearch/manual/mapreadstootus.html). But here, as we generate our OTU rep set, they can form spurious OTUs. So out they go. We're not removing our carefully curated stromatal sequences, though, as the point of excluding singletons here is that they have a high probability of being erroneous. Also, we want our stromata to inform the clustering process.

First we [dereplicate](http://drive5.com/usearch/manual/cmd_derep_fulllength.html) our reads, takes a few seconds:

In [9]:
usearch -derep_fulllength comboLW.fasta -fastaout combo_derepLW.fasta -sizeout



usearch v8.1.1861_i86linux64, 74.2Gb RAM, 12 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: ronh@molbio.uoregon.edu

00:36 3.7Gb  100.0% Reading comboLW.fasta
00:44 7.7Gb 15186063 seqs, 2286802 uniques, 1822327 singletons (79.7%)
00:44 7.7Gb Min size 1, median 1, max 416662, avg 6.64
00:58 4.8Gb  100.0% Writing combo_derepLW.fasta

<a id='removesings'></a>

Next we [sort](http://drive5.com/usearch/manual/cmd_sortbysize.html) these dereplicated reads by size. In the process, we remove singletons.

In [10]:
usearch -sortbysize combo_derepLW.fasta -fastaout combo_sorted.fasta -minsize 2



usearch v8.1.1861_i86linux64, 74.2Gb RAM, 12 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: ronh@molbio.uoregon.edu

00:06 635Mb  100.0% Reading combo_derepLW.fasta
00:06 601Mb Getting sizes                      
00:09 620Mb Sorting 464475 sequences
00:10 622Mb  100.0% Writing output

How many of our sequences were singletons? 

In [14]:
grep '>' combo_derepLW.fasta | wc -l
grep '>' combo_sortedLW.fasta | wc -l



2286802<br>
464475

Wow, a lot of singletons. 

Time to add the stromata back into this fasta file. Let's change the stromata labels to match the dereplicated, sorted illumina read labels. 

In [15]:
## sed, add "size 1" to end of each identifier
sed '/^>/ s/$/;size=1;/' strom.ITS1.fasta > strom.ITS1.relab.fasta



In [17]:
head strom.ITS1.relab.fasta -n 2

>A_aff_atroroseum;size=1;
GCGAGTTAGCAAAACTCCAAAACCCTTTGTGAACCTTACCGTCGTTGCCTCGGCGTGTGCCGCGGCTACCCTGGAGTAGTTACCCTGGACAGGTTACCCTATAGGGGCTACCCTGGAGGGGTTCCTACCCTGGAAGCCGGCACCCGGCCCGCCAAAGGACCCGTACAAAATTCTGTCTTACCAGTGTATCTCTGAATGCTTCAACTGAAATAAGTTA


In [19]:
cat combo_sortedLW.fasta strom.ITS1.relab.fasta > combo_sorted.fasta



And then give this combined file to usearch to chew on. We'll do a .97 and .95 similarity radius. Since we're going outside of the normal 0.97 radius, and because Roo's other analyses in this paper were done using an older version of usearch  (uclust) at the .95 radius, we'll use the "usearch -cluster_smallmem" command. I believe this is the closest descendant of the older algorithms used by usearch, and the newer algorithms haven't been tested for other radii. See discussion [here](http://drive5.com/usearch/manual/uparse_otu_radius.html).

In [None]:
usearch -cluster_smallmem combo_sorted.fasta -id 0.95 -centroids otus_95_combo.fasta -sizein -sizeout -sortedby size
usearch -cluster_smallmem combo_sorted.fasta -id 0.97 -centroids otus_97_combo.fasta -sizein -sizeout -sortedby size

That was freakishly fast, a couple seconds each with the computing cluster. But I can't see any problems in the output. I think removing all of the singletons for this process vastly decreases the processing time (we went from ~ 1 million unique reads to ~1/4 million).

<a id='UNITE'></a>

# Customize UNITE database

[UNITE](http://www2.dpes.gu.se/project/unite/UNITE_intro.htm) is an attempt by mycologists to get some quality control on fungal accessions in large public databases like GenBank. We will use it to identify our sequences. 

It's handy, but last I checked, there were some problems with sequences only identified to very low taxonomic resolution, e. g. accessions identified only to phylum. The cause problems because BLAST may display these matches over other very close matches that are more completely identified, if they match ever-so-slightly-better to the poorly identified sequences.

I will use usearch's utax algorithm. So let's get the UNITE database made expressly for this:

In [None]:
wget https://unite.ut.ee/sh_files/utax_reference_dataset_31.01.2016.zip

I had lot of trouble getting my python interpreter to work on the database due issues with special characters like umlauts and other symbols (especially "ë" and "×" that are used in some names of fungi in UNITE. My versions of sed and iconv and my python interpeter would not work with UNITE names. My $LANG = en_US.UTF-8, which seems like it should be able to handle these characters, but character encoding is a complex subject which I do not understand. Most of the database is in simple unicode8, though, so I was able to use my favorite text editor VIM to substitute these characters out:

In [None]:
unzip utax_reference_dataset_31.01.2016.zip
vim utax_reference_dataset_31.01.2016.fasta

Inside vim command mode (press ':'):

In [None]:
:%s/ë/e/g
:%s/×/x/g

This fixed it for me. Good luck with your unicode. Maybe mac users won't have this issue. Exit vim, back in the shell. Here I've written another python script, "UNITE_with_class2.py", for only keeping those UNITE records with class identification or better. This is in the repository. 

In [None]:
./UNITE_with_class2.py utax_reference_dataset_31.01.2016.fasta utax_ref_class.fasta

We'll need to append our curated stromata to this database. The usearch pipeline has a standard heading needed to create a searchable database from fasta files, described [here](http://drive5.com/usearch/manual/tax_annot.html). Since the UNITE database as we downloaded it is already formatted to these specs, we just need to reformat our stromata sequence fasta. More SED magic to add in most of the taxonomy:

In [11]:
sed '/^>/ s/>/,s:/g' Final_Stromata_Ref_Sequences.fasta | sed '/,s:/ s/^/>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae/' > strom_add_to_UNITE.fasta



Expand our genus names:

In [12]:
sed 's/s:A_/g:Annulohypoxylon,s:/g' strom_add_to_UNITE.fasta -i
sed 's/s:B_/g:Biscogniauxia,s:/g' strom_add_to_UNITE.fasta -i
sed 's/s:H_/g:Hypoxylon,s:/g' strom_add_to_UNITE.fasta -i
sed 's/s:K_/g:Kretzschmaria,s:/g' strom_add_to_UNITE.fasta -i
sed 's/s:N_/g:Nemania,s:/g' strom_add_to_UNITE.fasta -i
sed 's/s:W_/g:Whalleya,s:/g' strom_add_to_UNITE.fasta -i
sed 's/s:X_/g:Xylaria,s:/g' strom_add_to_UNITE.fasta -i



Our new headers look like this:

In [22]:
sed '1~2p' strom_add_to_UNITE.fasta -n | head -n 5 
sed '1~2p' strom_add_to_UNITE.fasta -n | tail -n 5

>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Annulohypoxylon,s:aff_atroroseum
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Annulohypoxylon,s:aff_stygium
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Annulohypoxylon,s:atroroseum
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Annulohypoxylon,s:bovei_var_microspora
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Annulohypoxylon,s:moriforme
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Xylaria,s:sp_nov_1_long
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Xylaria,s:sp_nov_1_short
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Xylaria,s:sp_nov_2
>RefStrom;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Xylaria,s:telfairii
>RefSt

Compare this to the already-formatted UNITE identifiers:

In [27]:
sed '1~2p' utax_ref_class.fasta -n | head -n 2

>EU821669|SH188517.07FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Cortinariaceae,g:Cortinarius,s:Cortinarius_balaustinus_SH188517.07FU;
>AF201716|SH204808.07FU;tax=d:Fungi,p:Ascomycota,c:Sordariomycetes,o:Xylariales,f:Xylariaceae,g:Annulohypoxylon,s:Annulohypoxylon_truncatum_SH204808.07FU;
sed: couldn't write 145 items to stdout: Broken pipe


Okay, add this to our UNITE database:

In [23]:
cat strom_add_to_UNITE.fasta utax_ref_class.fasta > utax-strom_ref.fasta



And convert this "classy" database to a format that usearch/utax likes:

In [None]:
usearch -makeudb_utax utax-strom_ref.fasta -output utax-strom_ref.udb -report dbmake_report.txt

<a id='asstax'></a>

# Assign taxonomy to otus

Time to put some names on the otu reference set. To do this with the usearch pipeline, we need to give all of our otu clusters a unique name/number. Since we used [usearch -cluster_smallmem](http://www.drive5.com/usearch/manual/cmd_cluster_smallmem.html) instead of [usearch -cluster_otus](http://www.drive5.com/usearch/manual/cmd_cluster_otus.html) due to our .95 similarity radius for our otus, [see here for discussion](http://www.drive5.com/usearch/manual/uparse_otu_radius.html), we need to add our otu names ourselves. As everyone knows, I like to do things the hard way. Anyway, wrote a python script for this, called "addOTUtag.py", it's in the repository. 

In [32]:
./addOTUtag.py otus_95_combo.fasta OTU otus_95_combo_relabel.fasta
./addOTUtag.py otus_97_combo.fasta OTU otus_97_combo_relabel.fasta



Now use the database created above to assign taxonomy. We'll use the [usearch -utax](http://drive5.com/usearch/manual/cmd_utax.html) command.

In [33]:
usearch -utax otus_95_combo_relabel.fasta -db utax-strom_ref.udb -strand both -fastaout otus_95_combo_asstax.fasta
usearch -utax otus_97_combo_relabel.fasta -db utax-strom_ref.udb -strand both -fastaout otus_97_combo_asstax.fasta



What does this look like? 

In [36]:
head otus_95_combo_asstax.fasta

>OTU1:79Leaf;size=651589;tax=d:Fungi(1.0000),p:Ascomycota(1.0000),c:Sordariomycetes(1.0000),o:Incertae_sedis(1.0000),f:Glomerellaceae(1.0000),g:Colletotrichum(0.9869),s:Colletotrichum_aotearoa_SH375577.07FU;
CTGAGTTTACGCTCTACAACCCTTTGTGAACATACCTATAACTGTTGCTTCGGCGGGTAGGGTCTCCGTGACCCTCCCGG
CCTCCCGCCCCCGGGCGGGTCGGCGCCCGCCGGAGGATAACCAAACTCTGATTTAACGACGTTTCTTCTGAGTGGTACAA
GCAAATAATCA
>OTU3:27Leaf;size=600849;tax=d:Fungi(1.0000),p:Ascomycota(1.0000),c:Dothideomycetes(0.9919),o:Capnodiales(0.9878),f:Mycosphaerellaceae(0.9911),g:Pseudocercospora(0.9784),s:Pseudocercospora_lyoniae_SH406155.07FU;
CTGAGTGAGGGCTCACGCCCGACCTCCAACCCTTTGTGAACACATCTTGTTGCTTCGGGGGCGACCCTGCCGGCACTTCG
TCGCCGGGCGCCCCCGAAGGTCTCCAAACACTGCATCTTTGCGTCGGAGTTTAAACAAATTAAACA
>OTU2:7Leaf;size=603746;tax=d:Fungi(0.2931),p:Ascomycota(0.0207),c:Eurotiomycetes(0.0221),o:Chaetothyriales(0.0087);
CTTGACCCCAGCGTAAGCTGGGGAATTGCATCACACAAATTGACCTATCCTTTGTTTGCCTCGGTGGGCGGCTCAGCTGA
GCCCCTGGACCCGAAAGGGCGCTCACCGTTGGACCAGTCTTGTTTGAATCT

<a id='makebiom'></a>

## Make and tidy up biom tables

So now we take these OTU clusters, with taxonomic assignments, and map our reads to them to create a biom table. Using the [usearch -usearch_global](http://drive5.com/usearch/manual/mapreadstootus.html) command. This command outputs the [first generation format](http://biom-format.org/documentation/biom_format.html)  of biom tables, a json table that is human readable. 

In [None]:
usearch -usearch_global combo.fasta -db otus_97_combo_asstax.fasta -strand both -id 0.97 -biomout combo_otu_97.biom
usearch -usearch_global combo.fasta -db otus_95_combo_asstax.fasta -strand both -id 0.95 -biomout combo_otu_95.biom

This makes our two biome tables. Log files for these report that 99.5% of our quality filtered reads mapped to an OTU in the 97% radius OTU rep set and 99.9% mapped to an OTU in our 95% radius OTU rep set. 

What do these look like? 

In [61]:
head -n 13 combo_otu_95.biom

{
	"id":"combo_otu_95.biom",
	"format": "Biological Observation Matrix 1.0",
	"format_url": "http://biom-format.org",
	"generated_by": "usearch8.1.1861",
	"type": "OTU table",
	"date": "Thu Sep  8 21:40:16 2016",
	"matrix_type": "sparse",
	"matrix_element_type": "float",
	"shape": [10269,232],
	"rows":[
		{"id":"OTU17:114Leaf", "metadata":{"taxonomy":"d:Fungi(1.0000),p:Ascomycota(1.0000),c:Sordariomycetes(0.9972),o:Xylariales(0.9992),f:Xylariaceae(0.9993),g:Xylaria(0.9990),s:hypoxylon_Oregon"}},
		{"id":"OTU7811:160wood", "metadata":{"taxonomy":"d:Fungi(1.0000),p:Ascomycota(0.9855),c:Dothideomycetes(0.9551),o:Pleosporales(0.7729)"}},


In [62]:
head -n 13 combo_otu_97.biom

{
	"id":"combo_otu_97.biom",
	"format": "Biological Observation Matrix 1.0",
	"format_url": "http://biom-format.org",
	"generated_by": "usearch8.1.1861",
	"type": "OTU table",
	"date": "Thu Sep  8 21:40:29 2016",
	"matrix_type": "sparse",
	"matrix_element_type": "float",
	"shape": [12558,232],
	"rows":[
		{"id":"OTU16:114Leaf", "metadata":{"taxonomy":"d:Fungi(1.0000),p:Ascomycota(1.0000),c:Sordariomycetes(0.9972),o:Xylariales(0.9992),f:Xylariaceae(0.9993),g:Xylaria(0.9990),s:hypoxylon_Oregon"}},
		{"id":"OTU9350:160wood", "metadata":{"taxonomy":"d:Fungi(1.0000),p:Ascomycota(0.9855),c:Dothideomycetes(0.9551),o:Pleosporales(0.7729)"}},


### clean up unidentified OTUs

We can check initial, obvious issues with our biome table by using the "biom validate-table command", part of the [biom-format python package](http://biom-format.org/index.html):

In [63]:
biom validate-table -i combo_otu_95.biom

Invalid format 'Biological Observation Matrix 1.0', must be '1.0.0'
'id' in {'id': '', 'metadata': {'taxonomy': 'd:Fungi(1.0000),p:Basidiomycota(0.9808),c:Agaricomycetes(0.9300),o:Agaricales(0.9049),f:Marasmiaceae(0.8875),g:Marasmius(0.7400),s:Marasmius_rotula_SH190961.07FU'}} appears empty
Bad value at idx 0: [0, 0, 70749]
Timestamp does not appear to be ISO 8601
The input file is not a valid BIOM-formatted file.


All of this appears to be pretty trivial, a missing id and a "Bad value". The "Bad value" happens often when I make biom tables, and the biom authors seem to think it is likely minor, due to having a floating decimal instead of an integer at some place in the table, discussion [here](https://github.com/biocore/biom-format/issues/701). The missing OTU is weird, but this seems to also happen every time I make a biom table. Let's see if we can find it:

In [64]:
grep '"id":""' combo_otu_95.biom

		{"id":"", "metadata":{"taxonomy":"d:Fungi(1.0000),p:Basidiomycota(0.9808),c:Agaricomycetes(0.9300),o:Agaricales(0.9049),f:Marasmiaceae(0.8875),g:Marasmius(0.7400),s:Marasmius_rotula_SH190961.07FU"}},


We can search our OTU taxonomic classifications to figure out what OTU this was:

In [65]:
grep 'd:Fungi(1.0000),p:Basidiomycota(0.9808),c:Agaricomycetes(0.9300),o:Agaricales(0.9049),f:Marasmiaceae(0.8875),g:Marasmius(0.7400),s:Marasmius_rotula_SH190961.07FU' otus_95_combo_asstax.fasta

>OTU1166:3Leaf;size=661;tax=d:Fungi(1.0000),p:Basidiomycota(0.9808),c:Agaricomycetes(0.9300),o:Agaricales(0.9049),f:Marasmiaceae(0.8875),g:Marasmius(0.7400),s:Marasmius_rotula_SH190961.07FU;


So one OTU pops up, it's name is "OTU1166:3Leaf". We need to plug this into our json file. What would I do without SED?:

In [66]:
sed '/"id":""/ s/"id":"",/"id":"OTU1166:3Leaf",/' combo_otu_95.biom -i



Did that work?

In [67]:
grep '"id":""' combo_otu_95.biom | wc -l

0


In [68]:
biom validate-table -i combo_otu_95.biom

Invalid format 'Biological Observation Matrix 1.0', must be '1.0.0'
Bad value at idx 0: [0, 0, 70749]
Timestamp does not appear to be ISO 8601
The input file is not a valid BIOM-formatted file.


Grep and biom-format can't find any other empty ids. Still not sure why this happens, some minor bug in the usearch code? Something weird in my data? As mentioned above, I think the "Bad value" is a minor issue, but let's look at it:

In [70]:
grep \\[0,0,70749\\] -A 10 -B 10 combo_otu_95.biom

		{"id":"67Leaf", "metadata":null},
		{"id":"112_1Leaf", "metadata":null},
		{"id":"113_2Leaf", "metadata":null},
		{"id":"74Leaf", "metadata":null},
		{"id":"113_1Leaf", "metadata":null},
		{"id":"126_2Leaf", "metadata":null},
		{"id":"91Leaf", "metadata":null},
		{"id":"Strom", "metadata":null}
	],
	"data": [
		[0,0,70749],
		[0,3,3],
		[0,4,8],
		[0,5,58],
		[0,13,50],
		[0,14,87],
		[0,17,6],
		[0,21,12],
		[0,26,15],
		[0,27,67],
		[0,34,15],


Looks fine to me. This is a plentiful OTU ("OTU17", in sample 114leaf, 99,865 reads) and the sample (160w) has slightly more reads in it than this amount (71,220 total reads), so it seems like this value is possible. Not sure. Our .97 biom table also had an unidentified OTU and a Bad Value at the [0,0] (first data entry). I fixed the unidentified OTU in the same manner as above, don't know why the bad value report. We'll keep going for now, but we should check on this "Bad value" OTU and sample downstream, cuz something might be fishy.  

### Change biom taxonomy metadata format

Another clean-up step for adapting the uparse-outputted biom tables for downstream applications is reformatting the taxonomy info. Formatting these JSON tables is a little tricky, but there is a good example that I will follow [here](http://biom-format.org/documentation/adding_metadata.html). I wrote a script for this, "format_tax.py", in the repository. 

In [86]:
./format_tax.py combo_otu_95.biom combo_95_relab.biom
./format_tax.py combo_otu_97.biom combo_97_relab.biom



And the taxonomy metadata now looks like this:

In [87]:
grep 'rows' -A 5 combo_95_relab.biom

	"rows":[
		{"id":"OTU17:114Leaf", "metadata":{"taxonomy": ["k__Fungi", "p__Ascomycota", "c__Sordariomycetes", "o__Xylariales", "f__Xylariaceae", "g__Xylaria", "s__hypoxylon_Oregon"]}},
		{"id":"OTU7811:160wood", "metadata":{"taxonomy": ["k__Fungi", "p__Ascomycota", "c__Dothideomycetes", "o__Pleosporales"]}},
		{"id":"OTU2412:53Leaf", "metadata":{"taxonomy": ["k__Fungi", "p__Ascomycota", "c__Eurotiomycetes", "o__Chaetothyriales"]}},
		{"id":"OTU494:114Leaf", "metadata":{"taxonomy": ["k__Fungi", "p__Ascomycota", "c__Dothideomycetes", "o__Pleosporales"]}},
		{"id":"OTU66:65Leaf", "metadata":{"taxonomy": ["k__Fungi", "p__Ascomycota", "c__Dothideomycetes", "o__Capnodiales", "f__Mycosphaerellaceae", "g__Mycosphaerella", "s__Mycosphaerella_tassiana_SH216250.07FU"]}},


<a id='addmetadata'></a>

## Add metadata

So, theoretically, these biom tables should now parse well in downstream applications, like phyloseq. Let's add some information about our samples first. For the moment, we'll just get real sample numbers onto our columns. Up to this point, the wood samples have been identified using the numbers given by illumina software when they dumultiplexed them. These numbers are meaningless to our study, except to keep track of samples. We can map them by [adding metadata with biom scripts](http://biom-format.org/documentation/adding_metadata.html). The leaf reads were given correct sample names in their identifiers when we [demultiplexed them](#demult), but we'll need to fill this info into their metadata, also.

For this we need a "metadata mapping file." For the moment, I will use a very simple map, looks like this:  

In [96]:
head combo_meta.tsv; tail combo_meta.tsv

#SampleID	SampleNumber	Library
160wood	Dc-X	W
161wood	Dc-PosG	W
162wood	Dc-PosI	W
163wood	Dc-Neg	W
164wood	1	W
165wood	2	W
166wood	3	W
167wood	4	W
168wood	5	W
127Leaf	127	L
128Leaf	128	L
129Leaf	129	L
130Leaf	130	L
131Leaf	131	L
132Leaf	132	L
133Leaf	133	L
NC_1Leaf	NC_1	L
NC_2Leaf	NC_2	L
Strom	S	S


So we use this map to add column metadata:

In [97]:
biom add-metadata -i combo_95_relab.biom -o combo_95_wMeta.biom -m combo_meta.tsv --output-as-json

biom add-metadata -i combo_97_relab.biom -o combo_97_wMeta.biom -m combo_meta.tsv --output-as-json



Did this work? A preliminary check with biom scripts.

In [99]:
biom validate-table -i combo_95_wMeta.biom


The input file is a valid BIOM-formatted file.


In [100]:
biom validate-table -i combo_97_wMeta.biom


The input file is a valid BIOM-formatted file.


Okay! Looks like we have a biom table to run some stats on. We'll do this in a different notebook, just to keep things organized. 