Additional post-processing scripts for metagenome analyses using metannotate
Perl R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
compareToPrimer.pl
phylogeneticNovelty.R
taxPredictionsWithErrorBars.R

README.md

metannotateScripts

This folder contains various post-processing scripts for metannotate analyses of metagenomes.

MetAnnotate can be found here http://metannotate.uwaterloo.ca/ and here https://bitbucket.org/doxeylab/metannotate

Some basic R commands are below:

Loading and processing output files


Set your working directory. For example, this may be the directory containing the .tsv MetAnnotate results file. E.g.,

setwd("~/Desktop/MetAnnotateResults/")

Load your dataset (here, it is a tab-separated file called data.tsv):

tb <- read.delim("data.tsv",sep='\t',header=T)

You may now want to look at the counts (# hits for all HMMs) for each dataset. To normalize these counts, divide them by the total number of reads (DNA) in each dataset.

sort(table(tb[,1]))

If there are datasets with very few (e.g., 50 or less) counts, you may want to remove them:

print("Removing...")
which(table(tb[,1]) < 50)

for (d in names(which(table(tb[,1]) < 50)))
{
	tb <- tb[-which(tb[,1] == d),]
}

tb[,1] <- as.character(tb[,1])

Plotting

Barplots

The first thing you need to decide on is the data column you want to analyze. For order level taxon assignments predicted using the tree-based method ...

k <- "Closest.Homolog.Order" # can be Species, Genus, Family, Order, Class...

Now run the following code to make your barplot

# a color scheme
pal12 = c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C", "#FB9A99", 
"#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6", "#6A3D9A", 
"#FFFF99", "#B15928") 
cols <- colorRampPalette(pal12)(length(levels(as.factor(tb[,k]))))

par(mfrow=c(1,2))

tb.2 <- t(table(tb[,c(1,which(colnames(tb) == k))]))
for (i in 1:ncol(tb.2))
{
	tb.2[,i] <- tb.2[,i] / sum(tb.2[,i])
}

tb.2 <- t(na.omit(t(tb.2)))


## suppose you want to reorder the barplot and show only a subset of datasets in a specific order. Uncomment the following two lines
# datasetOrder <- c("4446153","4477804","4537195","SRR908273a")  # a hypothetical list of datasets
# tb.2 <- tb.2[,match(datasetOrder,colnames(tb.2))]

barplot(tb.2,col=cols,las=3, cex.names=0.5)
plot.new()

legend("left", inset=.05, title="Class",legend =
  	c(levels(as.factor(tb[,k]))), fill = cols,cex=0.2)

Clustered Heatmap

You will need to have the pheatmap package installed.

library(pheatmap)

As before, you need to pick a taxonomic level. This time it is called sColumn

k <- "Closest.Homolog.Genus" #species column for which to do the species breakdown/analysis

Since showing all taxa may be overwhelming, we can choose to show only the most frequent species -- those present greater than some threshold in at least one dataset. This is done with the mpt parameter

mpt <- 0.02 #max fraction of a taxa in any dataset required for inclusion in heat map

The following parameter can also be set to less to reduce the size of labels in the heatmap

cexVal <- 1

Now run...

matr <- table(tb[,k],tb[,1])
matr <- sweep(matr, 2, colSums(matr), "/")
maxPerTaxa <- apply(matr,1,max)
cexVal <- 1 #sets size of labels
ll <- pheatmap((matr[which(maxPerTaxa > mpt),]),cex=cexVal)

Principal coordinates analysis (PCOA)

PCOA is really easy in R. First load the vegan and ape packages

library(vegan)
library(ape)

Just run:

D <- vegdist(t(matr[which(maxPerTaxa > mpt),]))   ## works better when rare species are removed
biplot(pcoa(D))

Generating an OTU table

otu.tb1 <- table(tb[,k],tb[,2])
otu.tb2 <- data.frame(rownames(otu.tb1),as.data.frame.matrix(otu.tb1))
colnames(otu.tb2)[1] <- "OTU"
write.table(otu.tb2,file="otutable.tsv",sep="\t",row.names=F)