01 . Filtering
===
**Pre cleaning of microbiome data** 
Preparing 16S data for further analysis such as beta diversity and statistical tests. 
1. add variabel types to meta data
2. filter features/samples that will not be part of some analysis


In [2]:
library(phyloseq)
library(stringr)
library(ggplot2)
source("/mnt/work/flatberg/projects/GCF-2019-658/analysis/src/microbiome/rules/analysis/notebooks/src/microfiltR/microfiltR_source_code.R")

**Load QIIME2 based taxonomy from phyloseq object (rds file)**

In [4]:
rds <- readRDS(snakemake@input[[1]])
#rds <- readRDS("../../../../../data/tmp/microbiome/quant/qiime2/silva/physeq.rds")
df <- data.frame(as(sample_data(rds), "matrix"))
head(df)

ERROR: Error in readRDS(snakemake@input[[1]]): object 'snakemake' not found


In [5]:
MODEL = snakemake@config$models[[snakemake@wildcards$model]]

ERROR: Error in eval(expr, envir, enclos): object 'snakemake' not found


In [None]:
subset.ids <- function(rds, params){
    df <- data.frame(as(sample_data(rds), "matrix"))
    # set defaults if missing params
    if (!"axis" %in% names(params)) params$axis <- "column"
    if (params$axis == "column"){
        df <- data.frame(as(sample_data(rds), "matrix"))
    } else{
        df <- data.frame(as(tax_table(rds), "matrix"))
    }
    if (!"name" %in% names(params)) params$name <- "Subset"
    if (!params$name %in% colnames(df)) stop("params.name not in df")
    if (!"selection" %in% names(params)) params$selection <- "keep"
    if ("ids" %in% names(params)){
        keep <- sample_names(rds) %in% params$keep
        if (sum(keep) == 0) stop("no overlap in subset ids from model.yaml")
    } else{
        if (!params$name %in% colnames(df)){
            stop(cat("missing column:", params$name, "in metadata."))
        } 
        subset_col <- as.character(df[,params$name])
        keep <- subset_col == as.character(params$selection)
        if (sum(keep) == 0) stop(paste0("no overlap in selection ", params$selection, ": ", params$name))
    }
    if (params$axis == "column"){
        rds <- prune_samples(keep, rds)
    } else{
        rds <- prune_taxa(keep, rds)
    }
}

In [4]:
if (!is.null(MODEL$subsets)){
    for (p in MODEL$subsets){
        rds <- subset.ids(rds, p)
    }
}
rds

Unnamed: 0_level_0,Sample_Biosource,Subject,Sample_Group
Unnamed: 0_level_1,<fct>,<fct>,<fct>
1_S,saft,1,PPI_FGP
10_C,corpus,10,PPI
10_S,saft,10,PPI
11_C,corpus,11,PPI_FGP
11_P,polypp,11,PPI_FGP
11_S,saft,11,PPI_FGP


phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 2467 taxa and 36 samples ]
sample_data() Sample Data:       [ 36 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 2467 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 2467 tips and 2433 internal nodes ]
refseq()      DNAStringSet:      [ 2467 reference sequences ]

**Taxonomy requirements**  
We require taxa to be from the Bacteria kingdom and remove mithochondria/chloroplasts

In [5]:
rds.f <- subset_taxa(rds,
    Kingdom == "Bacteria" &
    Family  != "mitochondria" &
    Class   != "Chloroplast" &
    Phylum != "Cyanobacteria/Chloroplast"
  )
rds.f = prune_taxa(taxa_sums(rds.f) > 0, rds.f)
rds.f


phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 2212 taxa and 36 samples ]
sample_data() Sample Data:       [ 36 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 2212 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 2212 tips and 2178 internal nodes ]
refseq()      DNAStringSet:      [ 2212 reference sequences ]

**Independent feature filtes**  
Functions for paramter tuning independent filtering is from Bryan Brown's microfiltR https://github.com/itsmisterbrown/microfiltR

In [6]:
as.threshold <- estimate.ASthreshold(ps=rds.f, WST=NULL, minLIB=500, Prange = c(0.025:0.15, 0.025), CVrange = c(1:10, 0.5), RArange = c(5e-6:0.9e-3, 1e-5))

Removing 0 samples with read count < 500

Estimating filtering statistics from relative abundance thresholds 5e-06 to 9e-04 by 1e-05

Estimating filtering statistics from CV thresholds 1 to 10 by 0.5



 OTU abundance data must have non-zero dimensions. 
 OTU abundance data must have non-zero dimensions. 
 OTU abundance data must have non-zero dimensions. 
 OTU abundance data must have non-zero dimensions. 
 OTU abundance data must have non-zero dimensions. 
 OTU abundance data must have non-zero dimensions. 
 OTU abundance data must have non-zero dimensions. 
 OTU abundance data must have non-zero dimensions. 


Estimating filtering statistics from prevalence thresholds 0.025 to 0.15 by 0.025



In [7]:
e3.f <- microfilter(ps=rds.f, WST=NULL, 
                    PFT=snakemake@params$prevalence_threshold,
                    RAT=snakemake@params$abundance_threshold , 
                    minLIB=snakemake@params$min_lib, 
                    return.all = TRUE)
e3.f$filtered.phyloseq

Removing 0 samples with read count < 500

Applying relative abundance threshold of 2.5e-06

Applying prevalence threshold of 0.05



phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 821 taxa and 36 samples ]
sample_data() Sample Data:       [ 36 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 821 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 821 tips and 814 internal nodes ]
refseq()      DNAStringSet:      [ 821 reference sequences ]

----------------

**Save filtered SILVA**

In [8]:
saveRDS(e3.f$filtered.phyloseq, snakemake@output[[1]])
cat("Saved ", snakemake@output[[1]])

Saved  data/tmp/microbiome/analysis/GCF-2019-658-juice/physeq_filtered.rds

--------

--------------
