# Preprocessing of updated OTU data from SILVA / DADA2
- Data from Cliff, used DADA2 assignTaxonomy() and SILVA version 138.

**Steps are:**

**1)** Preprocessing: sort samples by west to east, min samples total >= 500/168 samples
- `Silva_OTU_PP.txt`

**2)** DESeq normalization: Variance Stablized Counts Per Million 
- `Silva_OTU_VSTcpm.txt`

# 1) OTU table preprocessing 

In [3]:
#source("../modules/1_OTU_preprocess_module_0.2.r")
source("../modules/1_OTU_preprocessing.r")

### small test shows a problem with sample ordering: 
- Sandmound_Cattail sites were supposed to be dropped (only 2 cores, not v. important habitat) 
- Still there, and now "Sandmound_Tule_C_D1	Sandmound_Tule_C_D2" at the end after Consensus.lineage()
- Above samples have an extra dash before "C_D1, C_D2), fix it.

In [4]:
### IMPORT Sample mapping
metaDB <-read.table("../data/meta/SF_sal_meta_FIX3.txt", sep="\t", header=TRUE)               # import Mapping    # # try keeping all params...
row.names(metaDB) <- metaDB$Sample                                                            # Row names are samples for phyloseq             #head(map_iTag)
metaDB = metaDB[,-1]                                                                          # Drop only old index, keep everything else            
# colnames(metaDB)

## IMPORT OTU TABLE (This is the exact same OTU table as the iTagger (i.e., 79380 OTUs) but with SILVA instead of Greengenes)          
otu_raw <- read.table("SF_Salinity_gradient_OTU_table_SILVA.txt", sep='\t', header=TRUE, row.names = 1)      # add to fxn below?
#otu_tax <-data.frame(OTU= row.names(otu_raw), Taxonomy = otu_raw$Consensus.lineage)      

# Fix sample names to match metadata !!   - discovered with small data test
oldnames = c('Sandmound_Tule_C_D1','Sandmound_Tule_C_D2')
newnames = c('Sandmound_TuleC_D1','Sandmound_TuleC_D2')
for(i in 1:2) names(otu_raw)[names(otu_raw) == oldnames[i]] = newnames[i]

# PREPROCESS OTU TABLE (site sort, filter, taxonomy)
otu_PP = otu_t_preproc(otu_raw, 500, metaDB, "Sample", "EWsiteHyd_index")
# otu_PP = otu_t_preproc(otu_raw, 50, metaDB, "Sample", "EWsiteHyd_index")
# dim(otu_PP); head(otu_PP) # names(otu_PP)

# for unknown reasons, need to make OTU counts NUMERIC
exclude <- c("OTU", 'Consensus.lineage', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus')
all_cols <-colnames(otu_PP)
sites = all_cols[all_cols != exclude]
otu_PP[sites] <- lapply(otu_PP[sites], as.numeric)

In [5]:
# head(otu_PP)

In [6]:
# write table
write.table(otu_PP, "Silva_OTU_PP.txt", sep='\t')
# write.table(otu_PP, "Silva_OTU_PP_50.txt", sep='\t')

# 2) DESeq2 normalize data
- VST_CPM here is variance stablized transform, returned as Counts Per Million

In [7]:
source("../modules/2_OTU_table_to_DESeq2_and_VST_cpm_0.7.r")

In [8]:
# Reimport from file - this is the one with the 500 cutoff
otu_PP = read.table("Silva_OTU_PP.txt", sep='\t', row.names = 1) %>%
  dplyr::mutate_if(is.character, as.factor)
# otu_PP50 = read.table("Silva_OTU_PP_50.txt", sep='\t')


### IMPORT Sample mapping
metaDB <-read.table("../data/meta/SF_sal_meta_FIX3.txt", sep="\t", header=TRUE)               # import Mapping    # # try keeping all params...
row.names(metaDB) <- metaDB$Sample                                                            # Row names are samples for phyloseq             #head(map_iTag)
metaDB = metaDB[,-1]                                                                          # Drop only old index, keep everything else            

In [9]:
# oversight in the module, looks like fxn 4 needs to take physeq, pass to fxn 5 too.
# just add to global env here, sigh.  or fix in v0.7 new vs v0.6
# physeq = make_Phyloseq_data(otu_PP, metaDB)

### Get DESeq2 normalized counts per million ###
OTU_vstCPM = calc_DESeq2_CPM(otu_PP, metaDB)

“Coercing from data.frame class to character matrix 
prior to building taxonomyTable. 
This could introduce artifacts. 
Check your taxonomyTable, or coerce to matrix manually.”


phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 6275 taxa and 168 samples ]
sample_data() Sample Data:       [ 168 samples by 66 sample variables ]
tax_table()   Taxonomy Table:    [ 6275 taxa by 8 taxonomic ranks ]


converting counts to integer mode

“the design is ~ 1 (just an intercept). is this intended?”
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

-- replacing outliers and refitting for 3683 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)

estimating dispersions

fitting model and testing



[1] "Deseq2 finished computing"


In [10]:
# head(OTU_vstCPM)

In [11]:
# export it
write.table(OTU_vstCPM, "Silva_OTU_VSTcpm.txt", sep='\t')