## UVA MSDS Capstone Project Data Processing Pipeline Step 1
- This notebook uses R syntax to read 150NT QZA files and export them as .csv files for ease of use with python
- It also reads in the Silva Taxonomy file and save it as a .csv file

### Loading Bioinformatics R Packages
- The chunk below installs the required R packages including "dada2" and "qiime2R"

In [2]:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(version = "3.15")
BiocManager::install(c("dada2"))

if (!requireNamespace("devtools", quietly = TRUE)){install.packages("devtools")}
devtools::install_github("jbisanz/qiime2R")

library("dada2")
library("yaml")
library("qiime2R")

"package 'BiocManager' was built under R version 4.2.1"
'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.15 (BiocManager 1.30.18), R 4.2.0 (2022-04-22 ucrt)

Old packages: 'aplot', 'broom', 'bslib', 'callr', 'car', 'caret', 'classInt',
  'cli', 'cluster', 'DBI', 'dbplyr', 'dendextend', 'desc', 'devtools', 'dials',
  'dplyr', 'DT', 'dtplyr', 'FactoMineR', 'fontawesome', 'forcats', 'furrr',
  'future', 'future.apply', 'gam', 'gargle', 'gbm', 'GenomeInfoDb', 'gert',
  'ggfun', 'gh', 'gitcreds', 'globals', 'googlesheets4', 'gtable', 'gtools',
  'hardhat', 'haven', 'hms', 'htmltools', 'httpuv', 'httr', 'infer', 'kernlab',
  'klaR', 'labelled', 'lifecycle', 'lme4', 'lobstr', 'MASS', 'Matrix',
  'MatrixModels', 'modeldata', 'modelr', 'multcomp', 'nlme', 'openssl', 'ottr',
  'paletteer', 'parallelly', 'parsnip', 'patchwork', 'pkgload', 'pls',
  'prismatic', 

## Setting Up Directories and Reading in Sample CSV

In [9]:
## This notebook sits in the following working directory (code may need to be tweaked to fit your own directory)
getwd()

In [10]:
## Displaying all of the processed feature tables in .qza format that will be converted
list.files(file.path(getwd(), "150NT_files"))

In [12]:
## Loading the sample survery information file
info <- read.csv("metadata_files/sample_info.csv")
head(info)

Unnamed: 0_level_0,bmi_cat,bowel_movement_frequency,bowel_movement_quality,pregnant,race,sample_type,sample_name,sex,vegetable_frequency,kidney_disease,⋯,cdiff,liver_disease,lung_disease,diet_type,whole_grain_frequency,meat_eggs_frequency,milk_cheese_frequency,prepared_meals_frequency,age_cat,diagnosis
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Normal,Three,,No,Caucasian,feces,10317.000001792,female,Daily,"Diagnosed by a medical professional (doctor, physician assistant)",⋯,I do not have this condition,I do not have this condition,"Diagnosed by a medical professional (doctor, physician assistant)",Omnivore,Never,Daily,Never,Never,50s,kidney_disease
2,Obese,Two,I tend to have normal formed stool - Type 3 and 4,No,Caucasian,feces,10317.00000354,female,Regularly (3-5 times/week),"Diagnosed by a medical professional (doctor, physician assistant)",⋯,"Diagnosed by a medical professional (doctor, physician assistant)",I do not have this condition,"Diagnosed by a medical professional (doctor, physician assistant)",Omnivore,Never,Regularly (3-5 times/week),Rarely (less than once/week),Rarely (less than once/week),50s,kidney_disease
3,Overweight,One,I tend to be constipated (have difficulty passing stool) - Type 1 and 2,No,Caucasian,feces,10317.00000356,female,Daily,"Diagnosed by a medical professional (doctor, physician assistant)",⋯,I do not have this condition,I do not have this condition,I do not have this condition,Omnivore,Rarely (less than once/week),Daily,Occasionally (1-2 times/week),Occasionally (1-2 times/week),40s,kidney_disease
4,Normal,Two,I tend to have normal formed stool - Type 3 and 4,No,Caucasian,feces,10317.000007157,male,Daily,"Diagnosed by a medical professional (doctor, physician assistant)",⋯,,I do not have this condition,"Diagnosed by a medical professional (doctor, physician assistant)",Omnivore,Rarely (less than once/week),Daily,Occasionally (1-2 times/week),Rarely (less than once/week),60s,kidney_disease
5,Normal,Two,I tend to have normal formed stool - Type 3 and 4,No,Caucasian,feces,10317.000010099,male,Regularly (3-5 times/week),"Diagnosed by a medical professional (doctor, physician assistant)",⋯,I do not have this condition,I do not have this condition,I do not have this condition,Omnivore,Occasionally (1-2 times/week),Daily,Regularly (3-5 times/week),Regularly (3-5 times/week),40s,kidney_disease
6,Underweight,Less than one,I tend to be constipated (have difficulty passing stool) - Type 1 and 2,No,Caucasian,feces,10317.000012046,male,Occasionally (1-2 times/week),"Diagnosed by a medical professional (doctor, physician assistant)",⋯,I do not have this condition,I do not have this condition,I do not have this condition,Omnivore,Occasionally (1-2 times/week),Regularly (3-5 times/week),Rarely (less than once/week),Rarely (less than once/week),40s,kidney_disease


## Getting the Taxonomy Assignments and writing to CSV

In [27]:
### This tab takes a long time, just refer to the csv created in metadata_files from it

### Assigning Taxonomy to ~70k sequenes
df <- read_qza("150NT_files/130252_feature-table.qza")$data
seqs = as.data.frame(rownames(df))
seqs$abundance = 1
colnames(seqs) = c("sequence", "abundance")
taxa <- assignTaxonomy(seqs, "metadata_files/silva_nr_v132_train_set.fa.gz", multithread=TRUE)

In [28]:
### Displaying 5 of the sequences with taxonomy applied
head(taxa)

Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus
TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATG,Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus
TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCCGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTACTGTAGGGGTAACTGGAATTCCTG,Bacteria,Actinobacteria,Actinobacteria,Corynebacteriales,Corynebacteriaceae,Corynebacterium_1
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTACGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATTTCGAACTGGCAAACTAGAGTGTGATAGAGGGTGGTAGAATTTCAGG,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Pseudoalteromonadaceae,Pseudoalteromonas
TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGTTTGTGTAAGCCCGCAGCTTAACTGCGGGACTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTG,Bacteria,Actinobacteria,Actinobacteria,Corynebacteriales,Corynebacteriaceae,Corynebacterium_1
TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCACGTCGTCTGTGAAATCCTAGGGCTTAACCCTGGACGTGCAGGCGATACGGGCTGACTTGAGTACTACAGGGGAGACTGGAATTTCTGG,Bacteria,Actinobacteria,Actinobacteria,Corynebacteriales,Corynebacteriaceae,Lawsonella
TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATCCCGGGGCTTAACTTCGGGCGTGCAGGCGATACGGGCATAACTAGAGTGCTGTAGGGGAGACTGGAATTCCTG,Bacteria,Actinobacteria,Actinobacteria,Corynebacteriales,Corynebacteriaceae,Corynebacterium_1


In [29]:
### Writing taxa information to metadata files folder to avoid running the above code again
write.csv(taxa, "metadata_files/big_tax_table.csv")

### Reading all .qza files and writing them as .csvs and storing them in a folder

In [41]:
list.files(file.path(getwd(), "150NT_files"))

In [47]:
## Looping through all of the qzas and converting to .csvs
for(i in 1:length(list.files(file.path(getwd(), "150NT_files")))) {
    ## Read the file
    df <- as.data.frame(read_qza(paste("150NT_files",list.files(file.path(getwd(), "150NT_files"))[i], sep = "/"))$data)
    ## Write the file to csv
    write.csv(df, paste("150NT_csvs",paste(substr(list.files(file.path(getwd(), "150NT_files"))[i],1,6), "csv", sep = "."), sep = "\\"))
}

In [1]:
list.files(file.path(getwd(), "150NT_csvs"))

## Summary of notebook outputs
- 150NT_csvs folder: contains all the feature tables as .csvs
- meta_data folder: contains the sample survey information and the taxonomy table
- This notebook is written in R syntax, the other notebooks in this pipeline utilize Python for analysis