# Project: Janani's rpob sequences demultiplexing

Cassandra Wattenburger, 07/08/21

### Notes:
* Modified from script written by Roli Wilhelm
* Running script using QIIME2 v2021.4
* Included amplicons of rpob for Janani on SFA2 rep1 trial run, demultiplexing

# Introduction

### Pipeline to process raw sequences into phyloseq object with DADA2 ###
* Prep for Import to QIIME2  (Combine two index files)
* Import to QIIME2
* Demultiplex
* Denoise and Merge
* Prepare OTU Tables and Rep Sequences  *(Note: sample names starting with a digit will break this step)*
* Classify Seqs

*100% Appropriated from the "Atacama Desert Tutorial" for QIIME2*

### Pipeline can handle both 16S rRNA gene and ITS sequences####
* Tested on 515f and 806r
* Tested on ITS1

### Commands to install dependencies ####
##### || QIIME2 ||
** Note: QIIME2 is still actively in development, and I've noticed frequent new releases. Check for the most up-to-date conda install file <https://docs.qiime2.org/2017.11/install/native/#install-qiime-2-within-a-conda-environment>

* wget https://data.qiime2.org/distro/core/qiime2-2020.2-py35-linux-conda.yml
* conda env create -n qiime2-2020.2 --file qiime2-2018.2-py35-linux-conda.yml
* source activate qiime2-pipeline

##### || Copyrighter rrn database ||
* The script will automatically install the curated GreenGenes rrn attribute database
* https://github.com/fangly/AmpliCopyrighter

##### || rpy2 (don't use conda version) ||
* pip install rpy2  

##### || phyloseq ||
* conda install -c r r-igraph 
* Rscript -e "source('http://bioconductor.org/biocLite.R');biocLite('phyloseq')" 

##### || R packages ||
* ape   (natively installed in conda environment)

### Citations ###
* Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., *et al.* (2010). QIIME allows analysis of high-throughput community sequencing data. Nature methods, 7(5), 335-336.

* McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE. 8(4):e61217

* Paradis E., Claude J. & Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289-290.

* Angly, F. E., Dennis, P. G., Skarshewski, A., Vanwonterghem, I., Hugenholtz, P., & Tyson, G. W. (2014). CopyRighter: a rapid tool for improving the accuracy of microbial community profiles through lineage-specific gene copy number correction. Microbiome, 2(1), 11.


# Step 0: Prep the sequence data

### Instructions

Done on the command line.

1. Copy/paste the raw sequence data to a new location on the server (never modify original data!)
1. Decompress files to .fastq
1. Run 'truncate_seqid.sh' on the read1, read2, index1, and index2 fastq files
1. Recompress the read1 and read2 files to .fastq.gz (keep the index files uncompressed for a later step)

This script removes a portion of the sequence ID that is incompatible with QIIME2. I think it is due to intricacies involved with the BRC's sequencing methods.

Will work on incorporating directly into pipeline in future.

# Step 1: User Input

### Metadata requirements
* Must be located in the project directory
* Must be .tsv format 
* First column named "SampleID" for samples
* One column named "BarcodeSequence" with the relevant barcode seqeunces (rev. comp reverse concatenated with forward barcode sequence)

The output directory will be created inside the project directory when you run the script.

In [1]:
import os, re, numpy as np

# Prepare an object with the name of the library and all related file paths
# datasets = [['name', 'project directory path', 'output directory name', 'modified raw data directory', read1 file name, read2 file name, 'metadata file name','domain of life'], ...]
datasets = [['jananirpob', '/home/cassi/SFAgrowthrate/data_amplicon/janani_rpob', 'output', 
             '/home/backup_files/raw_reads/SFA2.cassi.2021/rep1.trial/modified', 
             'read1_mod.fastq.gz', 'read2_mod.fastq.gz', 'index1_mod.fastq.gz', 'index2_mod.fastq.gz', 
             'jananirpob_metadata.tsv', 'bacteria']]

# Set # of processors (10 ussually good)
processors = 10

# Step 2: Concatenate barcodes

### Instructions

Done on command line.

1. Run 'concatenate_barcodes_mod.py' on the modified index1 and index2 files
1. Recompress all files to .fastq.gz (inlcuding the barcodes.fastq file that you just created)

Will work on incorporating this directly into pipeline in future.

# Step 3: Move data to output directory

In [2]:
for dataset in datasets:
    directory = dataset[1]
    output = dataset[2]
    raw = dataset[3]
    read1 = dataset[4]
    read2 = dataset[5]
    
    # Create output directory if it doesn't exist already
    if not os.path.isdir(os.path.join(directory, output)):
        !mkdir $directory/$output
    
    # Create a symbolic link to the read data (files too big, copy/paste a waste of space)
    # QIIME2 import requires a directory containing files named: forward.fastq.gz, reverse.fastq.gz and barcodes.fastq.gz 
    !ln -s $raw/$read1 $directory/$output/forward.fastq.gz
    !ln -s $raw/$read2 $directory/$output/reverse.fastq.gz
    
    # Move concatenated barcodes to project directory
    !cp $raw/barcodes.fastq.gz $directory/$output/

# Step 3: Import into QIIME2

In [3]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    output = dataset[2]
    
    os.system(' '.join([
        "qiime tools import",
        "--type EMPPairedEndSequences",
        "--input-path "+directory+"/"+output,
        "--output-path "+directory+"/"+output+"/"+name+".qza"
    ]))
    
    # This more direct command is broken by the fact QIIME uses multiple dashes in their arguments (is my theory)
    #!qiime tools import --type EMPPairedEndSequences --input-path $directory/output --output-path $directory/output/$name.qza
     

# Step 4: Demultiplex

### Notes
* The barcode you supply to QIIME is now a concatenation of your forward and reverse barcode
* Your 'forward' barcode is actually the reverse complement of your reverse barcode and the 'reverse' is your forward barcode

In [4]:
### SLOW STEP

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    output = dataset[2]
    metadata = dataset[8]
    
    os.system(' '.join([
        "qiime demux emp-paired",
        "--m-barcodes-file "+directory+"/"+metadata,
        "--m-barcodes-column BarcodeSequence",
        "--p-no-golay-error-correction",
        "--i-seqs "+directory+"/"+output+"/"+name+".qza",
        "--o-per-sample-sequences "+directory+"/"+output+"/"+name+".demux",
        "--o-error-correction-details "+directory+"/"+output+"/"+name+".demux-details.qza"
    ]))

In [5]:
print("Demultiplexing is complete.")

Demultiplexing is complete.


# Step 5: Visualize quality scores

Drop output from below command into https://view.qiime2.org

In [6]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    output = dataset[2]
    
    os.system(' '.join([
        "qiime demux summarize",
        "--i-data "+directory+"/"+output+"/"+name+".demux.qza",
        "--o-visualization "+directory+"/"+output+"/"+name+".demux.qzv"
    ]))

# Step 6: Export demultiplexed sequences from QIIME2

In [7]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    output = dataset[2]
    metadata = dataset[8]
    
    # Export demultiplexed sequences
    os.system(' '.join([
        "qiime tools export",
        "--input-path "+directory+"/"+output+"/"+name+".demux.qza",
        "--output-path "+directory+"/"+output+"/"
    ]))

# Done!