# Pipeline 1 Tutorial
last modified 28 December 2021<br>
Code by Gary Olds and Jessie Berta-Thompson<br>
Instructions by Gary Olds and Andrew Wilson

### <font color="blue">Quick Referece Guide to Keyboard Commands
Shift-Return = Run code in cell<br>
Option-Return = Add cell below<br>
    
**Esc = Enter Command Mode**<br>
While in command mode
    B = Add cell below<br>
    A = Add cell above<br>
    DD = Delete Cell<br>
    X = Cut Cell <br>
    M = **Markdown** mode<br>
    E = Delete Cell

## Unzip and relocate your .fastq.gz files
Once you recive your .fastq.gz files from the GBRC, you will need to unzip them and then organize the files and data so you can analyze the sequences. Make a working directory that you plan to do all of your analysis in. Place your unzipped .fastq files in this directory.

Prior to analyzing anything you need to perform some simple commands for setting up your workspace. This includes setting the filepath for working on sequences, loading the libraries with the commands for analying your sequences, and defining some original functions to run the pipeline

### Setup
Start setting up your workspace.

In [None]:
MAINdir = "/Users/andrew.wilson/Russula_BC"
TRIMMEDdir = f"{MAINdir}/NoPrimers"

In [None]:
%cd {MAINdir}

In [9]:
##    BASICS
 
import os                #command-line like functions, for operating system interface (finding files)
import subprocess        #recommended way of running command line programs from within python
import numpy as np       #for math and arrays
import shutil            #used to move files around
import sys               #helps with reading and writing onto text files
from tqdm import tqdm    #a progress bar for for-loops (lets you see progress in actively running loops)
from time import time    #use to track time and measure how long chunks are taking to run
import shutil            #for moving files from directory to directory
import pandas as pd      #manipulate data tables
 
##    FOR DNA
 
from Bio import Seq                 #reading in and manipulating sequence data
from Bio import SeqIO               #for reading and writing fasta/qs
from Bio.SeqRecord import SeqRecord #creating sequence records that are objects and not just strings
from Bio.SeqUtils import GC         #for calculating GC content
 
##    FOR FIGURES
 
import matplotlib.pyplot as plt          #basic plotting tools that let you do most of what you'll need to do, set up as a shortcut
import matplotlib                        #get ALL of matplotlib for fancier tools
                                         #add specific other matplotlib imports as needed
matplotlib.rcParams['pdf.fonttype'] = 42 #make saved vector pdf files use fonts instead of lines/shapes for text
                                         #(essential weird line for saving vector pdfs for downstream editing in illustrator)

### Define Functions: Allows for downstream evaluation of sequence files.

In [None]:
def table(alist):
    """
    Description: Define a function that prints and stores a frequency table for 
    a list's contents. This is useful for quick summarizing and quality-control steps. 
    Input:   alist
    Output: (1) printed frequency table for how many instances of each value are in list
            (2) pair of lists, unique items list and counts list, order-matching. 
    """
    alist = list(alist) #make sure a list
    uniqueitems = sorted(list(set(list(alist)))) # get unique entries, make a list again, sort list
    counts = []#place to store frequency counts, one count for each unique item in input list
    print("value\tinstances") #print header for table
    for item in uniqueitems:
        counts.append(alist.count(item))
    for i in range(len(uniqueitems)):
        print(str(uniqueitems[i])+"\t"+str(counts[i]))
    return(uniqueitems, counts)

In [None]:
# Test that the table function written above is working. Shows frequency in the given list
TestList=[1,1,1,2,2,2,3,4,5,5]
table(TestList)

In [None]:
def summary_fastqs(files, outfilename):
    """
    Define a function for getting fastq file summaries
    Warning, it does math with quality scores so it's slow.
    input a list of files (full or relative path)
    output a tab delimited file of summary stats 
    about the sequences in the files
    """
    #Initialize summary stat table, one line per fastq file
    perfile = {}
    #Looking at the whole file
    perfile['path'] = []
    perfile['file'] = []
    perfile['n sequences'] = []
    perfile['total bases'] = []
    #Looking at the lengths of the sequences
    perfile['mean sequence length'] = []
    perfile['standard deviation sequence length'] = []
    perfile['median sequence length'] = []
    perfile['minimum sequence length'] = []
    perfile['maximum sequence length'] = []
    #Looking at the pool of individual base quality scores
    perfile['mean total base quality score'] = []
    perfile['standard deviation total base quality score'] = []
    perfile['min total base quality score'] = []
    perfile['max total base quality score'] = []
    #Looking at the read average quality scores
    perfile['mean read average quality score'] = []
    perfile['standard deviation read average quality score'] = []
    perfile['min read average quality score'] = []
    perfile['max read average quality score'] = []
    
    #Loop through files to collect information
    for file in tqdm(files):
        lengths = [] #read lengths
        qualities = []#qual values for all bases
        mean_qualities = []#read mean qual averages
        for rec in SeqIO.parse(file, 'fastq'):
            lengths.append(len(str(rec.seq)))
            seqqualities = rec.letter_annotations["phred_quality"] 
            qualities.extend(seqqualities)
            mean_qualities.append(np.mean(seqqualities))
            
        #Calculate and store summary stats. 
        perfile['path'].append(file)
        perfile['file'].append(file.split("/")[-1])
        perfile['n sequences'].append(len(lengths))
        perfile['total bases'].append(np.sum(lengths))
        
        #Looking at the lengths of the sequences
        perfile['mean sequence length'].append(np.mean(lengths))
        perfile['standard deviation sequence length'].append(np.std(lengths))
        perfile['median sequence length'].append(np.median(lengths))
        perfile['minimum sequence length'].append(np.min(lengths))
        perfile['maximum sequence length'].append(np.max(lengths))
        
        #Looking at the pool of individual base quality scores
        perfile['mean total base quality score'].append(np.mean(qualities))
        perfile['standard deviation total base quality score'].append(np.std(qualities))
        perfile['min total base quality score'].append(np.min(qualities))
        perfile['max total base quality score'].append(np.max(qualities))
        
        #Looking at the read average quality scores
        perfile['mean read average quality score'].append(np.mean(mean_qualities))
        perfile['standard deviation read average quality score'].append(np.std(mean_qualities))
        perfile['min read average quality score'].append(np.min(mean_qualities))
        perfile['max read average quality score'].append(np.max(mean_qualities))
    
    # Save data to csv file
    print(f"Saving fastqs summary to {outfile}")
    df = pd.DataFrame(test_forward)
    df.to_csv(outfilename, sep="\t", index=False)

### File Evaluation: Checks to make sure all forward and reverse read data match up.

In [None]:
#Locate raw input files (includes already split replicates)
allrawdir = os.listdir(MAINdir)
forward_fastqs = []
reverse_fastqs = []
for file in allrawdir:
    if file.endswith("_R1.fastq"):
        forward_fastqs.append(file)
    elif file.endswith("_R2.fastq"):
        reverse_fastqs.append(file)

#Sort to make sure pairs are in the same order in these lists
forward_fastqs = sorted(forward_fastqs)
reverse_fastqs = sorted(reverse_fastqs)

#Report back on findings (and read to make sure sensible)
print(f"Found {len(forward_fastqs)} forward files with {len(set(forward_fastqs))} unique names, like {forward_fastqs[0]}")
print(f"Found {len(reverse_fastqs)} reverse files with {len(set(reverse_fastqs))} unique names, like {reverse_fastqs[0]}")

#Check pair order on first few
for i in range(1):
    print(forward_fastqs[i].split("/")[-1], reverse_fastqs[i].split("/")[-1])    


### Summarize Files: Evaluates all the reads in your files and gives summary data.

In [None]:
#####     SUMMARY ONLY STEP

summary_fastqs(forward_fastqs, "1_Raw_forward_fastqs_summary.txt")
summary_fastqs(reverse_fastqs, "1_Raw_reverse_fastqs_summary.txt")

### Define Primers

In [None]:
#Forward primer (ITS7f) - without spacer (on 5' end, distal to cuts), with R for degenerate base (cutadapt fine with that)
forward_primer = Seq.Seq("GTGARTCATCGAATCTTTG")            #converting to sequence object for handy complementing
forward_primer_complement = forward_primer.reverse_complement()

#Reverse primer (ITS4) - without spacer (on 5' end) and without 5 Ns for barcode region (between spacer and primer) 
reverse_primer = Seq.Seq("TCCTCCGCTTATTGATATGC")
reverse_primer_complement = reverse_primer.reverse_complement()

#Make copies of those in simple string form for feeding to cutadapt
fprimer = str(forward_primer)
fprimer_rc = str(forward_primer_complement)
rprimer = str(reverse_primer)
rprimer_rc = str(reverse_primer_complement)

### Run CutAdapt
This will remove primers and spacers from the 3' end of the sequence reads.

In [None]:
#Save stdout to a file along the way - cutadapt prints lots of reporting information here
with open("2_cutadapt_noprimers.txt", "w") as stdouthandle:
    
    #Loop over pairs
#    for ffile in tqdm(range(len(forward_fastqs))):
    for ffile in tqdm(forward_fastqs):
        
        rfile = ffile.replace("_R1.fastq","_R2.fastq")
        foutfastq = ffile.replace("R1.fastq","R1_noprimer.fastq")
        routfastq = rfile.replace("R2.fastq","R2_noprimer.fastq")
        
#        #Compose command
#        #-g, -G removes from 5' end for f and r reads
#        cmd = f"cutadapt -g {fprimer} -G {rprimer} -m 200 -o {foutfastq} -p {routfastq} {ffile} {rfile}"

        #Compose command
        #-a, -A removes given primer from 3' end for f and r reads
        cmd = f"cutadapt -a {rprimer_rc} -A {fprimer_rc} -m 200 -o {foutfastq} -p {routfastq} {ffile} {rfile}"

#        print(cmd+"\n")
        
        #Run command
        subprocess.call(cmd, stdin=None, stdout=stdouthandle, stderr=subprocess.STDOUT, shell=True)        

### No Primers Folder
Make a folder to put your primer-free sequences in.

In [None]:
#Make a directory for files without primers & adapters, cutadapt results:
%mkdir NoPrimers

TOTAL = os.listdir()                  #take everything in the current directory and call it "TOTAL"
MAIN = f"{MAINdir}/"                 #name the full directory path to the main directory "MAIN"
NoPrimer = f"{TRIMMEDdir}/"          #name the destination directory "NoPrimer"
for file in TOTAL:                    #for-loop regarding new variable "file1" in current directory
    if ("_noprimer.fastq" in file):   #select files with "_TPaired.fastq" in their name (these were the files, forward and reverse, that did not lose their pair in trimming/filtering)
        src = MAIN+file               #define the source location of the file in question
        dst = NoPrimer+file           #define the destination location of the file in question
        shutil.move(src,dst)          #move the file from source to destination

### Run read files through DADA2.
Move operations over to R Studio.