##Bioinformatics Workshop  
###Kavli Institute for Theoretical Physics
#### Gita Mahmoudabadi | Phillips Lab | Caltech | August 2015

In this part of the excercise, we will perform quality control on Next Generation Sequencing data. Our first filter will eliminate any reads that contain any bases with a Phred quality score below 30. This quality score correpsonds to an error rate of 1 base per 1000 bases. 

The input file, called "main_file" will have phage sequences obtained from different individuals' oral cavity. The output files will be "highq_main.fastq" for high quality reads and "lowq_main.fastq" for low quality reads. 

Let's start by importing relevant modules and opening some files. But first, make sure to have moved "oral_phages.fastq" to the directory containing this ipython notebook. Otherwise you will need the os module to change between directories from within your python code.  

In [26]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import re
import os

#this is the file containing all joined paired end main
main_file = 'oral_phages.fastq' 

#creating a file handle to parse through the fastq records 
h_main = open(main_file) 
records=SeqIO.parse(h_main, 'fastq')                                                                                                                                          

#opening a file to keep all high quality main in
h_highq= open('highq_main.fastq','a') 

#opening a file to keep all low quality main in
h_lowq = open('lowq_main.fastq','a')  


We will have a for loop that will allow us to examine each "record" or read. In addition to that outer loop, we will need a inner loop that to examine every single base within a record. If any record contains low quality bases, we will write it to the "lowq_main.fastq". Throughout we will also keep a count of low and high quality reads. 

In [27]:
#initializing counters. ch will keep count of high quality reads and cl will keep count of low quality ones. 
c=0
ch = 0
cl = 0

#looping through each fastq record in the mainfile
for record in records: 
    #creating a list out of the Phred scores, while ignoring the first and last two bases in every sequence they typically have low Phred scores, but that can be resolved using error detection codes for barcode sequences
    q_list = record.letter_annotations["phred_quality"][2:-2]
    value=0
    #looping through the list of Phred scores for each sequence. 
    for q in range(len(q_list)):
        #if the score is greater than 29 (ascii for an error rate of 1 in 1000), value is 1, and we can continue stepping through the loop
        if q_list[q] > 29:
            value = 1
        else: 
            value = 0 
            SeqIO.write(record, h_lowq, 'fastq')
            cl = cl + 1
            break 
    #at the end of the interior loop, if the value is still 1, that means that the list of phred scores for that record has not contained any low quality bases, and therefore we can write the record to the highq_main.fastq file. 
    if value ==1:
        SeqIO.write(record, h_highq, 'fastq')
        #here, we are keeping count of high quality records
        ch = ch + 1

Let's not forget to close our open files!

In [28]:
h_highq.close()
h_lowq.close() 
h_main.close()

Finally, the counters will tell us how many reads we ended up with after quality filtering. 

In [29]:
print ('There were %d high qulaity sequences and %d low quality sequences found.' %(ch, cl)) 

There were 7502516 high qulaity sequences and 764156 low quality sequences found.
