# __Homework 4:__ Practical analysis with BioPython

For the homework, you are going to extend the code from the analysis of our FASTQ file in lectures 8 and 9.
Recall that the FASTQ file contains reads from a real sequencing run of influenza virus HA and NA genes.

---
The __actual sequences__ are as follows:

    5'-[end of HA]-AGGCGGCCGC-[16 X N barcode]-3'
or 

    5'-[end of NA]-AGGCGGCCGC-[16 X N barcode]-3'
---


__The end of NA is__ `...CACGATAGATAAATAATAGTGCACCAT`
    
__The end of HA is__ `...CCGGATTTGCATATAATGATGCACCAT`

---    

    
The __sequencing reads__ from the reverse end of the molecules (in 5'>3' orientation), so the sequencing reads are as follows:

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of HA]-3'
or

    5'-[reverse complement of 16 X N barcode]-GCGGCCGCCT-[reverse complement of the end of NA]-3'

---   
    
The reads can originate from **either** HA or NA, and that will be distinguished by the most 3' end of the read.
But in our example exercise in class, we did not distinguish among reads matching to HA and NA, as we didn't even look far enough into the read to tell the identity.

For the homework, your goal is to write code that extends the material from lectures 8 and 9 to also distinguish between HA and NA.
This homework can be completed almost entirely by re-using code from lecture 9. You will need to set up your analysis to do the following:
 1. Get the reverse complement of each read.
 2. Determine if it matches the expected pattern for HA and NA, and if so which one.
 3. If it matches, extract the barcode and add it to a dictionary to keep track of counts.
 4. Determine the number and distribution of barcodes for HA and NA separately.

Please include code to address each of the following questions. Please include code comments to explain what your code is attempting to accomplish. Don't forget to include references to the sources you used to obtain your answer, including your classmates (if you are working in groups).  

In [1]:
import Bio.SeqIO
import Bio.Seq
import re

In [2]:
# read in file using BioSeq into a list
# reference - lecture 9
reads = Bio.SeqIO.parse('barcodes_R1.fastq', format='fastq')
seqreads = list(reads)

In [3]:
# set up regex patterns for ends of NA and HA sequences
pattern_NA = re.compile('CACGATAGATAAATAATAGTGCACCAT')
pattern_HA = re.compile('CCGGATTTGCATATAATGATGCACCAT')

# set up barcode regex pattern
pattern_barcode = re.compile('AGGCGGCCGC(?P<barcode>[ATGC]{16})')

# initialize barcode dictionaries
barcodes_NA = {}
barcodes_HA = {}

# initialize NA and HA counts
count_NA = 0
count_HA = 0

# go through sequences
for seqread in seqreads:
    # get reverse complement of sequence
    revseq = str(seqread.seq.reverse_complement())
    # check if NA
    if pattern_NA.search(revseq):
        # add one to count for NA
        count_NA += 1
        # get barcode
        barcode_search = pattern_barcode.search(revseq)
        if barcode_search:
            barcode = barcode_search.group('barcode')
        else:
            # if a valid barcode isn't found, count it in the dictionary as "No valid barcode"
            barcode = 'No valid barcode'
        # add to barcode to dictionary
        if barcode in barcodes_NA:
            barcodes_NA[barcode] += 1
        else:
            barcodes_NA[barcode] = 1
    # check if HA
    elif pattern_HA.search(revseq):
        # add one to count for HA
        count_HA += 1
        # get barcode
        barcode_search = pattern_barcode.search(revseq)
        if barcode_search:
            barcode = barcode_search.group('barcode')
        else:
            # if a valid barcode isn't found, count it in the dictionary as "No valid barcode"
            barcode = 'No valid barcode'
        # add to barcode to dictionary
        if barcode in barcodes_HA:
            barcodes_HA[barcode] += 1
        else:
            barcodes_HA[barcode] = 1

1. How many reads map to HA, and how many reads map to NA?

In [4]:
print(count_HA, 'sequences map to HA')
print(count_NA, 'sequences match to NA')

5409 sequences map to HA
4122 sequences match to NA


2. How many HA sequences did not have a valid barcode? Also anwer the same question for NA.

In [5]:
print(barcodes_HA['No valid barcode'], 'HA sequences do not have a valid barcode')
print(barcodes_NA['No valid barcode'], 'NA sequences do not have a valid barcode')

160 HA sequences do not have a valid barcode
213 NA sequences do not have a valid barcode


3. What is the HA barcode with the most counts (and how many counts)? Also answer the same question for NA.

    _Hint: you will need to find the key associated with the maximum value in your dictionary. There are many ways to do this._

In [6]:
# Initialize max count and corresponding barcode variables
max_count_HA = 0
max_barcode_HA = ''
# Iterate through barcodes in dictionary
for barcode in barcodes_HA:
    # ignore the 'No valid barcode' entry in dict
    if barcode == 'No valid barcode':
        continue
    # Check if larger than current largest barcode count
    if barcodes_HA[barcode] > max_count_HA:
        max_count_HA = barcodes_HA[barcode]
        max_barcode_HA = barcode

# Initialize max count and corresponding barcode variables
max_count_NA = 0
max_barcode_NA = ''
# Iterate through barcodes in dictionary
for barcode in barcodes_NA:
    # ignore the 'No valid barcode' entry in dict
    if barcode == 'No valid barcode':
        continue
    # Check if larger than current largest barcode count
    if barcodes_NA[barcode] > max_count_NA:
        max_count_NA = barcodes_NA[barcode]
        max_barcode_NA = barcode

print("The HA barcode with the most counts is", max_barcode_HA, "with", max_count_HA, "counts")
print("The NA barcode with the most counts is", max_barcode_NA, "with", max_count_NA, "counts")

The HA barcode with the most counts is CCCGACCCGACATTAA with 155 counts
The NA barcode with the most counts is ACCAGTTCTCCCCGGG with 152 counts


Also tried to find a shorter way to do this...

In [10]:
# for this, used a solution from https://www.geeksforgeeks.org/python-get-key-with-maximum-value-in-dictionary/

# remove 'No valid barcode' first
# https://stackoverflow.com/questions/11277432/how-can-i-remove-a-key-from-a-python-dictionary
barcodes_HA.pop('No valid barcode', None)
barcodes_NA.pop('No valid barcode', None)

# use max() function to get max value
max_HA = max(zip(barcodes_HA.values(), barcodes_HA.keys()))
max_NA = max(zip(barcodes_NA.values(), barcodes_NA.keys()))

print("The HA barcode with the most counts is", max_HA[1], "with", max_HA[0], "counts")
print("The NA barcode with the most counts is", max_NA[1], "with", max_NA[0], "counts")

The HA barcode with the most counts is CCCGACCCGACATTAA with 155 counts
The NA barcode with the most counts is ACCAGTTCTCCCCGGG with 152 counts
