# Pattern matching in sequence

---
## Learning Objectives

1. Downloading files programmatically
* Read & process FASTA files
* Read & process GTF files
* Pattern matching

---
## Background

Bacteria use pattern matching in a variety of processes. These include the use of restriction enzymes that cut DNA at exact sequences as well as other factors that allow for more flexibility in the pattern match. A current popular example of this is CRISPR-Cas9 where a 20bp sequence is used to scan the genome and cut DNA. One of the earliest examples of known patterns in bacteria is the Shine-Dalgarno sequence. This pattern was identified in the mid-70s in _E. coli_ where the 3' end of the 16S rRNA sequence was found to recongize a complementary sequence (AGGAGGU) upstream of the start codon (AUG). This sequence has been shown to be important for the initiation of translation by bacterial ribosomes.

Our goal today will be to identify genes with the exact matching sequence AGGAGGU in the 50bp upstream of genes in the _Bacillus subtilis_ genome.

### FASTA Format


Recall from the lecture slides that the FASTA format (typically with the file extension .fa) has the following format:


### GFF Format

Also recall from the slide that the GFF format (or Generic Feature Format) is used to annotate features in the genome. For our use case here, we will be using an annotation of the genome of _B. subtilis_ from NCBI to identify the locations of coding sequence. GFF has the following format:

---
## Imports

In [1]:
import gzip

---
## Get the data


Go to NCBI and search for Bacillus subtilis subtilis 168 using the 'Genome' dropdown and copy the links to both the genome and GFF annotation file.

Now use `wget` to download the files into the /data/ folder.

You will need to install `wget` on your sysytem using `conda install wget`.

<center><img src='./figures/bsubtilis_screenshot.png'/></center>

---
## Implement FASTA and GFF reader

We first need to build generic functions for handling sequence data (FASTA) and annotations of coordinates on that sequence (GFF). Because the _B. subtilis_ genome fasta file only has a single entry, we have provided an example fasta file to make sure that your code handles multiple sequences appropriately (data/example.fa).

In [2]:
#FASTA READER

def get_fasta(file):
    '''Generator to lazily get all the fasta entries from a fasta file

    Args:
        file (str): /path/and/name/to.fa[.gz] 
            (file may be gziped)

    Yields:
        header (str): header sequence of fasta entry (excludes '>')
        seq (str): concatenated string sequence of the fasta entry
        
    Example:
        >>> for name, seq in get_fasta("data/example.fa.gz"): print(name, seq) #doctest: +ELLIPSIS
        seq0 GGCAG...
    '''
    #need to create the empty strings and set them equal to the variables. 
    #this is what will be returned in the final output. 

    #need to check if the file is a .gz and if it is gziped. If not it needs to be gziped. 
    if ".gz" in file:
        file_open = gzip.open(file, "rt") #want "rt" for text mode. "rb" will be for binary mode. 
    else:
        file_open = open(file, "rt") #for ".gz" files that are already open. Just need to open it. 
    
    name = ""
    seq  = ""
    #need to go through each line. 
    for line in file_open:
        #this is the main part of the generator, where it will say this is what you need. 
        #if you have both line and seq this is what you do. 
        if line.startswith(">") and seq:
            name = name[1:] #this will get rid of the ">"
            #this is the "return" for a generator 
            yield name, seq
            #resetting the variables name and seq for the next line in the file. 
            name = line.strip()
            seq = ""
        
        #getting the first header 
        elif line.startswith(">"): #here is my first seq. 
            name = line.strip() 
            
        #if the line only has the sequence and not the header, do this instead. 
        else:
            seq += line.strip()
        
    #when you get to the end, there is no next line/header/sequence to trigger 
    #the generator to keep going, so it'll just stop and not include the last line. 
    #need this simple if statement to include the last line. 
    if name and seq:
            name = name[1:]
            yield name, seq
    

In [3]:
#GFF READER
# This class does not need to be edited
from functools import total_ordering

@total_ordering
class GffEntry:
    ''' Main class for handling GFF entries
    
    Truncates the GFF entry to required data only. Also, class is totally ordered.
    This means that comparison operators can be used on GffEntry objects
    
    Attributes:
        seqid (str): The contig that the GFF entry is associated with
        type (str): The type of object that the GFF entry classifies as
        start (int): The left-most nucleotide position of the GFF entry relative to seqid (1-indexed)
        end (int): The right-most nucleotide position of the GFF entry relative to seqid (1-indexed)
        strand (str): Whether the entry is on the forward (+), backward (-) strand or N/A (.)
    '''
    
    slots = 'seqid type start end strand'.split()
    
    def __init__(self, args):
        """Initialize the object.
        
        Aggregates all GFF entry columns, and selectively assigns them to attributes
        
        Args:
            args (list): the complete stripped and split GFF entry line
        """
        self.seqid = args[0]
        self.type = args[2]
        self.start = int(args[3])
        self.end = int(args[4])
        self.strand = args[6]
    
    def __str__(self):
        """Determines how GffEntry appear when `print()` is called on them"""
        return f'{self.seqid}\t{self.type}\t{self.start}\t{self.end}\t{self.strand}'
    
    def __len__(self):
        """Determines how GffEntry reports when `len()` is called on them"""
        return self.end - self.start

    def __eq__(self, other):
        self_check = (self.seqid, self.type, self.start, self.end, self.strand)
        other_check = (other.seqid, other.type, other.start, other.end, other.strand)
        if  self_check == other_check:
            return True
    
    def __lt__(self, other):
        if self.seqid < other.seqid:
            return True
        elif self.seqid == other.seqid:
            if self.start < other.start:
                return True
            elif self.start == other.start:
                if self.end < other.end:
                    return True
        
def get_gff(gff_file):
    '''Generator that lazily reports each of the GFF entries within the GFF file
    
    Args:
        gff_file (str): /path/and/name/to.gff[.gz]
    
    Yields:
        (GffEntry): A GFF entry object with the attributes of seqid, type, start, end, and strand

    '''
    
    if '.gz' in gff_file:
        gff_file = gzip.open(gff_file, 'rb')
    else:
        gff_file = open(gff_file, 'rb')
    for entry in gff_file:
        entry = entry.decode('ascii')
        if entry.startswith('#') or not entry:
            continue
        yield GffEntry(entry.strip().split('\t'))

---
## Extract promoter regions

Given a geneome and a list of genes, we are interested in studying the promoter regions. We now need to write a function for extracting sequence from genomic coordinates. Specifically, we would like to extract the 50bp upstream of the TSS for every gene. Because we are looking for the region upstream of the TSS, use the 'CDS' annotation from the GFF file. Remember that for genes on the - strand you will need to take the reverse complement! 

In [4]:
def reverse_complement(seq):
    '''Get the reverse complement of a nucleotide sequence

    Returns the reverse complement of the input string representing a DNA 
    sequence. Works only with DNA sequences consisting solely of  A, C, G, T or N 
    characters. Preserves the case of the input sequence.

    Args:
        seq (str): a DNA sequence string

    Returns:
        (str): The reverse complement of the input DNA sequence string.
        
    Example:
        >>> reverse_complement("ATTAG") #doctest
        'CTAAT'
    '''
      # seq_reverse = seq[::-1]
    # for i in seq_reverse:
    #     if i == "T":
    #         print("A", end = "")
    #     if i == "A":
    #         print("T", end = "")
    #     if i == "C":
    #         print("G", end = "")
    #     if i == "G":
    #         print("C", end = "")
    #     if i == "t":
    #         print("a", end = "")
    #     if i == "c":
    #         print("g", end = "")
    #     if i == "a":
    #         print("t", end = "")
    #     if i == "g":
    #         print("c", end = "")
    #     if i == "N":
    #         print("N", end = "")
    #     if i == "n":
    #         print("n", end = "")
    
   

    #this will easily tell which characters to replace and with what. 
#     #use maketrans() method 
#     char_replace = str.maketrans('ACGTNacgtn', 'TGCANtgcan')

#     #translate() method will translate what I set in char_replace into the seq 
#     #which is the arguments of the function. 
#     #also we are reversing the entire sequence with [::-1]
#     return seq.translate(char_replace)[::-1]
    
    seq_reverse = seq[::-1]
    comp_dict = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C', "a": "t", "t": "a", "c": "g", "g": "c", "N": "N", "n": "n"}
    
    test_seq = list(seq_reverse)
    for i in range(len(test_seq)):
        test_seq[i] = comp_dict[test_seq[i]]
        final_seq = ''.join(test_seq)
    
    return final_seq


In [None]:
#sanity checks below:
# dna = "AATTGGCCNNaattggnn"
# reverse_complement(dna)

In [None]:
#     seq_reverse = seq[::-1]
#     complement_dict = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C', "a": "t", "t": "a", "c": "g", "g": "c", "N": "N", "n": "n"}
    
#     seq_reverse_1 = list(seq_reverse)
#     for i in range(len(seq_reverse_1)):
#         seq_reverse_1[i] = complement_dict[seq_reverse_1[i]]
        
#     return seq_reverse_1

In [None]:
# def test_rc(seq):
#     seq_reverse = seq[::-1]
#     comp_dict = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C', "a": "t", "t": "a", "c": "g", "g": "c", "N": "N", "n": "n"}
    
#     test_seq = list(seq_reverse)
#     for i in range(len(test_seq)):
#         test_seq[i] = comp_dict[test_seq[i]]
#         final_seq = ''.join(test_seq)
    
#     return final_seq

In [None]:
# seq = "AATTCCGG"

In [None]:
# test_rc(seq)

In [None]:
# complement_dict = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}

In [None]:
# seq = "ATTAG"
# seq

In [None]:
# seq_1 = list(seq)
# seq_1

In [None]:
# for i in range(len(seq_1)):
#     seq_1[i] = complement_dict[seq_1[i]]

In [None]:
# seq_1

In [None]:
# seq = "ATTAG"

In [None]:
# seq_reverse = seq[::-1]
# complement_dict = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C', "a": "t", "t": "a", "c": "g", "g": "c", "N": "N", "n": "n"}
    
# seq_reverse_1 = list(seq_reverse)
# for i in range(len(seq_reverse_1)):
#     seq_reverse_1[i] = complement_dict[seq_reverse_1[i]]
#     ''.join(seq_reverse_1)

In [None]:
# print(seq_reverse_1)

In [5]:
def get_promoter_seq(seq, gene_start, gene_end, strand, size):
    '''Get the desired sub sequence from genomic coordinates
    
    Args:
        seq (str): nucleotide sequence
        gene_start (int): left-most position of gene CDS
        gene_end (int): right-most position of gene CDS
        strand (str): Whether the entry is on the forward (+), backward (-) strand or N/A (.)
        size (int): the size of the region to return
    
    Returns:
        desired_seq (str): the desired sub-sequence of seq at coordinates start-end corrected to + strand
        
    Example:
        >>> get_promoter_seq("ATTATATATATA", 2, 4, '-', 3) #doctest
        'ATA'
        
    '''

    desired_seq = ""
    #if strand is backwards, you need the reverse complement. 
    #remember you're only looking for three bases before the gene segment of interest. 
    if strand == "-":
        desired_seq = reverse_complement(seq[gene_end:(gene_end + size)])
    elif strand == "+":
        desired_seq = seq[(gene_start - size - 1):(gene_start - 1)]
    else:
        desired_seq = seq[(gene_start - size - 1):(gene_start - 1)]
    
    return desired_seq



In [None]:
# #sanity checks below:
# get_promoter_seq("ATTATATATATA", 2, 4, '-', 3)

In [None]:
# get_promoter_seq("ATTATATATATA", 5, 8, '+', 3)

In [None]:
# #sanity check below
# gene = "ABCDEFG"
# gene[1:4]

---
## Use pattern matching to locate Shine-Dalgarno sequences

Finally, we will do very basic pattern matching using regular expressions. Below implement a function to count how many of the regions have the Shine-Dalgarno sequences (AGGAGGT) in the promoter regions from our GFF file. We will define the promoters as 50bp upstream of all genes. Note: You have the option of using the RE (Regular Expressions) package here to describe more complex patterns than a simple string.

In [6]:
def count_Shine_Dalgarno(seq_file="data/GCF_000009045.1_ASM904v1_genomic.fna.gz", gff_file="data/GCF_000009045.1_ASM904v1_genomic.gff.gz", motif="AGGAGGT"):
    '''Function to count the instances of a motif in the 50 bp upstream of promoters
    
    Args:
        seq_file (str): FASTA genome file
        gff_file (str): GFF format file for gene annotations
        motif (str): motif to search for
    
    Returns:
        count (int): the number of matches to the motif in the 50 bp region upstream of all promoters
        
    Example:
        >>> count_Shine_Dalgarno() #doctest
        212
        
    '''
    total = 0
    
    for name, seq in get_fasta(seq_file):
        for entry_gff in get_gff(gff_file):
            #we want to focus on only the "CDS" gene here. It is defined in the class at the beginning. 
            if entry_gff.type == "CDS":
                #if the type is "CDS" then we want to go ahead and get all the start, end, and strand and size and get the promoter region. 
                promoter_region = get_promoter_seq(seq, entry_gff.start, entry_gff.end, entry_gff.strand, 50)
                #now if the motif we have in the function arguments if ound in the promoter_region,
                #then we want to add it to our variable total which is already set to 0. 
                #that is why we have += to increment it. 
                if motif in promoter_region:
                    total += 1  
    return total



In [7]:
# Doctest to validate function output
import doctest
doctest.testmod()

**********************************************************************
File "__main__", line 13, in __main__.count_Shine_Dalgarno
Failed example:
    count_Shine_Dalgarno() #doctest
Expected:
    212
Got:
    211
**********************************************************************
1 items had failures:
   1 of   1 in __main__.count_Shine_Dalgarno
***Test Failed*** 1 failures.


TestResults(failed=1, attempted=4)