# Consensus motifs

---
# Before Class
In class today we will be building PWMs and using them to idenfity binding sites in a sequence.

Prior to class, please do the following:
1. Install `pdf2svg` on your machine:
    1. On linux: `$ sudo apt-get install pdf2svg`
    1. On mac: `$ brew install pdf2svg`
1. Install `seqlogo` in your b529 environment using: `$ pip install seqlogo`
1. For some machines, you will also need ghostscript: `$conda install ghostscript`
1. Review slides on PWMs in detail
1. Review the structure of a gene and why we reverse complement 
1. Review or read up on Python syntax for:
      1. numpy ndarray operations (https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html)
      1. str.maketrans() and str.translate (https://docs.python.org/3/library/stdtypes.html#str.translate)
      1. seqlogo (https://github.com/betteridiot/seqlogo)
      1. functions from previous class


---
## Learning Objectives

1. Build motifs from consensus sequences
* Plot motifs as a sequence logo
* Use motifs to score sequence and identify matches

---
## Background

As seen in the lecture, one type of transcriptional regulation is through control by transcription factors. These bind specific sequences known as transcription factor binding sites (TFBS). Unlike our previous class, these sites are more degenerate than a specific sequence and so basic pattern matching through the use of regular expressions is not sufficient to describe these sequences.

Our goal today will be to implement the use of PWMs to identify binding sites for a specific TFBS.

We will need to install the seqlogo package to complete today's assignment:

```$ pip install seqlogo```

We also need the pdf to svg converter for the graphics:
On linux: ```$ sudo apt-get install pdf2svg```
On mac: ```$ brew install pdf2svg```

---
## Imports

In [None]:
import numpy as np

#import function from previous class for building sequence motif & idenfitying seqs matching to motif
from data_readers import get_fasta
from seq_ops import reverse_complement

---
## Build sequence motif

Given a provided list of sequence k-mers we are going to build a consensus motif. Recall from lecture slides...

$PFM = \begin{bmatrix}
    x_{A1} & x_{A2} & x_{A3} & \dots & x_{An} \\
    x_{C1} & x_{C2} & x_{C3} & \dots & x_{Cn} \\
    x_{G1} & x_{G2} & x_{G3} & \dots & x_{Gn} \\
    x_{T1} & x_{T2} & x_{T3} & \dots & x_{Tn}
\end{bmatrix}$

$PWM = log_{2} \Big(\frac{x_{ij} + p_{i}}{\sum_{i=A,C,G,T}x_{ij}+\sum_{i=A,C,G,T}p_{i}}\Big) - log_{2}(b_{i})$

Where, $x_{ij}$ is the number of times nucleotide $i$ is observed at position $j$, $p_{i}$ is the pseudocount or Laplace estimator, and $b_{i}$ is the expected probability ($\textit{a priori}$) of observing nucleotide $i$ overall. In this assignment, we will use a pseudocount of $0.25$ and a uniform background probability of $0.25$.

Using your code developed in the previous class and the FASTA file in data/class4.fa write a program to read in the raw sequence reads, convert these into a Position Frequency Matrix, and then convert this into a PWM. 

In [None]:
def build_pfm(sequences, length):
    """Function to build a PFM using entries from a fasta file

    Args:
        sequences (list): list of sequence strings
        length (int): size of pfm we are building

    Returns:
        pfm (numpy array): dimensions are 4xlength
        
    Pseudocode:
        Initialize 4xlength numpy array as pfm (standard row order is alphabetical: A, C, G, T)
        for each sequence:
            for j in 1 to sequence length:
                increment pfm[base, j]
    """
    
    # Initialize an empty numpy array
    pfm = np.zeros((4,length), dtype=int)

    # Add base-wise counts to the numpy array to build PFM matrix
    for seq in sequences:

        # Counter will track our position along the sequence (j)
        counter = 0

        # For each sequence we count which bases we see at each position (j)
        for char in list(seq):

            # An if/switch statement would work here, but this is an ASCII trick
            #  to convert A, C, G, T to integers 0, 1, 3, 2 respectively based
            #  on their binary representations that I include for those interested
            #  A more standard way in Python would be just using:
            #  dna_to_index = str.maketrans('ACGT', '0123')
            #  for char in list(seq.translate(dna_to_index))
            pfm[((ord(char) >> 1 & 3), counter)] += 1

            # Increment counter along the sequence
            counter+=1

    # Because our trick above results in G and T in the wrong positions in the matrix
    #  we need to swap T and G rows for consistency
    pfm[3,:], pfm[2,:] = pfm[2,:], pfm[3,:].copy()

    # return PFM
    return pfm

In [None]:
def build_pwm(pfm):
    """Function to build a PWM from a PFM

    Args:
        pfm (numpy array): dimensions are 4xlength

    Returns:
        pwm (numpy array): dimensions are 4xlength
        
    Pseudocode:
        Initialize 4xlength numpy array as pfm (standard row order is alphabetical: A, C, G, T)
        Calculate column sums as sums
        for i in A,C,G,T:
            for j in 1 to pfm length:
                pwm[i, j] = log2( (pfm[i,j] + p) / (sums[j] + p*4) ) - log2(background probability)
    """

    # Initialize an empty numpy array
    pwm = np.zeros((4,len(pfm[0])), dtype=float)
    
    # Calculate the sums of each column (Xij)
    sums = np.sum(pfm, axis = 0)
    
    # For simplicity, pseudocounts and background are set to .25
    p = .25
    bg = .25
    
    # For each position in the matrix, apply the PWM formula
    for i in range(4):
            for j in range(len(pwm[0])):
                    pwm[i,j] = np.log2((pfm[i,j] + p) / (sums[j]+p*4)) - np.log2(bg)

    # Return pwm
    return pwm

In [None]:
# File containing the list of sequences for our PWM
file_name = 'data/class3.fa'

# Read sequences from file into list
sequences = []
for name, seq in get_fasta(file_name):
    sequences.append(seq)

# Build pfm
pfm = build_pfm(sequences, 20)
print (pfm)

# Build PWM
pwm = build_pwm(pfm)
print (pwm)

---
## Plot motif as logo

A typical way to display PWMs is a sequence logo. This is plot displaying the information content at each base position in the motif that we just generated. An example is below:
<center><img src='./figures/demoLogo.png'/ width=600px></center>

See documentation for this package here: https://github.com/betteridiot/seqlogo

In [None]:
import seqlogo

# Use the seqlogo package to plot the PFM built above
# You will not need to edit anything here.
# The seqlogo expects the transpose of our pfm matrix, this can be done on a numpy array using the member function T: pfm.T

seqlogo.seqlogo(seqlogo.CompletePm(pfm = pfm.T))


---
## Identify sequence matches to motif

We are now going to scan the promoter regions (identified from last class) for matches to our motif. As we scan, consider each k-length sequence substring in both forward and reverse orientation for the entirety of each promoter.

In order to calculate the score at each position, we take the sum of the matching columns to a k-mer. This is becuase each position now represents the log likelihood of observing a base at that position:

k-mer : ACTAG

  | N | 1 | 2 | 3 | 4 | 5 |
  |:---:|:---:|:---:|:---:|:---:|:---:|
| A | **0.26** | 1.26 | -1.32 | **1.49** | -0.32 |
| C | -0.32 | **-0.32** | -1.32 | -1.32 | -1.32 |
| G | -1.32 | -1.32 | 1.49 | -1.32 | **1.0** |
| T | 0.68 | -1.32 | **-1.32** | -1.32 | -1.32 |
| $\sum$ | 0.26 | -0.32 | -1.32 | 1.49 | 1.0 |

= 1.11

In [None]:
# Write a function to score a string with your PWM generated above.
# Expected input: DNA string
# Expected output: highest score and location of best PWM match in the sequence
# Note: Remember to also scan the reverse compliment of the sequence!
def score_kmer(seq, pwm):
    """Function to score a kmer with a pwm
        kmer length is expected to be the same as pwm length

    Args:
        seq(str): kmer to score
        pwm (numpy array): pwm for scoring

    Returns:
        score (float): PWM score for kmer
        
    Pseudocode:
        score = 0
        for j in seq length
            score = score + pwm[seq[j], j]
    """
    
    # Initialize score to 0
    score = 0
    
    if len(seq) != len(pwm[0]):
        raise ValueError('K-mer and PWM are different lengths!')
    
    # Translator for DNA to numeric indices
    dna_to_index = str.maketrans('ACGT', '0123')
    
    # Iterate across kmer and sum log likelihoods
    for j, val in enumerate(list(seq.translate(dna_to_index)), 0):
        score += pwm[int(val), j]

    # Return score
    return score

def score_sequence(seq, pwm):
    """Function to score a long sequence with a pwm
        This will scan sequence and score all 
        subsequences of length k with a pwm
        and return the maximum score

    Args:
        seq(str): nmer to score
        pwm (numpy array): pwm for scoring

    Returns:
        score (float): PWM score for nmer
        position (int): 0-based index of the best match location
         Note: for negative strand still report left-most base position
        strand (int): 0 for positive strand, 1 for negative strand
        
    Pseudocode:
        max_score = -100
        for i pwm-length kmers in seq:
            if kmer > max:
                keep score, i, strand
                max = score
            if reverse compliment kmer > max:
                keep score, i, strand
                max = score            
            
    """
    
    # Initialize score to -100
    max_score = -100
    max_index = 'None'
    max_strand = 'None'
    
    # Get PWM length
    pwm_len = len(pwm[0])
    
    # Make sure that our full sequence is at least the PWM length
    if len(seq) < pwm_len:
        raise ValueError('N-mer shorter than PWM!')
    
    # Iterate through the length of the sequence and test to see if the score is above max
    # Note: this is greedy and only the first max hit will be recorded
    for i in range(0, len(seq)-pwm_len+1):
        
        # Score the subsequence starting at i
        if score_kmer(seq[i:pwm_len+i], pwm) > max_score:
            max_score = score_kmer(seq[i:pwm_len+i], pwm)
            max_index = i
            max_strand = 0
            
        # Also test the reverse complement
        if score_kmer(reverse_complement(seq[i:pwm_len+i]), pwm) > max_score:
            max_score = score_kmer(reverse_complement(seq[i:pwm_len+i]), pwm)
            max_index = i
            max_strand = 1
    
    # Return maximum score and index
    return max_score, max_index, max_strand

In [None]:
# Testing your functions. The following should output (7.477562910794718, 6, 0)
print (score_sequence('TAGAGAACAACCAAAAGAGGGGACAAGGGTATA', pwm))


In [None]:
# Now apply this to the code from last class to score the promoters
# You will need to extract all of the promoter regions (as done in the previous class)
# and then score these regions using your score_sequence() function above. Please output (print)
# the sequence, score of the best hit, position of the best hit, and the strand of the
# best hit.

# Import all of our class 2 functions
from data_readers import *
from seq_ops import *

seq_file="../class_02/data/GCF_000009045.1_ASM904v1_genomic.fna.gz"
gff_file="../class_02/data/GCF_000009045.1_ASM904v1_genomic.gff.gz"

# From class 2, loop through the data to extract promoters
for name, seq in get_fasta(seq_file):
    for gff_entry in get_gff(gff_file):
        if gff_entry.type == 'CDS':
            promoter_seq = get_seq(seq, gff_entry.start, gff_entry.end, gff_entry.strand, 50)

            # Now score each promoter sequence with our pwm
            score, index, strand = score_sequence(promoter_seq, pwm)
            print (promoter_seq, score, index, strand)

