In order to define a motif scoring function taking this into account, first note that every column of Profile(Motifs) corresponds to a probability distribution, or a collection of nonnegative numbers that sum to 1. For example, the second column in the figure in the first step corresponds to the probabilities 0.2, 0.6, 0.0, and 0.2 for A, C, G, and T, respectively.

Entropy is a measure of the uncertainty of a probability distribution (p1, . . . , pN), and is defined as

H(p1,…,pN)=−∑i=1Npi⋅log2pi
For example, the entropy of the probability distribution (0.2, 0.6, 0.0, 0.2) corresponding to the second column of the profile matrix in the figure in the first step is

−(0.2log20.2+0.6log20.6+0.0log20.0+0.2log20.2)≈1.371
whereas the entropy of the more conserved final column (0.0, 0.6, 0.0, 0.4) is

−(0.0log20.0+0.6log20.6+0.0log20.0+0.4log20.4)≈0.971
and the entropy of the very conserved 5th column (0.0, 0.0, 0.9, 0.1) is

−(0.0log20.0+0.0log20.0+0.9log20.9+0.1log20.1)≈0.467
Note that technically, log2(0) is not defined, but in the computation of entropy, we assume that 0 · log2(0) is equal to 0.

STOP and Think: What are the maximum and minimum possible values for the entropy of a probability distribution containing four values?

In [2]:
def Count(Motifs):
    count = {} # initializing the count dictionary
    #Initialize each nucleotide with an empty list, 
    for nucleotide in ["A","C","G","T"]:
        count[nucleotide] = []     
    for ind in range(len(Motifs[0])):
        for nucleotide in ["A","C","G","T"]:
            count[nucleotide].append(0) #everything must have a 0 initially  
        for motif in range(len(Motifs)): #For each Motif, loop through chars
            count[Motifs[motif][ind]][ind] += 1 # FOr each nuc, increment its count for that Motif
    return count

# Input:  A list of kmers Motifs
# Output: the profile matrix of Motifs, as a dictionary of lists.
def Profile(Motifs):
    t = len(Motifs)
    k = len(Motifs[0])
    profile = {}
    counts = Count(Motifs)
    for nucleotide in ["A","C","G","T"]:
        #everything must divided byt the total number of Motifs  
        profile[nucleotide] = [ count/float(t) for count in counts[nucleotide]]  
    return profile


In [3]:
Motifs = ["TCGGGGGTTTTT",
"CCGGTGACTTAC",
"ACGGGGATTTTC",
"TTGGGGACTTTT",
"AAGGGGACTTCC",
"TTGGGGACTTCC",
"TCGGGGATTCAT",
"TCGGGGATTCCT",
"TAGGGGAACTAC",
"TCGGGTATAACC"]

prof = Profile(Motifs)

In [8]:
import math
def Entropy(Profile):
    H = 0.0
    n_entries = len(Profile["A"])
    for n in range(n_entries):
        temp = 0.0
        for key in Profile.keys():
            p = Profile[key][n]
            if p > 0.0:
                temp +=  p*math.log(p, 2)
        H+=temp
    
    return -H

Entropy(prof)

9.916290005356972