# Matching Random Motifs

## Background Info

* Promoter: Regions of DNA that initiate the transcription of a gene. It's usually located shortly before the start of its gene, and it contains specific intervals of DNA that provide an initial binding site for RNA polymerase to initiate transcription. Finding a promoter is usually the second step in gene prediction after establishing the presence of an ORF (Open Reading Frame).

There's no quick rule for identifying promoters as they vary by species (additional intervals used to bind to specific proteins or to change the intensity of transcription). Most eukaryotic promoters are harder to characterize; they have a TATA box, preceded by an interval called a B recognition element (BRE), typically located within 40 bp of the start of transcription, and can hold additional "regulatory" intervals.


## Aim of problem

To determine the probability with which a given motif (ex. promoter) occurs in a randomly constructed genome.
* Hint: For any event = $P(A) + P(A^c) = 1$

## Problem

* **Given**: A positive integer $N$ <= 100000, a number $x$ between 0 and 1, and a DNA string $s$ of length at most 10 bp.
* **Return**: The probability that if $N$ random DNA strings having the same length as $s$ are constructed with GC-content $x$, then at least one of the strings equals s. Same random string can be created more than once.

## Solution Explanation

We want $P(\text{at least 1 s in N sequences})$. Using the hint given in the Aim section, we can solve the problem by the following:
<br><br>
\begin{equation} 
P(\text{at least 1 s in N sequences}) = 1 - P(\text{no s in N sequences}), \text{where} \\
P(\text{no s in N sequences}) = P(\text{no s})^N, \text{where} \\
P(\text{no s}) = 1 - P(s)
\end{equation}

Based on this, we can conclude that<br>
\begin{equation}
P(\text{at least 1 s in N sequences}) = 1 - (1 - P(s))^N
\end{equation}

Now, let's calculate $P(s)$ by implementing a function, as shown below.

In [1]:
def prob_s(s, x):
    p = 1
    p_gc = x/2
    p_at = (1 - x)/2
    for i in s:
        if i == 'A' or i == 'T':
            p *= p_at
        else:
            p *= p_gc
    return p

Now that we've implemented the function to calculate $P(s)$, we can obtain $P(\text{at least 1 s in N sequences})$. Let's try if we get the correct answer by using the example values given in Rosalind.

In [6]:
x = 0.6
s = 'ATAGCCGA'
N = 90000

In [8]:
print(1- (1-prob_s(s, x))**N)

0.6885160784606543


We get 0.689, which is the correct answer!

# Actual Dataset

In [10]:
x = 0.534174
s = 'ACTACCGT'
N = 92550

In [11]:
print(1- (1-prob_s(s, x))**N)

0.7499282838670328


## Problem solved!