Dan Shea  
2021-06-15  

#### Problem
Our aim in this problem is to determine the probability with which a given motif (a known promoter, say) occurs in a randomly constructed genome. Unfortunately, finding this probability is tricky; instead of forming a long genome, we will form a large collection of smaller random strings having the same length as the motif; these smaller strings represent the genome's substrings, which we can then test against our motif.

Given a probabilistic event $A$, the complement of $A$ is the collection $A^c$ of outcomes not belonging to $A$. Because $A^c$ takes place precisely when $A$ does not, we may also call $A^c$ "not A."

For a simple example, if $A$ is the event that a rolled die is 2 or 4, then $P(A)=\frac{1}{3}$. $A^c$ is the event that the die is 1, 3, 5, or 6, and $P(A^c)=\frac{2}{3}$. In general, for any event we will have the identity that $P(A)+P(A^c)=1$.

__Given:__ A positive integer $N \leq 100000$, a number $x$ between $0$ and $1$, and a DNA string $s$ of length at most 10 bp.

__Return:__ The probability that if $N$ random DNA strings having the same length as s are constructed with GC-content $x$ (see “Introduction to Random Strings”), then at least one of the strings equals $s$. We allow for the same random string to be created more than once.

##### Sample Dataset
```
90000 0.6
ATAGCCGA
```
##### Sample Output
```
0.689
```

In [1]:
def parse_file(filename):
    with open(filename, 'r') as fh:
        N, gc_content = next(fh).strip().split(' ')
        N = int(N)
        gc_content = float(gc_content)
        kmer = next(fh).strip()
        return (N, gc_content, kmer)

In [2]:
parse_file('sample.txt')

(90000, 0.6, 'ATAGCCGA')

Binomial Probability of of exactly $x$ successes on $n$ repeated trials in an experiment which has two possible outcomes (commonly called a binomial experiment).
$$C\left(^{n}_{x}\right)\cdot p^x \cdot (1−p)^{n−x}$$
We will compute the probability of failing to generate any matching strings equal to $s$.
Then, we know the probability of at least 1 success is $1-P(0)$.  
Additionally, the combination part of the equation is reduced to 1 and the $p^0$ also reduces to 1, leaving us to just calculate $(1-p)^N$.  
We then return $1-(1-p)^N$ to get the probability of at least 1 success.

In [3]:
def compute_prob(N, gc_content, kmer):
    GCprob = gc_content / 2.0
    ATprob = (1 - gc_content) / 2.0
    kmer_prob = 1
    for k in kmer:
        if k in ['A','T']:
            kmer_prob *= ATprob
        else:
            kmer_prob *= GCprob
    # Compute binomial probability of no successes
    print(f'{1-(1-kmer_prob)**N:0.3f}')

In [4]:
compute_prob(*parse_file('sample.txt'))

0.689


In [5]:
compute_prob(*parse_file('rosalind_rstr.txt'))

0.295
