## Theory

# Analysis of DNA Sequences with Markov Models

## 1. Markov Models for DNA Sequences

A **Markov model** is a probabilistic framework for analyzing sequences, where the probability of each state depends on a fixed number of preceding states. In DNA sequence analysis, a **first-order Markov model** assumes that the probability of each nucleotide depends only on the immediate preceding nucleotide.

- **States**: The four DNA nucleotides — A, C, G, and T.
- **First-Order Property**:

  $$
  P(X_{t+1} = b \mid X_t = a, X_{t-1}, \ldots) = P(X_{t+1} = b \mid X_t = a)
  $$

- **Purpose**: This model captures local dependencies in DNA, such as patterns in protein-binding sites.

---

## 2. Transition Matrix: Definition and Construction

The transition matrix is a 4×4 table describing transition probabilities between nucleotide pairs:

- **Rows**: Current nucleotide \( a \in \{A, C, G, T\} \)
- **Columns**: Next nucleotide \( b \in \{A, C, G, T\} \)
- **Entry**: The conditional probability:

  $$
  P(b \mid a) = \frac{\text{Count}(ab)}{\sum_{y \in \{A, C, G, T\}} \text{Count}(ay)}
  $$

Each row is normalized so that its entries sum to 1:

$$
\sum_{b \in \{A,C,G,T\}} P(b \mid a) = 1
$$

This reflects the distribution over possible next nucleotides given a current one.

---

## 3. Comparison with Dinucleotide Frequency Models

A **dinucleotide frequency model** considers the overall frequency of each dinucleotide:

$$
\text{Frequency}(ab) = \frac{\text{Count}(ab)}{\sum_{x,y \in \{A, C, G, T\}} \text{Count}(xy)}
$$

In contrast, the **Markov model** focuses on the *conditional probability* of \( b \) given \( a \), capturing sequence dependency.

| Feature                 | Dinucleotide Frequency Model | Markov Model (First-Order)     |
|------------------------|------------------------------|--------------------------------|
| Assumption             | Independent occurrences      | Depends on previous base       |
| Normalization          | Global                       | Row-wise (per base)            |
| Captures Dependencies? | No                           | Yes                            |

---

In [3]:
import numpy as np
import pandas as pd

In [4]:
sequence = input("Enter DNA sequence: ").upper().strip()
nucleotides = ['A', 'C', 'G', 'T']

Enter DNA sequence:  AGACGTAGCT


## Initialize 4x4 transition count matrix

In [6]:
transition_counts = pd.DataFrame(0, index=nucleotides, columns=nucleotides)
#print(transition_counts)

In [7]:
## Count nucleotide transitions AG,CG,.... frequencies table
for i in range(len(sequence) - 1):
    curr = sequence[i]
    next_ = sequence[i + 1]
    if curr in nucleotides and next_ in nucleotides:
        transition_counts.loc[curr, next_] += 1
transition_counts=transition_counts/(len(sequence)-1)
#print(transition_counts_prob)

## convert to transition matrix (normalizing rows)

In [9]:
transition_probs = transition_counts.div(transition_counts.sum(axis=1), axis=0)
transition_probs = transition_probs.fillna(0) ## Handle rows with no transitions (set NaN to 0)
print("\nMarkov Transition Matrix (First-Order):")
print(transition_probs)


Markov Transition Matrix (First-Order):
          A         C         G         T
A  0.000000  0.333333  0.666667  0.000000
C  0.000000  0.000000  0.500000  0.500000
G  0.333333  0.333333  0.000000  0.333333
T  1.000000  0.000000  0.000000  0.000000


### REFERENCES:
1. Markov Process: https://en.wikipedia.org/wiki/Markov_chain
2. Markov Transition Matrices: https://math.libretexts.org/Bookshelves/Applied_Mathematics/Applied_Finite_Mathematics_(Sekhon_and_Bloom)/10%3A_Markov_Chains/10.01%3A_Introduction_to_Markov_Chains