<a href="https://colab.research.google.com/github/ericodle/mrbayes_primer/blob/main/mutation_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the field of DNA sequence analysis, several commonly used mutation models are used to describe the substitution patterns and rates of genetic changes that occur over time. Here are some of the common mutation models:

Jukes-Cantor (JC) Model: The Jukes-Cantor model assumes equal probabilities of all four nucleotides (A, C, G, T) and that all substitution rates are equal. It is a simple and commonly used model but assumes a uniform rate of evolution across the tree.

Kimura 2-Parameter (K2P) Model: The Kimura 2-Parameter model introduces a parameter to account for transitions (substitutions between purines A/G or pyrimidines C/T) and transversions (substitutions between purine and pyrimidine). It allows for different rates of transitions and transversions.

Hasegawa-Kishino-Yano (HKY) Model: The HKY model extends the K2P model by incorporating different base frequencies. It assumes different stationary frequencies for each nucleotide base (A, C, G, T) and allows for different rates of transitions and transversions.

General Time Reversible (GTR) Model: The General Time Reversible model is a more complex model that considers different substitution rates for all possible nucleotide changes. It incorporates base frequencies and six distinct substitution rates. GTR is often considered the most realistic mutation model but requires more computational resources.

Tamura-Nei (TN) Model: The Tamura-Nei model is an extension of the JC model that incorporates variable base frequencies and different substitution rates. It estimates the base frequencies from the data and allows for rate variation among different nucleotide positions.

# Jukes-Cantor Model

In [1]:
import numpy as np

# Example DNA sequences
sequences = [
    "ATGCTGCTGA",
    "ATGCTCCTGA",
    "ATGTTGCTGA",
    "ATGCCGCTGA"
]

# Number of sequences
n = len(sequences)

# Length of sequences
sequence_length = len(sequences[0])

# Compute the JC model distance matrix
distance_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(i+1, n):
        differences = sum(sequences[i][k] != sequences[j][k] for k in range(sequence_length))
        p = differences / sequence_length
        distance = -0.75 * np.log(1 - (4/3) * p)
        distance_matrix[i, j] = distance_matrix[j, i] = distance

# Print the distance matrix
for row in distance_matrix:
    print(row)

[0.         0.10732563 0.10732563 0.10732563]
[0.10732563 0.         0.2326162  0.2326162 ]
[0.10732563 0.2326162  0.         0.2326162 ]
[0.10732563 0.2326162  0.2326162  0.        ]


We initialize a square matrix of zeros with dimensions n (number of sequences) by n. We then iterate over each pair of sequences using nested loops. For each pair, we calculate the number of differing positions (differences) by comparing the nucleotides at each position. We compute the Jukes-Cantor distance formula using the proportion of differing positions (p), and assign it to the corresponding positions in the distance matrix.

# Kimura 2-Parameter Model

In [3]:
import numpy as np

# Example DNA sequences
sequences = [
    "ATGCTGCTGA",
    "ATGCTCCTGA",
    "ATGTTGCTGA",
    "ATGCCGCTGA"
]

# Number of sequences
n = len(sequences)

# Length of sequences
sequence_length = len(sequences[0])

# Compute the K2P model distance matrix
distance_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(i+1, n):
        transitions = 0
        transversions = 0
        for k in range(sequence_length):
            if sequences[i][k] != sequences[j][k]:
                if (sequences[i][k] == "A" and sequences[j][k] == "G") or (sequences[i][k] == "G" and sequences[j][k] == "A") \
                        or (sequences[i][k] == "C" and sequences[j][k] == "T") or (sequences[i][k] == "T" and sequences[j][k] == "C"):
                    transitions += 1
                else:
                    transversions += 1
        p = transitions / sequence_length
        q = transversions / sequence_length
        distance = -0.5 * np.log(1 - 2*p - q) - 0.25 * np.log(1 - 2*q)
        distance_matrix[i, j] = distance_matrix[j, i] = distance

# Print the distance matrix
for row in distance_matrix:
    print(row)

[0.         0.10846615 0.11157178 0.11157178]
[0.10846615 0.         0.23412336 0.23412336]
[0.11157178 0.23412336 0.         0.25541281]
[0.11157178 0.23412336 0.25541281 0.        ]


We compute the distance matrix using the Kimura 2-Parameter (K2P) model. We initialize a square matrix of zeros with dimensions n (number of sequences) by n. We then iterate over each pair of sequences using nested loops. For each pair, we calculate the number of transitions (transitions) and transversions (transversions) by comparing the nucleotides at each position. We compute the K2P distance formula using the proportions of transitions (p) and transversions (q), and assign it to the corresponding positions in the distance matrix.

# Hasegawa-Kishino-Yano (HKY) Model

In [4]:
import numpy as np

# Example DNA sequences
sequences = [
    "ATGCTGCTGA",
    "ATGCTCCTGA",
    "ATGTTGCTGA",
    "ATGCCGCTGA"
]

# Number of sequences
n = len(sequences)

# Length of sequences
sequence_length = len(sequences[0])

# Frequencies of each nucleotide
base_frequencies = {
    "A": 0,
    "C": 0,
    "G": 0,
    "T": 0
}

# Compute the frequencies of each nucleotide
for sequence in sequences:
    for base in base_frequencies:
        base_frequencies[base] += sequence.count(base)

total_bases = sum(base_frequencies.values())

for base in base_frequencies:
    base_frequencies[base] /= total_bases

# Compute the HKY model distance matrix
distance_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(i+1, n):
        transitions = 0
        transversions = 0
        for k in range(sequence_length):
            if sequences[i][k] != sequences[j][k]:
                if (sequences[i][k] == "A" and sequences[j][k] == "G") or (sequences[i][k] == "G" and sequences[j][k] == "A") \
                        or (sequences[i][k] == "C" and sequences[j][k] == "T") or (sequences[i][k] == "T" and sequences[j][k] == "C"):
                    transitions += 1
                else:
                    transversions += 1
        p = transitions / sequence_length
        q = transversions / sequence_length
        distance = -0.5 * np.log((1 - 2*p - 2*q) * np.sqrt(1 - 2*q))
        distance_matrix[i, j] = distance_matrix[j, i] = distance

# Print the distance matrix
for row in distance_matrix:
    print(row)

[0.         0.16735766 0.11157178 0.11157178]
[0.16735766 0.         0.3111987  0.3111987 ]
[0.11157178 0.3111987  0.         0.25541281]
[0.11157178 0.3111987  0.25541281 0.        ]


we compute the frequencies of each nucleotide (A, C, G, T) in the sequences. The frequencies are calculated by counting the occurrences of each nucleotide and dividing by the total number of bases.

We then compute the distance matrix using the Hasegawa-Kishino-Yano (HKY) model. We initialize a square matrix of zeros with dimensions n (number of sequences) by n. We iterate over each pair of sequences using nested loops. For each pair, we calculate the number of transitions (transitions) and transversions (transversions) by comparing the nucleotides at each position. We compute the HKY distance formula using the proportions of transitions (p) and transversions (q), and assign it to the corresponding positions in the distance matrix.

# General Time Reversible (GTR) Model

In [7]:
import numpy as np

# Example DNA sequences
sequences = [
    "ATGCTGCTGA",
    "ATGCTCCTGA",
    "ATGTTGCTGA",
    "ATGCCGCTGA"
]

# Number of sequences
n = len(sequences)

# Length of sequences
sequence_length = len(sequences[0])

# Substitution rate matrix (A, C, G, T)
substitution_rates = np.array([
    [0, 1, 2, 3],
    [1, 0, 4, 5],
    [2, 4, 0, 6],
    [3, 5, 6, 0]
])

# Compute the GTR-like model distance matrix
distance_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(i + 1, n):
        dist = 0
        for k in range(sequence_length):
            base_i = sequences[i][k]
            base_j = sequences[j][k]
            rate = substitution_rates['ACGT'.index(base_i), 'ACGT'.index(base_j)]
            dist += rate
        distance_matrix[i, j] = distance_matrix[j, i] = dist

# Print the distance matrix
for row in distance_matrix:
    print(row)

[0. 4. 5. 5.]
[4. 0. 9. 9.]
[ 5.  9.  0. 10.]
[ 5.  9. 10.  0.]


we assume a fixed substitution rate matrix (substitution_rates) for the four nucleotides A, C, G, and T. The matrix is represented as a 2D numpy array, where each entry represents the substitution rate between the corresponding nucleotide pair.

We then compute the distance matrix based on this simplified GTR-like model. We initialize a square matrix of zeros with dimensions n (number of sequences) by n. We iterate over each pair of sequences using nested loops. For each pair, we calculate the distance by summing the substitution rates based on the corresponding nucleotide pairs at each position in the sequences. The distance is accumulated in the dist variable and assigned to the corresponding positions in the distance matrix.

# Tamura-Nei (TN) Model

In [8]:
import numpy as np

# Example DNA sequences
sequences = [
    "ATGCTGCTGA",
    "ATGCTCCTGA",
    "ATGTTGCTGA",
    "ATGCCGCTGA"
]

# Number of sequences
n = len(sequences)

# Length of sequences
sequence_length = len(sequences[0])

# Frequencies of each nucleotide
base_frequencies = {
    "A": 0,
    "C": 0,
    "G": 0,
    "T": 0
}

# Compute the frequencies of each nucleotide
for sequence in sequences:
    for base in base_frequencies:
        base_frequencies[base] += sequence.count(base)

total_bases = sum(base_frequencies.values())

for base in base_frequencies:
    base_frequencies[base] /= total_bases

# Transition/transversion rate matrix (A, C, G, T)
transition_transversion_rates = np.array([
    [0, 1, 2, 1],
    [1, 0, 1, 2],
    [2, 1, 0, 1],
    [1, 2, 1, 0]
])

# Compute the TN model distance matrix
distance_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(i+1, n):
        dist = 0
        for k in range(sequence_length):
            base_i = sequences[i][k]
            base_j = sequences[j][k]
            if base_i != base_j:
                rate = transition_transversion_rates['ACGT'.index(base_i), 'ACGT'.index(base_j)]
                freq_i = base_frequencies[base_i]
                freq_j = base_frequencies[base_j]
                dist += (1 / (freq_i * freq_j)) * rate
        distance_matrix[i, j] = distance_matrix[j, i] = dist

# Print the distance matrix
for row in distance_matrix:
    print(row)

[ 0.         16.16161616 29.62962963 29.62962963]
[16.16161616  0.         45.79124579 45.79124579]
[29.62962963 45.79124579  0.         59.25925926]
[29.62962963 45.79124579 59.25925926  0.        ]


we compute the frequencies of each nucleotide (A, C, G, T) in the sequences. The frequencies are calculated by counting the occurrences of each nucleotide and dividing by the total number of bases.

We then compute the distance matrix using the Tamura-Nei (TN) model. We initialize a square matrix of zeros with dimensions n (number of sequences) by n. We iterate over each pair of sequences using nested loops. For each pair, we calculate the distance by considering only differing positions and summing the transition/transversion rates based on the corresponding nucleotide pairs. We also account for the frequencies of the nucleotides. The distance is accumulated in the dist variable and assigned to the corresponding positions in the distance matrix.