# Genetic distance calculation

## Fast pairwise distance estimation

For a limited number of evolutionary models a fast implementation is available.

In [1]:
from cogent3 import available_distances

available_distances()

Abbreviation,Suitable for moltype
paralinear,"dna, rna, protein"
logdet,"dna, rna, protein"
jc69,"dna, rna"
tn93,"dna, rna"
hamming,"dna, rna, protein, text"
percent,"dna, rna, protein, text"


## Computing genetic distances using the `Alignment` object

Abbreviations listed from `available_distances()` can be used as values for the `distance_matrix(calc=<abbreviation>)`.

In [2]:
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs('../data/primate_brca1.fasta', moltype="dna")
dists = aln.distance_matrix(calc="tn93", show_progress=False)
dists

Chimpanzee,Galago,Gorilla,HowlerMon,Human,Orangutan,Rhesus
0.0,0.1921,0.0054,0.0704,0.0089,0.014,0.0396
0.1921,0.0,0.1923,0.2157,0.1965,0.1944,0.1962
0.0054,0.1923,0.0,0.07,0.0086,0.0137,0.0393
0.0704,0.2157,0.07,0.0,0.0736,0.0719,0.0736
0.0089,0.1965,0.0086,0.0736,0.0,0.0173,0.0423
0.014,0.1944,0.0137,0.0719,0.0173,0.0,0.0411
0.0396,0.1962,0.0393,0.0736,0.0423,0.0411,0.0


## Using the distance calculator directly

In [3]:
from cogent3 import load_aligned_seqs, get_distance_calculator
aln = load_aligned_seqs('../data/primate_brca1.fasta')
dist_calc = get_distance_calculator("tn93", alignment=aln)
dist_calc

<cogent3.evolve.fast_distance.TN93Pair at 0x11501b550>

In [4]:
dist_calc.run(show_progress=False)
dists = dist_calc.get_pairwise_distances()
dists

Chimpanzee,Galago,Gorilla,HowlerMon,Human,Orangutan,Rhesus
0.0,0.1921,0.0054,0.0704,0.0089,0.014,0.0396
0.1921,0.0,0.1923,0.2157,0.1965,0.1944,0.1962
0.0054,0.1923,0.0,0.07,0.0086,0.0137,0.0393
0.0704,0.2157,0.07,0.0,0.0736,0.0719,0.0736
0.0089,0.1965,0.0086,0.0736,0.0,0.0173,0.0423
0.014,0.1944,0.0137,0.0719,0.0173,0.0,0.0411
0.0396,0.1962,0.0393,0.0736,0.0423,0.0411,0.0


The distance calculation object can provide more information. For instance, the standard errors.



In [5]:
dist_calc.stderr

Seq1 \ Seq2,Galago,HowlerMon,Rhesus,Orangutan,Gorilla,Human,Chimpanzee
Galago,0.0,0.0103,0.0096,0.0095,0.0095,0.0096,0.0095
HowlerMon,0.0103,0.0,0.0054,0.0053,0.0053,0.0054,0.0053
Rhesus,0.0096,0.0054,0.0,0.0039,0.0039,0.004,0.0039
Orangutan,0.0095,0.0053,0.0039,0.0,0.0022,0.0025,0.0023
Gorilla,0.0095,0.0053,0.0039,0.0022,0.0,0.0018,0.0014
Human,0.0096,0.0054,0.004,0.0025,0.0018,0.0,0.0018
Chimpanzee,0.0095,0.0053,0.0039,0.0023,0.0014,0.0018,0.0


## Likelihood based pairwise distance estimation

The standard ``cogent3`` likelihood function can also be used to estimate distances. Because these require numerical optimisation they can be significantly slower than the fast estimation approach above.

The following will use the F81 nucleotide substitution model and perform numerical optimisation.

In [6]:
from cogent3 import load_aligned_seqs, get_model
from cogent3.evolve import distance

aln = load_aligned_seqs('../data/primate_brca1.fasta', moltype="dna")
d = distance.EstimateDistances(aln, submodel=get_model("F81"))
d.run(show_progress=False)
dists = d.get_pairwise_distances()
dists

Chimpanzee,Galago,Gorilla,HowlerMon,Human,Orangutan,Rhesus
0.0,0.1892,0.0054,0.0697,0.0089,0.014,0.0395
0.1892,0.0,0.1891,0.2112,0.1934,0.1915,0.193
0.0054,0.1891,0.0,0.0693,0.0086,0.0136,0.0391
0.0697,0.2112,0.0693,0.0,0.0729,0.0713,0.0729
0.0089,0.1934,0.0086,0.0729,0.0,0.0173,0.0421
0.014,0.1915,0.0136,0.0713,0.0173,0.0,0.041
0.0395,0.193,0.0391,0.0729,0.0421,0.041,0.0


All `cogent3` substitution models can be used for distance calculation via this approach, with the caveat that identifiability issues mean this is not possible for some non-stationary model classes.