# Sequence Similarity and Identity

### Checking for Similarity Between Sequences

- Sequence Alignment
 + Dynamic Programming (Global/Local/needle/water)
 + Dotplot
- Similarity: resemblance between two sequences in comparison
 + the minimal number of edit operations (inserts, deletes, and substitution) in order to transform the one sequence into an exact copy of the sequence being aligned
 + distance
- Identity: the number of characters that match EXACTLY between two different sequences 
 + Gaps are not counted
 + The measurement is relational to the shorter of the two sequences
 + This has the effect that sequence identity is not transitive
 + if sequence A=B and B=C then A is not necessarily equal C (in terms of the identity distance measure)
 

- A: AAGGCTT
- B: AAGGC
- C: AAGGCAT

In [2]:
from Bio.Seq import Seq

In [4]:
seqA = Seq("AAGGCTT")
seqB = Seq("AAGGC")
seqC = Seq("AAGGCAT")

In [11]:
from Bio import pairwise2

In [16]:
# A versus B
AvsB = pairwise2.align.localxx(seqA, seqB, one_alignment_only=True, score_only=True)

# A versus C
AvsC = pairwise2.align.localxx(seqA, seqC, one_alignment_only=True, score_only=True)

# B versus C
BvsC = pairwise2.align.localxx(seqB, seqC, one_alignment_only=True, score_only=True)


#### Whether they are Identical or not

+ Seq A and B are 100 identical

In [18]:
print("A versus B: ", AvsB / len(seqB) * 100)

A versus B:  100.0


+ Seq B and C are 100 identical if divide to length of B

In [19]:
print("B versus C: ", BvsC / len(seqB) * 100)

B versus C:  100.0


+ Seq B and C are 71.42857 identical if divide to length of C

In [21]:
print("B versus C: ", BvsC / len(seqC) * 100)

B versus C:  71.42857142857143


In [23]:
print("A versus C: ", AvsC / len(seqC) * 100)

A versus C:  85.71428571428571


#### Whether they are the same or not

In [25]:
print(f"seqA == seqB: {seqA == seqB}\t seqA == seqC: {seqA == seqC}\t seqB == seqC: {seqB == seqC}")

seqA == seqB: False	 seqA == seqC: False	 seqB == seqC: False


# Well Done!