# Investigating how sequence identity behaves under different reduced alphabets for sequences with similar structure

## Overall goal
1. Get sequences from ECOD that have a similar structure but different sequence and should not be close evolutionarily to prevent bias
2. Cluster them down to 70 sequence identity using MMseqs2
3. Try different reduced alphabets (e.g. Dayhoff-6, etc) and calculate sequence identity using Blast
4. Plot identity scores by alphabet size to see whether any of the reduced alphabets capture structure similarity to some degree
5. Also plot correlation between actual identity and one after translation with reduced alphabet

In [25]:
from bin.alphafold_exploration.alphabets.dayhoff_recoding import dayhoff_6_recode
from bin.alphafold_exploration.data.load_fasta import load_fasta
from itertools import combinations

from Bio import pairwise2

In [26]:
def average_sequence_identity(seqs):
    scores = [pairwise2.align.localxx(p[0], p[1], score_only=True) for p in combinations(seqs, 2)]
    return sum(scores) / len(scores)

## First try: Just look at sperm whale myoglobin and clam hemoglobin I (TM score = 0.86) with handful of alphabets

In [27]:
similar_struct_seqs = list(load_fasta('../data/similar_struct_different_seq/whale_myoglobin_clam_hemoglobin.fasta'))

In [28]:
average_sequence_identity(similar_struct_seqs)

58.0

In [29]:
dayhoff6_similar_struct_seqs = [dayhoff_6_recode(seq) for seq in similar_struct_seqs]

In [30]:
average_sequence_identity(dayhoff6_similar_struct_seqs)

84.0

## Compare to sequences with different structures

In [31]:
different_struct_seqs = list(load_fasta('../data/similar_struct_different_seq/random_clam_hemoglobin.fasta'))

In [32]:
average_sequence_identity(different_struct_seqs)

91.0

In [33]:
dayhoff6_different_struct_seqs = [dayhoff_6_recode(seq) for seq in different_struct_seqs]

In [34]:
average_sequence_identity(dayhoff6_different_struct_seqs)

134.0

### Helpful links
- https://zhanggroup.org/TM-align/