# Mutations

### Problem

To test our folding methods on different BM3 mutants, we'd replicate the mutations in the target sequence in the starting sequence. The PyRosetta mutate tool requires the number of the amino acid (the numbering system is different in some poses from different BM3s) and the letter code amino acid to change it to.  The numbering system we use starts at ```MTIKEM...``` but not all the structures have these amino acids, so have a different numbering system. It would be useful to translate any BM3 numbering sytem to ours, so we can easily identify the position of mutations.


### Psuedo code

Not real code, but it's useful sometimes to make sketches like this. It's a simplified version of what I think a test iteration might look like. First, starting and target poses are made and their sequences are extracted. I think the sequences are taken directly from the structure, so might not be the complete biological sequence. Then we have to find the mutations between the two sequences, which I've been having trouble with. I'll put some things that I tried below. 

```python
starting_pose = pose(BM3wt.pdb)
starting_sequence = get_seq(starting_strucutre)

target_pose = pose(BM3mutant.pdb)
target_sequence = get_seq(target_structure) # not in sync with starting_sequence

mutations = find_mutations(starting_sequence, target_sequence)
# the pyrosetta mutation tool requires the amino acid number
# and the residue to change to

mutant_pose = copy(starting_pose)
mutant_pose.mutate(mutations) # something like that

mutant_pose.fold()

score(mutant_pose, target_pose)
```


### Aim
* To make a function that reliably identifies mutations between two BM3 sequences
* To return the mutation(s), where the numbering system is adjusted as necesary

I've dumped in some functions that I was trying yesterday. The sequences of our (current) mutant group are at ```BM3-Design-PyRosetta/data/sequences/```, so I made some pandas functions to get them. I also looked at using alignments, but I've no idea if that's a good idea or a waste of time. Maybe it would be simpler to just find some conserved residues and search for them to find our sequence offset? Or maybe finding the mutations directly from the sequence isn't a good use of our time and we'd be better off just finding the mutations in literature?


### Pandas functions
These ones are probably fine, they read in fasta data as a pandas Series, or DataFrame. The DataFrame is good for alignments, because each aa gets a cell so we can use padnas operations to find conservation etc/

In [6]:
import pandas as pd

def FastaToSeries(path):
    # opens fasta files, outputs series
    # full fasta string in one cell
    with open(path,'r') as file:
        data = file.read()
    data = [i.split('\n') for i in data.split('>')]
    index = [i[0] for i in data][1:] # first item is ''
    values = [''.join(i[1:]) for i in data][1:] # first item is []
    df = pd.Series(values, index=index)
    return df

FastaToSeries('../data/sequences/Sequences.fasta').head()

3ekb_clean.pdb    TIKEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPG...
2nnb_clean.pdb    KEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRV...
3ben_clean.pdb    EMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRVT...
1yqo_clean.pdb    KEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRV...
1p0v_clean.pdb    KEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRV...
dtype: object

In [3]:
def FastaToDataFrame(path):
    # opens fasta files, outputs dataframe
    # good for aligned sequences
    with open(path,'r') as file:
        data = file.read()
    data = [i.split('\n') for i in data.split('>')]
    index = [i[0] for i in data][1:] # first item is ''
    values = [list(''.join(i[1:])) for i in data][1:] # first item is []
    df = pd.DataFrame(values, index=index)
    return df



FastaToDataFrame('../data/sequences/Sequences_msa.fasta').head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,469,470,471,472,473,474,475,476,477,478
4dqk_clean.pdb,M,H,G,A,F,S,T,N,V,V,...,A,-,-,-,-,-,-,-,-,-
4dql_clean.pdb,-,-,G,A,F,S,T,N,V,V,...,A,G,-,-,-,-,-,-,-,-
3qi8_clean.pdb,-,-,-,-,-,-,-,-,-,-,...,K,K,I,P,L,G,G,I,P,S
3cbd_clean.pdb,-,-,-,-,-,-,-,-,-,-,...,K,K,I,P,L,-,-,-,-,-
3psx_clean.pdb,-,-,-,-,-,-,-,-,-,-,...,K,K,I,P,L,-,-,-,-,-


### Alingments
Still not sure this is a good idea! We can put the aligned sequences into a DataFrame  with
```python
pd.DataFrame([list(seq1), list(seq2)])
```

and we see that the mutations are offset from eachother, maybe this is solvable??

In [7]:
from Bio import pairwise2

sequences = FastaToSeries('../data/sequences/Sequences.fasta')


def AlignAgainstReferenceSequence(reference_seq, query_seq):
    alignments = pairwise2.align.globalxx(reference_seq, query_seq) # makes several alignments
    all_scores = [i[2] for i in alignments] # alignments return a list of lists, unpacking items
    top_scores_indexes = [i for i,j in enumerate(all_scores) if j ==max(all_scores)]# list of indexes where score is max()
    top_alignment = alignments[top_scores_indexes[0]] # Highest scoring first

    dictionary = {'aln_reference_seq': top_alignment[0],
    'aln_query_seq':top_alignment[1],
        'aln_score': top_alignment[2]}
    return dictionary

AlignAgainstReferenceSequence(sequences[0], sequences[1])

{'aln_reference_seq': 'TIKEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRVTRYLSSQRLIKEACDESRFDKNLSQALKFVRDFAGDGLFTSWTHEKNWKKAHNILLPSFSQQAMKGYHAMMVDIAVQLVQKWERLNADEHIEVPEDMTRLTLDTIGLCGFNYRFNSFYRDQPHPFITSMVRALDEAMNKLQRAN--QFQEDIKVMNDLVDKIIADRKA---QSDDLLTHMLNGKDPETGEPLDDENIRYQIITFLIC-GHETTSGLLSFALYFLVKNPHVLQKAAEEAARVLVDPVPSYKQVKQLKYVGMVLNEALRLWPTAPAFSLYAKEDTVLGGEYPLEKGDELMVLIPQLHRDKTIWGDDVEEFRPERFENPSAIPQHAFKPFGNGQRACIGQ-QFALHEATLVLGMMLKHFDFEDHTNYELDIKETLTLKPEGFVVKAKSKKIPLG',
 'aln_query_seq': '--KEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRVTRYLSSQRLIKEACDESRFDKNLSQALKFVRDFAGDGLFTSWTHEKNWKKAHNILLPSFSQQAMKGYHAMMVDIAVQLVQKWERLNADEHIEVPEDMTRLTLDTIGLCGFNYRFNSFYRDQPHPFITSMVRALDEAMNKL---NKRQFQEDIKVMNDLVDKIIADRKASGEQSDDLLTHMLNGKDPETGEPLDDENIRYQIITFLI-AGHETTSGLLSFALYFLVKNPHVLQKAAEEAARVLVDPVPSYKQVKQLKYVGMVLNEALRLWPTAPAFSLYAKEDTVLGGEYPLEKGDELMVLIPQLHRDKTIWGDDVEEFRPERFENPSAIPQHAFKPFGNGQRACIG-KQFALHEATLVLGMMLKHFDFEDHTNYELDIKETLTLKPEGFVVKAKSKKIPL-',
 'aln_score': 434.0}

In [8]:
alnmnt = AlignAgainstReferenceSequence(sequences[0], sequences[1])

def ResidueConservation(seq1, seq2):
    ## takes two aligned sequences
    # probably won't work if  they're different lengths
    alignment_df = pd.DataFrame([list(seq1), list(seq2)])
    alignment_df.replace(' ','-', inplace = True)
    conserved_residue_count = 0
    for i in alignment_df:
        if len(alignment_df[i].unique()) <2:
            conserved_residue_count += 1
    frac_conserved = conserved_residue_count/len(seq1)
    return frac_conserved


ResidueConservation(alnmnt['aln_reference_seq'], alnmnt['aln_query_seq'])

0.9665924276169265