# Train Data Contains Mutations
It is important to observe that the train data contains mutations. This means (1) we need to use GroupKFold in our CV (2) we need to focus our model training to learn how mutations affect melting point (i.e. target `tm`). And then apply this model knowledge to predict test data because test data is all mutations!

This notebook finds the train mutations using RAPIDS cuDF string functions. Specifically, we use `edit_distance_matrix()` to compute the `Levenshtein edit distance` between all pairs of rows. This will find mutations, insertations, and deletions. The RAPIDS cuDF api docs are [here][1]

This notebook displays some groups and saves a new `train.csv` with a new column `group` which identifies the groups of similar proteins that are mutations, insertions, or deletions from each other. Before using the new column `group` with GroupKFold, we must assign every row with `group=-1` to a new unique group number (otherwise all the `group=-1` will be in the same fold when they are actually different). So for example `nrow = train.loc[train.group==-1].shape[0]` and then `mx = train.group.max()` and finally `train.loc[train.group==-1,'group'] = np.arange(nrow) + mx + 1`.

Enjoy!

[1]: https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.column.string.StringMethods.edit_distance_matrix.html#cudf.core.column.string.StringMethods.edit_distance_matrix

# Load Train

In [1]:
import cudf, numpy as np
print('RAPIDS version',cudf.__version__)

RAPIDS version 21.10.01


In [2]:
train = cudf.read_csv('../input/novozymes-enzyme-stability-prediction/train.csv')
print('Train shape:', train.shape )
train.head()

Train shape: (31390, 5)


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5


# Find Mutations, Insertions, Deletions
Below we find train mutations, insertions, deletions. We loop through the entire train dataset. For each observed protein sequence length, we compare all rows with plus minus `D_THRESHOLD` sequence length. There are 1965 unique lengths, so our for-loop has 1965 iterations. For each iteration, we display the count of how many rows have that protein sequence length. Then we display count (i.e `ct2`) of how many rows have that length plus minus `D_THRESHOLD`. And lastly we display the counts of how many strings have 1,2,3 etc up to `M_THRESHOLD` Levenstein distance among those rows.

In [3]:
train['x'] = train.protein_sequence.str.len()
vc = train.x.value_counts()
vc.head()

164    856
231    828
148    370
155    274
448    250
Name: x, dtype: int32

In [4]:
train['group'] = -1
grp = 0

# MUTATION THRESHOLD
M_THRESHOLD = 10
# INSERTION DELETION THRESHOLD
D_THRESHOLD = 3

for k in range(len(vc)):
    c = vc.index[k]
    # SUBSET OF TRAIN DATA WITH SAME PROTEIN LENGTH PLUS MINUS D_THRESHOLD
    tmp = train.loc[(train.x>=c-D_THRESHOLD)&(train.x<=c+D_THRESHOLD)&(train.group==-1)]
    if len(tmp)<=1: break
    # COMPUTE LEVENSTEIN DISTANCE
    x = tmp.protein_sequence.str.edit_distance_matrix()
    x = np.array( x.to_pandas().values.tolist() )
    # COUNT HOW MANY MUTATIONS WE SEE
    mutation = []
    for kk in range(1,M_THRESHOLD+1):
        mutation.append( len( np.unique( np.where( x==kk )[0] ) ) )
    # FIND RELATED ROWS IN TRAIN WITH M_THRESHOLD MUTATIONS OR LESS
    y = np.unique( np.where( (x>0)&(x<=M_THRESHOLD) )[0] )
    seen = []
    for j in y:
        if j in seen: continue
        i = np.where( np.array(x[j,])<=M_THRESHOLD )[0]
        seen += list(i)
        idx = tmp.iloc[i].index
        train.loc[idx,'group'] = grp
        grp += 1
    ct = vc.iloc[k]
    ct2 = len(tmp)
    print(f'k={k} len={c} ct={ct} ct2={ct2} dist_ct={mutation}')
    #if k==9: break

k=0 len=164 ct=856 ct2=1140 dist_ct=[732, 878, 0, 0, 0, 0, 0, 2, 0, 63]
k=1 len=231 ct=828 ct2=1110 dist_ct=[776, 820, 0, 0, 0, 0, 0, 0, 0, 0]
k=2 len=148 ct=370 ct2=817 dist_ct=[371, 552, 0, 0, 2, 0, 0, 2, 0, 2]
k=3 len=155 ct=274 ct2=767 dist_ct=[351, 516, 0, 0, 2, 0, 0, 0, 50, 0]
k=4 len=448 ct=250 ct2=468 dist_ct=[145, 210, 0, 0, 0, 0, 0, 0, 0, 0]
k=5 len=455 ct=245 ct2=478 dist_ct=[157, 241, 0, 0, 0, 0, 0, 0, 0, 0]
k=6 len=157 ct=232 ct2=389 dist_ct=[114, 132, 0, 0, 0, 0, 0, 2, 0, 0]
k=7 len=268 ct=224 ct2=483 dist_ct=[154, 199, 0, 0, 0, 0, 0, 0, 2, 0]
k=8 len=246 ct=193 ct2=493 dist_ct=[101, 151, 0, 0, 2, 0, 0, 0, 0, 0]
k=9 len=159 ct=172 ct2=260 dist_ct=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
k=10 len=150 ct=171 ct2=253 dist_ct=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
k=11 len=170 ct=169 ct2=395 dist_ct=[30, 165, 0, 0, 2, 0, 0, 0, 0, 0]
k=12 len=96 ct=168 ct2=338 dist_ct=[62, 193, 0, 0, 0, 2, 0, 0, 0, 0]
k=13 len=537 ct=162 ct2=313 dist_ct=[133, 147, 0, 0, 136, 2, 0, 0, 0, 0]
k=14 len=142 ct=155 c

# Display Groups

In [5]:
for k in range(10):
    print('#'*25)
    print(f'### GROUP {k}')
    print('#'*25)
    display( train.loc[train.group==k] )

#########################
### GROUP 0
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
3632,3632,MADEEALPPGWEKRMSRSSGRVYYFNHITNASQWERPSGNSSSGGK...,7.0,10.1002/pro.172,59.4,163,0
3633,3633,MADEEALPPGWEKRMSRSSGRVYYFNHITNASQWERPSGNSSSGGK...,7.0,,40.0,163,0
3635,3635,MADEEKAPPGWEKRMSRSSGRVYYFNHITNASQWERPSGNSSSGGK...,7.0,10.1002/pro.172,37.8,163,0
3636,3636,MADEEKIPPGWEKRMSRSSGRVYYFNHITNASQWERPSGNSSSGGK...,7.0,,40.0,163,0
3637,3637,MADEEKIPPGWEKRMSRSSGRVYYFNHITNASQWERPSGNSSSGGK...,7.0,10.1002/pro.172,49.3,163,0
...,...,...,...,...,...,...,...
3689,3689,MADEEKLPPGWEKRMSRSSGRVYYFNHITNGSQWERPSGNSSSGGK...,7.0,10.1002/pro.172,40.9,163,0
3690,3690,MADEEKLPPGWEKRMSRSSGRVYYLNHITNASQWERPSGNSSSGGK...,7.0,10.1002/pro.172,42.5,163,0
3691,3691,MADEEKLPPGWEKRMSRSSGRVYYYNHITNASQWERPSGNSSSGGK...,7.0,10.1002/pro.172,62.0,163,0
3692,3692,MADEEKNPPGWEKRMSRSSGRVYYFNHITNASQWERPSGNSSSGGK...,7.0,10.1002/pro.172,48.8,163,0


#########################
### GROUP 1
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
4313,4313,MAETVADTRRLITKPQNLNDAYGPPSNFLEIDVSNPQTVGVGRGRF...,7.0,doi.org/10.1038/s41592-020-0801-4,45.1,161,1
4314,4314,MAETVADTRRLITKPQNLNDAYGPPSNFLEIDVSNPQTVGVGRGRF...,7.0,doi.org/10.1038/s41592-020-0801-4,46.8,161,1


#########################
### GROUP 2
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
4502,4502,MAFVTTAEVCDANQELIRSGQLRALQPIFQIYGRRQIFSGPVVTVK...,7.0,doi.org/10.1038/s41592-020-0801-4,47.7,165,2
4503,4503,MAFVTTAEVCDANQEMIRSGQLRALQPVFQIYGRRQIFSGPVVTVK...,7.0,doi.org/10.1038/s41592-020-0801-4,46.6,165,2


#########################
### GROUP 3
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
5770,5770,MAQEEEDVRDYNLTEEQKATKDKYPPVNRKYEYLDHTADVQLHAWG...,7.0,doi.org/10.1038/s41592-020-0801-4,47.5,167,3
5771,5771,MAQEEEDVRDYNLTEEQKATKDKYPPVNRKYEYLDHTADVQLHAWG...,7.0,doi.org/10.1038/s41592-020-0801-4,44.8,166,3


#########################
### GROUP 4
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
6004,6004,MARPKVFFDLTAGGNPVGRVVMELRADVVPRTAENFRQLCTGQPGY...,7.0,doi.org/10.1038/s41592-020-0801-4,45.8,163,4
6005,6005,MARPKVFFDLTAGGSPVGRVVMELRADVVPRTAENFRQLCTGQPGY...,7.0,doi.org/10.1038/s41592-020-0801-4,46.9,163,4


#########################
### GROUP 5
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
18020,18020,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1021/bi00535a054,38.1,164,5
18021,18021,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,4.2,,53.3,164,5
18022,18022,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1038/334406a0,38.1,164,5
18023,18023,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,6.5,10.1038/334406a0,62.9,164,5
18060,18060,MNCFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,,41.9,164,5
...,...,...,...,...,...,...,...
19949,19949,MNWFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,6.5,10.1038/334406a0,56.7,164,5
19950,19950,MNWFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1038/334406a0,25.5,164,5
19964,19964,MNYFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1038/334406a0,32.4,164,5
19965,19965,MNYFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,6.5,10.1038/334406a0,58.8,164,5


#########################
### GROUP 6
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
30804,30804,TMVAPFNGLKSSAAFPATRKANNDITSITSNGGRVNCMQVWPPIGK...,7.0,doi.org/10.1038/s41592-020-0801-4,63.9,161,6
30805,30805,TMVAPFTGLKSSASFPVTRKANNDITSITSNGGRVSCMKVWPPIGK...,7.0,doi.org/10.1038/s41592-020-0801-4,66.1,162,6


#########################
### GROUP 7
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
13838,13838,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.0237703,54.9,234,7
13839,13839,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.0237703,56.9,234,7
13840,13840,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.0237703,54.5,234,7
13841,13841,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.0237703,54.9,234,7
13842,13842,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.0237703,54.7,234,7
13843,13843,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.03220403,56.5,234,7
13844,13844,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.03220403,58.3,234,7
13845,13845,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.03220403,54.5,234,7
13846,13846,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.0237703,54.9,234,7
13847,13847,MKFLQVLPALIPAALAQTSCDQWATFTGNGYTVSNNLWGASAGSGF...,8.0,10.1110/ps.0237703,54.6,234,7


#########################
### GROUP 8
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
14056,14056,MKIAGADEAGRGPVIGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1016/j.jmb.2008.02.039,79.3,228,8
14058,14058,MKIAGIDAAGRGPVIGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,,50.0,228,8
14059,14059,MKIAGIDAAGRGPVIGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1021/bi060907v,93.2,228,8
14060,14060,MKIAGIDEAGRGPVAGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1016/j.jmb.2008.02.039,82.3,228,8
14061,14061,MKIAGIDEAGRGPVIGPMVAAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1016/j.jmb.2008.02.039,76.1,228,8
14062,14062,MKIAGIDEAGRGPVIGPMVIAAVVVDENSLPKAEELKVRDSKKLTP...,9.0,10.1016/j.jmb.2008.02.039,72.6,228,8
14063,14063,MKIAGIDEAGRGPVIGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1016/j.jmb.2008.02.039,75.8,228,8
14064,14064,MKIAGIDEAGRGPVIGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1016/j.jmb.2008.02.039,78.9,228,8
14065,14065,MKIAGIDEAGRGPVIGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1016/j.jmb.2008.02.039,73.8,228,8
14066,14066,MKIAGIDEAGRGPVIGPMVIAAVVVDENSLPKLEELKVRDSKKLTP...,9.0,10.1021/bi060907v,89.4,228,8


#########################
### GROUP 9
#########################


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group
16540,16540,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,,20.0,231,9
16541,16541,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,,20.0,231,9
16542,16542,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,10.1021/bi00006a025,50.9,231,9
16543,16543,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,,20.0,231,9
16544,16544,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,5.4,,40.0,231,9
...,...,...,...,...,...,...,...
17318,17318,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,,20.0,231,9
17319,17319,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,,20.0,231,9
17320,17320,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,,20.0,231,9
17321,17321,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.0,,20.0,231,9


# Save Groups

In [6]:
train = train.drop('x',axis=1)
train.to_csv('train_with_groups.csv',index=False)
train.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,group
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7,-1
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5,-1
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5,-1
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2,-1
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5,-1
