Questions about evaluation metrics for sequence alignment #153

hughplay · 2024-01-11T11:35:35Z

Hi,

Your excellent work on the sequence alignment is awesome and inspiring.

Recently, I tested deepblast on the MALIDUP and MALISAM and find the results indeed is great. However, I am confused how the F1 score in the Table 2 is computed. I have tried to reproduce the score with my own evaluation pipeline, as well as computing the f1 based on the tp, fp, fn returned by the function alignment_score, but both results are far from the value given in the table. I think there must be some mistakes in my evaluation code.

The code for evaluting one sample based on alignment_score is like this:

EPS = 1e-8

model = load_model("deepblast-v3.ckpt", "prot_t5_xl_uniref50").cuda()

true_alignment = ...
pred_alignment = model.align(primary_sequence, target_sequence)

scores = alignment_score(true_alignment, pred_alignment)
tp, fp, fn = scores[0], scores[1], scores[2]

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * precision * recall / (precision + recall + EPS)

Could you please provide guidance on the correct method for calculating the F1 score?

Thank you!

The text was updated successfully, but these errors were encountered:

mortonjt · 2024-01-11T16:49:00Z

Hi, it looks like you are using the same methods that I used for evaluation. The original notebooks can be found on the zenodo linked in the paper : https://doi.org/10.5281/zenodo.7731163

hughplay · 2024-01-18T05:23:00Z

Thank you very much! I have reproduced the results.

The reason that I got wrong scores is that I first represented the alignment in another format, and my transforming function was not well tested and I obatained wrong alignment states for computing scores. 😭

hughplay · 2024-01-19T04:24:47Z

I'm back again.

I find the alignment score seems to be weired in some cases. According to my observation, it happens when the alignments starting with "21:", for example (MALIDUP, d1knca):

manual
SSITRSSVLDQEQLWGTLLASAAATRNPQVLADIGAEATDH-LSAAARHAALGAAAIMGMNNVFYRGRGFLE
:::::::::::::::::::::::::::::::::::::::::1::::::::::::::::::::::::::::::
MNIIANPGIPKANFELWSFAVSAINGCSHCLVAHEHTLRTVGVDREAIFEALKAAAIVSGVAQALATIEALS

deepblast
S-SITRSSVLDQEQLWGTLLASAAATRNPQVLADIGAEATDH-LSAAARHAALGAAAIM-GMNNVFYRGRGFLE
21::::::::::::::::::::::::::::::::::::::::1::::::::::::::::1:::::::::::2::
-MNIIANPGIPKANFELWSFAVSAINGCSHCLVAHEHTLRTVGVDREAIFEALKAAAIVSGVAQALATIEA-LS

true_edges:
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15), (16, 16), (17, 17), (18, 18), (19, 19), (20, 20), (21, 21), (22, 22), (23, 23), (24, 24), (25, 25), (26, 26), (27, 27), (28, 28), (29, 29), (30, 30), (31, 31), (32, 32), (33, 33), (34, 34), (35, 35), (36, 36), (37, 37), (38, 38), (39, 39), (40, 40), (41, 40), (42, 41), (43, 42), (44, 43), (45, 44), (46, 45), (47, 46), (48, 47), (49, 48), (50, 49), (51, 50), (52, 51), (53, 52), (54, 53), (55, 54), (56, 55), (57, 56), (58, 57), (59, 58), (60, 59), (61, 60), (62, 61), (63, 62), (64, 63), (65, 64), (66, 65), (67, 66), (68, 67), (69, 68), (70, 69), (71, 70)]
pred_edges:
[(0, 0), (1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7), (9, 8), (10, 9), (11, 10), (12, 11), (13, 12), (14, 13), (15, 14), (16, 15), (17, 16), (18, 17), (19, 18), (20, 19), (21, 20), (22, 21), (23, 22), (24, 23), (25, 24), (26, 25), (27, 26), (28, 27), (29, 28), (30, 29), (31, 30), (32, 31), (33, 32), (34, 33), (35, 34), (36, 35), (37, 36), (38, 37), (39, 38), (40, 39), (41, 40), (42, 40), (43, 41), (44, 42), (45, 43), (46, 44), (47, 45), (48, 46), (49, 47), (50, 48), (51, 49), (52, 50), (53, 51), (54, 52), (55, 53), (56, 54), (57, 55), (58, 56), (59, 56), (60, 57), (61, 58), (62, 59), (63, 60), (64, 61), (65, 62), (66, 63), (67, 64), (68, 65), (69, 66), (70, 67), (70, 68), (71, 69), (72, 70)]

DeepBlast predicts pretty well in this case, but the f1 score is 0. I am confused about the evaluation method. What are the edges? Why we need to compute the edges first? And why the f1 score is 0 in this case?

mortonjt · 2024-05-27T13:09:19Z

Hi, the edges are the match coordinates between the two sequences.

Regarding f1 score, if there is an off-by-1 error, the f1 score can be zero, even if the structural similarity is preserved. This is why f1 score isn't a great metric (TM-score is more robust).

Regarding the edge alignments, indeed there are weird edge cases. This is partially due to the querks surrounding indels -- the current gap-position-specific scoring setup isn't ideal. And we don't have a concept of affine gap scoring (it turns out to be highly non-trivial to setup for differential dynamic programming). See the DEDAL paper on a discussion on this

Despite these setbacks, these edge cases doesn't seem to strong affect the TMscore, since superposition is still roughly the same.

hughplay closed this as completed Jan 18, 2024

hughplay reopened this Jan 19, 2024

mortonjt closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about evaluation metrics for sequence alignment #153

Questions about evaluation metrics for sequence alignment #153

hughplay commented Jan 11, 2024

mortonjt commented Jan 11, 2024

hughplay commented Jan 18, 2024 •

edited

Loading

hughplay commented Jan 19, 2024 •

edited

Loading

mortonjt commented May 27, 2024

Questions about evaluation metrics for sequence alignment #153

Questions about evaluation metrics for sequence alignment #153

Comments

hughplay commented Jan 11, 2024

mortonjt commented Jan 11, 2024

hughplay commented Jan 18, 2024 • edited Loading

hughplay commented Jan 19, 2024 • edited Loading

mortonjt commented May 27, 2024

hughplay commented Jan 18, 2024 •

edited

Loading

hughplay commented Jan 19, 2024 •

edited

Loading