You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given a pair of reads aligned to the reference/truth label:
read 1 : ACT-G
read 2: ACT-G
label : ACTAG
we are leaking the presence of an extra nucleotide in the label when doing training. Given only the 2 reads, there would be no need for the - spacing in the reads.
During inference, this would never be seen, it would be:
read 1: ACTG
read 2: ACTG
but during training, it could learn to predict the gap (and the exact base from sequence context).
Therefore, a requirement is that: a column with a space (-) in only the reads (which is therefore driven by the label) should not be sent during training (and will not occur during inference).
This requires tracking the coordinates for each row in the matrix--that map it back to the truth.
The text was updated successfully, but these errors were encountered:
This was brought to my attention by @danielecook
Given a pair of reads aligned to the reference/truth label:
we are leaking the presence of an extra nucleotide in the label when doing training. Given only the 2 reads, there would be no need for the
-
spacing in the reads.During inference, this would never be seen, it would be:
but during training, it could learn to predict the gap (and the exact base from sequence context).
Therefore, a requirement is that: a column with a space (-) in only the reads (which is therefore driven by the label) should not be sent during training (and will not occur during inference).
This requires tracking the coordinates for each row in the matrix--that map it back to the truth.
The text was updated successfully, but these errors were encountered: