leakage and bias from training vs inference #2

brentp · 2023-02-10T19:56:13Z

This was brought to my attention by @danielecook

Given a pair of reads aligned to the reference/truth label:

read 1 : ACT-G
read 2:  ACT-G
label  : ACTAG

we are leaking the presence of an extra nucleotide in the label when doing training. Given only the 2 reads, there would be no need for the - spacing in the reads.
During inference, this would never be seen, it would be:

read 1: ACTG
read 2: ACTG

but during training, it could learn to predict the gap (and the exact base from sequence context).

Therefore, a requirement is that: a column with a space (-) in only the reads (which is therefore driven by the label) should not be sent during training (and will not occur during inference).
This requires tracking the coordinates for each row in the matrix--that map it back to the truth.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leakage and bias from training vs inference #2

leakage and bias from training vs inference #2

brentp commented Feb 10, 2023 •

edited

Loading

leakage and bias from training vs inference #2

leakage and bias from training vs inference #2

Comments

brentp commented Feb 10, 2023 • edited Loading

brentp commented Feb 10, 2023 •

edited

Loading