Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leakage and bias from training vs inference #2

Open
brentp opened this issue Feb 10, 2023 · 0 comments
Open

leakage and bias from training vs inference #2

brentp opened this issue Feb 10, 2023 · 0 comments

Comments

@brentp
Copy link
Owner

brentp commented Feb 10, 2023

This was brought to my attention by @danielecook

Given a pair of reads aligned to the reference/truth label:

read 1 : ACT-G
read 2:  ACT-G
label  : ACTAG

we are leaking the presence of an extra nucleotide in the label when doing training. Given only the 2 reads, there would be no need for the - spacing in the reads.
During inference, this would never be seen, it would be:

read 1: ACTG
read 2: ACTG

but during training, it could learn to predict the gap (and the exact base from sequence context).

Therefore, a requirement is that: a column with a space (-) in only the reads (which is therefore driven by the label) should not be sent during training (and will not occur during inference).
This requires tracking the coordinates for each row in the matrix--that map it back to the truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant