-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter low-complexity cloneIDs #41
Comments
I agree with you on using entropy and by eyeballing the threshold looks good. I am going to be the devils advocate here and ask: Should we set 1 as a threshold or set it according to some other variable? For example, if we say that having less than 7 bases read is already too little, we could think that having (30-7) bases with the same value and the remaining bases as different as possible, the entropy would be:
So the entropy threshold is set either manually or taken from this minimum number of nucleotides read (I am not too fond of this last option). On a second note, how is the entropy of missing nucleotides computed?
|
I think we should just add an explicit filter for the number of nucleotides that we want to have in the cloneID. We shouldn’t try to set the entropy threshold so that it can work as such a length filter, we should just try to make it give acceptable results for what we consider low-complexity sequences. IMO, removal of sequences with many
Missing nucleotides are encoded as
These are good points, but I would prefer to leave the tuning of this to a subsequent PR, in particular because there are other, more pressing issues to work on. I’ll open an issue about this. |
PR #36 had some code to filter out low-complexity cloneIDs. The suggested method is to filter a cloneID if
len(set(clone_id)) <= 1
. This discards cloneIDs that consist of only a single repeated nucleotide, such asAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
.Maybe we can do a bit better and instead also remove cloneIDs like these (these come from real data):
A typical way to do this would be to compute the Shannon entropy for the frequency distribution of the characters in the string and to then discard cloneIDs for which the entropy is below a threshold.
To get an idea of how the threshold could be, here are some entropies for real data:
It seems that a threshold of 1 would be a pretty safe bet: We would definitely exclude the cloneIDs with a single repeated nucleotide, but also cover a couple of other cases that very much look like artifacts.
The text was updated successfully, but these errors were encountered: