Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter low-complexity cloneIDs #41

Closed
marcelm opened this issue Oct 11, 2023 · 2 comments · Fixed by #42
Closed

Filter low-complexity cloneIDs #41

marcelm opened this issue Oct 11, 2023 · 2 comments · Fixed by #42

Comments

@marcelm
Copy link
Contributor

marcelm commented Oct 11, 2023

PR #36 had some code to filter out low-complexity cloneIDs. The suggested method is to filter a cloneID if len(set(clone_id)) <= 1. This discards cloneIDs that consist of only a single repeated nucleotide, such as AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA.

Maybe we can do a bit better and instead also remove cloneIDs like these (these come from real data):

GGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAACAAAAAAAAA

A typical way to do this would be to compute the Shannon entropy for the frequency distribution of the characters in the string and to then discard cloneIDs for which the entropy is below a threshold.

To get an idea of how the threshold could be, here are some entropies for real data:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.0
AAAAAAAAAAAAAAAAAAAACAAAAAAAAA  0.21084230031853213
GGAAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.35335933502142136
TCTAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.5608251769947301
AAAAAAAAAAAAAAAAAAAAAAATTTCATA  0.7703437707962479
AAGGGGGAAGCAAGAAAATGGCCAAGGGAA  1.5376437149543918
TCTTGCGGCCGAGCTTAAAGTGGAATTTCC  1.9838711132181526

It seems that a threshold of 1 would be a pretty safe bet: We would definitely exclude the cloneIDs with a single repeated nucleotide, but also cover a couple of other cases that very much look like artifacts.

@acorbat
Copy link
Collaborator

acorbat commented Oct 11, 2023

I agree with you on using entropy and by eyeballing the threshold looks good.

I am going to be the devils advocate here and ask: Should we set 1 as a threshold or set it according to some other variable?

For example, if we say that having less than 7 bases read is already too little, we could think that having (30-7) bases with the same value and the remaining bases as different as possible, the entropy would be:

>>> from scipy.stats import entropy
>>> entropy([(30-7)/30, 2/30, 2/30, 3/30], base=2)
1.146996845892693

So the entropy threshold is set either manually or taken from this minimum number of nucleotides read (I am not too fond of this last option).

On a second note, how is the entropy of missing nucleotides computed?

  • If we consider them as an extra base, then entropy increases. (not good in my opinion)
  • If we remove them, entropy decreases and we are a bit more strict, or we could rescale entropy.

@marcelm
Copy link
Contributor Author

marcelm commented Oct 11, 2023

I think we should just add an explicit filter for the number of nucleotides that we want to have in the cloneID. We shouldn’t try to set the entropy threshold so that it can work as such a length filter, we should just try to make it give acceptable results for what we consider low-complexity sequences.

IMO, removal of sequences with many - characters is just a side-effect of computing the entropy on the entire cloneID. It’s not what was intended, but it’s ok to filter them because we want to remove them anyway.

On a second note, how is the entropy of missing nucleotides computed?

  • If we consider them as an extra base, then entropy increases. (not good in my opinion)

Missing nucleotides are encoded as - or 0, so they make entropy go up. But this is not an issue because the point of this filter is not to catch every bad cloneID, but to catch some more than before (and the baseline is that we only filter out those with entropy 0). If the entropy is "too high", we just don’t filter the sequence, which means we’re not doing worse than before.

  • If we remove them, entropy decreases and we are a bit more strict, or we could rescale entropy.

These are good points, but I would prefer to leave the tuning of this to a subsequent PR, in particular because there are other, more pressing issues to work on. I’ll open an issue about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants