Filter low-complexity cloneIDs #41

marcelm · 2023-10-11T11:23:29Z

PR #36 had some code to filter out low-complexity cloneIDs. The suggested method is to filter a cloneID if len(set(clone_id)) <= 1. This discards cloneIDs that consist of only a single repeated nucleotide, such as AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA.

Maybe we can do a bit better and instead also remove cloneIDs like these (these come from real data):

GGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAACAAAAAAAAA

A typical way to do this would be to compute the Shannon entropy for the frequency distribution of the characters in the string and to then discard cloneIDs for which the entropy is below a threshold.

To get an idea of how the threshold could be, here are some entropies for real data:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.0
AAAAAAAAAAAAAAAAAAAACAAAAAAAAA  0.21084230031853213
GGAAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.35335933502142136
TCTAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.5608251769947301
AAAAAAAAAAAAAAAAAAAAAAATTTCATA  0.7703437707962479
AAGGGGGAAGCAAGAAAATGGCCAAGGGAA  1.5376437149543918
TCTTGCGGCCGAGCTTAAAGTGGAATTTCC  1.9838711132181526

It seems that a threshold of 1 would be a pretty safe bet: We would definitely exclude the cloneIDs with a single repeated nucleotide, but also cover a couple of other cases that very much look like artifacts.

The text was updated successfully, but these errors were encountered:

acorbat · 2023-10-11T12:49:44Z

I agree with you on using entropy and by eyeballing the threshold looks good.

I am going to be the devils advocate here and ask: Should we set 1 as a threshold or set it according to some other variable?

For example, if we say that having less than 7 bases read is already too little, we could think that having (30-7) bases with the same value and the remaining bases as different as possible, the entropy would be:

>>> from scipy.stats import entropy
>>> entropy([(30-7)/30, 2/30, 2/30, 3/30], base=2)
1.146996845892693

So the entropy threshold is set either manually or taken from this minimum number of nucleotides read (I am not too fond of this last option).

On a second note, how is the entropy of missing nucleotides computed?

If we consider them as an extra base, then entropy increases. (not good in my opinion)
If we remove them, entropy decreases and we are a bit more strict, or we could rescale entropy.

marcelm · 2023-10-11T13:17:15Z

I think we should just add an explicit filter for the number of nucleotides that we want to have in the cloneID. We shouldn’t try to set the entropy threshold so that it can work as such a length filter, we should just try to make it give acceptable results for what we consider low-complexity sequences.

IMO, removal of sequences with many - characters is just a side-effect of computing the entropy on the entire cloneID. It’s not what was intended, but it’s ok to filter them because we want to remove them anyway.

On a second note, how is the entropy of missing nucleotides computed?

If we consider them as an extra base, then entropy increases. (not good in my opinion)

Missing nucleotides are encoded as - or 0, so they make entropy go up. But this is not an issue because the point of this filter is not to catch every bad cloneID, but to catch some more than before (and the baseline is that we only filter out those with entropy 0). If the entropy is "too high", we just don’t filter the sequence, which means we’re not doing worse than before.

If we remove them, entropy decreases and we are a bit more strict, or we could rescale entropy.

These are good points, but I would prefer to leave the tuning of this to a subsequent PR, in particular because there are other, more pressing issues to work on. I’ll open an issue about this.

marcelm mentioned this issue Oct 11, 2023

Add low-complexity filtering #42

Merged

marcelm mentioned this issue Oct 11, 2023

Tune low-complexity filtering #43

Open

marcelm closed this as completed in #42 Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter low-complexity cloneIDs #41

Filter low-complexity cloneIDs #41

marcelm commented Oct 11, 2023

acorbat commented Oct 11, 2023

marcelm commented Oct 11, 2023 •

edited

Loading

Filter low-complexity cloneIDs #41

Filter low-complexity cloneIDs #41

Comments

marcelm commented Oct 11, 2023

acorbat commented Oct 11, 2023

marcelm commented Oct 11, 2023 • edited Loading

marcelm commented Oct 11, 2023 •

edited

Loading