Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cov1 Multiple Sequence Alignment #36

Closed
ababaian opened this issue Apr 13, 2020 · 9 comments
Closed

cov1 Multiple Sequence Alignment #36

ababaian opened this issue Apr 13, 2020 · 9 comments
Labels
Bioinformatics Bioinformatics task

Comments

@ababaian
Copy link
Owner

ababaian commented Apr 13, 2020

Preamble

See #27

  1. cov0r.fa
    • generate suffix trees for faster subsequnce searching (python implementation)
    • compute for forward sequences only (we can reverse the incoming SAM reads to check for reverse matches)
    • compute for unique sequences only (out of 33,296 forward sequences, only 24,865 are unique)

Originally posted by @victorlin in #27 (comment)

The patched version of cov1r now removes perfect-match duplicates of sequences.

Objective

To further optimize cov1r, generate a multiple-sequence alignment of the current forward sequences and prune sequences that are >99% (or any arbitrary cut-off) similar. Since we don't need an exact match to 'detect' CoV reads, we can narrow the search space while retaining high levels of sensitivity.

@ababaian ababaian added the Bioinformatics Bioinformatics task label Apr 13, 2020
@victorlin
Copy link
Collaborator

The concept of MSA is still new to me, but it seems that there are several MSA tools available. However, the filtering methods seem more advanced than a % similarity cutoff (see parameters for CLUSTALW).

@ababaian
Copy link
Owner Author

Running any of those MSA on 24K sequences is non trivial. I'm suggesting we use some sort of similarity % as a heuristic to remove sequences that are highly repetitive.

As an aside: In theory a graph-based pan-genome is what we want but the over-head to make something like that is very large.

@victorlin
Copy link
Collaborator

victorlin commented Apr 14, 2020

Levenshtein distance (edit distance) could be used calculate similarity between two sequences. (and trivially for 24K sequences, non-efficiently). Example:

def compare(seq1, seq2, threshold=0.99):
    max_len = max(len(seq1), len(seq2))
    similarity = 1 - edit_distance(seq1, seq2) / max_len
    if similarity >= threshold:
        return seq1
    else:
        return (seq1, seq2)

Summary:
If seq1 and seq2 are similar under threshold, keep seq1.
Else, keep both sequences.

There are some decisions to make:

  1. Which sequence to keep if similarity is detected? Sequence length might be a good way to decide.
  2. Given similarity(A, B) = 0.99, similarity(B, C) = 0.99, similarity(A, C) = 0.98 : what should be filtered out? With this simple example, results would be different based on order of comparison:
compare(A,B) -> A
compare(A,C) -> A, C

result: {A, C}
compare(B,C) -> B
compare(B,A) -> A

result: {A}
compare(A,C) -> A, C
compare(B,A) -> B
compare(B,C) -> B

result: {B}

@ababaian
Copy link
Owner Author

ababaian commented Apr 14, 2020

Agreed that keeping the longer sequence is the correct choice.

How does edit distance work with sub-sequences that are not at the same start site?
i.e.

AATAACGATAGACCCATA
vs.
-------ATACA-----------

we don't know without first performing a global alignment how the two sequences line up

@victorlin
Copy link
Collaborator

Edit distance actually shows the domain for all potential best-case alignments. You can try your example using this online tool: http://www.let.rug.nl/~kleiweg/lev/.

The number in the bottom right of the matrix is the resulting edit distance metric.

Here is an example alignment:

AATAACGATAGACCCATA
       AT  A  C  A

@ababaian
Copy link
Owner Author

Hrmm, not sure if that's appropriate if it can't deal with mis-matches. It might be a quick/dirty heuristic so maybe worth checking to get rid of 'obvious' similar sequences

AATAACGATAGACCCATA
-------ATAcA--------

@ababaian ababaian changed the title cov1r Multiple Sequence Alignment cov1 Multiple Sequence Alignment Apr 15, 2020
@ababaian
Copy link
Owner Author

Note to self (or anyone working on this):

The most important question to consider is, "Are we able to pick up divergent sequences". Using bowtie2 --very-sensitive-local we pick up 80% of reads at 90% similarity and 10% of reads at 70% similarity (see figure 1 here ). This first pass alignment is essentially used to determine if a SRA library contains potential CoV. We will go back to it and do either de novo or directed transcriptome assembly to create CoV contigs which we can then align against the pan-CoV genome again to validate new CoV species.

If you interpret this curve, I now want to prune CoV sequences that are say 95%+ similar to one another to reduce the search space and speed up alignment. This should have a negligible cost to overall sensitivity.

Another approach would be to N-mask any perfect repeats of say 200-300 nt in the pan-genome, retaining only a single copy. We in a way 'destroy' the genomes that are there but it may work as a much more efficient 'bait' sequence for determining CoV+ or CoV- libraries.

@brietaylor
Copy link
Collaborator

Just gonna put it out there that I LOVE the way you documented that experiment. Super easy to follow your reasoning through methods through to results, especially as a non-scientist. Super inspiring and makes me want to strive to do things better.

@ababaian
Copy link
Owner Author

ababaian commented Apr 18, 2020

So this exactly can be solved with USEARCH . We have permission to use the 32bit binary from here

From Robert Edgar

#The clustered database was made with usearch:

usearch -sortbylength cov1.fa \
   -minseqlength 100 \
   -fastaout cov1.bylength.fa \

usearch -cluster_smallmem cov1.bylength.fa \
   -id 0.99 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc cov1.id99.uc \
   -centroids cov1.id99.fa

Need to implement this still and probabaly update version to cov2/cov2r

@mathemage mathemage added this to Open Tasks in TODO List via automation Apr 18, 2020
@rcedgar rcedgar closed this as completed Apr 20, 2020
TODO List automation moved this from Open Tasks to Completed Tasks Apr 20, 2020
@ababaian ababaian removed this from Completed Tasks in TODO List May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bioinformatics Bioinformatics task
Projects
None yet
Development

No branches or pull requests

4 participants