cov1 Multiple Sequence Alignment #36

ababaian · 2020-04-13T20:28:57Z

Preamble

cov0r.fa
- generate suffix trees for faster subsequnce searching (python implementation)
- compute for forward sequences only (we can reverse the incoming SAM reads to check for reverse matches)
- compute for unique sequences only (out of 33,296 forward sequences, only 24,865 are unique)

Originally posted by @victorlin in #27 (comment)

The patched version of cov1r now removes perfect-match duplicates of sequences.

Objective

To further optimize cov1r, generate a multiple-sequence alignment of the current forward sequences and prune sequences that are >99% (or any arbitrary cut-off) similar. Since we don't need an exact match to 'detect' CoV reads, we can narrow the search space while retaining high levels of sensitivity.

The text was updated successfully, but these errors were encountered:

victorlin · 2020-04-14T05:22:22Z

The concept of MSA is still new to me, but it seems that there are several MSA tools available. However, the filtering methods seem more advanced than a % similarity cutoff (see parameters for CLUSTALW).

ababaian · 2020-04-14T17:55:47Z

Running any of those MSA on 24K sequences is non trivial. I'm suggesting we use some sort of similarity % as a heuristic to remove sequences that are highly repetitive.

As an aside: In theory a graph-based pan-genome is what we want but the over-head to make something like that is very large.

victorlin · 2020-04-14T19:36:58Z

Levenshtein distance (edit distance) could be used calculate similarity between two sequences. (and trivially for 24K sequences, non-efficiently). Example:

def compare(seq1, seq2, threshold=0.99):
    max_len = max(len(seq1), len(seq2))
    similarity = 1 - edit_distance(seq1, seq2) / max_len
    if similarity >= threshold:
        return seq1
    else:
        return (seq1, seq2)

Summary:
If seq1 and seq2 are similar under threshold, keep seq1.
Else, keep both sequences.

There are some decisions to make:

Which sequence to keep if similarity is detected? Sequence length might be a good way to decide.
Given similarity(A, B) = 0.99, similarity(B, C) = 0.99, similarity(A, C) = 0.98 : what should be filtered out? With this simple example, results would be different based on order of comparison:

compare(A,B) -> A
compare(A,C) -> A, C

result: {A, C}

compare(B,C) -> B
compare(B,A) -> A

result: {A}

compare(A,C) -> A, C
compare(B,A) -> B
compare(B,C) -> B

result: {B}

ababaian · 2020-04-14T19:43:16Z

Agreed that keeping the longer sequence is the correct choice.

How does edit distance work with sub-sequences that are not at the same start site?
i.e.

AATAACGATAGACCCATA
vs.
-------ATACA-----------

we don't know without first performing a global alignment how the two sequences line up

victorlin · 2020-04-14T20:07:27Z

Edit distance actually shows the domain for all potential best-case alignments. You can try your example using this online tool: http://www.let.rug.nl/~kleiweg/lev/.

The number in the bottom right of the matrix is the resulting edit distance metric.

Here is an example alignment:

AATAACGATAGACCCATA
       AT  A  C  A

ababaian · 2020-04-14T20:11:49Z

Hrmm, not sure if that's appropriate if it can't deal with mis-matches. It might be a quick/dirty heuristic so maybe worth checking to get rid of 'obvious' similar sequences

AATAACGATAGACCCATA
-------ATAcA--------

ababaian · 2020-04-17T18:37:51Z

Note to self (or anyone working on this):

The most important question to consider is, "Are we able to pick up divergent sequences". Using bowtie2 --very-sensitive-local we pick up 80% of reads at 90% similarity and 10% of reads at 70% similarity (see figure 1 here ). This first pass alignment is essentially used to determine if a SRA library contains potential CoV. We will go back to it and do either de novo or directed transcriptome assembly to create CoV contigs which we can then align against the pan-CoV genome again to validate new CoV species.

If you interpret this curve, I now want to prune CoV sequences that are say 95%+ similar to one another to reduce the search space and speed up alignment. This should have a negligible cost to overall sensitivity.

Another approach would be to N-mask any perfect repeats of say 200-300 nt in the pan-genome, retaining only a single copy. We in a way 'destroy' the genomes that are there but it may work as a much more efficient 'bait' sequence for determining CoV+ or CoV- libraries.

brietaylor · 2020-04-17T19:45:33Z

Just gonna put it out there that I LOVE the way you documented that experiment. Super easy to follow your reasoning through methods through to results, especially as a non-scientist. Super inspiring and makes me want to strive to do things better.

ababaian · 2020-04-18T00:09:03Z

So this exactly can be solved with USEARCH . We have permission to use the 32bit binary from here

From Robert Edgar

#The clustered database was made with usearch:

usearch -sortbylength cov1.fa \
   -minseqlength 100 \
   -fastaout cov1.bylength.fa \

usearch -cluster_smallmem cov1.bylength.fa \
   -id 0.99 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc cov1.id99.uc \
   -centroids cov1.id99.fa

Need to implement this still and probabaly update version to cov2/cov2r

ababaian added the Bioinformatics Bioinformatics task label Apr 13, 2020

ababaian changed the title ~~cov1r Multiple Sequence Alignment~~ cov1 Multiple Sequence Alignment Apr 15, 2020

mathemage added this to Open Tasks in TODO List via automation Apr 18, 2020

victorlin mentioned this issue Apr 20, 2020

cov2 pan-genome refinements #41

Closed

rcedgar closed this as completed Apr 20, 2020

TODO List automation moved this from Open Tasks to Completed Tasks Apr 20, 2020

ababaian removed this from Completed Tasks in TODO List May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cov1 Multiple Sequence Alignment #36

cov1 Multiple Sequence Alignment #36

ababaian commented Apr 13, 2020 •

edited

Loading

victorlin commented Apr 14, 2020

ababaian commented Apr 14, 2020

victorlin commented Apr 14, 2020 •

edited

Loading

ababaian commented Apr 14, 2020 •

edited

Loading

victorlin commented Apr 14, 2020

ababaian commented Apr 14, 2020

ababaian commented Apr 17, 2020

brietaylor commented Apr 17, 2020

ababaian commented Apr 18, 2020 •

edited

Loading

cov1 Multiple Sequence Alignment #36

cov1 Multiple Sequence Alignment #36

Comments

ababaian commented Apr 13, 2020 • edited Loading

Preamble

Objective

victorlin commented Apr 14, 2020

ababaian commented Apr 14, 2020

victorlin commented Apr 14, 2020 • edited Loading

ababaian commented Apr 14, 2020 • edited Loading

victorlin commented Apr 14, 2020

ababaian commented Apr 14, 2020

ababaian commented Apr 17, 2020

brietaylor commented Apr 17, 2020

ababaian commented Apr 18, 2020 • edited Loading

ababaian commented Apr 13, 2020 •

edited

Loading

victorlin commented Apr 14, 2020 •

edited

Loading

ababaian commented Apr 14, 2020 •

edited

Loading

ababaian commented Apr 18, 2020 •

edited

Loading