Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate gap and non-gap misassemblies #46

Closed
sjackman opened this Issue Jan 31, 2018 · 30 comments

Comments

Projects
None yet
3 participants
@sjackman
Copy link
Contributor

sjackman commented Jan 31, 2018

In the # misassemblies count in misassemblies_report.txt, I'd like to know how many of the # relocations occurred in scaffold gaps. In the mis_contigs.info report, it'd be helpful to report which Extensive misassembly (relocation) were found in scaffold gaps, and the size of the scaffold gap size error. I'm curious which misassemblies may have been classified as a scaffold gap size mis. rather than an extensive misassembly had the --scaffold-gap-max-size threshold been higher.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Jan 31, 2018

Can you confirm that 10 or more Ns in the assembly is considered a scaffold gap, for the purpose of classifying scaffold gap size mis.?

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Jan 31, 2018

Well, technically yes, 10 or more Ns may be considered a scaffold gap. This is not the only condition of course. Lets assume there are two aligned fragments in a scaffold, A1 and A2. To consider a scaffold gap misassembly between them, we check:

  • there is a misassembly between A1 and A2 (using regular misassembly definition), the size of inconsistency between reference and scaffold positions of A1 and A2 is less than scaffolds_gap_threshold (10 kbp, by default)
  • both A1 and A2 align on the same strand of the same chromosome (i.e., the misassembly is NOT an inversion or a translocation)
  • there is a sequence of nucleotides between A1 and A2 in the scaffold, the sequence should have length at least 10 (here it is!) and at least 95% of the nucleotides should be Ns.
@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Jan 31, 2018

and at least 95% of the nucleotides should be Ns

Ahah! That explains the behaviour that I'm seeing. I have a scaffold gap of 100 Ns (arbitrary number of Ns, because the size of the gap is not estimated). There's a single SNV 10 bp from the end of A1, which prevents that 10 bp from aligning to the reference. So the unaligned sequence is 100 bp composed of 10 nucleotides and 100 Ns. This is being categorized as a misassembly rather than a scaffold gap size mis. because 100/110 = 91% < 95%.

With the default ---min-identity=95 up to 20 nucleotides at the end of an alignment can be unaligned, due to a single SNV. If 20 nucleotides are unaligned from both sides, that's 40 unaligned nucleotides. To exceed the 95% N threshold, the gap would have to be larger than 800 Ns. Shorter gaps will be considered a misassembly.

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Jan 31, 2018

Well, we can add some absolute thresholds in addition to 95% to exclude penalising of such small scaffold gaps. Do you think this is reasonable?
Actually, we expected larger stretches of Ns from scaffolders when they are not sure about the gap size (like 1000 or 2000 Ns).

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Jan 31, 2018

It looks to me like this code:

return gap_in_contig.count('N') / len(gap_in_contig) >= qconfig.gap_filled_ns_threshold

should match this code:
if count_ns / float(unaligned_len) < qconfig.gap_filled_ns_threshold and unaligned_len - count_ns >= qconfig.unaligned_part_size:

and check that the number of unaligned nucleotides is larger than qconfig.unaligned_part_size.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Jan 31, 2018

In my opinion, I think the ratio check can be dropped. If there's at least 10 Ns, and the number of unaligned non-N nucleotides, is less than a fixed threshold, it should be considered a scaffold gap.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 1, 2018

Here's a case where QUAST is incorrectly identifying a scaffold gap as a relocation misassembly, but it's not due to failing the Ns ration test.

A gist of the sequence (42 kbp):
https://gist.github.com/sjackman/80d7021b76a2d82b2f1e1e50c1649282

The minimap2 alignment:

10686884	42259	14361	42254	-	CM000669.2	159345973	4313215	4341083	23453	27909	60	tp:A:P	cm:i:2193	s1:i:23442	s2:i:542
10686884	42259	16	4311	-	CM000669.2	159345973	4341196	4345492	3787	4309	60	tp:A:P	cm:i:367	s1:i:3781	s2:i:77

The QUAST report:

10686884_32279_0_470812__..._10458231
Extensive misassembly (relocation, inconsistency = -9937) between 4319 1 and 42259 14331
4341189	4345508	4319	1	CM000669.2_Homo_sapiens_chromosome_7__GRCh38_reference_primary_assembly	10686884_32279_0_470812__..._10458231	99.12		True
relocation, inconsistency = -9937
4313211	4341114	42259	14331	CM000669.2_Homo_sapiens_chromosome_7__GRCh38_reference_primary_assembly	10686884_32279_0_470812__..._10458231	99.56		True
CONTIG	10686884_32279_0_470812__..._10458231	42259	misassembled

The unaligned sequence of the contig between 10686884:4320-14330 is composed of 11 nucleotides and 10,000 Ns. So it passes the 95% Ns ration test. Any idea why it's not being classified as a scaffold gap?

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 1, 2018

I believe this line of code has a bug with Python 2:

return gap_in_contig.count('N') / len(gap_in_contig) >= qconfig.gap_filled_ns_threshold

It's doing an integer division so the result is an integer, either 0 or 1. This test only passes if the sequence in the gap is composed entirely 100% of Ns. This bug wouldn't affect Python 3, where this division would yield a float.

May I suggest removing the ratio test, and replacing it with:

return gap_in_contig.count('N') >= qconfig.Ns_break_threshold
@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 1, 2018

I've opened PR #47 to address this issue.

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Feb 1, 2018

Thanks for debugging this issue! Yes, it was a division bug, worked correctly in Python 3 only.
The proper fix is to add
from __future__ import division
in the beginning of the module but since we decide to drop the ratio test, it is not needed here (but probably needed in other places, e.g. in your example from quast/quast_libs/ca_utils/analyze_contigs.py).

As for the suggested fix -- I do not fully agree with it. E.g., if you have a 10 000 bp long gap (inconsistency with reference) in the contig and there is just 10 Ns there (even not necessary consecutive, just a single N per 1 kbp) this will be treated as "scaffold gap size mis" according to your fix. And I am not sure that this is desired behaviour :)

I suggest to change to something like this:

gap_len = len(gap_in_contig)
if gap_len < qconfig.Ns_break_threshold:
    return False`
side_len = min(qconfig.MAX_INDEL_LENGTH, (gap_len - qconfig.Ns_break_threshold) // 2)
return gap_in_contig[X:-X] == len() * 'N')

That is, we ask for at least 10 Ns ("Ns_break_threshold") in the gap and also allow up to 50 ("MAX_INDEL_LENGTH") non-Ns at both sides of the gap. At the same time, all nucleotides inside the gap should be strictly Ns, which is a typical pattern for scaffold gaps.
Do you agree?

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 1, 2018

If there were a 10 kbp gap, I'd expect the 10 kbp of sequence to align somewhere else in the reference, and so it would be classified as a misassembly (either a translocation or relocation). If there's sequence that does not align to the reference at all, it's more likely a reference issue then an assembly issue, either sequence found in this individual and missing from the reference, or a >5% diverged mobile genetic element.

MAX_INDEL_LENGTH (50 bp) is an appropriate threshold in the middle of an alignment, but it's too small at the ends of alignments. For example, requiring 95% alignment identity, a 10 bp indel and a single SNV (normal individual variation) at the end of contig can cause 50 bp to go unaligned, and so incorrectly be classified as a misassembly rather than a scaffold gap.

I feel that an alignment gap should be classified as a scaffold gap as long as the gap contains more than 10 consecutive Ns, and the alignment blocks to the left and right of the alignment gap are collinear.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 1, 2018

If there is an unaligned segment in the middle of a contig, I believe it's detected by this function process_unaligned_part, which increments possible_misassemblies.

if count_ns / float(unaligned_len) < qconfig.gap_filled_ns_threshold and unaligned_len - count_ns >= qconfig.unaligned_part_size:
possible_misassemblies = 1

Where does that unaligned part get counted in the metrics? My guess is either # unaligned mis. contigs or unaligned contigs (partial). Which one would it be? Does it also get counted as # misassembled contigs and/or # misassemblies?

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 1, 2018

even not necessary consecutive, just a single N per 1 kbp

Good catch. I've fixed up my PR #47:

def is_gap_filled_ns(contig_seq, align1, align2):
    gap_in_contig = contig_seq[align1.end(): align2.start() - 1]
    return 'N' * qconfig.Ns_break_threshold in gap_in_contig
@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 1, 2018

gap_in_contig = contig_seq[align1.end(): align2.start() - 1]

PAF uses 0-based coordinates. Does QUAST use 1-based coordinates internally?
https://github.com/lh3/miniasm/blob/master/PAF.md

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Feb 1, 2018

Will answer to all questions here:

If there were a 10 kbp gap, I'd expect the 10 kbp of sequence to align somewhere else in the reference, and so it would be classified as a misassembly (either a translocation or relocation). If there's sequence that does not align to the reference at all, it's more likely a reference issue then an assembly issue, either sequence found in this individual and missing from the reference, or a >5% diverged mobile genetic element.

I feel that an alignment gap should be classified as a scaffold gap as long as the gap contains more than 10 consecutive Ns, and the alignment blocks to the left and right of the alignment gap are collinear.

This totally make sense but note that this code is also used in MetaQUAST which deals with some complex cases like absence of some reference genomes, presence of very close reference genomes, and so on. I am afraid mostly about this use case. However, probably your suggestion make sense in the metagenomic case, too. I applied your PR locally and started the test on a medium-size metagenomic dataset and will compare results after/before.

Good catch. I've fixed up my PR #47:

Now I like it more :) Lets wait until my metagenomic test is finished and we probably merge this PR.

Does QUAST use 1-based coordinates internally?

Yes, we use 1-based coords since QUAST was originally based on Nucmer/Mummer which reports 1-based alignments. After recent switch to minimap2 we decided not to change internal representations (it is very painful!)

Where does that unaligned part get counted in the metrics? My guess is either # unaligned mis. contigs or unaligned contigs (partial). Which one would it be? Does it also get counted as # misassembled contigs and/or # misassemblies?

Most of the answers are in our manual (just search for the corresponding metric names). Briefly speaking: possible misassembly is a separate event which is reported in MetaQUAST mode only. However, it is always happen in partially unaligned contigs (but the opposite is not always true -- not all partially unaligned contigs have possible misassemblies). As for the # misassemblies -- no, they include only relocations/translocations/inversions and do NOT include local misassemblies, possible misassemblies, unaligned mis. contigs, scaffold gap size misassemblies, structural variations, etc.

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Feb 2, 2018

My metagenomic test of the PR passed correctly, so I merged it. Thanks, @sjackman !

By the way, are you still interested in the topic of this issue (reporting number of relocations which could be possibly counted as scaffold gap errors)?

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 2, 2018

Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not. It's helpful to know whether the error was introduced by the unitigger or the scaffolder. I found that the size of each gap size estimation error is in contigs_report_*.stdout, when the gap size error is between [extensive misassembly size, scaffold gap size] ([7k, 10k] by default). It'd be helpful to print a similar message in .stdout when the inconsistency is <7k or >10k.

@sidorov-si

This comment has been minimized.

Copy link

sidorov-si commented Feb 2, 2018

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 2, 2018

@sidorov-si ABySS-Samtobreak reports scaffolding errors. It takes a SAM file of aligned scaftigs and produces a SAM file of non-colinear aligned scaftigs. It may be useful to you. It's built by default when you compile ABySS when you have ghc in your PATH.

sudo apt-get install ghc
# or
brew install ghc

https://github.com/bcgsc/abyss/tree/master/Misc
Cut your scaffolds into scaftigs using seqtk cutN -n1 and then align them with either bwa mem -x intractg or minimap2 -x asm10.

I'm looking forward to similar functionality from QUAST! =)

@sidorov-si

This comment has been minimized.

Copy link

sidorov-si commented Feb 2, 2018

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 2, 2018

IGV can also view PAF files, though it doesn't currently visualize the alignment in the CIGAR strings, only the alignment blocks.

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Feb 2, 2018

Ok, this feature looks useful for you and probably for the whole community, too :)
We will add it soon!

Few notes:

Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not.

Hmm, only relocations could be potentially a scaffold gap (in case of increased --scaffold-gap-max-size), while inversions and translocations could not. At least using our definition of scaffold gaps. Am I wrong?

I found that the size of each gap size estimation error is in contigs_report_*.stdout, when the gap size error is between [extensive misassembly size, scaffold gap size] ([7k, 10k] by default). It'd be helpful to print a similar message in .stdout when the inconsistency is <7k or >10k.

Are you talking about messages like "inconsistency = ..."? Actually we report them already for <7k and >10k. For local misassemblies (<7k by default in --large mode and <1k by default, otherwise) they look like
Gap between these two alignments (local misassembly). Inconsistency = -190
For relocations they look like:
Extensive misassembly (relocation, inconsistency = 1886198) between these two alignments
(note the number is huge).
In case of scaffold gap misassemblies we report "gap length difference".
We do NOT report inconsistency in case of inversions and translocations because it cannot be properly defined in this case.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 2, 2018

Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not.

Hmm, only relocations could be potentially a scaffold gap (in case of increased --scaffold-gap-max-size), while inversions and translocations could not. At least using our definition of scaffold gaps. Am I wrong?

Say in the assembly there's a scaffold composed of two contigs A and B with a gap of 10 or more Ns between them. If A and B map for then 10,000 bp apart, it's a scaffolding relocation misassembly. If they map to different chromosomes, it's a scaffolding translocation misassembly, and if they map to the same chromosome with different orientations, it's also a scaffold orientation misassembly. You could perhaps define a scaffold inversion misassembly as three colinear contigs A B' C, where is B is inverted with respect to A and C.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 2, 2018

ca_output.stdout_f.write('\t\t\t Incorrectly estimated size of scaffold gap between these two alignments: ')

This message is reported when there's a scaffold gap inconsistency in the range of [7k, 10k].

Incorrectly estimated size of scaffold gap between these two alignments: gap length difference = 8509

For relocations they look like:
Extensive misassembly (relocation, inconsistency = 1886198) between these two alignments

This message doesn't specify whether there's a scaffold gap of more than 10 Ns on the query between the two alignment blocks. It could be either a contig or a scaffolding error.

We do NOT report inconsistency in case of inversions and translocations because it cannot be properly defined in this case.

Whether the alignment break involves a scaffold gap of Ns or not can be reported.

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Feb 4, 2018

Ok, got it.
Note that the currently implemented scaffold gap size misassemblies are excluded from # misassemblies, since we consider them as not severe errors (like local misassemblies).
Suggested novel metrics (scaffold relocations, translocations, inversions) will BE counted in # misassemblies. In fact it is an extended classification of them in addition to regular relocations, translocation, inversions. I hope this handful of novel metrics will not be confusing in the main report :)

... if they map to the same chromosome with different orientations, it's also a scaffold orientation misassembly. You could perhaps define a scaffold inversion misassembly as three colinear contigs A B' C, where is B is inverted with respect to A and C.

Do you mind if we call your scaffold orientation misassembly as scaffold inversion for consistency with our regular inversions? And if we also do not report your scaffold inversion misassembly, since all our current misassemblies are "breakpoint-like", i.e. defined as an even between two adjacent blocks.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 5, 2018

Yep, that all makes sense me to me.

I hope this handful of novel metrics will not be confusing in the main report :)

I don't believe this breakdown needs to be in the main report, since relocations, translocations, inversions aren't already reported in the main report. I'd suggest they go in misassemblies_report.txt. Perhaps something like…

# misassemblies              1200
  sequence                   600
    # relocations            300
    # translocations         200
    # inversions             100
  scaffold                   600
    # relocations            300
    # translocations         200
    # inversions             100
@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Feb 6, 2018

Note that relocations, translocations, inversions aren't in report.txt but they are present in report.html (after pressing "extended report" button). However, your suggested report structure looks pretty nice to me, I think we can report them in a similar fashion in HTML report, too.

@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Feb 6, 2018

Great! contig may be a better name than sequence for the subsection.

@alexeigurevich

This comment has been minimized.

Copy link
Contributor

alexeigurevich commented Jun 9, 2018

We are finalizing v.5.0 release and this feature was added there in the following format.

  1. The detailed log file (contigs_report_*.stdout) now includes additional labels "scaffold gap is present" for misassemblies containing the gap but not marked as "scaffold gap mis size" misassembly (e.g. because of too large inconsistency).
    Example:
Real Alignment 1: 3533883 3536114 | 2199 1 | 2232 2199 | 95.18 | gi_49175990_ref_NC_000913.2__Escherichia_coli_str._K_12_substr._MG1655__complete_genome Contig_390
       Extensive misassembly (relocation, scaffold gap is present, inconsistency = 10184) between these two alignments
Real Alignment 2: 3523090 3523261 | 2808 2637 | 172 172 | 100.0 | gi_49175990_ref_NC_000913.2__Escherichia_coli_str._K_12_substr._MG1655__complete_genome Contig_390
  1. The detailed misassemblies report (.../contigs_reports/misassemblies_report.txt) now counts contig and scaffold misassemblies separately.
    Example (for metaquast run, so interspecies translocations are present):
Assembly                              meta_contigs_1  meta_contigs_2
# misassemblies                       3               4             
  # contig misassemblies              2               4             
    # c. relocations                  1               1             
    # c. translocations               0               0             
    # c. inversions                   0               0             
    # c. interspecies translocations  1               3             
  # scaffold misassemblies            1               0             
    # s. relocations                  0               0             
    # s. translocations               0               0             
    # s. inversions                   0               0             
    # s. interspecies translocations  1               0             
# misassembled contigs                2               2 
....
  1. The rest reports remain unchanged. E.g. HTML report does not distinguish between contig and scaffold misassemblies. Probably we will add this separation there too but let's test this feature with the text report only.
  2. Not directly related to this issue but you may be also interested in this: we also added "scaffold gap size local misassemblies" which are a combination of "scaffold gap size misassemblies" and "local misassemblies". Previously, if there was a scaffold gap size misassembly but inconsistency is small (less than extensive misassembly threshold, e.g. 1 kbp by default) we reported a local misassembly. Now we report "scaffold gap loc. mis." in such cases and these misassemblies are excluded from the number of local misassemblies.
@sjackman

This comment has been minimized.

Copy link
Contributor Author

sjackman commented Jun 12, 2018

Nice! Thanks! I'm excited for the upcoming 5.0 release of QUAST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.